-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Hey @rclement ! Sorry for the delay, but here's a continuation of the discussions from rclement/datasette-ml#3
Supporting queries in sqlite-loadable-rs
I created a new issue in sqlite-loadable-rs to track adding querying support to that library. That way we can make queries like CREATE TABLE / INSERT INTO in the extension itself, which I know is a blocker for this work. Not sure if I'll have an ETA soon, but once that's in, it should unblock us here to make a proper loadable extension
In Pure Rust, no Python?
The native-ext branch uses PyO3 for ML algorithms, which will be great to get started, but bundling Python in an extension can be tricky. I have the sqlite-python project that lets you define loadable SQLite extensions with Python, which can be useful here, but can come with problems:
- It'll assume the user already has a pre-configured Python environment with the right packages installed
- It can be unstable when switching between different Python version
- Python <-> Rust isn't very fun in general
There's the linfa project that could help us move to a pure-Rust extension. It's the more complete scikit-like Rust crate I can find, so we can use those algorithms in sqlite-ml to remove the Python dependency.
I played around with it a bit and it seems pretty advanced, most of the models seem to support serializing to a byte array (so we can persist a trained model across connections). It may not have 100% of the algorithms that scikit has, but probably enough for this?
Defining the SQL API
I've been thinking about a few different ways to express sqlite-ml operations in pure SQL, using eponymous virtual tables, table functions, and regular scalar functions. Here are some of my thoughts, but definitely not complete:
ml_experiments and friends
I think all these tables can be shadow tables that are read-only to users, since I dont think users will ever need to insert/update rows in these directly:
- ml_experiments
- ml_runs
- ml_models
- ml_metrics
- ml_deployments
ml_train
ml_train can be an eponymous virtual table that users can INSERT into, to create new experiments/models.
insert into ml_train
values (
'Iris prediction', -- name of experiemnt
'classification', -- prediction type
'logistic_regression', -- algorithm
'ml_datasets.iris', -- source data. can be a table/view name, optional schema
'target' -- target column
);
ml_predict
A table function that takes in a JSON array of values and predicts the target column:
select
iris.*,
prediction.prediction
from ml_datasets.iris as iris
from ml_predict(
'Iris prediction',
json_array(
iris.sepal_length,
iris.sepal_width,
iris.petal_length,
iris.petal_width
)
) from prediction;ml_load_dataset
If we wanted to just inline those default datasets into the extension, we could have eponymous virtual tables for each one like so:
select * from ml_datasets_iris;
select * from ml_datasets_breast_cancer;
select * from ml_datasets_diabetes;
Or if we don't want to bloat the size of the extension, we could offer a separate pre-built database file that people can attach themselves:
-- here ml_datasets_path() can return the path of the pre-built SQLite database, or create one if it doesnt exust
attach database ml_datasets_path() as ml_datasets;
select * from ml_datasets.iris;
select * from ml_datasets.breast_cancer;
select * from ml_datasets.diabetes;Again, some very loose thoughts and notes, feel free to ask about anything!