Skip to content

As a loadable extension (with Rust) #1

@asg017

Description

@asg017

Hey @rclement ! Sorry for the delay, but here's a continuation of the discussions from rclement/datasette-ml#3

Supporting queries in sqlite-loadable-rs

I created a new issue in sqlite-loadable-rs to track adding querying support to that library. That way we can make queries like CREATE TABLE / INSERT INTO in the extension itself, which I know is a blocker for this work. Not sure if I'll have an ETA soon, but once that's in, it should unblock us here to make a proper loadable extension

In Pure Rust, no Python?

The native-ext branch uses PyO3 for ML algorithms, which will be great to get started, but bundling Python in an extension can be tricky. I have the sqlite-python project that lets you define loadable SQLite extensions with Python, which can be useful here, but can come with problems:

  • It'll assume the user already has a pre-configured Python environment with the right packages installed
  • It can be unstable when switching between different Python version
  • Python <-> Rust isn't very fun in general

There's the linfa project that could help us move to a pure-Rust extension. It's the more complete scikit-like Rust crate I can find, so we can use those algorithms in sqlite-ml to remove the Python dependency.

I played around with it a bit and it seems pretty advanced, most of the models seem to support serializing to a byte array (so we can persist a trained model across connections). It may not have 100% of the algorithms that scikit has, but probably enough for this?

Defining the SQL API

I've been thinking about a few different ways to express sqlite-ml operations in pure SQL, using eponymous virtual tables, table functions, and regular scalar functions. Here are some of my thoughts, but definitely not complete:

ml_experiments and friends

I think all these tables can be shadow tables that are read-only to users, since I dont think users will ever need to insert/update rows in these directly:

  • ml_experiments
  • ml_runs
  • ml_models
  • ml_metrics
  • ml_deployments

ml_train

ml_train can be an eponymous virtual table that users can INSERT into, to create new experiments/models.

insert into ml_train 
    values (
      'Iris prediction',  -- name of experiemnt
      'classification',  -- prediction type
      'logistic_regression',  -- algorithm
      'ml_datasets.iris',  -- source data. can be a table/view name, optional schema
      'target' -- target column
    );

ml_predict

A table function that takes in a JSON array of values and predicts the target column:

select 
  iris.*, 
  prediction.prediction
from ml_datasets.iris as iris
from ml_predict(
  'Iris prediction', 
  json_array( 
    iris.sepal_length, 
    iris.sepal_width, 
    iris.petal_length, 
    iris.petal_width
  )
) from prediction;

ml_load_dataset

If we wanted to just inline those default datasets into the extension, we could have eponymous virtual tables for each one like so:

select * from ml_datasets_iris;
select * from ml_datasets_breast_cancer;
select * from ml_datasets_diabetes;

Or if we don't want to bloat the size of the extension, we could offer a separate pre-built database file that people can attach themselves:

-- here ml_datasets_path() can return the path of the pre-built SQLite database, or create one if it doesnt exust
attach database ml_datasets_path() as ml_datasets;

select * from ml_datasets.iris;
select * from ml_datasets.breast_cancer;
select * from ml_datasets.diabetes;

Again, some very loose thoughts and notes, feel free to ask about anything!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions