Skip to content

Conversation

@jkanche
Copy link
Member

@jkanche jkanche commented Nov 22, 2024

This PR introduces major improvements to matrix handling, storage, and performance, including support for multiple matrices in H5AD/AnnData workflows and optimizations for ingestion and querying.

Support for multiple matrices:

  • Both build_cellarrdataset and CellArrDataset now support multiple matrices. During ingestion, a TileDB group called "assays" is created to store all matrices, along with group-level metadata.

This may introduce breaking changes with the default parameters based on how these classes are used. Previously to build the TileDB files:

dataset = build_cellarrdataset(
    output_path=tempdir,
    files=[adata1, adata2],
    matrix_options=MatrixOptions(matrix_name="counts", dtype=np.int16),
    num_threads=2,
)

Now you may provide a list of matrix options for each layers in the files.

dataset = build_cellarrdataset(
    output_path=tempdir,
    files=[adata1, adata2],
    matrix_options=[
        MatrixOptions(matrix_name="counts", dtype=np.int16),
        MatrixOptions(matrix_name="log-norm", dtype=np.float32),
    ],
    num_threads=2,
)

Querying follows a similar structure:

cd = CellArrDataset(
    dataset_path=tempdir,
    assay_tiledb_group="assays",
    assay_uri=["counts", "log-norm"]
)

assay_uri is relative to assay_tiledb_group. For backwards compatibility, assay_tiledb_group can be an empty string.

  • Parallelized ingestion:
    The build process now uses num_threads to ingest matrices concurrently. Two new columns in the sample metadata, cellarr_sample_start_index and cellarr_sample_end_index, track sample offsets, improving matrix processing.

    • Note: The process pool uses the spawn method on UNIX systems, which may affect usage on windows machines.
  • TileDB query condition fixes:
    Fixed a few issues with fill values represented as bytes (seems to be common when ascii is used as the column type) and in general filtering operations on TileDB Dataframes.

  • Index remapping:
    Improved remapping of indices from sliced TileDB arrays for both dense and sparse matrices. This is not a user facing function but an internal slicing operation.

  • Get a sample:
    Added a method to access all cells for a particular sample. you can either provide an index or a sample id.

sample_1_slice = cd.get_cells_for_sample(0)
  • Other updates to documentation, tutorials, the README, and additional tests.

@jkanche jkanche self-assigned this Nov 22, 2024
@jkanche jkanche changed the title EOD refactor the build functionality Nov 23, 2024
@jkanche jkanche marked this pull request as ready for review November 25, 2024 22:35
@jkanche jkanche requested review from keviny2 and tony-kuo November 25, 2024 22:35
@jkanche
Copy link
Member Author

jkanche commented Nov 26, 2024

@tony-kuo and @keviny2 i'll update and merge this tomorrow, but good to get another set of eyes here

@jkanche
Copy link
Member Author

jkanche commented Nov 28, 2024

This PR has dependencies in #61 and #60. Will merge this PR first to keep it small and avoid merge conflicts later.

@jkanche jkanche changed the title refactor the build functionality Support for multiple matrices and improving construction of TileDB objects Nov 28, 2024
@jkanche jkanche merged commit 48de52c into master Nov 28, 2024
6 checks passed
@jkanche jkanche deleted the refactor-layers branch November 28, 2024 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants