Support for multiple matrices and improving construction of TileDB objects #53

jkanche · 2024-11-22T02:06:17Z

This PR introduces major improvements to matrix handling, storage, and performance, including support for multiple matrices in H5AD/AnnData workflows and optimizations for ingestion and querying.

Support for multiple matrices:

Both build_cellarrdataset and CellArrDataset now support multiple matrices. During ingestion, a TileDB group called "assays" is created to store all matrices, along with group-level metadata.

This may introduce breaking changes with the default parameters based on how these classes are used. Previously to build the TileDB files:

dataset = build_cellarrdataset(
    output_path=tempdir,
    files=[adata1, adata2],
    matrix_options=MatrixOptions(matrix_name="counts", dtype=np.int16),
    num_threads=2,
)

Now you may provide a list of matrix options for each layers in the files.

dataset = build_cellarrdataset(
    output_path=tempdir,
    files=[adata1, adata2],
    matrix_options=[
        MatrixOptions(matrix_name="counts", dtype=np.int16),
        MatrixOptions(matrix_name="log-norm", dtype=np.float32),
    ],
    num_threads=2,
)

Querying follows a similar structure:

cd = CellArrDataset(
    dataset_path=tempdir,
    assay_tiledb_group="assays",
    assay_uri=["counts", "log-norm"]
)

assay_uri is relative to assay_tiledb_group. For backwards compatibility, assay_tiledb_group can be an empty string.

Parallelized ingestion:
The build process now uses num_threads to ingest matrices concurrently. Two new columns in the sample metadata, cellarr_sample_start_index and cellarr_sample_end_index, track sample offsets, improving matrix processing.
- Note: The process pool uses the spawn method on UNIX systems, which may affect usage on windows machines.
TileDB query condition fixes:
Fixed a few issues with fill values represented as bytes (seems to be common when ascii is used as the column type) and in general filtering operations on TileDB Dataframes.
Index remapping:
Improved remapping of indices from sliced TileDB arrays for both dense and sparse matrices. This is not a user facing function but an internal slicing operation.
Get a sample:
Added a method to access all cells for a particular sample. you can either provide an index or a sample id.

sample_1_slice = cd.get_cells_for_sample(0)

Other updates to documentation, tutorials, the README, and additional tests.

for more information, see https://pre-commit.ci

…nto refactor-layers

for more information, see https://pre-commit.ci

…uerying; updating tests and documentation as well

…nto refactor-layers

for more information, see https://pre-commit.ci

jkanche · 2024-11-26T02:43:14Z

@tony-kuo and @keviny2 i'll update and merge this tomorrow, but good to get another set of eyes here

src/cellarr/CellArrDataset.py

for more information, see https://pre-commit.ci

jkanche · 2024-11-28T18:23:00Z

This PR has dependencies in #61 and #60. Will merge this PR first to keep it small and avoid merge conflicts later.

for more information, see https://pre-commit.ci

…nto refactor-layers

for more information, see https://pre-commit.ci

EOD

4ea4677

jkanche self-assigned this Nov 22, 2024

pre-commit-ci bot and others added 4 commits November 22, 2024 02:06

[pre-commit.ci] auto fixes from pre-commit.com hooks

8be9e3e

for more information, see https://pre-commit.ci

Merge branch 'master' into refactor-layers

298f4d1

Merge branch 'refactor-layers' of https://github.com/BiocPy/cellarr i…

7f72009

…nto refactor-layers

EOD

acbc55f

jkanche changed the title ~~EOD~~ refactor the build functionality Nov 23, 2024

pre-commit-ci bot and others added 4 commits November 23, 2024 00:28

[pre-commit.ci] auto fixes from pre-commit.com hooks

88e1146

for more information, see https://pre-commit.ci

there's many changes to support building the cellarr collection and q…

8e22118

…uerying; updating tests and documentation as well

Merge branch 'refactor-layers' of https://github.com/BiocPy/cellarr i…

0ba6f4f

…nto refactor-layers

[pre-commit.ci] auto fixes from pre-commit.com hooks

09b41ca

for more information, see https://pre-commit.ci

jkanche marked this pull request as ready for review November 25, 2024 22:35

jkanche requested review from keviny2 and tony-kuo November 25, 2024 22:35

jkanche and others added 4 commits November 25, 2024 14:46

reset sample index

d66a593

does the pool need to return?

89b4358

is fork the problem?

fb67352

[pre-commit.ci] auto fixes from pre-commit.com hooks

b9b46b5

for more information, see https://pre-commit.ci

jkanche added 2 commits November 25, 2024 18:48

add checks with threads

6548d59

run autoencoder tests only on github action

6f84688

keviny2 approved these changes Nov 26, 2024

View reviewed changes

src/cellarr/CellArrDataset.py Outdated Show resolved Hide resolved

jkanche and others added 8 commits November 26, 2024 09:03

fix docstring typos

03eb747

update assets

7014544

[pre-commit.ci] auto fixes from pre-commit.com hooks

6835ad1

for more information, see https://pre-commit.ci

update README

08d94a3

update docstrings throughout

fb0f2f2

filter dataframes with tiledb query expressions

fa6c3d1

[pre-commit.ci] auto fixes from pre-commit.com hooks

bf71cd5

for more information, see https://pre-commit.ci

fix dataloader when filtering query conditions

62365ff

pre-commit-ci bot and others added 7 commits November 26, 2024 21:17

[pre-commit.ci] auto fixes from pre-commit.com hooks

60c93b8

for more information, see https://pre-commit.ci

back to using remap

0db584c

get all cells for a sample

b28f037

[pre-commit.ci] auto fixes from pre-commit.com hooks

eef1a41

for more information, see https://pre-commit.ci

minor changes to README

0577a8d

separate assay group

f7cd450

add caching and with clause support

d21f5a3

jkanche and others added 2 commits November 28, 2024 10:33

update changelog

756111a

[pre-commit.ci] auto fixes from pre-commit.com hooks

128e091

for more information, see https://pre-commit.ci

jkanche changed the title ~~refactor the build functionality~~ Support for multiple matrices and improving construction of TileDB objects Nov 28, 2024

jkanche and others added 3 commits November 28, 2024 12:18

update changelog

9c9b360

Merge branch 'refactor-layers' of https://github.com/BiocPy/cellarr i…

1146902

…nto refactor-layers

[pre-commit.ci] auto fixes from pre-commit.com hooks

21397c5

for more information, see https://pre-commit.ci

jkanche merged commit 48de52c into master Nov 28, 2024
6 checks passed

jkanche deleted the refactor-layers branch November 28, 2024 20:23

This was referenced Nov 29, 2024

filter the frame when query conditions are used #55

Closed

support multiple layers during the build process #45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for multiple matrices and improving construction of TileDB objects #53

Support for multiple matrices and improving construction of TileDB objects #53

Uh oh!

jkanche commented Nov 22, 2024 •

edited

Loading

Uh oh!

jkanche commented Nov 26, 2024

Uh oh!

Uh oh!

jkanche commented Nov 28, 2024 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support for multiple matrices and improving construction of TileDB objects #53

Support for multiple matrices and improving construction of TileDB objects #53

Uh oh!

Conversation

jkanche commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkanche commented Nov 26, 2024

Uh oh!

Uh oh!

jkanche commented Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jkanche commented Nov 22, 2024 •

edited

Loading

jkanche commented Nov 28, 2024 •

edited

Loading