-
-
Notifications
You must be signed in to change notification settings - Fork 16
Zarr support #190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Zarr support #190
Conversation
…dataR into keller-mark/zarr
…milar to HDF5AnnData
Simplify how obs and var names handled in ZarrAnnData
More zarr-related changes
Update comments
rcannood
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work @keller-mark and @Artur-man !
I went through the PR for a first time and left some minor comments. I will review the code by running it a couple of times next :)
R/read_zarr_helpers.R
Outdated
| attrs <- g$get_attrs()$to_list() | ||
|
|
||
| if (!all(c("encoding-type", "encoding-version") %in% names(attrs))) { | ||
| path <- name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where are a lot of linting issues in this file -- could you run lintr::lint_package() and fix any issues that pop up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done ... did a full lint_package check and corrected some R check issues too!
inst/extdata/example2.zarr/.zgroup
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should probably be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Bisaloo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, the patterns I highlight here can also be detected via lintr if you use all_linters() and then selectively turn off the linters you don't want.
For example: https://github.com/Huber-group-EMBL/Rarr/blob/devel/.lintr
R/Rarr_utils.R
Outdated
| #' @param version zarr version | ||
| #' @export | ||
| create_zarr_group <- function(store, name, version = "v2"){ | ||
| split.name <- strsplit(name, split = "\\/")[[1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll be slightly faster if regexes are not used:
| split.name <- strsplit(name, split = "\\/")[[1]] | |
| split.name <- strsplit(name, split = "/", fixed = TRUE)[[1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
R/Rarr_utils.R
Outdated
| create_zarr_group <- function(store, name, version = "v2"){ | ||
| split.name <- strsplit(name, split = "\\/")[[1]] | ||
| if(length(split.name) > 1){ | ||
| split.name <- vapply(seq_len(length(split.name)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| split.name <- vapply(seq_len(length(split.name)), | |
| split.name <- vapply(seq_along(split.name), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
R/write_zarr_helpers.R
Outdated
| if (!is.null(dim(value))) { | ||
| dims <- dim(value) | ||
| } else { | ||
| dims <- length(value) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're using a recent version of R:
| if (!is.null(dim(value))) { | |
| dims <- dim(value) | |
| } else { | |
| dims <- length(value) | |
| } | |
| dims <- dim(value) %||% length(value) |
(you can also redefine the operator in this pkg for compatibility with older R versions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I will change this directly, anndataR requires 4.5 and later anyways. Thanks Hugo.
I think they are slightly different because one will call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will probably need some more rounds of reviews but hopefully I've caught a lot of stuff already.
Additional comments:
- I commented on some style things the first time I saw them but please do the same wherever else it is relevant
- The documentation in
AnnData-usageneeds to be updated to addZarrAnnData - Round trip tests for Zarr need to be added
- We should probably show Zarr usage somewhere in a main vignette
- Please try to get the CI checks to pass
inst/scripts/example_files.py
Outdated
| # os.chdir("inst/extdata/") | ||
| # zip = zipfile.ZipFile("example.zarr.zip", "w", zipfile.ZIP_DEFLATED) | ||
| # zip.write("example.zarr") | ||
| # shutil.rmtree("example.zarr") | ||
| # zip.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be uncommented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are adding this we can probably remove the versions from the script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I learned about this recently, which can be a good lightweight way to document versions without having to create a dedicated environment file: https://docs.astral.sh/uv/guides/scripts/#declaring-script-dependencies
Note that it's not uv-specific but it's also documented in the official python docs.
R/AbstractAnnData.R
Outdated
| ) | ||
| }, | ||
| #' @description | ||
| #' Convert to an [`ZarrAnnData`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #' Convert to an [`ZarrAnnData`] | |
| #' Convert to a [`ZarrAnnData`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
R/AbstractAnnData.R
Outdated
| #' @param compression See [as_ZarrAnnData()] | ||
| #' @param mode See [as_ZarrAnnData()] | ||
| #' | ||
| #' @return An [`ZarrAnnData`] object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #' @return An [`ZarrAnnData`] object | |
| #' @return A [`ZarrAnnData`] object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
R/AbstractAnnData.R
Outdated
| write_h5ad(object = self, path, compression = compression, mode = mode) | ||
| }, | ||
| #' @description | ||
| #' Write the `AnnData` object to an H5AD file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #' Write the `AnnData` object to an H5AD file | |
| #' Write the `AnnData` object to a Zarr file |
| }) | ||
|
|
||
| test_that("reading recarrays works", { | ||
| skip("read_zarr_rec_array is not implemented yet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is implemented now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, ok the skip here is misleading, since Rarr currently doesn't read arbitrary structured data types, but my current implementation under read_zarr_rec_array only works for the example data. So it should stay like this until the following is fixed:
Huber-group-EMBL/Rarr#81
tests/testthat/test-h5ad-zarr.R
Outdated
| expect_equal(mapping_h5ad, mapping_zarr) | ||
| }) | ||
|
|
||
| test_that("reading dataframes works", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| test_that("reading dataframes works", { | |
| test_that("reading dataframes is the same for h5ad and zarr", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| }) | ||
|
|
||
| test_that("reading H5AD as SingleCellExperiment is same for h5ad and zarr", { | ||
| skip("for now, example.zarr and example.h5ad are not identical!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are they not identical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some readers are not implemented yet, but will check again.
#190 (comment)
| expect_equal(df_h5ad, df_zarr) | ||
| }) | ||
|
|
||
| test_that("reading H5AD as SingleCellExperiment is same for h5ad and zarr", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably add a test for Seurat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, but skipping for now until identical.
| - The `ReticulateAnnData` access data stored in an `AnnData` object in a concurrent Python session. | ||
| This comes with the overhead and complexity of using `r CRANpkg("reticulate")` but is sometimes useful to access functionality that has not yet been implemented in `r Biocpkg("anndataR")`. | ||
| - The planned `ZarrAnnData` will provide an interface to an `AnnData` Zarr store, similar to `HDF5AnnData`. | ||
| - The `ZarrAnnData` provides an interface to a Zarr data (i.e. Zarr store) and, similar to a H5AD file, minimal data is stored in memory until it is requested by the user. It is as an intermediate object when reading/writing Zarr store but can be useful for accessing parts of large files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - The `ZarrAnnData` provides an interface to a Zarr data (i.e. Zarr store) and, similar to a H5AD file, minimal data is stored in memory until it is requested by the user. It is as an intermediate object when reading/writing Zarr store but can be useful for accessing parts of large files. | |
| - The `ZarrAnnData` provides an interface to a Zarr store and, similar to a H5AD file, minimal data is stored in memory until it is requested by the user. It is as an intermediate object when reading/writing Zarr stores but can be useful for accessing parts of large files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
awesome @lazappi I will get to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I learned about this recently, which can be a good lightweight way to document versions without having to create a dedicated environment file: https://docs.astral.sh/uv/guides/scripts/#declaring-script-dependencies
Note that it's not uv-specific but it's also documented in the official python docs.
R/read_zarr_helpers.R
Outdated
| if (is.null(dim(data))) { | ||
| data <- as.vector(data) | ||
| dim(data) <- length(data) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rarr::read_zarr_array() currently always returns arrays (i.e., dim() is not NULL). I don't expect this to change as I like the idea of a consistent output class.
README.md
Outdated
| - To convert to/from `Seurat` objects, install [SeuratObject](https://cran.r-project.org/package=SeuratObject): | ||
| `install.packages("SeuratObject")` | ||
| - To read/write \*.zarr files, you need to install | ||
| [zarr](https://www.bioconductor.org/packages/release/bioc/html/Rarr.html): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| [zarr](https://www.bioconductor.org/packages/release/bioc/html/Rarr.html): | |
| [Rarr](https://www.bioconductor.org/packages/release/bioc/html/Rarr.html): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
Errors I am getting at lint and pkgdown GHAs look unrelated to the package: |
|
@Bisaloo also the tests that are passing for macos, ubuntu and windows are failing for macos (intel) and ubuntu-bioc-release. I understand the error in release (for obvious reasons), but macos-15 I don't know. Any ideas ? https://github.com/scverse/anndataR/actions/runs/20411185095/job/58648259569?pr=190 |
|
Yes, it looks that on this runner, Rarr 1.11.9 is installed, when support for reading scalars was added in Rarr 1.11.12. I think it's related to BioC devel not having more recent binaries for macOS x86 yet. Maybe try setting: Imports:
Rarr (>= 1.11.12)In This will not work for the runner using BioC release but it's already failing anyways. |
|
I see. Then I will assume all checks are passing for now. thanks man |
|
Current issues:
|
Fixes #91
These changes are from both me and @Artur-man
The main public-facing changes here are:
ZarrAnnDataclassread_zarrandwrite_zarrtop-level functionsfrom_Seurat(output_class="ZarrAnnData")from_SingleCellExperiment(output_class="ZarrAnnData")Internally:
read_zarr_helpers.Ris the zarr analog ofread_h5ad_helpers.Rwrite_zarr_helpers.Ris the zarr analog ofwrite_h5ad_helpers.Rinst/extdata/example.zarr(this makes the diff noisy, apologies)test-Zarr-read.R(35 new tests)test-Zarr-write.R(70)test-ZarrAnnData.R(26)test-h5ad-zarr.R(17)A number of these functions generate warnings in the R console that are intended to be followed up on to improve the code (and should probably be resolved as end users may not appreciate them), but the tests still pass despite these warnings.
Known things that are not implemented here:
recarraysmode = c("r", "r+", "a", "w", "w-", "x")parameter value