Skip to content

Conversation

@keller-mark
Copy link

@keller-mark keller-mark commented Nov 5, 2024

Fixes #91

These changes are from both me and @Artur-man

The main public-facing changes here are:

  • The ZarrAnnData class
  • read_zarr and write_zarr top-level functions
  • Support for from_Seurat(output_class="ZarrAnnData")
  • Support for from_SingleCellExperiment(output_class="ZarrAnnData")

Internally:

  • read_zarr_helpers.R is the zarr analog of read_h5ad_helpers.R
  • write_zarr_helpers.R is the zarr analog of write_h5ad_helpers.R
  • Test fixtures within inst/extdata/example.zarr (this makes the diff noisy, apologies)
  • Lots of tests:
    • test-Zarr-read.R (35 new tests)
    • test-Zarr-write.R (70)
    • test-ZarrAnnData.R (26)
    • test-h5ad-zarr.R (17)

A number of these functions generate warnings in the R console that are intended to be followed up on to improve the code (and should probably be resolved as end users may not appreciate them), but the tests still pass despite these warnings.

Known things that are not implemented here:

  • support for recarrays
  • usage of mode = c("r", "r+", "a", "w", "w-", "x") parameter value

Copy link
Member

@rcannood rcannood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work @keller-mark and @Artur-man !

I went through the PR for a first time and left some minor comments. I will review the code by running it a couple of times next :)

attrs <- g$get_attrs()$to_list()

if (!all(c("encoding-type", "encoding-version") %in% names(attrs))) {
path <- name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are a lot of linting issues in this file -- could you run lintr::lint_package() and fix any issues that pop up?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done ... did a full lint_package check and corrected some R check issues too!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should probably be removed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link

@Bisaloo Bisaloo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, the patterns I highlight here can also be detected via lintr if you use all_linters() and then selectively turn off the linters you don't want.

For example: https://github.com/Huber-group-EMBL/Rarr/blob/devel/.lintr

R/Rarr_utils.R Outdated
#' @param version zarr version
#' @export
create_zarr_group <- function(store, name, version = "v2"){
split.name <- strsplit(name, split = "\\/")[[1]]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be slightly faster if regexes are not used:

Suggested change
split.name <- strsplit(name, split = "\\/")[[1]]
split.name <- strsplit(name, split = "/", fixed = TRUE)[[1]]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

R/Rarr_utils.R Outdated
create_zarr_group <- function(store, name, version = "v2"){
split.name <- strsplit(name, split = "\\/")[[1]]
if(length(split.name) > 1){
split.name <- vapply(seq_len(length(split.name)),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
split.name <- vapply(seq_len(length(split.name)),
split.name <- vapply(seq_along(split.name),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 250 to 254
if (!is.null(dim(value))) {
dims <- dim(value)
} else {
dims <- length(value)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're using a recent version of R:

Suggested change
if (!is.null(dim(value))) {
dims <- dim(value)
} else {
dims <- length(value)
}
dims <- dim(value) %||% length(value)

(you can also redefine the operator in this pkg for compatibility with older R versions)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I will change this directly, anndataR requires 4.5 and later anyways. Thanks Hugo.

@lazappi
Copy link
Collaborator

lazappi commented Dec 11, 2025

Some functions such as read_h5ad_data_frame and read_h5ad_collection would repeat themselves if we were to define a, e.g. read_zarr_data_frame/_collection since they do not directly call rhdf5 or Rarr methods.

I think they are slightly different because one will call read_h5ad_element() and the other will call read_zarr_element(). I'm open to reducing redundancy where it makes sense and is still easy to follow but maybe it's better to leave for a separate PR so it's easier to review the Zarr functionality without other changes.

@Artur-man
Copy link

Artur-man commented Dec 11, 2025

Then @lazappi perhaps we can start your review, let me know if you guys want me to change anything (there are lots of changes!). Meanwhile, @Bisaloo will keep working on fixing a few small issues in Rarr.

Copy link
Collaborator

@lazappi lazappi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will probably need some more rounds of reviews but hopefully I've caught a lot of stuff already.

Additional comments:

  • I commented on some style things the first time I saw them but please do the same wherever else it is relevant
  • The documentation in AnnData-usage needs to be updated to add ZarrAnnData
  • Round trip tests for Zarr need to be added
  • We should probably show Zarr usage somewhere in a main vignette
  • Please try to get the CI checks to pass

Comment on lines 108 to 112
# os.chdir("inst/extdata/")
# zip = zipfile.ZipFile("example.zarr.zip", "w", zipfile.ZIP_DEFLATED)
# zip.write("example.zarr")
# shutil.rmtree("example.zarr")
# zip.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be uncommented?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are adding this we can probably remove the versions from the script.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learned about this recently, which can be a good lightweight way to document versions without having to create a dedicated environment file: https://docs.astral.sh/uv/guides/scripts/#declaring-script-dependencies

Note that it's not uv-specific but it's also documented in the official python docs.

)
},
#' @description
#' Convert to an [`ZarrAnnData`]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' Convert to an [`ZarrAnnData`]
#' Convert to a [`ZarrAnnData`]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

#' @param compression See [as_ZarrAnnData()]
#' @param mode See [as_ZarrAnnData()]
#'
#' @return An [`ZarrAnnData`] object
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' @return An [`ZarrAnnData`] object
#' @return A [`ZarrAnnData`] object

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

write_h5ad(object = self, path, compression = compression, mode = mode)
},
#' @description
#' Write the `AnnData` object to an H5AD file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' Write the `AnnData` object to an H5AD file
#' Write the `AnnData` object to a Zarr file

})

test_that("reading recarrays works", {
skip("read_zarr_rec_array is not implemented yet")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is implemented now?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ok the skip here is misleading, since Rarr currently doesn't read arbitrary structured data types, but my current implementation under read_zarr_rec_array only works for the example data. So it should stay like this until the following is fixed:
Huber-group-EMBL/Rarr#81

expect_equal(mapping_h5ad, mapping_zarr)
})

test_that("reading dataframes works", {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
test_that("reading dataframes works", {
test_that("reading dataframes is the same for h5ad and zarr", {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

})

test_that("reading H5AD as SingleCellExperiment is same for h5ad and zarr", {
skip("for now, example.zarr and example.h5ad are not identical!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are they not identical?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some readers are not implemented yet, but will check again.
#190 (comment)

expect_equal(df_h5ad, df_zarr)
})

test_that("reading H5AD as SingleCellExperiment is same for h5ad and zarr", {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably add a test for Seurat

Copy link

@Artur-man Artur-man Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, but skipping for now until identical.

- The `ReticulateAnnData` access data stored in an `AnnData` object in a concurrent Python session.
This comes with the overhead and complexity of using `r CRANpkg("reticulate")` but is sometimes useful to access functionality that has not yet been implemented in `r Biocpkg("anndataR")`.
- The planned `ZarrAnnData` will provide an interface to an `AnnData` Zarr store, similar to `HDF5AnnData`.
- The `ZarrAnnData` provides an interface to a Zarr data (i.e. Zarr store) and, similar to a H5AD file, minimal data is stored in memory until it is requested by the user. It is as an intermediate object when reading/writing Zarr store but can be useful for accessing parts of large files.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The `ZarrAnnData` provides an interface to a Zarr data (i.e. Zarr store) and, similar to a H5AD file, minimal data is stored in memory until it is requested by the user. It is as an intermediate object when reading/writing Zarr store but can be useful for accessing parts of large files.
- The `ZarrAnnData` provides an interface to a Zarr store and, similar to a H5AD file, minimal data is stored in memory until it is requested by the user. It is as an intermediate object when reading/writing Zarr stores but can be useful for accessing parts of large files.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@Artur-man
Copy link

We will probably need some more rounds of reviews but hopefully I've caught a lot of stuff already.

Additional comments:

  • I commented on some style things the first time I saw them but please do the same wherever else it is relevant
  • The documentation in AnnData-usage needs to be updated to add ZarrAnnData
  • Round trip tests for Zarr need to be added
  • We should probably show Zarr usage somewhere in a main vignette
  • Please try to get the CI checks to pass

awesome @lazappi I will get to it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learned about this recently, which can be a good lightweight way to document versions without having to create a dedicated environment file: https://docs.astral.sh/uv/guides/scripts/#declaring-script-dependencies

Note that it's not uv-specific but it's also documented in the official python docs.

Comment on lines 144 to 147
if (is.null(dim(data))) {
data <- as.vector(data)
dim(data) <- length(data)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rarr::read_zarr_array() currently always returns arrays (i.e., dim() is not NULL). I don't expect this to change as I like the idea of a consistent output class.

README.md Outdated
- To convert to/from `Seurat` objects, install [SeuratObject](https://cran.r-project.org/package=SeuratObject):
`install.packages("SeuratObject")`
- To read/write \*.zarr files, you need to install
[zarr](https://www.bioconductor.org/packages/release/bioc/html/Rarr.html):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[zarr](https://www.bioconductor.org/packages/release/bioc/html/Rarr.html):
[Rarr](https://www.bioconductor.org/packages/release/bioc/html/Rarr.html):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@Artur-man
Copy link

Errors I am getting at lint and pkgdown GHAs look unrelated to the package:

Error in dyn.load(file, DLLpath = DLLpath, ...) : 
    unable to load shared object '/home/runner/work/_temp/Library/Rhdf5lib/libs/Rhdf5lib.so':
    libsz.so.2: cannot open shared object file: No such file or directory

@Artur-man
Copy link

Artur-man commented Dec 21, 2025

@Bisaloo also the tests that are passing for macos, ubuntu and windows are failing for macos (intel) and ubuntu-bioc-release. I understand the error in release (for obvious reasons), but macos-15 I don't know. Any ideas ?

Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
   5.         └─base::array(metadata$fill_value, dim = lengths(index))
  ── Error ('test-Zarr-read.R:87:3'): reading numeric scalars works ──────────────
  Error in `array(metadata$fill_value, dim = lengths(index))`: 'dim' cannot be of length 0
  Backtrace:
      ▆
   1. └─anndataR:::read_zarr_numeric_scalar(store, "uns/IntScalar") at test-Zarr-read.R:87:3
   2.   └─anndataR:::read_zarr_array(store, name)
   3.     └─Rarr::read_zarr_array(file.path(store, name))
   4.       └─Rarr:::read_data(...)
   5.         └─base::array(metadata$fill_value, dim = lengths(index))
  
  [ FAIL 4 | WARN 447 | SKIP 12 | PASS 1842 ]

https://github.com/scverse/anndataR/actions/runs/20411185095/job/58648259569?pr=190

@Bisaloo
Copy link

Bisaloo commented Dec 21, 2025

Yes, it looks that on this runner, Rarr 1.11.9 is installed, when support for reading scalars was added in Rarr 1.11.12. I think it's related to BioC devel not having more recent binaries for macOS x86 yet.

Maybe try setting:

Imports:
    Rarr (>= 1.11.12)

In DESCRIPTION to force installation from source?

This will not work for the runner using BioC release but it's already failing anyways.

@Artur-man
Copy link

I see. Then I will assume all checks are passing for now. thanks man

@Artur-man
Copy link

Artur-man commented Dec 21, 2025

Current issues:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Zarr backend

5 participants