Skip to content

Conversation

@pratikunterwegs
Copy link
Contributor

@pratikunterwegs pratikunterwegs commented Apr 14, 2025

This is an initial trial-balloon PR for the general user-facing side of {vimclimate}.

There's a single function, get_vimc_climate() that accepts similar arguments to the data preparation function in clim2parquet.

Eventually, the data_location argument will point to a remote repository which can be passed by the user (with some sensible default if we know where the data is).

I'm copying the fix in {hubutils}: hubverse-org/hubValidations#74 to avoid CI failing on Windows; this is because the {tzdb} package is in Suggests for {arrow}, see more in apache/arrow#40073 (comment).

Mod global state checker for Windows

Remove check on no conditions
@pratikunterwegs pratikunterwegs marked this pull request as ready for review April 14, 2025 14:53
@pratikunterwegs pratikunterwegs changed the title Read and serve climate data from local files Read climate data from local files Apr 14, 2025
Copy link
Member

@richfitz richfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall comments here - happy to discuss in person if you prefer as some are fairly large changes in direction.

Can you please also set the repo up to do a merge based flow - I would like this project to match how other VIMC projects tend to work as I want and expect contributions from the science side

),
admin_level = c(0, 1, 2, 3)
) {
# more checks below
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please consider using the existing utilities we have for this in reside.utils - we use them everywhere else and they are IMO much more compact than most of these checkmate-then-throw versions. They're easy enough to expand if you need more.

usethis::use_standalone("reside-ic/reside.utils", "utils-assert")

https://github.com/reside-ic/reside.utils/

That said, we don't have date handling in there (yet) I think but they'd be good to add - in the same style. Perhaps start by writing similar assertions in this package, using lower-level ones there as required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice final state of this would probably look like

country <- validate_country(country)
date_range <- validate_date_range(date_range)
admin_level <- validate_admin_level(admin_level)
target_file <- locate_parquet_file(data_location, country, admin_level)
...then read/process

Comment on lines 77 to 100
nchar_country <- as.character(nchar(country))
country_iso3c <- switch(
nchar_country,
"2" = countrycode::countrycode(
country,
"iso2c",
"iso3c",
warn = FALSE,
nomatch = NA
),
"3" = countrycode::countrycode(
country,
"iso3c",
"iso3c",
warn = FALSE,
nomatch = NA
),
countrycode::countryname(
country,
"iso3c",
warn = FALSE,
nomatch = NA
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move all this into a validate_country_code function or similar. Please use an if/else ladder rather than switch with a conversion from integer to character, the semantics are not obvious


# read files and return data.frame
data <- arrow::read_parquet(target_file)
data <- as.data.frame(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why convert back to df here? Surely tibble or Arrow Table would be ok here?

Copy link
Contributor Author

@pratikunterwegs pratikunterwegs Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a known format (to users), and avoids a dependency on {tibble}. No strong feelings either way if expected end users are fine with Arrow tables, or if they'd prefer tibbles.

data$Date <- lubridate::as_date(data$Date)

data[
lubridate::`%within%`(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use infix operators as infix - really lubridate should provide a prefix form of this, but I think it would be less alarming to use @importFrom above this function probably

That said, it might be easier just to use

data$Date >= min(date_range) & data$Date <= max(date_range)

which would make more explicit the inclusiveness (or not) of date ranges

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I liked the interval feature and it would be nice to use, but can switch to simpler inequality testing.

```r
installation from R-universe
install.packages(
"vimclimate",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/vimc/vimc.r-universe.dev - can you make a PR here to add this, I think it can be added immediately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

README.Rmd Outdated
# )
``` r
install.packages("pak")
pak::pak("vimc/vimclimate")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past, we have found that remotes leaves Windows libraries in a corrupt state quite regularly - I don't think many Rstudio folks actually develop on windows. Do you know if pak is better in this regard? My (fairly strong) preference is just to point people at r-universe as we get very few complaints of things not working from there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea really. I can remove the {pak} instructions and point to r-universe.

#' @return A `<data.frame>` of climate data.
#'
#' @export
get_vimc_climate <- function(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a real dislike of function names starting with get_ - it seems both too much of an implementation detail (like pointless getters and setters in OOP) and too little - is this loading from a local file or is it fetching from a remote resource. It would be nice to sketch out somewhere the possible total package API in a ticket so that we get a nice set of functions that hint at their functionality

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this function is really a "load" or "read" function, but sketching the whole interface will make this clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a load function for now, but the idea is to have a single function that can access different types of locations, both local dirs and packit. Is there something specific to rename this to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed this to the longer and more descriptive load_local_vimc_climate().

@@ -0,0 +1,150 @@
#' Get climate data Parquets prepared for VIMC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the use of Parquets here might be confusing in the context where packets are also something that we might end up referring to. It's also an implementation detail that they are stored as parquet files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the doc text to be more accurate.

#' @param country A string for the country name, or for the country ISO 2- or 3-
#' character code. E.g. one of "Canada", "CA", or "CAN". Input is checked
#' against names and codes in \pkg{countrycode}.
#' @param date_range A two element vector of either strings or `<Date>`s, giving
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please put newlines between @param entries so they're easier to scan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@pratikunterwegs
Copy link
Contributor Author

Overall comments here - happy to discuss in person if you prefer as some are fairly large changes in direction.

These all seem do-able changes. I'll try and add some documentation to {reside.utils} as well.

Can you please also set the repo up to do a merge based flow - I would like this project to match how other VIMC projects tend to work as I want and expect contributions from the science side.

I think it already is, or it's been changed to allow merges. I've allowed non-linear histories on main as well if that helps, but ideally folks would at least squash if they have a messy PR.

@pratikunterwegs
Copy link
Contributor Author

I've restructured a good bit - just pop in any comments if there's something I haven't understood.

Copy link
Member

@richfitz richfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - this is looking good.

A couple of minor suggestions here.

I think it already is, or it's been changed to allow merges. I've allowed non-linear histories on main as well if that helps, but ideally folks would at least squash if they have a messy PR.

The PR is still saying it wants to do a squash and merge, so does something else need changing?

Comment on lines 34 to 42
data_source = c(
"CHIRPS",
"ERA5_mean",
"ERA5_min",
"ERA5_max",
"ERA5_RH",
"ERA5_SH",
"PERSIANN"
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, but we should be getting this information back from whatever we do for the data catalogue I think. I'm also not really a fan of R's argument-default-as-first-in-a-vector (I get that it's theoretically self-documenting, but it makes for a messy signature and it's really ambiguous about types)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the existence of data_source_names, I'd use our usual pattern of data_source = NULL as the argument, then

match_value(data_source, names(data_source_names))

as otherwise the data here is split across two places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be getting this information back from whatever we do for the data catalogue

I don't agree with any of the rest apart from this, but sure I can default to NULL with the pattern.

load_local_vimc_climate <- function(
country,
date_range,
data_location = here::here(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a sensible default? For now, perhaps require an explicit path and then we can add magic later once we know how things will work

R/locate_data.R Outdated
Comment on lines 20 to 27
checkmate::assert_directory_exists(
data_location
)
data_location <- file.path(data_location, country_iso3c)
checkmate::assert_directory_exists(
data_location
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use assert_file_exists and assert_is_directory here, or update reside.utils to offer assert_directory_exists by wrapping these together if you want

Copy link
Contributor Author

@pratikunterwegs pratikunterwegs Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll write a package specific validator for the data location. I think assert_is_directory satisfies the use case here so don't think it's worth adding a new assertion function.

R/locate_data.R Outdated
target_file <- files[grepl(search_pattern, files)]

# check that file is found
if (identical(target_file, character(0))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (identical(target_file, character(0))) {
if (length(target_file) == 0) {

Copy link
Contributor Author

@pratikunterwegs pratikunterwegs Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll throw in an assertion before this that makes this unnecessary. Added in a commit.

R/locate_data.R Outdated
)

files <- list.files(data_location, full.names = TRUE)
data_source_id <- data_source_names[data_source]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data_source_id <- data_source_names[data_source]
data_source_id <- data_source_names[[data_source]]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in a commit.

Comment on lines 30 to 38
search_pattern <- sprintf(
"%s_admin_%s.parquet",
data_source_id,
admin_level
)

# search for files
target_file <- files[grepl(search_pattern, files)]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason here not to use the pattern argument to list.files?

pattern <- sprintf("*%s_admin_%s.parquet", data_source_id, admin_level)
target_file <- llist.files(data_location, pattern = pattern)

Copy link
Contributor Author

@pratikunterwegs pratikunterwegs Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've simplified this to use the search pattern. I think initially I was expecting to use some information on any files found vs those matching the search.

)

# data source not included
expect_error(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please always include an error message with expect_error() - we're testing our errors here so that's fine.

@pratikunterwegs
Copy link
Contributor Author

The PR is still saying it wants to do a squash and merge, so does something else need changing?

It might only apply to future PRs - I can try closing and reopening this PR and check, but the repo settings are set up to allow merges.

@richfitz
Copy link
Member

that makes sense - don't worry about it for this one

@pratikunterwegs pratikunterwegs merged commit 3105a9a into main Apr 25, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants