-
Notifications
You must be signed in to change notification settings - Fork 0
Read climate data from local files #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Update Readme
0154e19 to
a1ca1ea
Compare
Mod global state checker for Windows Remove check on no conditions
5588b95 to
e52dbfd
Compare
richfitz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall comments here - happy to discuss in person if you prefer as some are fairly large changes in direction.
Can you please also set the repo up to do a merge based flow - I would like this project to match how other VIMC projects tend to work as I want and expect contributions from the science side
R/get_vimc_climate.R
Outdated
| ), | ||
| admin_level = c(0, 1, 2, 3) | ||
| ) { | ||
| # more checks below |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please consider using the existing utilities we have for this in reside.utils - we use them everywhere else and they are IMO much more compact than most of these checkmate-then-throw versions. They're easy enough to expand if you need more.
usethis::use_standalone("reside-ic/reside.utils", "utils-assert")
https://github.com/reside-ic/reside.utils/
That said, we don't have date handling in there (yet) I think but they'd be good to add - in the same style. Perhaps start by writing similar assertions in this package, using lower-level ones there as required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nice final state of this would probably look like
country <- validate_country(country)
date_range <- validate_date_range(date_range)
admin_level <- validate_admin_level(admin_level)
target_file <- locate_parquet_file(data_location, country, admin_level)
...then read/process
R/get_vimc_climate.R
Outdated
| nchar_country <- as.character(nchar(country)) | ||
| country_iso3c <- switch( | ||
| nchar_country, | ||
| "2" = countrycode::countrycode( | ||
| country, | ||
| "iso2c", | ||
| "iso3c", | ||
| warn = FALSE, | ||
| nomatch = NA | ||
| ), | ||
| "3" = countrycode::countrycode( | ||
| country, | ||
| "iso3c", | ||
| "iso3c", | ||
| warn = FALSE, | ||
| nomatch = NA | ||
| ), | ||
| countrycode::countryname( | ||
| country, | ||
| "iso3c", | ||
| warn = FALSE, | ||
| nomatch = NA | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move all this into a validate_country_code function or similar. Please use an if/else ladder rather than switch with a conversion from integer to character, the semantics are not obvious
R/get_vimc_climate.R
Outdated
|
|
||
| # read files and return data.frame | ||
| data <- arrow::read_parquet(target_file) | ||
| data <- as.data.frame(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why convert back to df here? Surely tibble or Arrow Table would be ok here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a known format (to users), and avoids a dependency on {tibble}. No strong feelings either way if expected end users are fine with Arrow tables, or if they'd prefer tibbles.
R/get_vimc_climate.R
Outdated
| data$Date <- lubridate::as_date(data$Date) | ||
|
|
||
| data[ | ||
| lubridate::`%within%`( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use infix operators as infix - really lubridate should provide a prefix form of this, but I think it would be less alarming to use @importFrom above this function probably
That said, it might be easier just to use
data$Date >= min(date_range) & data$Date <= max(date_range)
which would make more explicit the inclusiveness (or not) of date ranges
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I liked the interval feature and it would be nice to use, but can switch to simpler inequality testing.
| ```r | ||
| installation from R-universe | ||
| install.packages( | ||
| "vimclimate", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/vimc/vimc.r-universe.dev - can you make a PR here to add this, I think it can be added immediately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
README.Rmd
Outdated
| # ) | ||
| ``` r | ||
| install.packages("pak") | ||
| pak::pak("vimc/vimclimate") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past, we have found that remotes leaves Windows libraries in a corrupt state quite regularly - I don't think many Rstudio folks actually develop on windows. Do you know if pak is better in this regard? My (fairly strong) preference is just to point people at r-universe as we get very few complaints of things not working from there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea really. I can remove the {pak} instructions and point to r-universe.
R/get_vimc_climate.R
Outdated
| #' @return A `<data.frame>` of climate data. | ||
| #' | ||
| #' @export | ||
| get_vimc_climate <- function( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a real dislike of function names starting with get_ - it seems both too much of an implementation detail (like pointless getters and setters in OOP) and too little - is this loading from a local file or is it fetching from a remote resource. It would be nice to sketch out somewhere the possible total package API in a ticket so that we get a nice set of functions that hint at their functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this function is really a "load" or "read" function, but sketching the whole interface will make this clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a load function for now, but the idea is to have a single function that can access different types of locations, both local dirs and packit. Is there something specific to rename this to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed this to the longer and more descriptive load_local_vimc_climate().
R/get_vimc_climate.R
Outdated
| @@ -0,0 +1,150 @@ | |||
| #' Get climate data Parquets prepared for VIMC | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the use of Parquets here might be confusing in the context where packets are also something that we might end up referring to. It's also an implementation detail that they are stored as parquet files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the doc text to be more accurate.
R/get_vimc_climate.R
Outdated
| #' @param country A string for the country name, or for the country ISO 2- or 3- | ||
| #' character code. E.g. one of "Canada", "CA", or "CAN". Input is checked | ||
| #' against names and codes in \pkg{countrycode}. | ||
| #' @param date_range A two element vector of either strings or `<Date>`s, giving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please put newlines between @param entries so they're easier to scan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
These all seem do-able changes. I'll try and add some documentation to {reside.utils} as well.
I think it already is, or it's been changed to allow merges. I've allowed non-linear histories on |
|
I've restructured a good bit - just pop in any comments if there's something I haven't understood. |
richfitz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - this is looking good.
A couple of minor suggestions here.
I think it already is, or it's been changed to allow merges. I've allowed non-linear histories on main as well if that helps, but ideally folks would at least squash if they have a messy PR.
The PR is still saying it wants to do a squash and merge, so does something else need changing?
R/get_vimc_climate.R
Outdated
| data_source = c( | ||
| "CHIRPS", | ||
| "ERA5_mean", | ||
| "ERA5_min", | ||
| "ERA5_max", | ||
| "ERA5_RH", | ||
| "ERA5_SH", | ||
| "PERSIANN" | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now, but we should be getting this information back from whatever we do for the data catalogue I think. I'm also not really a fan of R's argument-default-as-first-in-a-vector (I get that it's theoretically self-documenting, but it makes for a messy signature and it's really ambiguous about types)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the existence of data_source_names, I'd use our usual pattern of data_source = NULL as the argument, then
match_value(data_source, names(data_source_names))as otherwise the data here is split across two places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be getting this information back from whatever we do for the data catalogue
I don't agree with any of the rest apart from this, but sure I can default to NULL with the pattern.
R/get_vimc_climate.R
Outdated
| load_local_vimc_climate <- function( | ||
| country, | ||
| date_range, | ||
| data_location = here::here(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really a sensible default? For now, perhaps require an explicit path and then we can add magic later once we know how things will work
R/locate_data.R
Outdated
| checkmate::assert_directory_exists( | ||
| data_location | ||
| ) | ||
| data_location <- file.path(data_location, country_iso3c) | ||
| checkmate::assert_directory_exists( | ||
| data_location | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use assert_file_exists and assert_is_directory here, or update reside.utils to offer assert_directory_exists by wrapping these together if you want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll write a package specific validator for the data location. I think assert_is_directory satisfies the use case here so don't think it's worth adding a new assertion function.
R/locate_data.R
Outdated
| target_file <- files[grepl(search_pattern, files)] | ||
|
|
||
| # check that file is found | ||
| if (identical(target_file, character(0))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (identical(target_file, character(0))) { | |
| if (length(target_file) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll throw in an assertion before this that makes this unnecessary. Added in a commit.
R/locate_data.R
Outdated
| ) | ||
|
|
||
| files <- list.files(data_location, full.names = TRUE) | ||
| data_source_id <- data_source_names[data_source] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| data_source_id <- data_source_names[data_source] | |
| data_source_id <- data_source_names[[data_source]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in a commit.
| search_pattern <- sprintf( | ||
| "%s_admin_%s.parquet", | ||
| data_source_id, | ||
| admin_level | ||
| ) | ||
|
|
||
| # search for files | ||
| target_file <- files[grepl(search_pattern, files)] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason here not to use the pattern argument to list.files?
pattern <- sprintf("*%s_admin_%s.parquet", data_source_id, admin_level)
target_file <- llist.files(data_location, pattern = pattern)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've simplified this to use the search pattern. I think initially I was expecting to use some information on any files found vs those matching the search.
| ) | ||
|
|
||
| # data source not included | ||
| expect_error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please always include an error message with expect_error() - we're testing our errors here so that's fine.
It might only apply to future PRs - I can try closing and reopening this PR and check, but the repo settings are set up to allow merges. |
|
that makes sense - don't worry about it for this one |
This is an initial trial-balloon PR for the general user-facing side of {vimclimate}.
There's a single function,
get_vimc_climate()that accepts similar arguments to the data preparation function in clim2parquet.Eventually, the
data_locationargument will point to a remote repository which can be passed by the user (with some sensible default if we know where the data is).I'm copying the fix in {hubutils}: hubverse-org/hubValidations#74 to avoid CI failing on Windows; this is because the {tzdb} package is in Suggests for {arrow}, see more in apache/arrow#40073 (comment).