Read climate data from local files #4

pratikunterwegs · 2025-04-14T09:34:13Z

This is an initial trial-balloon PR for the general user-facing side of {vimclimate}.

There's a single function, get_vimc_climate() that accepts similar arguments to the data preparation function in clim2parquet.

Eventually, the data_location argument will point to a remote repository which can be passed by the user (with some sensible default if we know where the data is).

I'm copying the fix in {hubutils}: hubverse-org/hubValidations#74 to avoid CI failing on Windows; this is because the {tzdb} package is in Suggests for {arrow}, see more in apache/arrow#40073 (comment).

Update Readme

Mod global state checker for Windows Remove check on no conditions

richfitz

Overall comments here - happy to discuss in person if you prefer as some are fairly large changes in direction.

Can you please also set the repo up to do a merge based flow - I would like this project to match how other VIMC projects tend to work as I want and expect contributions from the science side

richfitz · 2025-04-16T09:06:39Z

R/get_vimc_climate.R

+  ),
+  admin_level = c(0, 1, 2, 3)
+) {
+  # more checks below


Can you please consider using the existing utilities we have for this in reside.utils - we use them everywhere else and they are IMO much more compact than most of these checkmate-then-throw versions. They're easy enough to expand if you need more.

usethis::use_standalone("reside-ic/reside.utils", "utils-assert")

https://github.com/reside-ic/reside.utils/

That said, we don't have date handling in there (yet) I think but they'd be good to add - in the same style. Perhaps start by writing similar assertions in this package, using lower-level ones there as required.

A nice final state of this would probably look like

country <- validate_country(country) date_range <- validate_date_range(date_range) admin_level <- validate_admin_level(admin_level) target_file <- locate_parquet_file(data_location, country, admin_level) ...then read/process

richfitz · 2025-04-16T09:10:51Z

R/get_vimc_climate.R

+  nchar_country <- as.character(nchar(country))
+  country_iso3c <- switch(
+    nchar_country,
+    "2" = countrycode::countrycode(
+      country,
+      "iso2c",
+      "iso3c",
+      warn = FALSE,
+      nomatch = NA
+    ),
+    "3" = countrycode::countrycode(
+      country,
+      "iso3c",
+      "iso3c",
+      warn = FALSE,
+      nomatch = NA
+    ),
+    countrycode::countryname(
+      country,
+      "iso3c",
+      warn = FALSE,
+      nomatch = NA
+    )
+  )


Move all this into a validate_country_code function or similar. Please use an if/else ladder rather than switch with a conversion from integer to character, the semantics are not obvious

richfitz · 2025-04-16T09:11:58Z

R/get_vimc_climate.R

+
+    # read files and return data.frame
+    data <- arrow::read_parquet(target_file)
+    data <- as.data.frame(data)


Why convert back to df here? Surely tibble or Arrow Table would be ok here?

It's a known format (to users), and avoids a dependency on {tibble}. No strong feelings either way if expected end users are fine with Arrow tables, or if they'd prefer tibbles.

richfitz · 2025-04-16T09:14:18Z

R/get_vimc_climate.R

+    data$Date <- lubridate::as_date(data$Date)
+
+    data[
+      lubridate::`%within%`(


Please don't use infix operators as infix - really lubridate should provide a prefix form of this, but I think it would be less alarming to use @importFrom above this function probably

That said, it might be easier just to use

data$Date >= min(date_range) & data$Date <= max(date_range)

which would make more explicit the inclusiveness (or not) of date ranges

Sure. I liked the interval feature and it would be nice to use, but can switch to simpler inequality testing.

richfitz · 2025-04-16T09:16:58Z

README.Rmd

+```r
+installation from R-universe
+install.packages(
+  "vimclimate", 


https://github.com/vimc/vimc.r-universe.dev - can you make a PR here to add this, I think it can be added immediately

richfitz · 2025-04-16T09:19:27Z

README.Rmd

-# )
+``` r
+install.packages("pak")
+pak::pak("vimc/vimclimate")


In the past, we have found that remotes leaves Windows libraries in a corrupt state quite regularly - I don't think many Rstudio folks actually develop on windows. Do you know if pak is better in this regard? My (fairly strong) preference is just to point people at r-universe as we get very few complaints of things not working from there

No idea really. I can remove the {pak} instructions and point to r-universe.

richfitz · 2025-04-16T09:21:40Z

R/get_vimc_climate.R

+#' @return A `<data.frame>` of climate data.
+#'
+#' @export
+get_vimc_climate <- function(


I have a real dislike of function names starting with get_ - it seems both too much of an implementation detail (like pointless getters and setters in OOP) and too little - is this loading from a local file or is it fetching from a remote resource. It would be nice to sketch out somewhere the possible total package API in a ticket so that we get a nice set of functions that hint at their functionality

I think that this function is really a "load" or "read" function, but sketching the whole interface will make this clearer

It's a load function for now, but the idea is to have a single function that can access different types of locations, both local dirs and packit. Is there something specific to rename this to?

I've renamed this to the longer and more descriptive load_local_vimc_climate().

richfitz · 2025-04-16T09:22:37Z

R/get_vimc_climate.R

@@ -0,0 +1,150 @@
+#' Get climate data Parquets prepared for VIMC


I think the use of Parquets here might be confusing in the context where packets are also something that we might end up referring to. It's also an implementation detail that they are stored as parquet files

I've changed the doc text to be more accurate.

richfitz · 2025-04-16T09:23:20Z

R/get_vimc_climate.R

+#' @param country A string for the country name, or for the country ISO 2- or 3-
+#' character code. E.g. one of "Canada", "CA", or "CAN". Input is checked
+#' against names and codes in \pkg{countrycode}.
+#' @param date_range A two element vector of either strings or `<Date>`s, giving


Can you please put newlines between @param entries so they're easier to scan

pratikunterwegs · 2025-04-16T10:05:28Z

Overall comments here - happy to discuss in person if you prefer as some are fairly large changes in direction.

These all seem do-able changes. I'll try and add some documentation to {reside.utils} as well.

Can you please also set the repo up to do a merge based flow - I would like this project to match how other VIMC projects tend to work as I want and expect contributions from the science side.

I think it already is, or it's been changed to allow merges. I've allowed non-linear histories on main as well if that helps, but ideally folks would at least squash if they have a messy PR.

pratikunterwegs · 2025-04-16T14:01:34Z

I've restructured a good bit - just pop in any comments if there's something I haven't understood.

richfitz

Thanks - this is looking good.

A couple of minor suggestions here.

I think it already is, or it's been changed to allow merges. I've allowed non-linear histories on main as well if that helps, but ideally folks would at least squash if they have a messy PR.

The PR is still saying it wants to do a squash and merge, so does something else need changing?

richfitz · 2025-04-23T08:36:49Z

R/get_vimc_climate.R

+  data_source = c(
+    "CHIRPS",
+    "ERA5_mean",
+    "ERA5_min",
+    "ERA5_max",
+    "ERA5_RH",
+    "ERA5_SH",
+    "PERSIANN"
+  ),


This is fine for now, but we should be getting this information back from whatever we do for the data catalogue I think. I'm also not really a fan of R's argument-default-as-first-in-a-vector (I get that it's theoretically self-documenting, but it makes for a messy signature and it's really ambiguous about types)

Given the existence of data_source_names, I'd use our usual pattern of data_source = NULL as the argument, then

match_value(data_source, names(data_source_names))

as otherwise the data here is split across two places

we should be getting this information back from whatever we do for the data catalogue

I don't agree with any of the rest apart from this, but sure I can default to NULL with the pattern.

richfitz · 2025-04-23T08:37:34Z

R/get_vimc_climate.R

+load_local_vimc_climate <- function(
+  country,
+  date_range,
+  data_location = here::here(),


Is this really a sensible default? For now, perhaps require an explicit path and then we can add magic later once we know how things will work

richfitz · 2025-04-23T08:38:48Z

R/locate_data.R

+  checkmate::assert_directory_exists(
+    data_location
+  )
+  data_location <- file.path(data_location, country_iso3c)
+  checkmate::assert_directory_exists(
+    data_location
+  )
+


You can use assert_file_exists and assert_is_directory here, or update reside.utils to offer assert_directory_exists by wrapping these together if you want

I'll write a package specific validator for the data location. I think assert_is_directory satisfies the use case here so don't think it's worth adding a new assertion function.

richfitz · 2025-04-23T08:39:20Z

R/locate_data.R

+  target_file <- files[grepl(search_pattern, files)]
+
+  # check that file is found
+  if (identical(target_file, character(0))) {


Suggested change

if (identical(target_file, character(0))) {

if (length(target_file) == 0) {

~~I'll throw in an assertion before this that makes this unnecessary.~~ Added in a commit.

richfitz · 2025-04-23T08:41:25Z

R/locate_data.R

+  )
+
+  files <- list.files(data_location, full.names = TRUE)
+  data_source_id <- data_source_names[data_source]


Suggested change

data_source_id <- data_source_names[data_source]

data_source_id <- data_source_names[[data_source]]

Added in a commit.

richfitz · 2025-04-23T08:41:58Z

R/locate_data.R

+  search_pattern <- sprintf(
+    "%s_admin_%s.parquet",
+    data_source_id,
+    admin_level
+  )
+
+  # search for files
+  target_file <- files[grepl(search_pattern, files)]
+


Any reason here not to use the pattern argument to list.files?

pattern <- sprintf("*%s_admin_%s.parquet", data_source_id, admin_level) target_file <- llist.files(data_location, pattern = pattern)

I've simplified this to use the search pattern. I think initially I was expecting to use some information on any files found vs those matching the search.

richfitz · 2025-04-23T08:46:56Z

tests/testthat/test-get_vimc_climate.R

+  )
+
+  # data source not included
+  expect_error(


Please always include an error message with expect_error() - we're testing our errors here so that's fine.

pratikunterwegs · 2025-04-23T09:00:37Z

The PR is still saying it wants to do a squash and merge, so does something else need changing?

It might only apply to future PRs - I can try closing and reopening this PR and check, but the repo settings are set up to allow merges.

richfitz · 2025-04-23T09:04:10Z

that makes sense - don't worry about it for this one

pratikunterwegs added 8 commits April 7, 2025 13:54

Update package infrastructure from template

81b0caf

Comment out r-universe install instructions

45e0907

Update Readme

R CMD check GHA errors on warning not note

99acbe2

Initial data access function and docs

8eb562d

Update infrastructure files

ffe43e7

Fix filepath construction for data location

a9b0db8

Add basic tests and add test data, rm check on daterange

8f3d6ff

Update Rproj file

a1ca1ea

pratikunterwegs force-pushed the vimc-7537 branch from 0154e19 to a1ca1ea Compare April 14, 2025 11:39

Manually install pkg tzdb for arrow deps on Windows

e52dbfd

Mod global state checker for Windows Remove check on no conditions

pratikunterwegs force-pushed the vimc-7537 branch from 5588b95 to e52dbfd Compare April 14, 2025 14:39

pratikunterwegs marked this pull request as ready for review April 14, 2025 14:53

pratikunterwegs requested a review from richfitz April 14, 2025 14:54

pratikunterwegs changed the title ~~Read and serve climate data from local files~~ Read climate data from local files Apr 14, 2025

richfitz suggested changes Apr 16, 2025

View reviewed changes

pratikunterwegs added 9 commits April 16, 2025 14:41

Import reside.utils asserts

d895633

Functions for assertions on <Date>

bc9ac50

Input validator functions

50273d2

File locator function

81a8184

Declutter and rename data loader fn

0e382b2

Update tests, docs, and wordlist

2af8e63

Update package infrastructure files

c836421

Update fun docs and add pkgdown reference

416acc0

Update Readme install code

4b7e774

pratikunterwegs requested a review from richfitz April 16, 2025 14:01

richfitz suggested changes Apr 23, 2025

View reviewed changes

pratikunterwegs added 6 commits April 23, 2025 12:01

Internal validator for local data location

c79d1a1

Simplify data file locator

ca7502f

Rework main function args

3104895

Imports from reside.utils

c335e9a

Update tests with errors, update pkg infra

8627372

Ignore reside.utils while linting

7f9ba22

pratikunterwegs requested a review from richfitz April 23, 2025 11:23

richfitz approved these changes Apr 25, 2025

View reviewed changes

pratikunterwegs merged commit 3105a9a into main Apr 25, 2025
8 checks passed

pratikunterwegs mentioned this pull request Apr 25, 2025

Initial package infrastructure #3

Closed

		@@ -0,0 +1,150 @@
		#' Get climate data Parquets prepared for VIMC

	if (identical(target_file, character(0))) {
	if (length(target_file) == 0) {

	data_source_id <- data_source_names[data_source]
	data_source_id <- data_source_names[[data_source]]

Read climate data from local files #4

Read climate data from local files #4

Uh oh!

Conversation

pratikunterwegs commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

richfitz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pratikunterwegs Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pratikunterwegs commented Apr 16, 2025

Uh oh!

pratikunterwegs commented Apr 16, 2025

Uh oh!

richfitz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pratikunterwegs Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pratikunterwegs Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

pratikunterwegs commented Apr 14, 2025 •

edited

Loading

pratikunterwegs Apr 16, 2025 •

edited

Loading

pratikunterwegs Apr 23, 2025 •

edited

Loading

pratikunterwegs Apr 23, 2025 •

edited

Loading

pratikunterwegs Apr 23, 2025 •

edited

Loading