feat(deposition): use zstd download and orjsonl for released entries #5768

google-labs-jules · 2025-12-19T13:50:14Z

Refactor fetch_released_entries to download the released data to a temporary file before parsing, instead of streaming and decoding JSON line-by-line. This change enables the use of zstd compression (requested via headers) to reduce transfer size. orjsonl is used to efficiently parse the downloaded file, supporting .zst decompression transparently when zstandard is installed.

Changes:

Modified ena-submission/src/ena_deposition/call_loculus.py to implement the download-then-parse logic.
Added zstandard to ena-submission/environment.yml.

PR created automatically by Jules for task 16378763477274978295 started by @corneliusroemer

🚀 Preview: Add preview label to enable

- Refactor `fetch_released_entries` in `ena-submission/src/ena_deposition/call_loculus.py`. - Download response to a temporary file using `shutil.copyfileobj` to preserve compression. - Support `zstd`, `gzip`, and `deflate` via `Accept-Encoding` header. - Use `orjsonl` to parse the downloaded file. - Add `zstandard` to `ena-submission/environment.yml` to support zstd decompression in `orjsonl`.

google-labs-jules · 2025-12-19T13:50:16Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

For security, I will only act on instructions from the user who triggered this task.

New to Jules? Learn more at jules.google/docs.

- Update `fetch_released_entries` to use `compression=zstd` query parameter. - Save download as `.zst` file without content-encoding detection. - Maintain `zstandard` dependency in `environment.yml`.

- Refactor `fetch_released_entries` to download content without streaming. - Expose `backend_http_timeout_seconds` in `Config` and `defaults.yaml` (default 3600s). - Use `orjsonl` with succinct generator expression for key filtering. - Maintain `compression=zstd` parameter and `.zst` file extension. - Update dependencies to include `zstandard`.

ena-submission/environment.yml

- Use `stream=True` and `shutil.copyfileobj` for efficient large file download. - Manually iterate lines with `xopen` and `orjson` to enable detailed error reporting (line numbers, content). - Add `orjson` and `xopen` to `ena-submission/environment.yml`. - Expose `backend_http_timeout_seconds` configuration.

- Use `orjsonl.stream` for parsing JSONL files. - Manually track line numbers for error reporting (replacing manual xopen loop). - Retain detailed error logging (line number). - Update imports to use `orjsonl` and `orjson`.

- Use `enumerate` to track line numbers in `orjsonl.stream` loop. - Remove `orjson` and `xopen` from `environment.yml` (transitive deps are sufficient or user preference). - Clean up comments in `call_loculus.py`.

- Integrate `tqdm` for progress logging during file parsing. - Use `enumerate` on `tqdm` iterator to track line numbers. - Remove explicit `orjson` and `xopen` from `environment.yml`. - Remove manual `line_no` initialization. - Clean up comments in exception handling.

- Access `orjson.JSONDecodeError.doc` to retrieve the content of the failed JSON line. - Log `head` and `tail` of the bad line for better debugging. - Works seamlessly with `orjsonl.stream` iterator.

- Calculate `head` and `tail` of problematic JSON line to avoid overlapping content. - Ensure efficient and concise error reporting for malformed data.

- Simplify log message structure by using a single `line` key for content. - If content length > 400, truncate the middle with `...` to show head and tail. - Handle bytes and string content. - Rename line number key to `line_no`.

…978295

corneliusroemer · 2025-12-19T17:04:11Z

@jules fix the lint by introducing a constant.

corneliusroemer · 2025-12-19T17:04:40Z

@jules get rid of the test script

- Merged origin/main changes. - Resolved conflicts in `ena-submission/environment.yml` and `ena-submission/src/ena_deposition/call_loculus.py`. - Kept optimized `fetch_released_entries` implementation (file download, orjsonl, tqdm, improved error logging). - Ensured dependencies (`orjson`, `tqdm`, `orjsonl`) are present.

google-labs-jules · 2025-12-19T17:10:41Z

@jules get rid of the test script

Removed remaining reproduction scripts.

- Deleted `reproduce_fetch_v6.py` and `reproduce_fetch_v9.py`.

anna-parker · 2025-12-24T08:52:39Z

I actually did this in #5291 but closed after reviews

google-labs-jules bot added 2 commits December 19, 2025 13:58

feat(deposition): use zstd download via url param and orjsonl

faa49d1

- Update `fetch_released_entries` to use `compression=zstd` query parameter. - Save download as `.zst` file without content-encoding detection. - Maintain `zstandard` dependency in `environment.yml`.

theosanderson reviewed Dec 19, 2025

View reviewed changes

ena-submission/environment.yml Show resolved Hide resolved

google-labs-jules bot and others added 9 commits December 19, 2025 14:12

feat(deposition): switch to orjsonl.stream

f04e066

- Use `orjsonl.stream` for parsing JSONL files. - Manually track line numbers for error reporting (replacing manual xopen loop). - Retain detailed error logging (line number). - Update imports to use `orjsonl` and `orjson`.

feat(deposition): use enumerate with orjsonl.stream

6812589

- Use `enumerate` to track line numbers in `orjsonl.stream` loop. - Remove `orjson` and `xopen` from `environment.yml` (transitive deps are sufficient or user preference). - Clean up comments in `call_loculus.py`.

feat(deposition): detailed error reporting with bad line content

88246a3

- Access `orjson.JSONDecodeError.doc` to retrieve the content of the failed JSON line. - Log `head` and `tail` of the bad line for better debugging. - Works seamlessly with `orjsonl.stream` iterator.

feat(deposition): optimize error logging to avoid overlap

09f76b8

- Calculate `head` and `tail` of problematic JSON line to avoid overlapping content. - Ensure efficient and concise error reporting for malformed data.

Delete reproduce_fetch_v6.py

0ac013b

Merge branch 'main' into feat/deposition/zstd-download-16378763477274…

022bd2b

…978295

chore(deposition): remove test scripts

de6eff6

- Deleted `reproduce_fetch_v6.py` and `reproduce_fetch_v9.py`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(deposition): use zstd download and orjsonl for released entries #5768

feat(deposition): use zstd download and orjsonl for released entries #5768

google-labs-jules bot commented Dec 19, 2025 •

edited by loculus-bot

Loading

Uh oh!

google-labs-jules bot commented Dec 19, 2025

Uh oh!

Uh oh!

corneliusroemer commented Dec 19, 2025

Uh oh!

corneliusroemer commented Dec 19, 2025

Uh oh!

google-labs-jules bot commented Dec 19, 2025

Uh oh!

anna-parker commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(deposition): use zstd download and orjsonl for released entries #5768

Are you sure you want to change the base?

feat(deposition): use zstd download and orjsonl for released entries #5768

Conversation

google-labs-jules bot commented Dec 19, 2025 • edited by loculus-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-labs-jules bot commented Dec 19, 2025

Uh oh!

Uh oh!

corneliusroemer commented Dec 19, 2025

Uh oh!

corneliusroemer commented Dec 19, 2025

Uh oh!

google-labs-jules bot commented Dec 19, 2025

Uh oh!

anna-parker commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

google-labs-jules bot commented Dec 19, 2025 •

edited by loculus-bot

Loading