Skip to content

Conversation

@google-labs-jules
Copy link
Contributor

@google-labs-jules google-labs-jules bot commented Dec 19, 2025

Refactor fetch_released_entries to download the released data to a temporary file before parsing, instead of streaming and decoding JSON line-by-line. This change enables the use of zstd compression (requested via headers) to reduce transfer size. orjsonl is used to efficiently parse the downloaded file, supporting .zst decompression transparently when zstandard is installed.

Changes:

  • Modified ena-submission/src/ena_deposition/call_loculus.py to implement the download-then-parse logic.
  • Added zstandard to ena-submission/environment.yml.

PR created automatically by Jules for task 16378763477274978295 started by @corneliusroemer

🚀 Preview: Add preview label to enable

- Refactor `fetch_released_entries` in `ena-submission/src/ena_deposition/call_loculus.py`.
- Download response to a temporary file using `shutil.copyfileobj` to preserve compression.
- Support `zstd`, `gzip`, and `deflate` via `Accept-Encoding` header.
- Use `orjsonl` to parse the downloaded file.
- Add `zstandard` to `ena-submission/environment.yml` to support zstd decompression in `orjsonl`.
@google-labs-jules
Copy link
Contributor Author

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!


For security, I will only act on instructions from the user who triggered this task.

New to Jules? Learn more at jules.google/docs.

- Update `fetch_released_entries` to use `compression=zstd` query parameter.
- Save download as `.zst` file without content-encoding detection.
- Maintain `zstandard` dependency in `environment.yml`.
- Refactor `fetch_released_entries` to download content without streaming.
- Expose `backend_http_timeout_seconds` in `Config` and `defaults.yaml` (default 3600s).
- Use `orjsonl` with succinct generator expression for key filtering.
- Maintain `compression=zstd` parameter and `.zst` file extension.
- Update dependencies to include `zstandard`.
google-labs-jules bot and others added 9 commits December 19, 2025 14:12
- Use `stream=True` and `shutil.copyfileobj` for efficient large file download.
- Manually iterate lines with `xopen` and `orjson` to enable detailed error reporting (line numbers, content).
- Add `orjson` and `xopen` to `ena-submission/environment.yml`.
- Expose `backend_http_timeout_seconds` configuration.
- Use `orjsonl.stream` for parsing JSONL files.
- Manually track line numbers for error reporting (replacing manual xopen loop).
- Retain detailed error logging (line number).
- Update imports to use `orjsonl` and `orjson`.
- Use `enumerate` to track line numbers in `orjsonl.stream` loop.
- Remove `orjson` and `xopen` from `environment.yml` (transitive deps are sufficient or user preference).
- Clean up comments in `call_loculus.py`.
- Integrate `tqdm` for progress logging during file parsing.
- Use `enumerate` on `tqdm` iterator to track line numbers.
- Remove explicit `orjson` and `xopen` from `environment.yml`.
- Remove manual `line_no` initialization.
- Clean up comments in exception handling.
- Access `orjson.JSONDecodeError.doc` to retrieve the content of the failed JSON line.
- Log `head` and `tail` of the bad line for better debugging.
- Works seamlessly with `orjsonl.stream` iterator.
- Calculate `head` and `tail` of problematic JSON line to avoid overlapping content.
- Ensure efficient and concise error reporting for malformed data.
- Simplify log message structure by using a single `line` key for content.
- If content length > 400, truncate the middle with `...` to show head and tail.
- Handle bytes and string content.
- Rename line number key to `line_no`.
@corneliusroemer
Copy link
Contributor

@jules fix the lint by introducing a constant.

@corneliusroemer
Copy link
Contributor

@jules get rid of the test script

- Merged origin/main changes.
- Resolved conflicts in `ena-submission/environment.yml` and `ena-submission/src/ena_deposition/call_loculus.py`.
- Kept optimized `fetch_released_entries` implementation (file download, orjsonl, tqdm, improved error logging).
- Ensured dependencies (`orjson`, `tqdm`, `orjsonl`) are present.
@google-labs-jules
Copy link
Contributor Author

@jules get rid of the test script

Removed remaining reproduction scripts.

- Deleted `reproduce_fetch_v6.py` and `reproduce_fetch_v9.py`.
@anna-parker
Copy link
Contributor

I actually did this in #5291 but closed after reviews

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants