-
Notifications
You must be signed in to change notification settings - Fork 9
feat(deposition): use zstd download and orjsonl for released entries #5768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(deposition): use zstd download and orjsonl for released entries #5768
Conversation
- Refactor `fetch_released_entries` in `ena-submission/src/ena_deposition/call_loculus.py`. - Download response to a temporary file using `shutil.copyfileobj` to preserve compression. - Support `zstd`, `gzip`, and `deflate` via `Accept-Encoding` header. - Use `orjsonl` to parse the downloaded file. - Add `zstandard` to `ena-submission/environment.yml` to support zstd decompression in `orjsonl`.
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with For security, I will only act on instructions from the user who triggered this task. New to Jules? Learn more at jules.google/docs. |
- Update `fetch_released_entries` to use `compression=zstd` query parameter. - Save download as `.zst` file without content-encoding detection. - Maintain `zstandard` dependency in `environment.yml`.
- Refactor `fetch_released_entries` to download content without streaming. - Expose `backend_http_timeout_seconds` in `Config` and `defaults.yaml` (default 3600s). - Use `orjsonl` with succinct generator expression for key filtering. - Maintain `compression=zstd` parameter and `.zst` file extension. - Update dependencies to include `zstandard`.
- Use `stream=True` and `shutil.copyfileobj` for efficient large file download. - Manually iterate lines with `xopen` and `orjson` to enable detailed error reporting (line numbers, content). - Add `orjson` and `xopen` to `ena-submission/environment.yml`. - Expose `backend_http_timeout_seconds` configuration.
- Use `orjsonl.stream` for parsing JSONL files. - Manually track line numbers for error reporting (replacing manual xopen loop). - Retain detailed error logging (line number). - Update imports to use `orjsonl` and `orjson`.
- Use `enumerate` to track line numbers in `orjsonl.stream` loop. - Remove `orjson` and `xopen` from `environment.yml` (transitive deps are sufficient or user preference). - Clean up comments in `call_loculus.py`.
- Integrate `tqdm` for progress logging during file parsing. - Use `enumerate` on `tqdm` iterator to track line numbers. - Remove explicit `orjson` and `xopen` from `environment.yml`. - Remove manual `line_no` initialization. - Clean up comments in exception handling.
- Access `orjson.JSONDecodeError.doc` to retrieve the content of the failed JSON line. - Log `head` and `tail` of the bad line for better debugging. - Works seamlessly with `orjsonl.stream` iterator.
- Calculate `head` and `tail` of problematic JSON line to avoid overlapping content. - Ensure efficient and concise error reporting for malformed data.
- Simplify log message structure by using a single `line` key for content. - If content length > 400, truncate the middle with `...` to show head and tail. - Handle bytes and string content. - Rename line number key to `line_no`.
|
@jules fix the lint by introducing a constant. |
|
@jules get rid of the test script |
- Merged origin/main changes. - Resolved conflicts in `ena-submission/environment.yml` and `ena-submission/src/ena_deposition/call_loculus.py`. - Kept optimized `fetch_released_entries` implementation (file download, orjsonl, tqdm, improved error logging). - Ensured dependencies (`orjson`, `tqdm`, `orjsonl`) are present.
Removed remaining reproduction scripts. |
- Deleted `reproduce_fetch_v6.py` and `reproduce_fetch_v9.py`.
|
I actually did this in #5291 but closed after reviews |
Refactor
fetch_released_entriesto download the released data to a temporary file before parsing, instead of streaming and decoding JSON line-by-line. This change enables the use ofzstdcompression (requested via headers) to reduce transfer size.orjsonlis used to efficiently parse the downloaded file, supporting.zstdecompression transparently whenzstandardis installed.Changes:
ena-submission/src/ena_deposition/call_loculus.pyto implement the download-then-parse logic.zstandardtoena-submission/environment.yml.PR created automatically by Jules for task 16378763477274978295 started by @corneliusroemer
🚀 Preview: Add
previewlabel to enable