-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Problem
The Python client's upload implementation has memory issues when processing tens of thousands of files, compared to the TypeScript CLI.
Issues
-
Duplicate check phase — stores all checksums in memory
- Location:
immich/_internal/upload.py:155-217incheck_duplicates() - Problem: All checksums are accumulated in a list before batching:
checksums: list[tuple[Path, str]] = [] for filepath in files: checksum = await asyncio.to_thread(compute_sha1_sync, filepath) checksums.append((filepath, checksum)) # All stored in memory
- Impact: For 10k+ files, this can use significant memory (e.g., ~10k × ~100 bytes = ~1MB+ just for checksums, plus overhead).
- Location:
-
Upload phase — creates all coroutines upfront
- Location:
immich/_internal/upload.py:394inupload_files() - Problem:
asyncio.gather(*[upload_with_semaphore(f) for f in files])creates all coroutines at once:
await asyncio.gather(*[upload_with_semaphore(f) for f in files])
- Impact: For 10k+ files, this creates 10k+ coroutine objects in memory before processing.
- Location:
Comparison with TypeScript CLI
The TypeScript CLI (immich/cli/src/commands/asset.ts) handles this better:
- Streaming duplicate checks: batches checksums as they're computed (batches of 5000), avoiding storing all in memory
- Queue-based uploads: uses a
Queuethat processes files incrementally rather than creating all tasks upfront
Recommended fixes
- Stream duplicate checks: batch checksums as they're computed instead of storing all first
- Use a task queue: replace
asyncio.gatherwith a queue/worker pattern that processes files incrementally
Expected outcome
- Lower memory usage for large uploads (10k+ files)
- Better scalability without memory spikes
- Behavior aligned with the TypeScript CLI
Priority
Medium — functional but inefficient for very large uploads.
Metadata
Metadata
Assignees
Labels
No labels