Skip to content

Memory inefficiency in upload for large file sets #26

@timonrieger

Description

@timonrieger

Problem

The Python client's upload implementation has memory issues when processing tens of thousands of files, compared to the TypeScript CLI.

Issues

  1. Duplicate check phase — stores all checksums in memory

    • Location: immich/_internal/upload.py:155-217 in check_duplicates()
    • Problem: All checksums are accumulated in a list before batching:
    checksums: list[tuple[Path, str]] = []
    for filepath in files:
        checksum = await asyncio.to_thread(compute_sha1_sync, filepath)
        checksums.append((filepath, checksum))  # All stored in memory
    • Impact: For 10k+ files, this can use significant memory (e.g., ~10k × ~100 bytes = ~1MB+ just for checksums, plus overhead).
  2. Upload phase — creates all coroutines upfront

    • Location: immich/_internal/upload.py:394 in upload_files()
    • Problem: asyncio.gather(*[upload_with_semaphore(f) for f in files]) creates all coroutines at once:
    await asyncio.gather(*[upload_with_semaphore(f) for f in files])
    • Impact: For 10k+ files, this creates 10k+ coroutine objects in memory before processing.

Comparison with TypeScript CLI

The TypeScript CLI (immich/cli/src/commands/asset.ts) handles this better:

  • Streaming duplicate checks: batches checksums as they're computed (batches of 5000), avoiding storing all in memory
  • Queue-based uploads: uses a Queue that processes files incrementally rather than creating all tasks upfront

Recommended fixes

  1. Stream duplicate checks: batch checksums as they're computed instead of storing all first
  2. Use a task queue: replace asyncio.gather with a queue/worker pattern that processes files incrementally

Expected outcome

  • Lower memory usage for large uploads (10k+ files)
  • Better scalability without memory spikes
  • Behavior aligned with the TypeScript CLI

Priority

Medium — functional but inefficient for very large uploads.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions