Skip to content

A Python-based tool to automatically sync and replicate files across multiple cloud storage services (mainly AWS & GCP), ensuring seamless backups and redundancy.

Notifications You must be signed in to change notification settings

Anshulgada/cross-cloud-replicator

Repository files navigation

☁️ Cross-Cloud Event-Driven Storage Replicator

This project is a Python service that operates as an event-driven replication worker. It exposes an HTTP endpoint to receive a notification about a new file in an AWS S3 bucket and replicates it to a Google Cloud Storage (GCS) bucket, ensuring the process is both robust and idempotent.

This project was developed as part of a take-home assignment and showcases best practices in API design, cloud integration, and local development workflows.

This README outlines the technical approach, design choices, and provides step-by-step instructions for running and testing the service.

Key Features and Design Approach

The service is built on a foundation of modern Python best practices, and designed with a focus on reliability, developer experience, and production readiness. The architecture is not just reactive; it is self-validating and built to fail fast, preventing it from running in a misconfigured or broken state.

  • Event-Driven Endpoint: A simple HTTP POST endpoint at /v1/replicate to trigger the replication process based on external events.

  • Flexible Configuration: A dependency injection pattern allows the service to seamlessly switch between local emulators (MinIO for S3, fake-gcs-server for GCS) and real cloud environments. This is controlled by environment variables, requiring no code changes and enabling safe, isolated testing.

Proactive Startup Validation

A key feature of this service is its proactive health check on startup. Before the API becomes available to accept requests, it performs a series of critical validations:

  1. Configuration Loading: It uses Pydantic to strictly load and validate all required environment variables from .env files. If a required variable is missing, the service provides a clean, human-readable error and exits, preventing runtime failures due to missing configuration.
  2. Endpoint Connectivity Test: It actively attempts to connect to the configured S3 and GCS endpoints (localhost emulators or real cloud services).

This "fail-fast" approach ensures that the service only runs when its core dependencies are available and correctly configured, which is a critical practice for building reliable distributed systems.

Advanced Configuration Management

The service utilizes a sophisticated configuration pattern in app/config.py:

  • Layered Environments: It correctly loads from .env.local first, allowing developers to easily override production settings for local testing without modifying shared files.
  • Conditional Validation: It contains business logic to enforce conditional rules, such as requiring GOOGLE_APPLICATION_CREDENTIALS only when not using the local GCS emulator.

High-Throughput and Efficient API Design

Beyond the basic requirements, the API has been enhanced for real-world use cases:

  • Batch Replication Endpoint: A /v1/replicate/batch endpoint was added to allow clients to replicate multiple files in a single API call. This is far more efficient than sending one request per file, reducing network overhead and improving throughput.
  • Memory-Efficient Streaming: Files are streamed directly from S3 to GCS without being saved to the local disk. This ensures a minimal memory footprint, allowing the service to handle large files efficiently.

Strategy for Robust Error Handling

Transient network errors are inevitable. The service is designed to be resilient:

  • Automatic Retries: It uses the tenacity library to automatically retry failed network operations (both downloading from S3 and uploading to GCS).
  • Exponential Backoff: The waiting time between retries increases exponentially (e.g., 2s, 4s, 8s). This prevents overwhelming a temporarily struggling downstream service and increases the chance of a successful recovery.
  • Specific Error Handling: The application returns clear HTTP status codes (404 Not Found for missing files, 503 Service Unavailable for connection failures), providing meaningful feedback to the client.

Strategy for Guaranteed Idempotency

An idempotent service guarantees that receiving the same request multiple times produces the same result as receiving it once. This is critical to prevent data duplication and wasted processing.

  • Core Implementation: The primary strategy is to check for the file's existence in the destination before uploading. Before any replication attempt, the service makes a blob.exists() API call to GCS. If the file is already there, the operation is considered a success, and the service gracefully skips the download and upload steps.

  • Scaling Considerations and Future Improvements: The current blob.exists() check is simple and effective for this assignment. However, in a high-throughput system processing hundreds of files per second, this approach would introduce a performance bottleneck, as it doubles the number of API calls to GCS for new files (one to check, one to upload).

A more scalable, production-grade solution would involve using an external, high-speed metadata store (like Redis or DynamoDB) to track processed files. The workflow would be:

  1. Receive a request for s3_bucket/s3_key.
  2. Generate a unique key for the file (e.g., s3:source-bucket:path/to/file).
  3. Check for the existence of this key in a Redis set—a millisecond-level operation.
  4. If the key exists, the file has been processed; skip.
  5. If not, perform the replication and add the key to the Redis set upon successful upload.

This improved design decouples the idempotency check from the storage provider, significantly reducing latency and API costs at scale.


Technology Stack

The technologies were chosen to align with modern, high-performance backend development practices.

Technology Purpose Justification
FastAPI Web Framework For its high performance, automatic data validation with Pydantic, and interactive API documentation.
uv Package Manager A next-generation, high-speed package manager that significantly accelerates dependency installation.
Docker Emulation For running local, containerized emulators (MinIO & fake-gcs-server), enabling a complete and isolated local development loop.
Pydantic Data Validation Used for both request body validation and robust, type-safe settings management from environment variables.
Rich Console Logging Provides clean, readable, and beautifully formatted terminal output for a superior developer experience.
Tenacity Retry Logic A powerful library for adding robust, declarative retry mechanisms to network operations.
Pre-commit & Ruff Code Quality For enforcing a consistent, high-quality codebase with automated linting and formatting on every commit.

API Documentation

The service exposes three primary endpoints. Full interactive documentation is also available at the /docs endpoint when the service is running.

- GET /

Confirms that the API is online and returns the currently active configuration, indicating whether the service is connected to local emulators or live cloud environments.

Example Response (Local Emulator Mode)

When the emulator URLs are set in the environment:

{
  "status": "ok",
  "message": "Welcome to the Cross-Cloud Replicator!",
  "current_config": {
    "s3_target": "http://localhost:9000",
    "gcs_target": "http://localhost:4443"
  }
}

Example Response (Production Mode)

When no emulator URLs are set:

{
  "status": "ok",
  "message": "Welcome to the Cross-Cloud Replicator!",
  "current_config": {
    "s3_target": "REAL AWS",
    "gcs_target": "REAL GCS"
  }
}

- POST /v1/replicate

Triggers the replication of a single file.

  • Request Body:
    {
      "s3_bucket": "source-bucket",
      "s3_key": "path/to/your/file"
    }
  • Success Responses:
    • 200 OK: If the file is successfully replicated or if it already exists in the destination (idempotency).
  • Error Responses:
    • 404 Not Found: If the specified s3_key does not exist in the s3_bucket.
    • 503 Service Unavailable: If the service cannot connect to S3 or GCS after multiple retries.

- POST /v1/replicate/batch

Triggers the replication of multiple files from the same bucket in one call.

  • Request Body:
    {
      "s3_bucket": "source-bucket",
      "s3_keys": [
        "path/to/file1.txt",
        "path/to/image.jpg",
        "data/report.csv"
      ]
    }
  • Success Response (200 OK): Returns a detailed breakdown of the status for each file.
    {
      "status": "completed",
      "results": [
        { "key": "path/to/file1.txt", "status": "success", "message": "Successfully replicated..." },
        { "key": "path/to/image.jpg", "status": "not_found", "error": "Object '...' not found..." }
      ]
    }

Automated Code Quality

To ensure code is clean, consistent, and maintainable, this project uses a two-layered approach to automated code quality checks with ruff.

  1. Pre-commit Hooks: The repository is configured with pre-commit hooks that run automatically on every git commit. These hooks format the code and check for linting errors before the code is even committed. This provides immediate feedback to the developer and maintains a high standard of quality on the local machine.

  2. Continuous Integration (CI): A GitHub Actions workflow is defined in .github/workflows/ci.yml. This workflow runs on every push or pull request to the main branch. It performs a fresh installation of dependencies and runs the linter and formatter checks on a clean runner. This serves as a final validation gate to ensure that all code integrated into the main branch adheres to the project's quality standards.


Project Structure

.
├── .github/                    # GitHub Actions CI/CD Workflows
│ └── workflows/
│ └── ci.yml
├── app/                        # Main application source code
│ ├── services/                 # Core business logic
│ │ └── replicator.py
│ ├── config.py                 # Pydantic settings management & validation
│ ├── dependencies.py           # Cloud client dependency injection
│ ├── logging_config.py         # Logging configuration
│ └── main.py                   # FastAPI application and endpoints
├── assets/                     # Asset files (e.g., Sequence Diagram)
├── .env.example                # Example environment file
├── .gitignore
├── .pre-commit-config.yaml     # Configuration for local pre-commit hooks
├── pyproject.toml              # Project definition and dependencies (for uv)
├── README.md
└── uv.lock                     # Lock file for reproducible dependencies

Getting Started: Running the Service Locally

This guide provides a complete, step-by-step walkthrough to get the application running on your local machine using Docker-based emulators.

Step 1: Prerequisites

Ensure you have the following tools installed on your system:

  • Python (3.11 or newer)
  • Git for version control
  • Docker Desktop for running the cloud emulators. Make sure Docker is running.

Step 2: Clone and Install Dependencies

First, clone the repository and set up the Python environment using uv.

  1. Clone the repository:

    git clone cross-cloud-replicator
    cd cross-cloud-replicator
  2. Create and activate a virtual environment:

    uv venv .venv
    # On Windows:
    .venv\Scripts\activate
    # On Linux/macOS:
    # source .venv/bin/activate
  3. Install all dependencies (including dev tools):

    uv pip install -e ".[dev]"
  4. Set up the Git hooks (for developers): This installs the pre-commit hooks, which will run automatically to ensure code quality.

    pre-commit install

Step 3: Set Up the Emulated Cloud Environment

This service uses Docker to run local versions of S3 (MinIO) and GCS (fake-gcs-server).

  1. Start the S3 Emulator (MinIO): Open a new terminal and run:

    docker run -d --rm -p 9000:9000 -p 9001:9001 --name minio \
      -e "MINIO_ROOT_USER=minioadmin" \
      -e "MINIO_ROOT_PASSWORD=minioadmin" \
      quay.io/minio/minio server /data --console-address ":9001"
    • The S3 API will be available at http://localhost:9000.
    • You can access the MinIO web console at http://localhost:9001.
  2. Start the GCS Emulator (fake-gcs-server): In another terminal, run:

    docker run -d --rm -p 4443:4443 --name fake-gcs-server fsouza/fake-gcs-server
    • The GCS API will be available at http://localhost:4443.

Step 4: Configure Local Environment Variables

The application uses environment variables for configuration.

  1. Create a local environment file: Copy the .env.example file to a new file named .env.local. This file is ignored by Git and is safe for your local settings.

    # On Windows
    copy .env.example .env.local
    # On Linux/macOS
    cp .env.example .env.local
  2. Verify the content: The default values in .env.example are already configured for the local emulator setup. Your .env.local should look like this:

    AWS_ACCESS_KEY_ID="minioadmin"
    AWS_SECRET_ACCESS_KEY="minioadmin"
    AWS_REGION="us-east-1"
    GCS_BUCKET_NAME="destination-bucket"
    S3_ENDPOINT_URL="http://localhost:9000"
    STORAGE_EMULATOR_HOST="http://localhost:4443"

Step 5: Run the Application

Now, with the environment and dependencies ready, you can start the API service.

  • Start the FastAPI server:
    uvicorn app.main:app --reload
  • The API is now running at http://127.0.0.1:8000.
  • The interactive API documentation is available at http://127.0.0.1:8000/docs.

Step 6: Test the Service

Finally, let's send a request to confirm everything is working end-to-end.

  1. Create test data:

    • Navigate to the MinIO console at http://localhost:9001.
    • Log in with minioadmin / minioadmin.
    • Create a new bucket named source-bucket.
    • Inside source-bucket, upload a small test file (e.g., sample.txt).
  2. Send a replication request: Use curl or an API client like Postman to send a POST request to the service.

    curl -X POST "http://127.0.0.1:8000/v1/replicate" \
         -H "Content-Type: application/json" \
         -d '{"s3_bucket": "source-bucket", "s3_key": "sample.txt"}'
  3. Verify the result:

    • You should receive a 200 OK success response.
    • Idempotency Check: Send the exact same request again. You should receive another 200 OK response with a message indicating the file already exists and was skipped. This confirms the idempotency logic is working.

Sequence Diagram

This diagram illustrates the flow for a single replication request, including the idempotency check.

Sequence Diagram


Running in Production

To run the service against real AWS and GCP environments:

  1. Create a .env file from the .env.example template.
  2. Fill in your actual AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, and GCS_BUCKET_NAME.
  3. Ensure you have a gcp-credentials.json file for your service account and that the GOOGLE_APPLICATION_CREDENTIALS variable in the .env file points to it.
  4. Make sure the emulator endpoint URLs (S3_ENDPOINT_URL, STORAGE_EMULATOR_HOST) are not set in the .env file. The application will automatically detect their absence and connect to the real cloud services.

About

A Python-based tool to automatically sync and replicate files across multiple cloud storage services (mainly AWS & GCP), ensuring seamless backups and redundancy.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages