This project is a Python service that operates as an event-driven replication worker. It exposes an HTTP endpoint to receive a notification about a new file in an AWS S3 bucket and replicates it to a Google Cloud Storage (GCS) bucket, ensuring the process is both robust and idempotent.
This project was developed as part of a take-home assignment and showcases best practices in API design, cloud integration, and local development workflows.
This README outlines the technical approach, design choices, and provides step-by-step instructions for running and testing the service.
The service is built on a foundation of modern Python best practices, and designed with a focus on reliability, developer experience, and production readiness. The architecture is not just reactive; it is self-validating and built to fail fast, preventing it from running in a misconfigured or broken state.
-
Event-Driven Endpoint: A simple HTTP POST endpoint at
/v1/replicateto trigger the replication process based on external events. -
Flexible Configuration: A dependency injection pattern allows the service to seamlessly switch between local emulators (MinIO for S3, fake-gcs-server for GCS) and real cloud environments. This is controlled by environment variables, requiring no code changes and enabling safe, isolated testing.
A key feature of this service is its proactive health check on startup. Before the API becomes available to accept requests, it performs a series of critical validations:
- Configuration Loading: It uses Pydantic to strictly load and validate all required environment variables from
.envfiles. If a required variable is missing, the service provides a clean, human-readable error and exits, preventing runtime failures due to missing configuration. - Endpoint Connectivity Test: It actively attempts to connect to the configured S3 and GCS endpoints (
localhostemulators or real cloud services).
This "fail-fast" approach ensures that the service only runs when its core dependencies are available and correctly configured, which is a critical practice for building reliable distributed systems.
The service utilizes a sophisticated configuration pattern in app/config.py:
- Layered Environments: It correctly loads from
.env.localfirst, allowing developers to easily override production settings for local testing without modifying shared files. - Conditional Validation: It contains business logic to enforce conditional rules, such as requiring
GOOGLE_APPLICATION_CREDENTIALSonly when not using the local GCS emulator.
Beyond the basic requirements, the API has been enhanced for real-world use cases:
- Batch Replication Endpoint: A
/v1/replicate/batchendpoint was added to allow clients to replicate multiple files in a single API call. This is far more efficient than sending one request per file, reducing network overhead and improving throughput. - Memory-Efficient Streaming: Files are streamed directly from S3 to GCS without being saved to the local disk. This ensures a minimal memory footprint, allowing the service to handle large files efficiently.
Transient network errors are inevitable. The service is designed to be resilient:
- Automatic Retries: It uses the
tenacitylibrary to automatically retry failed network operations (both downloading from S3 and uploading to GCS). - Exponential Backoff: The waiting time between retries increases exponentially (e.g., 2s, 4s, 8s). This prevents overwhelming a temporarily struggling downstream service and increases the chance of a successful recovery.
- Specific Error Handling: The application returns clear HTTP status codes (
404 Not Foundfor missing files,503 Service Unavailablefor connection failures), providing meaningful feedback to the client.
An idempotent service guarantees that receiving the same request multiple times produces the same result as receiving it once. This is critical to prevent data duplication and wasted processing.
-
Core Implementation: The primary strategy is to check for the file's existence in the destination before uploading. Before any replication attempt, the service makes a
blob.exists()API call to GCS. If the file is already there, the operation is considered a success, and the service gracefully skips the download and upload steps. -
Scaling Considerations and Future Improvements: The current
blob.exists()check is simple and effective for this assignment. However, in a high-throughput system processing hundreds of files per second, this approach would introduce a performance bottleneck, as it doubles the number of API calls to GCS for new files (one to check, one to upload).
A more scalable, production-grade solution would involve using an external, high-speed metadata store (like Redis or DynamoDB) to track processed files. The workflow would be:
- Receive a request for
s3_bucket/s3_key. - Generate a unique key for the file (e.g.,
s3:source-bucket:path/to/file). - Check for the existence of this key in a Redis set—a millisecond-level operation.
- If the key exists, the file has been processed; skip.
- If not, perform the replication and add the key to the Redis set upon successful upload.
This improved design decouples the idempotency check from the storage provider, significantly reducing latency and API costs at scale.
The technologies were chosen to align with modern, high-performance backend development practices.
| Technology | Purpose | Justification |
|---|---|---|
| FastAPI | Web Framework | For its high performance, automatic data validation with Pydantic, and interactive API documentation. |
| uv | Package Manager | A next-generation, high-speed package manager that significantly accelerates dependency installation. |
| Docker | Emulation | For running local, containerized emulators (MinIO & fake-gcs-server), enabling a complete and isolated local development loop. |
| Pydantic | Data Validation | Used for both request body validation and robust, type-safe settings management from environment variables. |
| Rich | Console Logging | Provides clean, readable, and beautifully formatted terminal output for a superior developer experience. |
| Tenacity | Retry Logic | A powerful library for adding robust, declarative retry mechanisms to network operations. |
| Pre-commit & Ruff | Code Quality | For enforcing a consistent, high-quality codebase with automated linting and formatting on every commit. |
The service exposes three primary endpoints. Full interactive documentation is also available at the /docs endpoint when the service is running.
Confirms that the API is online and returns the currently active configuration, indicating whether the service is connected to local emulators or live cloud environments.
When the emulator URLs are set in the environment:
{
"status": "ok",
"message": "Welcome to the Cross-Cloud Replicator!",
"current_config": {
"s3_target": "http://localhost:9000",
"gcs_target": "http://localhost:4443"
}
}When no emulator URLs are set:
{
"status": "ok",
"message": "Welcome to the Cross-Cloud Replicator!",
"current_config": {
"s3_target": "REAL AWS",
"gcs_target": "REAL GCS"
}
}Triggers the replication of a single file.
- Request Body:
{ "s3_bucket": "source-bucket", "s3_key": "path/to/your/file" } - Success Responses:
200 OK: If the file is successfully replicated or if it already exists in the destination (idempotency).
- Error Responses:
404 Not Found: If the specifieds3_keydoes not exist in thes3_bucket.503 Service Unavailable: If the service cannot connect to S3 or GCS after multiple retries.
Triggers the replication of multiple files from the same bucket in one call.
- Request Body:
{ "s3_bucket": "source-bucket", "s3_keys": [ "path/to/file1.txt", "path/to/image.jpg", "data/report.csv" ] } - Success Response (
200 OK): Returns a detailed breakdown of the status for each file.{ "status": "completed", "results": [ { "key": "path/to/file1.txt", "status": "success", "message": "Successfully replicated..." }, { "key": "path/to/image.jpg", "status": "not_found", "error": "Object '...' not found..." } ] }
To ensure code is clean, consistent, and maintainable, this project uses a two-layered approach to automated code quality checks with ruff.
-
Pre-commit Hooks: The repository is configured with
pre-commithooks that run automatically on everygit commit. These hooks format the code and check for linting errors before the code is even committed. This provides immediate feedback to the developer and maintains a high standard of quality on the local machine. -
Continuous Integration (CI): A GitHub Actions workflow is defined in
.github/workflows/ci.yml. This workflow runs on every push or pull request to themainbranch. It performs a fresh installation of dependencies and runs the linter and formatter checks on a clean runner. This serves as a final validation gate to ensure that all code integrated into the main branch adheres to the project's quality standards.
.
├── .github/ # GitHub Actions CI/CD Workflows
│ └── workflows/
│ └── ci.yml
├── app/ # Main application source code
│ ├── services/ # Core business logic
│ │ └── replicator.py
│ ├── config.py # Pydantic settings management & validation
│ ├── dependencies.py # Cloud client dependency injection
│ ├── logging_config.py # Logging configuration
│ └── main.py # FastAPI application and endpoints
├── assets/ # Asset files (e.g., Sequence Diagram)
├── .env.example # Example environment file
├── .gitignore
├── .pre-commit-config.yaml # Configuration for local pre-commit hooks
├── pyproject.toml # Project definition and dependencies (for uv)
├── README.md
└── uv.lock # Lock file for reproducible dependencies
This guide provides a complete, step-by-step walkthrough to get the application running on your local machine using Docker-based emulators.
Ensure you have the following tools installed on your system:
- Python (3.11 or newer)
- Git for version control
- Docker Desktop for running the cloud emulators. Make sure Docker is running.
First, clone the repository and set up the Python environment using uv.
-
Clone the repository:
git clone cross-cloud-replicator cd cross-cloud-replicator -
Create and activate a virtual environment:
uv venv .venv # On Windows: .venv\Scripts\activate # On Linux/macOS: # source .venv/bin/activate
-
Install all dependencies (including dev tools):
uv pip install -e ".[dev]" -
Set up the Git hooks (for developers): This installs the pre-commit hooks, which will run automatically to ensure code quality.
pre-commit install
This service uses Docker to run local versions of S3 (MinIO) and GCS (fake-gcs-server).
-
Start the S3 Emulator (MinIO): Open a new terminal and run:
docker run -d --rm -p 9000:9000 -p 9001:9001 --name minio \ -e "MINIO_ROOT_USER=minioadmin" \ -e "MINIO_ROOT_PASSWORD=minioadmin" \ quay.io/minio/minio server /data --console-address ":9001"
- The S3 API will be available at
http://localhost:9000. - You can access the MinIO web console at
http://localhost:9001.
- The S3 API will be available at
-
Start the GCS Emulator (fake-gcs-server): In another terminal, run:
docker run -d --rm -p 4443:4443 --name fake-gcs-server fsouza/fake-gcs-server
- The GCS API will be available at
http://localhost:4443.
- The GCS API will be available at
The application uses environment variables for configuration.
-
Create a local environment file: Copy the
.env.examplefile to a new file named.env.local. This file is ignored by Git and is safe for your local settings.# On Windows copy .env.example .env.local # On Linux/macOS cp .env.example .env.local
-
Verify the content: The default values in
.env.exampleare already configured for the local emulator setup. Your.env.localshould look like this:AWS_ACCESS_KEY_ID="minioadmin" AWS_SECRET_ACCESS_KEY="minioadmin" AWS_REGION="us-east-1" GCS_BUCKET_NAME="destination-bucket" S3_ENDPOINT_URL="http://localhost:9000" STORAGE_EMULATOR_HOST="http://localhost:4443"
Now, with the environment and dependencies ready, you can start the API service.
- Start the FastAPI server:
uvicorn app.main:app --reload
- The API is now running at
http://127.0.0.1:8000. - The interactive API documentation is available at
http://127.0.0.1:8000/docs.
Finally, let's send a request to confirm everything is working end-to-end.
-
Create test data:
- Navigate to the MinIO console at
http://localhost:9001. - Log in with
minioadmin/minioadmin. - Create a new bucket named
source-bucket. - Inside
source-bucket, upload a small test file (e.g.,sample.txt).
- Navigate to the MinIO console at
-
Send a replication request: Use
curlor an API client like Postman to send aPOSTrequest to the service.curl -X POST "http://127.0.0.1:8000/v1/replicate" \ -H "Content-Type: application/json" \ -d '{"s3_bucket": "source-bucket", "s3_key": "sample.txt"}'
-
Verify the result:
- You should receive a
200 OKsuccess response. - Idempotency Check: Send the exact same request again. You should receive another
200 OKresponse with a message indicating the file already exists and was skipped. This confirms the idempotency logic is working.
- You should receive a
This diagram illustrates the flow for a single replication request, including the idempotency check.
To run the service against real AWS and GCP environments:
- Create a
.envfile from the.env.exampletemplate. - Fill in your actual
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_REGION, andGCS_BUCKET_NAME. - Ensure you have a
gcp-credentials.jsonfile for your service account and that theGOOGLE_APPLICATION_CREDENTIALSvariable in the.envfile points to it. - Make sure the emulator endpoint URLs (
S3_ENDPOINT_URL,STORAGE_EMULATOR_HOST) are not set in the.envfile. The application will automatically detect their absence and connect to the real cloud services.
