Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 27, 2025

Implements a production-grade ETL pipeline in Scala that streams data from PostgreSQL to S3 in Parquet format, with local Docker development environment and AWS deployment support.

Core Implementation

  • Database Layer: Doobie with Hikari connection pooling, FS2 streaming queries
  • S3 Writer: Parquet4s with Hadoop S3A filesystem integration, configurable Hadoop settings
  • Configuration: PureConfig with type-safe case classes, kebab-case field mapping
  • Pipeline Orchestration: Cats Effect Resource management, functional error handling
  • Domain Model: Sample schema (id, name, value, category, created_at)
// Streaming pipeline using FS2 and Doobie
def streamData(query: String, batchSize: Int): Stream[F, SampleData] = {
  Fragment.const(query)
    .query[SampleData]
    .stream
    .transact(xa)
    .chunkN(batchSize)
    .flatMap(chunk => Stream.chunk(chunk))
}

// Parquet writing with proper Hadoop configuration
ParquetWriter.of[SampleData]
  .options(ParquetWriter.Options(hadoopConf = hadoopConf))
  .writeAndClose(path, records)

Local Development

  • Docker Compose with PostgreSQL 15 and LocalStack S3
  • Auto-initialization: database schema, sample data (10 records), S3 bucket creation
  • Integration test script for environment verification

AWS Deployment Options

  • EC2: IAM role configuration, systemd service setup
  • ECS/Fargate: Task definition with Secrets Manager integration
  • Scheduled Execution: EventBridge rules for recurring jobs

Project Structure

src/main/scala/com/async2databricks/
├── config/           # PureConfig models
├── database/         # Doobie connection pool, streaming repository
├── etl/             # Pipeline orchestration
├── model/           # Domain entities
└── s3/              # Parquet S3 writer with Hadoop config

Technical Stack

Doobie 1.0.0-RC4 · FS2 3.9.3 · Cats Effect 3.5.2 · Parquet4s 2.15.0 · PureConfig 0.17.4 · Hadoop AWS 3.3.4 · Scala 2.13.12

Documentation

  • QUICKSTART.md: 5-minute local setup
  • DEPLOYMENT.md: EC2, ECS/Fargate, Lambda deployment with IAM policies, monitoring, cost optimization
  • CONTRIBUTING.md: Development workflow, code style, testing guidelines

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • 127.0.0.11
    • Triggering command: REDACTED, pid is -1 (packet block)
  • 8.8.8.8
    • Triggering command: REDACTED, pid is -1 (packet block)
  • analytics.localstack.cloud
    • Triggering command: /opt/code/localstack/.venv/bin/python /opt/code/localstack/.venv/bin/python -m localstack.runtime.main /run/containerd/io.containerd.runtime.v2.task/moby/bb08ad27515013774d13c17070d1ad58d74dc3bba9dffaafb52ad3054247b580 --pid-file /run/containerd/io.containerd.runtime.v2.task/moby/bb08ad27515013774d13c17070d1ad58d74dc3bba9dffaafb52ad3054247b580/init.pid bb08ad27515013774d13c17070d1ad58d74dc3bba9dffaafb52ad3054247b580 p/bin/infocmp (dns block)
  • s3.localhost.localstack.cloud
    • Triggering command: /opt/code/localstack/.venv/bin/python /opt/code/localstack/.venv/bin/python -m localstack.runtime.main /run/containerd/io.containerd.runtime.v2.task/moby/bb08ad27515013774d13c17070d1ad58d74dc3bba9dffaafb52ad3054247b580 --pid-file /run/containerd/io.containerd.runtime.v2.task/moby/bb08ad27515013774d13c17070d1ad58d74dc3bba9dffaafb52ad3054247b580/init.pid bb08ad27515013774d13c17070d1ad58d74dc3bba9dffaafb52ad3054247b580 p/bin/infocmp (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

Build an ETL pipeline in Scala:

  • the data source is a postgres database with sample data
  • we use doobie to connect to the database and query it (as a stream using doobie's fs2 integration)
  • then, we use parquet4s to ingest data into s3
  • configuration should be loaded with pureconfig
  • things should run locally using docker
  • add instructions how to deploy & run on AWS

Work like a professional software engineer. Use modular and tested code.


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 6 commits December 27, 2025 22:56
Co-authored-by: akreit <64101884+akreit@users.noreply.github.com>
Co-authored-by: akreit <64101884+akreit@users.noreply.github.com>
Co-authored-by: akreit <64101884+akreit@users.noreply.github.com>
Co-authored-by: akreit <64101884+akreit@users.noreply.github.com>
Co-authored-by: akreit <64101884+akreit@users.noreply.github.com>
Co-authored-by: akreit <64101884+akreit@users.noreply.github.com>
Copilot AI changed the title [WIP] Build ETL pipeline using Scala with Postgres and S3 Build streaming ETL pipeline: Postgres → Parquet/S3 using Doobie, FS2, and Parquet4s Dec 27, 2025
Copilot AI requested a review from akreit December 27, 2025 23:19
@akreit akreit marked this pull request as ready for review December 28, 2025 20:04
@akreit akreit merged commit 46e0943 into main Dec 28, 2025
1 check passed
@akreit akreit deleted the copilot/build-etl-pipeline-scala branch December 28, 2025 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants