Build streaming ETL pipeline: Postgres → Parquet/S3 using Doobie, FS2, and Parquet4s #1

Copilot · 2025-12-27T22:49:40Z

Implements a production-grade ETL pipeline in Scala that streams data from PostgreSQL to S3 in Parquet format, with local Docker development environment and AWS deployment support.

Core Implementation

Database Layer: Doobie with Hikari connection pooling, FS2 streaming queries
S3 Writer: Parquet4s with Hadoop S3A filesystem integration, configurable Hadoop settings
Configuration: PureConfig with type-safe case classes, kebab-case field mapping
Pipeline Orchestration: Cats Effect Resource management, functional error handling
Domain Model: Sample schema (id, name, value, category, created_at)

// Streaming pipeline using FS2 and Doobie
def streamData(query: String, batchSize: Int): Stream[F, SampleData] = {
  Fragment.const(query)
    .query[SampleData]
    .stream
    .transact(xa)
    .chunkN(batchSize)
    .flatMap(chunk => Stream.chunk(chunk))
}

// Parquet writing with proper Hadoop configuration
ParquetWriter.of[SampleData]
  .options(ParquetWriter.Options(hadoopConf = hadoopConf))
  .writeAndClose(path, records)

Local Development

Docker Compose with PostgreSQL 15 and LocalStack S3
Auto-initialization: database schema, sample data (10 records), S3 bucket creation
Integration test script for environment verification

AWS Deployment Options

EC2: IAM role configuration, systemd service setup
ECS/Fargate: Task definition with Secrets Manager integration
Scheduled Execution: EventBridge rules for recurring jobs

Project Structure

src/main/scala/com/async2databricks/
├── config/           # PureConfig models
├── database/         # Doobie connection pool, streaming repository
├── etl/             # Pipeline orchestration
├── model/           # Domain entities
└── s3/              # Parquet S3 writer with Hadoop config

Technical Stack

Doobie 1.0.0-RC4 · FS2 3.9.3 · Cats Effect 3.5.2 · Parquet4s 2.15.0 · PureConfig 0.17.4 · Hadoop AWS 3.3.4 · Scala 2.13.12

Documentation

QUICKSTART.md: 5-minute local setup
DEPLOYMENT.md: EC2, ECS/Fargate, Lambda deployment with IAM policies, monitoring, cost optimization
CONTRIBUTING.md: Development workflow, code style, testing guidelines

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

127.0.0.11
- Triggering command: REDACTED, pid is -1 (packet block)
8.8.8.8
- Triggering command: REDACTED, pid is -1 (packet block)
analytics.localstack.cloud
- Triggering command: /opt/code/localstack/.venv/bin/python /opt/code/localstack/.venv/bin/python -m localstack.runtime.main /run/containerd/io.containerd.runtime.v2.task/moby/bb08ad27515013774d13c17070d1ad58d74dc3bba9dffaafb52ad3054247b580 --pid-file /run/containerd/io.containerd.runtime.v2.task/moby/bb08ad27515013774d13c17070d1ad58d74dc3bba9dffaafb52ad3054247b580/init.pid bb08ad27515013774d13c17070d1ad58d74dc3bba9dffaafb52ad3054247b580 p/bin/infocmp (dns block)
s3.localhost.localstack.cloud
- Triggering command: /opt/code/localstack/.venv/bin/python /opt/code/localstack/.venv/bin/python -m localstack.runtime.main /run/containerd/io.containerd.runtime.v2.task/moby/bb08ad27515013774d13c17070d1ad58d74dc3bba9dffaafb52ad3054247b580 --pid-file /run/containerd/io.containerd.runtime.v2.task/moby/bb08ad27515013774d13c17070d1ad58d74dc3bba9dffaafb52ad3054247b580/init.pid bb08ad27515013774d13c17070d1ad58d74dc3bba9dffaafb52ad3054247b580 p/bin/infocmp (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

Build an ETL pipeline in Scala:

the data source is a postgres database with sample data

we use doobie to connect to the database and query it (as a stream using doobie's fs2 integration)

then, we use parquet4s to ingest data into s3

configuration should be loaded with pureconfig

things should run locally using docker

add instructions how to deploy & run on AWS

Work like a professional software engineer. Use modular and tested code.

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: akreit <64101884+akreit@users.noreply.github.com>

Initial plan

5c2bf2f

Copilot AI assigned Copilot and akreit Dec 27, 2025

Copilot started work on behalf of akreit December 27, 2025 22:50 View session

Copilot AI and others added 6 commits December 27, 2025 22:56

Add Scala project structure and core implementation

b8f9875

Co-authored-by: akreit <64101884+akreit@users.noreply.github.com>

Fix compilation errors and tests

2598966

Co-authored-by: akreit <64101884+akreit@users.noreply.github.com>

Fix docker-compose and add integration test script

2b6aa3a

Co-authored-by: akreit <64101884+akreit@users.noreply.github.com>

Add comprehensive documentation (QUICKSTART, CONTRIBUTING, DEPLOYMENT)

6a58dea

Co-authored-by: akreit <64101884+akreit@users.noreply.github.com>

Address code review feedback - fix S3Writer Hadoop configuration

e13fb10

Co-authored-by: akreit <64101884+akreit@users.noreply.github.com>

Add project summary document

9f77c93

Co-authored-by: akreit <64101884+akreit@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Build ETL pipeline using Scala with Postgres and S3~~ Build streaming ETL pipeline: Postgres → Parquet/S3 using Doobie, FS2, and Parquet4s Dec 27, 2025

Copilot AI requested a review from akreit December 27, 2025 23:19

Copilot finished work on behalf of akreit December 27, 2025 23:19

akreit added 7 commits December 28, 2025 20:21

rework copilot implementation

9fef63a

add CI pipeline

29ecff0

remove code coverage batch for now

076a398

run ci on sbt container

09ec677

run ci on sbt container

eeec32f

add comment

3018dab

apply scalafmt

6289108

akreit approved these changes Dec 28, 2025

View reviewed changes

akreit marked this pull request as ready for review December 28, 2025 20:04

akreit merged commit 46e0943 into main Dec 28, 2025
1 check passed

akreit deleted the copilot/build-etl-pipeline-scala branch December 28, 2025 20:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Build streaming ETL pipeline: Postgres → Parquet/S3 using Doobie, FS2, and Parquet4s #1

Build streaming ETL pipeline: Postgres → Parquet/S3 using Doobie, FS2, and Parquet4s #1

Uh oh!

Copilot AI commented Dec 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Build streaming ETL pipeline: Postgres → Parquet/S3 using Doobie, FS2, and Parquet4s #1

Build streaming ETL pipeline: Postgres → Parquet/S3 using Doobie, FS2, and Parquet4s #1

Uh oh!

Conversation

Copilot AI commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Core Implementation

Local Development

AWS Deployment Options

Project Structure

Technical Stack

Documentation

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Dec 27, 2025 •

edited

Loading