Skip to content

PySpark Concurrent Loader is a Spark-based application designed for efficient parallel data loading from multiple sources. It features dynamic partitioning for parallel writes, eliminating intermediate storage, and offers comprehensive logging, error handling, and a customizable Spark session to optimize performance across various workloads.

Notifications You must be signed in to change notification settings

iagunov/pyspark_concurrent_loader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

PySpark concurrent loader.

This project is a Spark-based data processing application designed to efficiently handle large datasets by loading data from multiple sources in parallel. It uses ThreadPoolExecutor to process data concurrently and writes results directly into a unified target table, eliminating the need for intermediate storage. The application supports dynamic partitioning to enable efficient data overwriting, comprehensive logging for tracking processes, and error handling at a granular level. Additionally, it features automatic journal writing for status updates and a customizable Spark session configuration to optimize performance for different workloads.

Features

  • Loading data from multiple sources in parallel using ThreadPoolExecutor
  • Parallel writing directly to a unified result table without intermediate storage.
  • Dynamic partition overwrite with Spark to optimize data handling.
  • Granular error handling for individual data sources.
  • Comprehensive logging to track process status and errors.
  • Automatic journal writing for status updates and process results.
  • Configurable Spark session with customizable parameters.

Project Status

This project is currently under development. Features and functionality are subject to change as development progresses.

Requirements

  • Python 3.6+
  • Apache Spark 3.0+
  • PySpark

Contact

Feel free to reach out with any questions or feedback.

About

PySpark Concurrent Loader is a Spark-based application designed for efficient parallel data loading from multiple sources. It features dynamic partitioning for parallel writes, eliminating intermediate storage, and offers comprehensive logging, error handling, and a customizable Spark session to optimize performance across various workloads.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages