Stock ETF Pipeline

This project consists of an ETL pipeline to process stocks and ETFs (Exchange Traded Funds) data. The pipeline performs raw data processing and calculates the Moving Average and Rolling Mean of the data. Note: Since available dataset contains large volume of data, I've decided to use PySpark framework to process data in distributed manner and entire process is dockerized.

Requirements

Python 3.6+
PySpark
Pandas
os
datetime
sys
glob
src.exception.CustomException
src.logger.logging

Usage

The pipeline is executed by running the docker-compose up script. Before running the script, the following configuration parameters need to be set in the config_dict dictionary in the script:

data: Path to the input data directory..
stocks_input: The path to the directory containing the input stocks data in CSV format.
etfs_input: The path to the directory containing the input ETFs data in CSV format.
nos_file_to_process: The number of files to process per iteration.
stocks_output: The path to the directory where the processed stocks data will be stored in Parquet format.
etfs_output: The path to the directory where the processed ETFs data will be stored in Parquet format.
stocks_parquet: The path and prefix of the Parquet files containing the processed stocks data.
etfs_parquet: The path and prefix of the Parquet files containing the processed ETFs data.
stocks_etfs_combined: The path to the directory where the combined stocks and ETFs data will be stored in Parquet format.
stocks_calculate: The path to the directory where the calculated Moving Average and Rolling Mean for the stocks data will be stored in Parquet format.
etfs_calculate: The path to the directory where the calculated Moving Average and Rolling Mean for the ETFs data will be stored in Parquet format.
stocks_parquet_location: The path to the Parquet file containing the processed stocks data.
etfs_parquet_location: The path to the Parquet file containing the processed ETFs data.
symbols_meta_file: The path to the CSV file containing the metadata for the stocks and ETFs.
combined_parquet_location: Path to the directory where combined stock and ETF data Parquet files will be saved..
KAGGLE_API: Path to the Kaggle API JSON file.

Functionality

Download File

Raw Data Processing

The ConvertToStructured class is responsible for the raw data processing of both stocks and ETFs data. The stocks_csv_to_parquet_file function reads CSV files from the stocks_input directory in batches and converts them to Parquet files. The function takes a symbols_meta_df dataframe as input, which contains metadata for the stocks and ETFs.

The convert_data_types function is used to convert the CSV files to the appropriate data format. The function takes a dataframe as input and returns a processed dataframe.

Moving Average and Rolling Mean Calculation

The FeatureEngineering class is responsible for calculating the Moving Average and Rolling Mean of both stocks and ETFs data. The calculate_moving_average function takes a dataframe as input and returns a new dataframe containing the calculated Moving Average.

The calculate_rolling_mean function takes a dataframe as input and returns a new dataframe containing the calculated Rolling Mean.

Error Handling

The CustomException class, defined in src.exception.py, is used for error handling. The class extends the built-in Exception class and takes a message and a module as input.

The logging module, defined in src.logger.py, is used for logging. The module logs messages

Usage

To run the pipeline, execute the docker-compose up file. The pipeline consists of the following steps:

Download data from Kaggle.
Convert stock and ETF CSV files to Parquet files.
Calculate moving averages and rolling means for stock and ETF data.
Combine stock and ETF data.
Calculate moving averages and rolling means for combined stock and ETF data.
Train model for stocks and etfs and stored trained model data into specified location

To predict the volume using the machine learning model, navigate to the `src` and `pipeline` directories and execute the command `uvicorn app:app --reload`.

To predict the volume using the machine learning model, you can access this link `https://model-prediction.onrender.com/docs` which is hosted on Render

           +----------------------------------+
           |                                  |
           |              Kaggle              |
           |                                  |
           +-----------------+----------------+
                             |
                             | download
                             |
           +-----------------v----------------+
           |                                  |
           |            Raw Data              |
           |           Processing             |
           |                                  |
           +-----------------+----------------+
                             |
                             | transform
                             |
           +-----------------v----------------+
           |                                  |
           |       Moving Average & Rolling   |
           |              Calculation         |
           |                                  |
           +-----------------+----------------+
                             |
                             | load
                             |
           +-----------------v----------------+
           |                                  |
           |            STRUCTURED DATA        |
           |                                  |
           +-----------------+----------------+
                             |
                             | pyspark dataframe
                             |
           +-----------------v----------------+
           |    calculate vol_moving_avg and  |
           |      adj_close_rolling_med       |
           |                                  |
           +-----------------+----------------+    
                             |
                             |
                             | train
                             |
           +-----------------v----------------+
           |                                  |
           |            Machine Learning       |
           |                                  |
           +-----------------+----------------+
                             |
                             | predict
                             |
           +-----------------v----------------+
           |                                  |
           |            Application            |
           |                                  |
           +-----------------+----------------+
                             |
                             | schedule tasks
                             |
           +-----------------v----------------+
           |                                  |
           |              Airflow             |
           |              Pipeline            |
           |                                  |
           +----------------------------------+

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
src/pipeline		src/pipeline
.gitignore		.gitignore
Project-explanation.md		Project-explanation.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Stock ETF Pipeline

Requirements

Usage

Functionality

Download File

Raw Data Processing

Moving Average and Rolling Mean Calculation

Error Handling

Usage

To predict the volume using the machine learning model, navigate to the `src` and `pipeline` directories and execute the command `uvicorn app:app --reload`.

To predict the volume using the machine learning model, you can access this link `https://model-prediction.onrender.com/docs` which is hosted on Render

About

Uh oh!

Releases

Packages

Uh oh!

Languages

nabinelnino/work-sample

Folders and files

Latest commit

History

Repository files navigation

Stock ETF Pipeline

Requirements

Usage

Functionality

Download File

Raw Data Processing

Moving Average and Rolling Mean Calculation

Error Handling

Usage

To predict the volume using the machine learning model, navigate to the src and pipeline directories and execute the command uvicorn app:app --reload.

To predict the volume using the machine learning model, you can access this link https://model-prediction.onrender.com/docs which is hosted on Render

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

To predict the volume using the machine learning model, navigate to the `src` and `pipeline` directories and execute the command `uvicorn app:app --reload`.

To predict the volume using the machine learning model, you can access this link `https://model-prediction.onrender.com/docs` which is hosted on Render

Packages