Title: Codecademy Portfolio Project, Data Ingestion Pipeline.
To see the full codebase for this project:
Link to my github account
Description:
A project intended to automate the data ingestion of subscriber cancellations for an online learning company. Ultimately providing a tidy, analytic ready csv file and sqlite database.
Features:
- Utilizing Jupyter notebooks and the pandas python library to expolore, clean and transform datasets.
- Use Python to automate data cleaning and transformation using the unittest built-in module and also error logging utilizing the logging module.
- Harness the sqlite3 module within Python to access data from a relational database, and then produce an analytic ready data warehouse.
- Using Bash scripts to automate file management and run scripts.
Technologies:
- Python and various standard library modules.
- The Pandas and Numpy third-party packages.
- Sqlite databases.
- Knowledge of data cleaning and tidying.
- Command Line and Bash Scripting.
Folder Structure:
Main Level: Includes the python, testing and bash scripts as well as the folders for the following:
- /logs - contains the logs for the testing and main production scripts.
- /data_dev - the repository for the main development database and csv file.
- /data_prod - the final location for the sqlite analytics database and csv file.
Running the Bash Script:
The entire process can be run by running the following script from the command line:
bash_script_runall.sh
Collaborators:
Codecademy Portfolio Project: #13 lesson from the Data Engineering Career Path.
License:
N/A.