This boilerplate is designed to kickstart data science projects by providing a basic setup for database connections, data processing, and machine learning model development. It includes a structured folder organization for your datasets and a set of pre-defined Python packages necessary for most data science tasks.
The project is organized as follows:
src/app.py
→ Main Python script where your project will run.src/explore.ipynb
→ Notebook for exploration and testing. Once exploration is complete, migrate the clean code toapp.py
.src/utils.py
→ Auxiliary functions, such as database connection.requirements.txt
→ List of required Python packages.models/
→ Will contain your SQLAlchemy model classes.data/
→ Stores datasets at different stages:data/raw/
→ Raw data.data/interim/
→ Temporarily transformed data.data/processed/
→ Data ready for analysis.
No manual setup is required, as Codespaces is automatically configured with the predefined files created by the academy for you. Just follow these steps:
- Wait for the environment to configure automatically.
- All necessary packages and the database will install themselves.
- The automatically created
username
anddb_name
are in the.env
file at the root of the project.
- Once Codespaces is ready, you can start working immediately.
Prerequisites
Make sure you have Python 3.11+ installed on your machine. You will also need pip to install the Python packages.
Installation
Clone the project repository to your local machine.
Navigate to the project directory and install the required Python packages:
pip install -r requirements.txt
Create a database (if necessary)
Create a new database within the Postgres engine by customizing and executing the following command:
$ psql -U postgres -c "DO \$\$ BEGIN
CREATE USER my_user WITH PASSWORD 'my_password';
CREATE DATABASE my_database OWNER my_user;
END \$\$;"
Connect to the Postgres engine to use your database, manipulate tables, and data:
$ psql -U my_user -d my_database
Once inside PSQL, you can create tables, run queries, insert, update, or delete data, and much more!
Environment Variables
Create a .env file in the root directory of the project to store your environment variables, such as your database connection string:
DATABASE_URL="postgresql://<USER>:<PASSWORD>@<HOST>:<PORT>/<DB_NAME>"
#example
DATABASE_URL="postgresql://my_user:my_password@localhost:5432/my_database"
To run the application, execute the app.py script from the root directory of the project:
python src/app.py
To add SQLAlchemy model classes, create new Python script files within the models/ directory. These classes should be defined according to your database schema.
Example model definition (models/example_model.py
):
from sqlalchemy.orm import declarative_base
from sqlalchemy import String
from sqlalchemy.orm import Mapped, mapped_column
Base = declarative_base()
class ExampleModel(Base):
__tablename__ = 'example_table'
id: Mapped[int] = mapped_column(primary_key=True)
username: Mapped[str] = mapped_column(unique=True)
You can place your raw datasets in the data/raw directory, intermediate datasets in data/interim, and processed datasets ready for analysis in data/processed.
To process data, you can modify the app.py script to include your data processing steps, using pandas for data manipulation and analysis.
This template was built as part of the Data Science and Machine Learning Bootcamp by 4Geeks Academy by Alejandro Sanchez and many other contributors. Learn more about 4Geeks Academy BootCamp programs here.
Other templates and resources like this can be found on the school's GitHub page.