Data Pipeline¶

Current Implementation¶

The data pipeline consists of data loading and preprocessing stages.

V1.0 Data Pipeline Scope

The dataset is already pre-cleaned and contains no missing values, so no additional cleaning or imputation is required. At this stage, the only preprocessing step is applying label encoding to categorical features to make them compatible with the models.

Data Loading¶

Implementation¶

File: src/load_data.py

Functions Available¶

load_train_data() - Loads training data from raw CSV
load_test_data() - Loads test data from raw CSV
load_data() - Loads both training and test data
load_processed_data() - Loads already processed data
load_label_encoders() - Loads saved label encoder objects

Data Loading Process¶

# Load raw data
train_df, test_df = load_data()
print(f"Training data shape: {train_df.shape}")
print(f"Test data shape: {test_df.shape}")

The function prints data shapes and returns both datasets as pandas DataFrames.

Data Preprocessing¶

Implementation¶

File: src/preprocess.py

The preprocessing performs label encoding on categorical features:

def preprocess_data() -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Apply label encoding to categorical features and save encoders.

    Fits encoders on combined train+test data to ensure consistency.
    Saves processed datasets and encoders to configured paths.
    """
    config = load_config()

    # Load raw data
    train_df, test_df = load_data()

    # Get categorical features from config
    categorical_features = config["features"]["categorical_features"]

    # Process each categorical feature
    for feature in categorical_features:
        # Create label encoder
        le = LabelEncoder()

        # Fit on combined data for consistent encoding
        combined_values = pd.concat([
            train_df[feature].astype(str), 
            test_df[feature].astype(str)
        ]).unique()

        le.fit(combined_values)

        # Transform both datasets
        train_df[feature] = le.transform(train_df[feature].astype(str))
        test_df[feature] = le.transform(test_df[feature].astype(str))

Output Files¶

data/processed/
├── train_processed.csv    # Processed training data
├── test_processed.csv     # Processed test data
└── label_encoders.pkl     # Saved label encoder objects

Configuration¶

The data pipeline is controlled by settings in config/config.yaml:

data:
  raw_data_path: "data/raw/"
  processed_data_path: "data/processed/"
  train_file: "train.csv"
  test_file: "test.csv"
  target_column: "y"
  id_column: "id"

features:
  categorical_features: [job, marital, education, default, housing, loan, contact, month, poutcome]
  numerical_features: [age, balance, day, duration, campaign, pdays, previous]
  features_to_drop: [id]
  unknown_values: ["unknown"]

Execution¶

Run Preprocessing¶

# Via MLFlow entry point
mlflow run . -e data_preprocessing

# Direct execution
python src/preprocess.py

Load Processed Data¶

from src.load_data import load_processed_data, load_label_encoders

# Load processed datasets
train_df, test_df = load_processed_data()

# Load saved encoders for future use
encoders = load_label_encoders()