Project Architecture¶
UML Class Diagram¶
Extensible Architecture
The system uses the Factory Pattern combined with configuration-driven design to make adding new machine learning models effortless. The ModelFactory
maintains a registry of available models, while hyperparameters are centralized in config.yaml
. To add a new model (e.g., RandomForest), you simply:
- Create a new class inheriting from
BaseModel
- Add it to the factory's
_models
dictionary - Define its hyperparameters in the config file
- The model immediately becomes available through MLflow entry points
This design separates algorithm implementation from pipeline orchestration, enabling rapid experimentation without modifying core training logic.
classDiagram
class BaseModel {
<<abstract>>
-config: Dict
-hyperparams: Dict
-model: Any
-is_trained: bool
+train(X_train, y_train, X_val, y_val)*
+predict(X)*
+predict_proba(X)*
+evaluate(X_test, y_test): Dict
+log_metrics(metrics): void
+log_model(X_val, model_name)*
+model_type*: str
}
class LightGBMModel {
+train(X_train, y_train, X_val, y_val)
+predict(X)
+predict_proba(X)
+log_model(X_val, model_name)
+model_type: str
}
class XGBoostModel {
+train(X_train, y_train, X_val, y_val)
+predict(X)
+predict_proba(X)
+log_model(X_val, model_name)
+model_type: str
}
class CatBoostModel {
+train(X_train, y_train, X_val, y_val)
+predict(X)
+predict_proba(X)
+log_model(X_val, model_name)
+model_type: str
}
class ModelFactory {
<<static>>
-_models: Dict[str, BaseModel]
+create_model(model_type, config, hyperparams): BaseModel
+get_supported_models(): List[str]
}
class DataLoader {
<<utility>>
+load_train_data(): DataFrame
+load_test_data(): DataFrame
+load_processed_data(): Tuple[DataFrame, DataFrame]
+load_label_encoders(): Dict[str, LabelEncoder]
}
class DataPreprocessor {
<<utility>>
+preprocess_data(): Tuple[DataFrame, DataFrame]
}
class TrainModel {
<<utility>>
+train_model(model_type): str
}
BaseModel <|-- LightGBMModel
BaseModel <|-- XGBoostModel
BaseModel <|-- CatBoostModel
ModelFactory --> BaseModel : creates
TrainModel --> ModelFactory : uses
TrainModel --> DataLoader : uses
DataPreprocessor --> DataLoader : uses
UML Sequence Diagram - Training Pipeline¶
sequenceDiagram
participant MLProject as MLproject
participant TrainFunc as train_model()
participant Config as config_loader
participant DataLoader as Data Functions
participant Factory as ModelFactory
participant Model as BaseModel Instance
participant MLFlow as MLFlow Server
MLProject->>TrainFunc: mlflow run . -e train
TrainFunc->>Config: load_config()
Config-->>TrainFunc: config dict
rect rgb(240, 248, 255)
Note over MLProject,DataLoader: DATA LOADING PHASE
TrainFunc->>DataLoader: load_processed_data()
DataLoader-->>TrainFunc: train_processed.csv
TrainFunc->>TrainFunc: train_test_split(X, y)
end
rect rgb(245, 255, 245)
Note over TrainFunc,MLFlow: MODEL TRAINING PHASE
TrainFunc->>MLFlow: set_experiment()
TrainFunc->>MLFlow: start_run()
TrainFunc->>MLFlow: log_param(model_type, hyperparams...)
TrainFunc->>Factory: create_model(model_type, config, hyperparams)
Factory->>Model: __init__(config, hyperparams)
Factory-->>TrainFunc: model_instance
TrainFunc->>Model: train(X_train, y_train, X_val, y_val)
Model-->>TrainFunc: trained model
end
rect rgb(255, 248, 240)
Note over TrainFunc,MLFlow: EVALUATION & LOGGING PHASE
TrainFunc->>Model: evaluate(X_val, y_val)
Model->>Model: calculate metrics
Model-->>TrainFunc: metrics dict
TrainFunc->>Model: log_metrics(metrics)
Model->>MLFlow: log_metric() for each metric
TrainFunc->>Model: log_model(X_val, model_name)
Model->>MLFlow: log_model() & register_model()
end
TrainFunc-->>MLProject: run_id
Module Structure¶
src/
├── load_data.py # Data loading functions
├── preprocess.py # Data preprocessing functions
├── train.py # Model training pipeline
├── predict.py # Model prediction pipeline
└── models/
├── __init__.py
├── base_model.py # BaseModel abstract class and ModelFactory
├── lightgbm_model.py # LightGBM implementation
├── xgboost_model.py # XGBoost implementation
└── catboost_model.py # CatBoost implementation
config/
├── config.yaml # Main configuration file
└── config_loader.py # Configuration loading functions
MLproject # MLFlow project definition
main.py # Main pipeline runner
Design Patterns Used¶
Factory Pattern¶
- ModelFactory: Creates model instances based on string identifiers
- Static method approach with dictionary mapping
- Supports easy addition of new model types
- Encapsulates model instantiation logic
Template Method Pattern¶
- BaseModel: Abstract class defining training/prediction workflow
- Concrete implementations provide algorithm-specific details
- Ensures consistent interface across all models
- Common evaluation logic in base class
Strategy Pattern¶
- Model Implementations: Interchangeable algorithms (LightGBM, XGBoost, CatBoost)
- Runtime selection of training strategy based on configuration
- Consistent evaluation metrics across strategies