Categorical Feature Engineering (Step 3)¶

Overview¶

The EDA analysis revealed substantial differences in conversion rates across categorical variables, particularly job types and seasonal patterns. This step transforms these insights into engineered features that capture the underlying segmentation patterns.

Key Findings from EDA¶

Students and retirees show significantly higher conversion rates at 34.1% and 30.8% respectively, compared to other job categories. March campaigns achieve remarkable success at 57.1% while August drops to just 4.7%. Additionally, customers with previous successful campaigns convert at 76.4%, indicating strong relationship continuity.

Feature Engineering Approach¶

Job Performance Segmentation¶

Created three-tier job groupings based on observed conversion patterns:

High performers: students and retirees
Medium performers: self-employed, unemployed, management roles
Standard performers: remaining job categories

Seasonal Campaign Effectiveness¶

Grouped months by historical success rates:

Peak performance: March campaigns
Solid performance: September, October, December
Standard performance: remaining months

Campaign History Encoding¶

Previous campaign outcomes serve as strong predictors, with successful prior interactions showing 76.4% conversion rates. This relationship warranted direct binary encoding.

Customer Segment Identification¶

Combined high-performing job categories into a single high-value customer flag for targeting purposes.

Implementation¶

The approach uses modular functions for each feature type, allowing independent testing and validation. Each transformation maintains the original categorical values while adding the derived features.

def create_job_conversion_groups(train_df, test_df):
    """Create job category groups based on conversion rates from EDA."""
    high_conversion_jobs = ["student", "retired"]  # 34.1%, 30.8%
    medium_conversion_jobs = ["self-employed", "unemployed", "management"]

    def categorize_job(job):
        if job in high_conversion_jobs:
            return "high_conversion"
        elif job in medium_conversion_jobs:
            return "medium_conversion"
        else:
            return "low_conversion"

    train_df["job_conversion_group"] = train_df["job"].map(categorize_job)
    test_df["job_conversion_group"] = test_df["job"].map(categorize_job)

    return train_df, test_df

The main orchestrator function coordinates all transformations:

def apply_categorical_feature_engineering(train_df, test_df):
    """Apply categorical feature engineering based on EDA insights."""

    train_df, test_df = create_job_conversion_groups(train_df, test_df)
    train_df, test_df = create_monthly_success_patterns(train_df, test_df)
    train_df, test_df = create_previous_campaign_features(train_df, test_df)
    train_df, test_df = create_high_value_segments(train_df, test_df)

    return train_df, test_df

Data Transformation Example¶

Before Processing¶

job	month	poutcome	y	Customer Profile
technician	aug	unknown	0	Blue-collar worker, August campaign
student	may	unknown	0	Student, May campaign
blue-collar	jun	success	1	Previous successful contact
retired	mar	failure	1	Retired customer, March timing

After Processing¶

job	month	poutcome	job_conversion_group	month_success_group	previous_success	high_value_segment	y	Insights
technician	aug	unknown	low_conversion	low_success	0	0	0	Standard segment, poor timing
student	may	unknown	high_conversion	low_success	0	1	0	High-value segment, average timing
blue-collar	jun	success	low_conversion	low_success	1	0	1	Previous relationship success
retired	mar	failure	high_conversion	high_success	0	1	1	Premium segment, peak timing

Features Created¶

Feature	Type	Description
`job_conversion_group`	Categorical	3-tier grouping (high/medium/low conversion) based on job success rates
`month_success_group`	Categorical	3-tier seasonal grouping based on campaign timing effectiveness
`previous_success`	Binary	Flag for customers with previous successful campaigns (76.4% conversion)
`high_value_segment`	Binary	Flag for high-performing job categories (students, retirees)

Expected Impact¶

Capture conversion patterns: Job and seasonal groupings reveal underlying success drivers
Leverage relationship history: Previous campaign success as strong predictor
Enable targeted segmentation: High-value customer identification for focused campaigns
Preserve original features: Keep original categorical values alongside engineered features

Results¶

MLflow Performance¶

Categorical feature engineering results:

Single Model Performance (80/20 split): - Test AUC: 0.9685

K-Fold Cross-Validation (5 folds): - Average AUC: 0.9687

Performance is comparable to duration feature treatment, with consistent cross-validation results.

Classification Metrics¶

Performance comparison with duration feature treatment results:

False Positives: Reduced from 5,374 to 3,848 (-1,526)
False Negatives: Increased from 4,409 to 5,657 (+1,248)
Trade-off: Higher precision but lower recall compared to duration treatment

The categorical engineering approach shows different error patterns compared to duration treatment.

Kaggle Competition Results¶

Competition submission results:

Competition Score: 0.96907
Leaderboard Position: 1159

Spoiler Warning

The following section contains additional details about the model and its evaluation. Click to expand to see more.

Additional Model Information

Based on the strong performance results, model V19 was identified as the best performing model and promoted to production with the champion alias in the MLflow Model Registry.

Model Version: V19
Status: Champion (Production)
Deployment: Ready for inference and business use
Registry: Versioned and tracked in MLflow Model Registry