Skip to content

Categorical Feature Engineering (Step 3)

Overview

The EDA analysis revealed substantial differences in conversion rates across categorical variables, particularly job types and seasonal patterns. This step transforms these insights into engineered features that capture the underlying segmentation patterns.

Key Findings from EDA

Students and retirees show significantly higher conversion rates at 34.1% and 30.8% respectively, compared to other job categories. March campaigns achieve remarkable success at 57.1% while August drops to just 4.7%. Additionally, customers with previous successful campaigns convert at 76.4%, indicating strong relationship continuity.

Feature Engineering Approach

Job Performance Segmentation

Created three-tier job groupings based on observed conversion patterns:

  • High performers: students and retirees
  • Medium performers: self-employed, unemployed, management roles
  • Standard performers: remaining job categories

Seasonal Campaign Effectiveness

Grouped months by historical success rates:

  • Peak performance: March campaigns
  • Solid performance: September, October, December
  • Standard performance: remaining months

Campaign History Encoding

Previous campaign outcomes serve as strong predictors, with successful prior interactions showing 76.4% conversion rates. This relationship warranted direct binary encoding.

Customer Segment Identification

Combined high-performing job categories into a single high-value customer flag for targeting purposes.

Implementation

The approach uses modular functions for each feature type, allowing independent testing and validation. Each transformation maintains the original categorical values while adding the derived features.

def create_job_conversion_groups(train_df, test_df):
    """Create job category groups based on conversion rates from EDA."""
    high_conversion_jobs = ["student", "retired"]  # 34.1%, 30.8%
    medium_conversion_jobs = ["self-employed", "unemployed", "management"]

    def categorize_job(job):
        if job in high_conversion_jobs:
            return "high_conversion"
        elif job in medium_conversion_jobs:
            return "medium_conversion"
        else:
            return "low_conversion"

    train_df["job_conversion_group"] = train_df["job"].map(categorize_job)
    test_df["job_conversion_group"] = test_df["job"].map(categorize_job)

    return train_df, test_df

The main orchestrator function coordinates all transformations:

def apply_categorical_feature_engineering(train_df, test_df):
    """Apply categorical feature engineering based on EDA insights."""

    train_df, test_df = create_job_conversion_groups(train_df, test_df)
    train_df, test_df = create_monthly_success_patterns(train_df, test_df)
    train_df, test_df = create_previous_campaign_features(train_df, test_df)
    train_df, test_df = create_high_value_segments(train_df, test_df)

    return train_df, test_df

Data Transformation Example

Before Processing

job month poutcome y Customer Profile
technician aug unknown 0 Blue-collar worker, August campaign
student may unknown 0 Student, May campaign
blue-collar jun success 1 Previous successful contact
retired mar failure 1 Retired customer, March timing

After Processing

job month poutcome job_conversion_group month_success_group previous_success high_value_segment y Insights
technician aug unknown low_conversion low_success 0 0 0 Standard segment, poor timing
student may unknown high_conversion low_success 0 1 0 High-value segment, average timing
blue-collar jun success low_conversion low_success 1 0 1 Previous relationship success
retired mar failure high_conversion high_success 0 1 1 Premium segment, peak timing

Features Created

Feature Type Description
job_conversion_group Categorical 3-tier grouping (high/medium/low conversion) based on job success rates
month_success_group Categorical 3-tier seasonal grouping based on campaign timing effectiveness
previous_success Binary Flag for customers with previous successful campaigns (76.4% conversion)
high_value_segment Binary Flag for high-performing job categories (students, retirees)

Expected Impact

  • Capture conversion patterns: Job and seasonal groupings reveal underlying success drivers
  • Leverage relationship history: Previous campaign success as strong predictor
  • Enable targeted segmentation: High-value customer identification for focused campaigns
  • Preserve original features: Keep original categorical values alongside engineered features

Results

MLflow Performance

Categorical feature engineering results:

MLflow Results - Categorical Engineering

Single Model Performance (80/20 split): - Test AUC: 0.9685

K-Fold Cross-Validation (5 folds): - Average AUC: 0.9687

Performance is comparable to duration feature treatment, with consistent cross-validation results.

Classification Metrics

Performance comparison with duration feature treatment results:

MLflow Categorical Engineering Metrics

  • False Positives: Reduced from 5,374 to 3,848 (-1,526)
  • False Negatives: Increased from 4,409 to 5,657 (+1,248)
  • Trade-off: Higher precision but lower recall compared to duration treatment

The categorical engineering approach shows different error patterns compared to duration treatment.

Kaggle Competition Results

Competition submission results:

Kaggle Submission - Categorical Treatment

  • Competition Score: 0.96907
  • Leaderboard Position: 1159

Spoiler Warning

The following section contains additional details about the model and its evaluation. Click to expand to see more.

Additional Model Information

Based on the strong performance results, model V19 was identified as the best performing model and promoted to production with the champion alias in the MLflow Model Registry.

Model V19 Promoted to Production

  • Model Version: V19
  • Status: Champion (Production)
  • Deployment: Ready for inference and business use
  • Registry: Versioned and tracked in MLflow Model Registry