Duration Feature Treatment (Step 2)¶
Overview¶
Duration emerges as the strongest predictor from the EDA analysis, showing the highest correlation (r=0.519) with the target variable. This step focuses on extracting maximum value from this critical feature while addressing potential data leakage concerns.
Key EDA Findings¶
- Strongest correlation: Duration shows r=0.519 with target (highest among all features)
- Clear engagement pattern: Subscribers average 525 seconds vs 212 seconds for non-subscribers
- Business insight: Call duration serves as an early success indicator during campaigns
- Data quality: Duration has right-skewed distribution requiring transformation
Implementation Strategy¶
1. Duration Binning¶
Created categorical bins based on engagement patterns:
very_short
(0-120s): Quick rejectionsshort
(120-300s): Standard interactionsmedium
(300-600s): Engaged prospectslong
(>600s): High engagement calls
2. Log Transformation¶
Applied np.log1p()
transformation to handle right-skewed distribution and reduce impact of extreme outliers.
3. High Engagement Flag¶
Binary feature duration_high_engagement
flags calls exceeding 300 seconds, capturing the threshold where conversion likelihood increases significantly.
Code Implementation¶
def apply_duration_feature_treatment(train_df: pd.DataFrame, test_df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
"""Apply duration feature treatment based on EDA insights."""
# Duration bins based on engagement patterns
duration_bins = [0, 120, 300, 600, float('inf')]
duration_labels = ['very_short', 'short', 'medium', 'long']
train_df['duration_bin'] = pd.cut(train_df['duration'], bins=duration_bins, labels=duration_labels, include_lowest=True)
test_df['duration_bin'] = pd.cut(test_df['duration'], bins=duration_bins, labels=duration_labels, include_lowest=True)
# Log transformation for skewed distribution
train_df['duration_log'] = np.log1p(train_df['duration'])
test_df['duration_log'] = np.log1p(test_df['duration'])
# High engagement binary flag
train_df['duration_high_engagement'] = (train_df['duration'] > 300).astype(int)
test_df['duration_high_engagement'] = (test_df['duration'] > 300).astype(int)
return train_df, test_df
Features Created¶
Feature | Type | Description |
---|---|---|
duration_bin |
Categorical | 4 engagement levels (very_short, short, medium, long) |
duration_log |
Numerical | Log-transformed duration to handle skewness |
duration_high_engagement |
Binary | Flag for calls > 300 seconds |
Data Transformation Example¶
Before Processing¶
duration | y | Description |
---|---|---|
117 | 0 | 117 seconds, no subscription |
185 | 0 | 185 seconds, no subscription |
111 | 0 | 111 seconds, no subscription |
10 | 0 | 10 seconds, no subscription |
After Processing¶
duration | duration_bin | duration_log | duration_high_engagement | y | Analysis |
---|---|---|---|---|---|
117 | very_short | 4.77 | 0 | 0 | Short call, low engagement |
185 | short | 5.23 | 0 | 0 | Standard call, low engagement |
111 | very_short | 4.72 | 0 | 0 | Quick rejection, very low engagement |
10 | very_short | 2.40 | 0 | 0 | Immediate hang-up, no engagement |
Expected Impact¶
- Capture non-linear patterns: Binning reveals engagement thresholds
- Handle distribution skewness: Log transformation improves model performance
- Create interpretable features: Business-meaningful engagement levels
- Preserve original signal: Keep original duration alongside engineered features
Results¶
MLflow Performance¶
Duration feature treatment results:
Single Model Performance (80/20 split): - Test AUC: 0.96844
K-Fold Cross-Validation (5 folds): - Average AUC: 0.9686
Results show consistent performance with low variance across cross-validation folds. The improvement over baseline is marginal.
Classification Metrics¶
Performance comparison with baseline results:
- False Positives: Increased from 3,702 to 5,374 (+1,672)
- False Negatives: Reduced from 6,187 to 4,409 (-1,778)
- Trade-off: Higher recall but more false positives
Kaggle Competition Results¶
Competition submission results:
- Competition Score: 0.969
- Leaderboard Position: 1170