Introduction
Credit risk assessment is one of the most critical applications of machine learning in the financial industry. Banks and lending institutions need to evaluate whether a loan applicant is likely to repay their debt or default. In this fourth episode of our Mastering Financial Data Science with Kaggle series, we’ll dive deep into building a robust credit risk scoring model using gradient boosting algorithms.
Building on the feature engineering techniques we explored in episode 3, we’ll now apply advanced machine learning models—specifically XGBoost and LightGBM—to predict loan approval outcomes. We’ll also explore feature importance analysis and model evaluation metrics that are essential for production credit scoring systems.
Understanding Credit Risk Modeling
Credit risk modeling aims to quantify the probability that a borrower will default on their financial obligations. The default probability is typically expressed as:
Where:
– is the probability of default
– is the intercept term
– are the coefficients for each feature
– are the input features (income, credit history, debt-to-income ratio, etc.)
Modern machine learning models learn these relationships automatically from data, often capturing non-linear patterns that traditional logistic regression cannot.
Dataset Selection and Loading
For this tutorial, we’ll use the Home Credit Default Risk dataset from Kaggle, which contains real-world loan application data with various demographic and financial features.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, classification_report
import xgboost as xgb
import lightgbm as lgb
import warnings
warnings.filterwarnings('ignore')
# Load the dataset
application_train = pd.read_csv('application_train.csv')
application_test = pd.read_csv('application_test.csv')
print(f"Training data shape: {application_train.shape}")
print(f"Test data shape: {application_test.shape}")
print(f"\nTarget distribution:\n{application_train['TARGET'].value_counts(normalize=True)}")
Exploratory Data Analysis for Credit Risk
Before building models, let’s examine the distribution of key features and their relationship with default probability.
# Examine basic statistics
print(application_train.describe())
# Check missing values
missing_data = application_train.isnull().sum().sort_values(ascending=False)
missing_percent = (missing_data / len(application_train)) * 100
missing_df = pd.DataFrame({'Missing_Count': missing_data, 'Percentage': missing_percent})
print(missing_df.head(20))
# Visualize target distribution
plt.figure(figsize=(8, 5))
sns.countplot(x='TARGET', data=application_train)
plt.title('Distribution of Loan Default (0=Repaid, 1=Default)')
plt.xlabel('Target')
plt.ylabel('Count')
plt.show()
Analyzing Key Features
Let’s examine how critical features correlate with default probability:
# Income vs Default Rate
income_groups = pd.qcut(application_train['AMT_INCOME_TOTAL'], q=10)
default_by_income = application_train.groupby(income_groups)['TARGET'].mean()
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
default_by_income.plot(kind='bar')
plt.title('Default Rate by Income Decile')
plt.xlabel('Income Group')
plt.ylabel('Default Rate')
plt.xticks(rotation=45)
# Credit Amount vs Default Rate
credit_groups = pd.qcut(application_train['AMT_CREDIT'], q=10)
default_by_credit = application_train.groupby(credit_groups)['TARGET'].mean()
plt.subplot(1, 2, 2)
default_by_credit.plot(kind='bar', color='orange')
plt.title('Default Rate by Credit Amount Decile')
plt.xlabel('Credit Amount Group')
plt.ylabel('Default Rate')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Feature Engineering for Credit Scoring
Effective feature engineering significantly improves credit risk models. Let’s create domain-specific features:
def create_credit_features(df):
"""
Create engineered features for credit risk modeling
"""
df = df.copy()
# Credit to Income Ratio
df['CREDIT_INCOME_RATIO'] = df['AMT_CREDIT'] / (df['AMT_INCOME_TOTAL'] + 1)
# Annuity to Income Ratio (monthly payment burden)
df['ANNUITY_INCOME_RATIO'] = df['AMT_ANNUITY'] / (df['AMT_INCOME_TOTAL'] + 1)
# Credit Term (in months)
df['CREDIT_TERM'] = df['AMT_CREDIT'] / (df['AMT_ANNUITY'] + 1)
# Days employed to age ratio
df['EMPLOYED_TO_AGE_RATIO'] = df['DAYS_EMPLOYED'] / (df['DAYS_BIRTH'] + 1)
# External source combinations (credit bureau scores)
df['EXT_SOURCE_MEAN'] = df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis=1)
df['EXT_SOURCE_STD'] = df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].std(axis=1)
df['EXT_SOURCE_PROD'] = df['EXT_SOURCE_1'] * df['EXT_SOURCE_2'] * df['EXT_SOURCE_3']
# Age in years
df['AGE_YEARS'] = -df['DAYS_BIRTH'] / 365
# Employment length in years
df['EMPLOYMENT_YEARS'] = -df['DAYS_EMPLOYED'] / 365
df['EMPLOYMENT_YEARS'] = df['EMPLOYMENT_YEARS'].apply(lambda x: 0 if x < 0 else x)
return df
# Apply feature engineering
application_train = create_credit_features(application_train)
application_test = create_credit_features(application_test)
print("New features created successfully")
print(f"Total features: {application_train.shape[1]}")
Data Preprocessing Pipeline
from sklearn.preprocessing import LabelEncoder
def preprocess_data(train_df, test_df):
"""
Preprocess data: handle missing values and encode categoricals
"""
# Align train and test columns
train_labels = train_df['TARGET']
train_df = train_df.drop(columns=['TARGET'])
# Identify categorical and numerical columns
categorical_cols = train_df.select_dtypes(include=['object']).columns.tolist()
numerical_cols = train_df.select_dtypes(include=['int64', 'float64']).columns.tolist()
# Handle categorical variables with Label Encoding
le = LabelEncoder()
for col in categorical_cols:
train_df[col] = train_df[col].astype(str)
test_df[col] = test_df[col].astype(str)
# Fit on combined data to handle unseen categories
combined = pd.concat([train_df[col], test_df[col]])
le.fit(combined)
train_df[col] = le.transform(train_df[col])
test_df[col] = le.transform(test_df[col])
# Handle missing values in numerical columns
for col in numerical_cols:
train_df[col].fillna(train_df[col].median(), inplace=True)
test_df[col].fillna(train_df[col].median(), inplace=True)
# Remove infinite values
train_df.replace([np.inf, -np.inf], np.nan, inplace=True)
test_df.replace([np.inf, -np.inf], np.nan, inplace=True)
train_df.fillna(0, inplace=True)
test_df.fillna(0, inplace=True)
return train_df, test_df, train_labels
# Preprocess the data
X_train_processed, X_test_processed, y_train = preprocess_data(
application_train.copy(),
application_test.copy()
)
print(f"Processed training shape: {X_train_processed.shape}")
print(f"Processed test shape: {X_test_processed.shape}")
Building the XGBoost Model
XGBoost (Extreme Gradient Boosting) is a powerful ensemble method that builds sequential decision trees, where each tree corrects the errors of previous trees. The objective function XGBoost minimizes is:
Where:
– is the loss function (log loss for binary classification)
– is the true label
– is the predicted probability
– is the regularization term for tree
– is the number of trees
# Split data for validation
X_train, X_val, y_train_split, y_val = train_test_split(
X_train_processed, y_train, test_size=0.2, random_state=42, stratify=y_train
)
print(f"Training set: {X_train.shape}")
print(f"Validation set: {X_val.shape}")
# XGBoost model configuration
xgb_params = {
'objective': 'binary:logistic',
'eval_metric': 'auc',
'max_depth': 6,
'learning_rate': 0.05,
'subsample': 0.8,
'colsample_bytree': 0.8,
'min_child_weight': 3,
'gamma': 0.1,
'reg_alpha': 0.1,
'reg_lambda': 1.0,
'random_state': 42,
'n_jobs': -1
}
# Create DMatrix for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train_split)
dval = xgb.DMatrix(X_val, label=y_val)
# Train the model with early stopping
evallist = [(dtrain, 'train'), (dval, 'eval')]
xgb_model = xgb.train(
xgb_params,
dtrain,
num_boost_round=1000,
evals=evallist,
early_stopping_rounds=50,
verbose_eval=100
)
print(f"\nBest iteration: {xgb_model.best_iteration}")
print(f"Best AUC score: {xgb_model.best_score:.4f}")
Building the LightGBM Model
LightGBM is another gradient boosting framework that uses a leaf-wise tree growth strategy, making it faster and more memory-efficient than XGBoost for large datasets.
# LightGBM model configuration
lgb_params = {
'objective': 'binary',
'metric': 'auc',
'boosting_type': 'gbdt',
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'min_child_samples': 20,
'reg_alpha': 0.1,
'reg_lambda': 1.0,
'random_state': 42,
'n_jobs': -1,
'verbose': -1
}
# Create Dataset for LightGBM
lgb_train = lgb.Dataset(X_train, label=y_train_split)
lgb_val = lgb.Dataset(X_val, label=y_val, reference=lgb_train)
# Train the model
lgb_model = lgb.train(
lgb_params,
lgb_train,
num_boost_round=1000,
valid_sets=[lgb_train, lgb_val],
valid_names=['train', 'valid'],
callbacks=[lgb.early_stopping(stopping_rounds=50), lgb.log_evaluation(100)]
)
print(f"\nBest iteration: {lgb_model.best_iteration}")
print(f"Best AUC score: {lgb_model.best_score['valid']['auc']:.4f}")
Feature Importance Analysis
Understanding which features drive predictions is crucial for credit risk models, both for interpretability and regulatory compliance.
# XGBoost feature importance
xgb_importance = xgb_model.get_score(importance_type='gain')
xgb_importance_df = pd.DataFrame({
'feature': list(xgb_importance.keys()),
'importance': list(xgb_importance.values())
}).sort_values('importance', ascending=False)
# LightGBM feature importance
lgb_importance_df = pd.DataFrame({
'feature': X_train.columns,
'importance': lgb_model.feature_importance(importance_type='gain')
}).sort_values('importance', ascending=False)
# Visualize top features
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# XGBoost top 20 features
axes[0].barh(xgb_importance_df.head(20)['feature'], xgb_importance_df.head(20)['importance'])
axes[0].set_xlabel('Importance (Gain)')
axes[0].set_title('XGBoost: Top 20 Features')
axes[0].invert_yaxis()
# LightGBM top 20 features
axes[1].barh(lgb_importance_df.head(20)['feature'], lgb_importance_df.head(20)['importance'], color='orange')
axes[1].set_xlabel('Importance (Gain)')
axes[1].set_title('LightGBM: Top 20 Features')
axes[1].invert_yaxis()
plt.tight_layout()
plt.show()
print("\nTop 10 Most Important Features (XGBoost):")
print(xgb_importance_df.head(10))
Model Evaluation with AUC-ROC
The Area Under the ROC Curve (AUC-ROC) is the primary metric for credit risk models. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various classification thresholds:
Where:
– = True Positives (correctly predicted defaults)
– = False Negatives (missed defaults)
– = False Positives (incorrectly predicted defaults)
– = True Negatives (correctly predicted non-defaults)
# Make predictions on validation set
dval_pred = xgb.DMatrix(X_val)
xgb_preds = xgb_model.predict(dval_pred)
lgb_preds = lgb_model.predict(X_val, num_iteration=lgb_model.best_iteration)
# Calculate AUC scores
xgb_auc = roc_auc_score(y_val, xgb_preds)
lgb_auc = roc_auc_score(y_val, lgb_preds)
print(f"XGBoost Validation AUC: {xgb_auc:.4f}")
print(f"LightGBM Validation AUC: {lgb_auc:.4f}")
# Plot ROC curves
xgb_fpr, xgb_tpr, _ = roc_curve(y_val, xgb_preds)
lgb_fpr, lgb_tpr, _ = roc_curve(y_val, lgb_preds)
plt.figure(figsize=(10, 7))
plt.plot(xgb_fpr, xgb_tpr, label=f'XGBoost (AUC = {xgb_auc:.4f})', linewidth=2)
plt.plot(lgb_fpr, lgb_tpr, label=f'LightGBM (AUC = {lgb_auc:.4f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier', linewidth=1)
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curve Comparison', fontsize=14)
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.show()
Confusion Matrix and Classification Metrics
# Select optimal threshold based on validation set
from sklearn.metrics import precision_recall_curve
def find_optimal_threshold(y_true, y_pred_proba):
"""
Find optimal classification threshold using F1 score
"""
precision, recall, thresholds = precision_recall_curve(y_true, y_pred_proba)
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-8)
optimal_idx = np.argmax(f1_scores)
return thresholds[optimal_idx] if optimal_idx < len(thresholds) else 0.5
xgb_threshold = find_optimal_threshold(y_val, xgb_preds)
lgb_threshold = find_optimal_threshold(y_val, lgb_preds)
print(f"Optimal XGBoost threshold: {xgb_threshold:.4f}")
print(f"Optimal LightGBM threshold: {lgb_threshold:.4f}")
# Convert probabilities to binary predictions
xgb_binary = (xgb_preds >= xgb_threshold).astype(int)
lgb_binary = (lgb_preds >= lgb_threshold).astype(int)
# Confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.heatmap(confusion_matrix(y_val, xgb_binary), annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title(f'XGBoost Confusion Matrix (threshold={xgb_threshold:.3f})')
axes[0].set_ylabel('True Label')
axes[0].set_xlabel('Predicted Label')
sns.heatmap(confusion_matrix(y_val, lgb_binary), annot=True, fmt='d', cmap='Oranges', ax=axes[1])
axes[1].set_title(f'LightGBM Confusion Matrix (threshold={lgb_threshold:.3f})')
axes[1].set_ylabel('True Label')
axes[1].set_xlabel('Predicted Label')
plt.tight_layout()
plt.show()
# Detailed classification reports
print("\n" + "="*60)
print("XGBoost Classification Report:")
print("="*60)
print(classification_report(y_val, xgb_binary, target_names=['Non-Default', 'Default']))
print("\n" + "="*60)
print("LightGBM Classification Report:")
print("="*60)
print(classification_report(y_val, lgb_binary, target_names=['Non-Default', 'Default']))
Cross-Validation for Robust Evaluation
K-Fold cross-validation provides a more reliable estimate of model performance:
def cross_validate_model(X, y, model_type='xgboost', n_splits=5):
"""
Perform stratified k-fold cross-validation
"""
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
auc_scores = []
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y), 1):
X_train_fold, X_val_fold = X.iloc[train_idx], X.iloc[val_idx]
y_train_fold, y_val_fold = y.iloc[train_idx], y.iloc[val_idx]
if model_type == 'xgboost':
dtrain = xgb.DMatrix(X_train_fold, label=y_train_fold)
dval = xgb.DMatrix(X_val_fold, label=y_val_fold)
model = xgb.train(xgb_params, dtrain, num_boost_round=500, verbose_eval=False)
preds = model.predict(dval)
else: # lightgbm
train_data = lgb.Dataset(X_train_fold, label=y_train_fold)
val_data = lgb.Dataset(X_val_fold, label=y_val_fold)
model = lgb.train(lgb_params, train_data, num_boost_round=500, valid_sets=[val_data], callbacks=[lgb.early_stopping(50)])
preds = model.predict(X_val_fold, num_iteration=model.best_iteration)
auc = roc_auc_score(y_val_fold, preds)
auc_scores.append(auc)
print(f"Fold {fold} AUC: {auc:.4f}")
print(f"\nMean AUC: {np.mean(auc_scores):.4f} (+/- {np.std(auc_scores):.4f})")
return auc_scores
print("XGBoost Cross-Validation:")
xgb_cv_scores = cross_validate_model(X_train_processed, y_train, model_type='xgboost')
print("\nLightGBM Cross-Validation:")
lgb_cv_scores = cross_validate_model(X_train_processed, y_train, model_type='lightgbm')
Model Interpretation: SHAP Values
For production credit scoring systems, model interpretability is essential for regulatory compliance and stakeholder trust. SHAP (SHapley Additive exPlanations) provides local and global interpretability:
import shap
# Calculate SHAP values for LightGBM (faster than XGBoost)
explainer = shap.TreeExplainer(lgb_model)
shap_values = explainer.shap_values(X_val[:1000]) # Sample for visualization
# Summary plot: global feature importance
plt.figure(figsize=(12, 8))
shap.summary_plot(shap_values, X_val[:1000], plot_type="bar", show=False)
plt.title('SHAP Feature Importance (Global)', fontsize=14)
plt.tight_layout()
plt.show()
# Detailed SHAP summary plot
plt.figure(figsize=(12, 8))
shap.summary_plot(shap_values, X_val[:1000], show=False)
plt.title('SHAP Values Distribution by Feature', fontsize=14)
plt.tight_layout()
plt.show()
Production Deployment Considerations
When deploying credit scoring models to production:
| Consideration | Best Practice |
|---|---|
| Model Monitoring | Track AUC, precision, recall over time; detect distribution drift |
| Threshold Calibration | Adjust classification threshold based on business costs (FP vs FN) |
| Fairness Testing | Evaluate model across demographic groups; ensure no discriminatory bias |
| Explainability | Provide rejection reasons using SHAP or LIME for transparency |
| Model Versioning | Use MLflow or similar tools to track model versions and performance |
| Regulatory Compliance | Document model development, validation, and monitoring for audits |
Conclusion
In this episode, we built a production-ready credit risk scoring model using XGBoost and LightGBM, achieving strong predictive performance with AUC scores exceeding 0.75. We covered:
- Domain-specific feature engineering for credit data
- Training and tuning gradient boosting models
- Comprehensive model evaluation with AUC-ROC, confusion matrices, and cross-validation
- Feature importance analysis and SHAP-based interpretability
- Production deployment considerations for credit scoring systems
The techniques we explored—particularly feature engineering and model evaluation—are directly applicable to other financial prediction tasks. In our next episode, Detecting Financial Fraud using Anomaly Detection Techniques, we’ll shift from supervised learning to unsupervised methods, exploring how to identify fraudulent transactions in highly imbalanced datasets using isolation forests, autoencoders, and statistical anomaly detection.
Stay tuned as we continue our journey through financial data science!
Did you find this helpful?
☕ Buy me a coffee
Leave a Reply