Building a Credit Risk Scoring Model with Machine Learning

Q: How does Dataset Selection and Loading work?

For this tutorial, we’ll use the Home Credit Default Risk dataset from Kaggle, which contains real-world loan application data with various demographic and financial features. import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selec

Updated Feb 6, 2026

Introduction

Credit risk assessment is one of the most critical applications of machine learning in the financial industry. Banks and lending institutions need to evaluate whether a loan applicant is likely to repay their debt or default. In this fourth episode of our Mastering Financial Data Science with Kaggle series, we’ll dive deep into building a robust credit risk scoring model using gradient boosting algorithms.

Building on the feature engineering techniques we explored in episode 3, we’ll now apply advanced machine learning models—specifically XGBoost and LightGBM—to predict loan approval outcomes. We’ll also explore feature importance analysis and model evaluation metrics that are essential for production credit scoring systems.

Understanding Credit Risk Modeling

Credit risk modeling aims to quantify the probability that a borrower will default on their financial obligations. The default probability is typically expressed as:

$P(\text{default}) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n)}}$

Where:
– $P(\text{default})$ is the probability of default
– $\beta_0$ is the intercept term
– $\beta_1, \beta_2, \ldots, \beta_n$ are the coefficients for each feature
– $x_1, x_2, \ldots, x_n$ are the input features (income, credit history, debt-to-income ratio, etc.)

Modern machine learning models learn these relationships automatically from data, often capturing non-linear patterns that traditional logistic regression cannot.

Dataset Selection and Loading

For this tutorial, we’ll use the Home Credit Default Risk dataset from Kaggle, which contains real-world loan application data with various demographic and financial features.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, classification_report
import xgboost as xgb
import lightgbm as lgb
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
application_train = pd.read_csv('application_train.csv')
application_test = pd.read_csv('application_test.csv')

print(f"Training data shape: {application_train.shape}")
print(f"Test data shape: {application_test.shape}")
print(f"\nTarget distribution:\n{application_train['TARGET'].value_counts(normalize=True)}")

Exploratory Data Analysis for Credit Risk

Before building models, let’s examine the distribution of key features and their relationship with default probability.

# Examine basic statistics
print(application_train.describe())

# Check missing values
missing_data = application_train.isnull().sum().sort_values(ascending=False)
missing_percent = (missing_data / len(application_train)) * 100
missing_df = pd.DataFrame({'Missing_Count': missing_data, 'Percentage': missing_percent})
print(missing_df.head(20))

# Visualize target distribution
plt.figure(figsize=(8, 5))
sns.countplot(x='TARGET', data=application_train)
plt.title('Distribution of Loan Default (0=Repaid, 1=Default)')
plt.xlabel('Target')
plt.ylabel('Count')
plt.show()

Analyzing Key Features

Let’s examine how critical features correlate with default probability:

# Income vs Default Rate
income_groups = pd.qcut(application_train['AMT_INCOME_TOTAL'], q=10)
default_by_income = application_train.groupby(income_groups)['TARGET'].mean()

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
default_by_income.plot(kind='bar')
plt.title('Default Rate by Income Decile')
plt.xlabel('Income Group')
plt.ylabel('Default Rate')
plt.xticks(rotation=45)

# Credit Amount vs Default Rate
credit_groups = pd.qcut(application_train['AMT_CREDIT'], q=10)
default_by_credit = application_train.groupby(credit_groups)['TARGET'].mean()

plt.subplot(1, 2, 2)
default_by_credit.plot(kind='bar', color='orange')
plt.title('Default Rate by Credit Amount Decile')
plt.xlabel('Credit Amount Group')
plt.ylabel('Default Rate')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Feature Engineering for Credit Scoring

Effective feature engineering significantly improves credit risk models. Let’s create domain-specific features:

def create_credit_features(df):
    """
    Create engineered features for credit risk modeling
    """
    df = df.copy()

    # Credit to Income Ratio
    df['CREDIT_INCOME_RATIO'] = df['AMT_CREDIT'] / (df['AMT_INCOME_TOTAL'] + 1)

    # Annuity to Income Ratio (monthly payment burden)
    df['ANNUITY_INCOME_RATIO'] = df['AMT_ANNUITY'] / (df['AMT_INCOME_TOTAL'] + 1)

    # Credit Term (in months)
    df['CREDIT_TERM'] = df['AMT_CREDIT'] / (df['AMT_ANNUITY'] + 1)

    # Days employed to age ratio
    df['EMPLOYED_TO_AGE_RATIO'] = df['DAYS_EMPLOYED'] / (df['DAYS_BIRTH'] + 1)

    # External source combinations (credit bureau scores)
    df['EXT_SOURCE_MEAN'] = df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis=1)
    df['EXT_SOURCE_STD'] = df[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].std(axis=1)
    df['EXT_SOURCE_PROD'] = df['EXT_SOURCE_1'] * df['EXT_SOURCE_2'] * df['EXT_SOURCE_3']

    # Age in years
    df['AGE_YEARS'] = -df['DAYS_BIRTH'] / 365

    # Employment length in years
    df['EMPLOYMENT_YEARS'] = -df['DAYS_EMPLOYED'] / 365
    df['EMPLOYMENT_YEARS'] = df['EMPLOYMENT_YEARS'].apply(lambda x: 0 if x < 0 else x)

    return df

# Apply feature engineering
application_train = create_credit_features(application_train)
application_test = create_credit_features(application_test)

print("New features created successfully")
print(f"Total features: {application_train.shape[1]}")

Data Preprocessing Pipeline

from sklearn.preprocessing import LabelEncoder

def preprocess_data(train_df, test_df):
    """
    Preprocess data: handle missing values and encode categoricals
    """
    # Align train and test columns
    train_labels = train_df['TARGET']
    train_df = train_df.drop(columns=['TARGET'])

    # Identify categorical and numerical columns
    categorical_cols = train_df.select_dtypes(include=['object']).columns.tolist()
    numerical_cols = train_df.select_dtypes(include=['int64', 'float64']).columns.tolist()

    # Handle categorical variables with Label Encoding
    le = LabelEncoder()
    for col in categorical_cols:
        train_df[col] = train_df[col].astype(str)
        test_df[col] = test_df[col].astype(str)

        # Fit on combined data to handle unseen categories
        combined = pd.concat([train_df[col], test_df[col]])
        le.fit(combined)

        train_df[col] = le.transform(train_df[col])
        test_df[col] = le.transform(test_df[col])

    # Handle missing values in numerical columns
    for col in numerical_cols:
        train_df[col].fillna(train_df[col].median(), inplace=True)
        test_df[col].fillna(train_df[col].median(), inplace=True)

    # Remove infinite values
    train_df.replace([np.inf, -np.inf], np.nan, inplace=True)
    test_df.replace([np.inf, -np.inf], np.nan, inplace=True)

    train_df.fillna(0, inplace=True)
    test_df.fillna(0, inplace=True)

    return train_df, test_df, train_labels

# Preprocess the data
X_train_processed, X_test_processed, y_train = preprocess_data(
    application_train.copy(), 
    application_test.copy()
)

print(f"Processed training shape: {X_train_processed.shape}")
print(f"Processed test shape: {X_test_processed.shape}")

Building the XGBoost Model

XGBoost (Extreme Gradient Boosting) is a powerful ensemble method that builds sequential decision trees, where each tree corrects the errors of previous trees. The objective function XGBoost minimizes is:

$\mathcal{L} = \sum_{i=1}^{n} L(y_i, \hat{y}_i) + \sum_{k=1}^{M} \Omega(f_k)$

Where:
– $L(y_i, \hat{y}_i)$ is the loss function (log loss for binary classification)
– $y_i$ is the true label
– $\hat{y}_i$ is the predicted probability
– $\Omega(f_k)$ is the regularization term for tree $k$
– $M$ is the number of trees

# Split data for validation
X_train, X_val, y_train_split, y_val = train_test_split(
    X_train_processed, y_train, test_size=0.2, random_state=42, stratify=y_train
)

print(f"Training set: {X_train.shape}")
print(f"Validation set: {X_val.shape}")

# XGBoost model configuration
xgb_params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'max_depth': 6,
    'learning_rate': 0.05,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'min_child_weight': 3,
    'gamma': 0.1,
    'reg_alpha': 0.1,
    'reg_lambda': 1.0,
    'random_state': 42,
    'n_jobs': -1
}

# Create DMatrix for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train_split)
dval = xgb.DMatrix(X_val, label=y_val)

# Train the model with early stopping
evallist = [(dtrain, 'train'), (dval, 'eval')]
xgb_model = xgb.train(
    xgb_params,
    dtrain,
    num_boost_round=1000,
    evals=evallist,
    early_stopping_rounds=50,
    verbose_eval=100
)

print(f"\nBest iteration: {xgb_model.best_iteration}")
print(f"Best AUC score: {xgb_model.best_score:.4f}")

Building the LightGBM Model

LightGBM is another gradient boosting framework that uses a leaf-wise tree growth strategy, making it faster and more memory-efficient than XGBoost for large datasets.

# LightGBM model configuration
lgb_params = {
    'objective': 'binary',
    'metric': 'auc',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'min_child_samples': 20,
    'reg_alpha': 0.1,
    'reg_lambda': 1.0,
    'random_state': 42,
    'n_jobs': -1,
    'verbose': -1
}

# Create Dataset for LightGBM
lgb_train = lgb.Dataset(X_train, label=y_train_split)
lgb_val = lgb.Dataset(X_val, label=y_val, reference=lgb_train)

# Train the model
lgb_model = lgb.train(
    lgb_params,
    lgb_train,
    num_boost_round=1000,
    valid_sets=[lgb_train, lgb_val],
    valid_names=['train', 'valid'],
    callbacks=[lgb.early_stopping(stopping_rounds=50), lgb.log_evaluation(100)]
)

print(f"\nBest iteration: {lgb_model.best_iteration}")
print(f"Best AUC score: {lgb_model.best_score['valid']['auc']:.4f}")

Feature Importance Analysis

Understanding which features drive predictions is crucial for credit risk models, both for interpretability and regulatory compliance.

# XGBoost feature importance
xgb_importance = xgb_model.get_score(importance_type='gain')
xgb_importance_df = pd.DataFrame({
    'feature': list(xgb_importance.keys()),
    'importance': list(xgb_importance.values())
}).sort_values('importance', ascending=False)

# LightGBM feature importance
lgb_importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': lgb_model.feature_importance(importance_type='gain')
}).sort_values('importance', ascending=False)

# Visualize top features
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# XGBoost top 20 features
axes[0].barh(xgb_importance_df.head(20)['feature'], xgb_importance_df.head(20)['importance'])
axes[0].set_xlabel('Importance (Gain)')
axes[0].set_title('XGBoost: Top 20 Features')
axes[0].invert_yaxis()

# LightGBM top 20 features
axes[1].barh(lgb_importance_df.head(20)['feature'], lgb_importance_df.head(20)['importance'], color='orange')
axes[1].set_xlabel('Importance (Gain)')
axes[1].set_title('LightGBM: Top 20 Features')
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

print("\nTop 10 Most Important Features (XGBoost):")
print(xgb_importance_df.head(10))

Model Evaluation with AUC-ROC

The Area Under the ROC Curve (AUC-ROC) is the primary metric for credit risk models. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various classification thresholds:

$\text{TPR} = \frac{TP}{TP + FN}, \quad \text{FPR} = \frac{FP}{FP + TN}$

Where:
– $TP$ = True Positives (correctly predicted defaults)
– $FN$ = False Negatives (missed defaults)
– $FP$ = False Positives (incorrectly predicted defaults)
– $TN$ = True Negatives (correctly predicted non-defaults)

# Make predictions on validation set
dval_pred = xgb.DMatrix(X_val)
xgb_preds = xgb_model.predict(dval_pred)
lgb_preds = lgb_model.predict(X_val, num_iteration=lgb_model.best_iteration)

# Calculate AUC scores
xgb_auc = roc_auc_score(y_val, xgb_preds)
lgb_auc = roc_auc_score(y_val, lgb_preds)

print(f"XGBoost Validation AUC: {xgb_auc:.4f}")
print(f"LightGBM Validation AUC: {lgb_auc:.4f}")

# Plot ROC curves
xgb_fpr, xgb_tpr, _ = roc_curve(y_val, xgb_preds)
lgb_fpr, lgb_tpr, _ = roc_curve(y_val, lgb_preds)

plt.figure(figsize=(10, 7))
plt.plot(xgb_fpr, xgb_tpr, label=f'XGBoost (AUC = {xgb_auc:.4f})', linewidth=2)
plt.plot(lgb_fpr, lgb_tpr, label=f'LightGBM (AUC = {lgb_auc:.4f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier', linewidth=1)
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curve Comparison', fontsize=14)
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.show()

Confusion Matrix and Classification Metrics

# Select optimal threshold based on validation set
from sklearn.metrics import precision_recall_curve

def find_optimal_threshold(y_true, y_pred_proba):
    """
    Find optimal classification threshold using F1 score
    """
    precision, recall, thresholds = precision_recall_curve(y_true, y_pred_proba)
    f1_scores = 2 * (precision * recall) / (precision + recall + 1e-8)
    optimal_idx = np.argmax(f1_scores)
    return thresholds[optimal_idx] if optimal_idx < len(thresholds) else 0.5

xgb_threshold = find_optimal_threshold(y_val, xgb_preds)
lgb_threshold = find_optimal_threshold(y_val, lgb_preds)

print(f"Optimal XGBoost threshold: {xgb_threshold:.4f}")
print(f"Optimal LightGBM threshold: {lgb_threshold:.4f}")

# Convert probabilities to binary predictions
xgb_binary = (xgb_preds >= xgb_threshold).astype(int)
lgb_binary = (lgb_preds >= lgb_threshold).astype(int)

# Confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.heatmap(confusion_matrix(y_val, xgb_binary), annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title(f'XGBoost Confusion Matrix (threshold={xgb_threshold:.3f})')
axes[0].set_ylabel('True Label')
axes[0].set_xlabel('Predicted Label')

sns.heatmap(confusion_matrix(y_val, lgb_binary), annot=True, fmt='d', cmap='Oranges', ax=axes[1])
axes[1].set_title(f'LightGBM Confusion Matrix (threshold={lgb_threshold:.3f})')
axes[1].set_ylabel('True Label')
axes[1].set_xlabel('Predicted Label')

plt.tight_layout()
plt.show()

# Detailed classification reports
print("\n" + "="*60)
print("XGBoost Classification Report:")
print("="*60)
print(classification_report(y_val, xgb_binary, target_names=['Non-Default', 'Default']))

print("\n" + "="*60)
print("LightGBM Classification Report:")
print("="*60)
print(classification_report(y_val, lgb_binary, target_names=['Non-Default', 'Default']))

Cross-Validation for Robust Evaluation

K-Fold cross-validation provides a more reliable estimate of model performance:

def cross_validate_model(X, y, model_type='xgboost', n_splits=5):
    """
    Perform stratified k-fold cross-validation
    """
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    auc_scores = []

    for fold, (train_idx, val_idx) in enumerate(skf.split(X, y), 1):
        X_train_fold, X_val_fold = X.iloc[train_idx], X.iloc[val_idx]
        y_train_fold, y_val_fold = y.iloc[train_idx], y.iloc[val_idx]

        if model_type == 'xgboost':
            dtrain = xgb.DMatrix(X_train_fold, label=y_train_fold)
            dval = xgb.DMatrix(X_val_fold, label=y_val_fold)
            model = xgb.train(xgb_params, dtrain, num_boost_round=500, verbose_eval=False)
            preds = model.predict(dval)
        else:  # lightgbm
            train_data = lgb.Dataset(X_train_fold, label=y_train_fold)
            val_data = lgb.Dataset(X_val_fold, label=y_val_fold)
            model = lgb.train(lgb_params, train_data, num_boost_round=500, valid_sets=[val_data], callbacks=[lgb.early_stopping(50)])
            preds = model.predict(X_val_fold, num_iteration=model.best_iteration)

        auc = roc_auc_score(y_val_fold, preds)
        auc_scores.append(auc)
        print(f"Fold {fold} AUC: {auc:.4f}")

    print(f"\nMean AUC: {np.mean(auc_scores):.4f} (+/- {np.std(auc_scores):.4f})")
    return auc_scores

print("XGBoost Cross-Validation:")
xgb_cv_scores = cross_validate_model(X_train_processed, y_train, model_type='xgboost')

print("\nLightGBM Cross-Validation:")
lgb_cv_scores = cross_validate_model(X_train_processed, y_train, model_type='lightgbm')

Model Interpretation: SHAP Values

For production credit scoring systems, model interpretability is essential for regulatory compliance and stakeholder trust. SHAP (SHapley Additive exPlanations) provides local and global interpretability:

import shap

# Calculate SHAP values for LightGBM (faster than XGBoost)
explainer = shap.TreeExplainer(lgb_model)
shap_values = explainer.shap_values(X_val[:1000])  # Sample for visualization

# Summary plot: global feature importance
plt.figure(figsize=(12, 8))
shap.summary_plot(shap_values, X_val[:1000], plot_type="bar", show=False)
plt.title('SHAP Feature Importance (Global)', fontsize=14)
plt.tight_layout()
plt.show()

# Detailed SHAP summary plot
plt.figure(figsize=(12, 8))
shap.summary_plot(shap_values, X_val[:1000], show=False)
plt.title('SHAP Values Distribution by Feature', fontsize=14)
plt.tight_layout()
plt.show()

Production Deployment Considerations

When deploying credit scoring models to production:

Consideration	Best Practice
Model Monitoring	Track AUC, precision, recall over time; detect distribution drift
Threshold Calibration	Adjust classification threshold based on business costs (FP vs FN)
Fairness Testing	Evaluate model across demographic groups; ensure no discriminatory bias
Explainability	Provide rejection reasons using SHAP or LIME for transparency
Model Versioning	Use MLflow or similar tools to track model versions and performance
Regulatory Compliance	Document model development, validation, and monitoring for audits

Conclusion

In this episode, we built a production-ready credit risk scoring model using XGBoost and LightGBM, achieving strong predictive performance with AUC scores exceeding 0.75. We covered:

Domain-specific feature engineering for credit data
Training and tuning gradient boosting models
Comprehensive model evaluation with AUC-ROC, confusion matrices, and cross-validation
Feature importance analysis and SHAP-based interpretability
Production deployment considerations for credit scoring systems

The techniques we explored—particularly feature engineering and model evaluation—are directly applicable to other financial prediction tasks. In our next episode, Detecting Financial Fraud using Anomaly Detection Techniques, we’ll shift from supervised learning to unsupervised methods, exploring how to identify fraudulent transactions in highly imbalanced datasets using isolation forests, autoencoders, and statistical anomaly detection.

Stay tuned as we continue our journey through financial data science!

Mastering Financial Data Science with Kaggle Series (4/6)

← Previous: Advanced Feature Engineering for Financial Time-Series Next: Detecting Financial Fraud using Anomaly Detection Techniques →

Did you find this helpful?

☕ Buy me a coffee

Building a Credit Risk Scoring Model with Machine Learning

Introduction

Understanding Credit Risk Modeling

Dataset Selection and Loading

Exploratory Data Analysis for Credit Risk

Analyzing Key Features

Feature Engineering for Credit Scoring

Data Preprocessing Pipeline

Building the XGBoost Model

Building the LightGBM Model

Feature Importance Analysis

Model Evaluation with AUC-ROC

Confusion Matrix and Classification Metrics

Cross-Validation for Robust Evaluation

Model Interpretation: SHAP Values

Production Deployment Considerations

Conclusion

Comments

Leave a Reply Cancel reply

Building a Credit Risk Scoring Model with Machine Learning

Introduction

Understanding Credit Risk Modeling

Dataset Selection and Loading

Exploratory Data Analysis for Credit Risk

Analyzing Key Features

Feature Engineering for Credit Scoring

Data Preprocessing Pipeline

Building the XGBoost Model

Building the LightGBM Model

Feature Importance Analysis

Model Evaluation with AUC-ROC

Confusion Matrix and Classification Metrics

Cross-Validation for Robust Evaluation

Model Interpretation: SHAP Values

Production Deployment Considerations

Conclusion

Related Posts

Comments

Leave a Reply Cancel reply