go back to projects

Project - Machine Learning for Healthcare

Healthcare Utilization Prediction

Predicting healthcare utilization and medical expenditure using machine learning models on a comprehensive dataset of 108 features from a nationwide U.S. patient survey.

built with
  • Python
  • scikit-learn
  • XGBoost
  • Pandas
  • NumPy
  • GridSearchCV
  • Statistical Analysis
Artificial Intelligence and machine learning models have tremendous potential in the healthcare domain for precision medicine, diagnosis, treatment planning, and administrative optimization. This project develops predictive models for healthcare utilization using a dataset compiled from a nationwide survey of U.S. patients. The dataset comprises 108 features spanning demographics, personal characteristics, and health-related information, providing a comprehensive view of patient profiles. The project addresses both regression (predicting total medical expenditure) and classification (determining low vs. high healthcare utilization) tasks through rigorous preprocessing and feature engineering.

Project Objectives

Regression Task
Predict the continuous variable of total medical expenditure in US dollars to understand spending patterns and identify high-cost patients.
Classification Task
Classify patients into low or high healthcare utilization categories to enable targeted interventions and resource allocation.

ML Pipeline Overview

Four-Stage Pipeline

The ML workflow is structured in four sequential stages, from raw data to final model selection and deployment readiness.
Task
108 Features
Regression
Medical Cost Prediction
Classification
Utilization Category
Stage 1: Data Analysis
EDA, correlation matrix, statistical significance testing
Stage 2: Preprocessing
Encoding, scaling, transformation, feature engineering
Stage 3: Model Training
GridSearchCV tuning, cross-validation, parameter optimization
Stage 4: Evaluation
Performance metrics, model comparison, best selection

Data Analysis and Preprocessing

Exploratory Data Analysis

Extensive EDA was conducted on all 108 features using statistical summaries, visualizations (boxplots, regression plots, heatmaps), correlation analysis, and statistical testing (ANOVA, T-tests, chi-square) to identify key predictors.

Statistical Testing

ANOVA, T-tests, and chi-square tests were used to assess the significance of categorical features in relation to target variables, guiding feature selection for model development.

Data Preprocessing Pipeline

The preprocessing pipeline transformed raw survey data through four critical stages, each addressing specific data quality and feature engineering challenges to prepare the 108 features for optimal model performance.
Step 1: Missing Value Handling
Context-Aware Imputation Strategy
Survey data contains special codes (-1, -2, -7) indicating "inapplicable", "don't know", and "refused" responses. Each code was interpreted in context before applying appropriate imputation methods.
Special Code Interpretation
-1: Inapplicable (logical skip)
-2: Don't know (true missing)
-7: Refused to answer
Continuous Variables
Mean: Normally distributed
Median: Skewed distributions
Example: INCOME, AGE
Categorical Variables
Mode: Most frequent value
Preserves distribution
Example: REGION, MARITAL_STATUS
Value Interpretation
Contextual interpretation of categorical codes
Example: Binary values (1, 2)
→ Mapped to "Yes"/"No" based on feature context
Step 2: Feature Encoding
Multi-Strategy Encoding Approach
Different encoding techniques were applied based on feature cardinality and relationship to target variables, optimizing information retention while preventing dimensionality explosion.
Label Encoding
Used for: Binary features
SEX: Male → 0, Female → 1
MARRIED: No → 0, Yes → 1
Preserves natural ordering
One-Hot Encoding
Used for: Nominal categories (≤10 values)
REGION: Northeast, Midwest,
South, West → 4 binary cols
Prevents ordinal assumptions
Target Encoding
Used for: High-cardinality features
Mean target per category
+ smoothing to avoid overfitting
Captures target relationship
Step 3: Transformation
Distribution Normalization & Feature Engineering
Applied mathematical transformations to address skewness, create meaningful feature groups, and improve model convergence by reshaping distributions closer to normal.
Logarithmic Transforms
Target: Highly skewed features (income, expenditure)
log(TOT_INCOME+1)\log(\text{TOT\_INCOME} + 1)
Normalizes distribution for improved stability
Impact: Improved regression stability and reduced outlier influence
Binning Strategies
Created: Categorical age groups from continuous age
0-17, 18-30, 31-45, 46-64, 65+
Benefit: Captures non-linear age effects on healthcare utilization
Feature Engineering Examples
Income Consolidation
Grouped TOT_INCOME and FAM_INCOME together to reduce dimensionality while preserving economic information
Chronic Condition Count
Aggregated binary indicators (diabetes, hypertension, heart disease) into single numeric feature
Healthcare Access Score
Combined insurance status, regular provider, and visit frequency
Socioeconomic Index
Weighted combination of income, education, and employment
Result: Domain-specific engineered features added to enhance model expressiveness
Step 4: Scaling & Validation
Normalization & Multicollinearity Check
Final preprocessing stage ensuring features are on comparable scales and free from harmful collinearity that could destabilize model training and interpretation.
Min-Max Scaling
Range: [0, 1]
x=xminmaxminx' = \frac{x - \min}{\max - \min}
Applied to: Tree-based models (Random Forest, XGBoost)
Preserves zero values and distributional shape
Z-Score Normalization
Mean: 0, Std: 1
z=xμσz = \frac{x - \mu}{\sigma}
Applied to: Linear models, Logistic Regression
Critical for gradient descent convergence
VIF Analysis
Threshold: VIF < 10
Removed features with
VIF > 10 threshold
Result: Eliminated multicollinearity, improved coefficient stability
Outlier Handling Strategy
Winsorization at 1st and 99th percentiles for extreme values in expenditure and income features, preserving 98% of data while reducing leverage of statistical outliers on regression models.

Machine Learning Models

Regression Models (RMSE Focus)

Model Performance Comparison
Linear Regression
Role:
Baseline model
Simple, interpretable approach
11,131.18
RMSE
ElasticNet
Strength:
Balanced regularization
L1 + L2 penalties combined
10,964.77
RMSE
Random Forest
Approach:
Ensemble learning
Multiple decision trees
12,012.17
RMSE
XGBoost
Method:
Gradient boosting
Sequential tree correction
12,298.58
RMSE

Classification Models (F1-Score Focus)

Model Comparison
Logistic Regression
Role
Baseline classifier
Simple binary classifier
0.711
F1 Score
XGBoost
Method:
Gradient boosting
Sequential optimization
0.853
F1 Score
Random Forest
Strength:
Ensemble diversity
Multiple trees averaging
0.858
F1 Score

Model Selection & Hyperparameter Tuning

GridSearchCV was employed to systematically tune hyperparameters for each model, evaluating parameter combinations using cross-validation.

Regression Configuration

Linear Regression
fit_intercept: False
Reduces RMSE while maintaining interpretability. Balanced fit without forcing the line through origin.
ElasticNet (Best)
alpha: 0.1
l1_ratio: 0.9
90% L1 + 10% L2 penalty. Optimal balance between feature selection and regularization.
Random Forest
max_depth: 10
min_samples_split: 5
n_estimators: 200
Effective bias-variance trade-off. 200 trees with reasonable depth constraints.
XGBoost
learning_rate: 0.1
max_depth: 3
n_estimators: 100
subsample: 0.8
Conservative settings. Shallow trees prevent overfitting while capturing patterns.

Classification Configuration

Logistic Regression
C: 0.1
penalty: 'l1'
solver: 'saga'
L1 regularization for feature selection and interpretability.
XGBoost
learning_rate: 0.1
max_depth: 3
n_estimators: 200
subsample: 1.0
Gradual learning with conservative depth. All samples per iteration.
Random Forest (Best)
max_depth: None
min_samples_leaf: 2
min_samples_split: 2
n_estimators: 100
Unconstrained tree growth. Minimal leaf requirements for flexibility.

Empirical Results

Regression Performance

Best Model: ElasticNet
RMSE (Test Set)
10,964.77
ElasticNet's combination of L1 and L2 regularization effectively balanced feature selection and generalization, achieving the best performance among all.
Model Comparison
Linear Regression11,131
ElasticNet10,965
Random Forest12,012
XGBoost12,299

Classification Performance

Best Model: Random Forest
F1-Score / Accuracy
0.858 / 86.4%
Random Forest's ensemble approach achieved superior performance, with 95% recall on high-utilization patients (Class 1), enabling effective resource targeting.
Model Comparison
Logistic Regression0.711
XGBoost0.853
Random Forest0.858
Test Set Distribution & Class Imbalance

Feature Selection Experiments

Recursive Feature Elimination with Cross-Validation (RFECV) and Sequential Feature Selection (SFS) were tested. Unexpectedly, both methods degraded performance with higher RMSE and lower R² values. This indicated that seemingly unimportant features contributed significantly when combined with others. Therefore, the full 108-feature set was retained for optimal results.
RFECV Results
Finding: Feature selection degraded performance
Removing features that appeared individually weak still reduced model accuracy. Complex feature interactions are crucial.
Decision: Retained full feature set
Feature Interactions
Implication: Non-linear combinations matter
Individual feature importance doesn't capture synergistic effects in ensemble and regularized models.
Result: 108 features optimal

Key Findings & Insights

Regression Insights
  • Regularized models outperform complex ensemble methods
  • Medical expenditure highly variable
  • Feature interactions critical to predictions
  • L1+L2 balance most effective for this domain
Classification Insights
  • Ensemble methods vastly superior (0.86 vs 0.71 F1)
  • High recall for positive class (95%) is critical
  • Imbalanced data handled well by models
  • 85% accuracy suitable for real-world deployment

Conclusion

This project successfully developed robust predictive models for healthcare utilization. ElasticNet emerged as the optimal regression model (RMSE: 10,965), while Random Forest proved most effective for classification (F1: 0.858, 86.4% accuracy). The comprehensive preprocessing pipeline, feature engineering, and rigorous hyperparameter tuning ensured models capable of supporting clinical decision-making. Results demonstrate the viability of these approaches for patient stratification, resource allocation, and operational efficiency in healthcare systems.
my resume

too bright? click ↝