Healthcare Utilization Prediction

Medical spending is one of the harder quantities to predict. It is heavily right-skewed, a small number of patients account for most of the total, and the features that explain ordinary spending stop explaining the tail. This project asks the cost question and the utilization question separately, on the same 108-feature profile drawn from a nationwide U.S. patient survey — and the two turn out to reward completely different models.

Source: Nationwide U.S. patient survey
Features: 108
Coverage: Demographics, characteristics, health status
Held-out test set: 3,000 samples
Regression target: Total medical expenditure
Classification target: Low vs high utilization

The same feature matrix feeds both tasks; only the target and the scoring metric change.

Two Questions, One Pipeline

Both tasks share the preparation work and diverge only at the modelling stage. Keeping the pipeline common makes the comparison fair — any difference in the result belongs to the model, not to the preprocessing.

01ExploreDistribution checks and statistical testing on demographics, conditions, and prior spend.Skew and imbalance quantified

02PrepareImputation, encoding, scaling, and interaction terms across the full feature set.108 modelling features

03FitGridSearchCV over four regressors and three classifiers with cross-validation.Tuned candidate models

04JudgeRMSE for the cost question, F1 for the utilization question, on untouched test data.ElasticNet and Random Forest

Exploration and preparation are shared; the split into regression and classification happens only at the fitting stage.

The Imbalance That Shapes Everything

Before any model runs, the class split already predicts where the difficulty will be. High utilizers outnumber low utilizers roughly four to one in the test set, which means a classifier can score well overall while barely recognising the minority class at all.

Low utilizationClass 0 · 647 samples

High utilizationClass 1 · 2353 samples

Test-set composition. Predicting the majority class for every patient would already score 78.4% accuracy, which is why accuracy alone is not the metric this project optimises.

The Cost Question

Four regressors were tuned with GridSearchCV and scored by RMSE on the held-out set. The result is not what a model-complexity ordering would predict: both linear models beat both tree ensembles.

Test RMSE by regression model

The four models span roughly 1,300 RMSE. Regularisation beats ensembling here, which suggests the usable signal is closer to linear than the tree models can exploit without overfitting the skew.

Selected regression configurations

Linear Regression: fit_intercept = false
ElasticNet: alpha 0.1 · l1_ratio 0.9
Random Forest: depth 10 · split 5 · 200 trees
XGBoost: rate 0.1 · depth 3 · 100 trees · subsample 0.8

The Utilization Question

The classification task reverses the finding. Here the ensembles win, and the interesting part is not the headline accuracy — it is how far the models separate once the metric stops rewarding majority-class guessing.

Metric profile across classifiers

All three models look comparable on accuracy. Logistic Regression then falls away on F1, because accuracy rewards it for following the majority class while F1 does not.

Selected classifier configurations

Logistic Regression: C 0.1 · L1 · saga
XGBoost: rate 0.1 · depth 3 · 200 trees
Random Forest: unbounded depth · leaf 2 · split 2 · 100 trees

Where Random Forest Actually Fails

Aggregate scores hide the failure mode. Broken out by class, one cell carries the entire weakness of the model: it finds only just over half of the genuinely low-utilization patients.

Random Forest, per class

Class 0 recall at 0.54 is the outlier. Nearly half of low-utilization patients are predicted as high-utilization — the direction of error a cost-planning model can least afford. Support is 647 samples for class 0 against 2,353 for class 1, which is why the weighted column tracks class 1 so closely.

86.4% accuracy is a real result, but it sits only about eight points above what predicting the majority class everywhere would score. The honest reading is that the model is strong on high utilizers and mediocre on the minority it was hardest to learn.

Feature Selection Did Not Help

Recursive Feature Elimination with cross-validation and Sequential Feature Selection were both tested, and both made the models worse — higher RMSE and lower R². Features that look uninformative in isolation were contributing through interactions, so the full set was kept.

RFECV and SFS each degraded held-out performance.
Individually weak features still contributed through non-linear and regularised interactions.
The final models use all 108 features.

What the Two Answers Say Together

The cost question and the utilization question disagree about which model family to trust, and that disagreement is the finding. Regularised linear regression handles a skewed continuous target better than boosted trees do; ensembles handle a thresholded, imbalanced label better than a linear classifier does. Choosing one model for both would have cost real accuracy on whichever task lost.

ElasticNet reaches 10,965 RMSE on expenditure and Random Forest reaches 0.858 F1 at 86.4% accuracy on utilization. The remaining headroom is almost entirely in the minority class, which points at resampling or class-weighted training rather than at more features.