Module 04: Model Evaluation, Metrics, and Cross-Validation

Listen to this article

Riverbend Roasters now has models. Some predict order counts. Some predict high-demand days. That feels like progress, but it creates a new and more important question:

How do we know whether a model is good enough to trust?

This module is about moving from predictions to evidence. A model is not good because it is complicated, modern, or impressive in a demo. A model is good when it performs well on the kind of data it will face later and when its errors are acceptable for the decision it supports.

The companion script is:

samples/module_04_evaluation_validation.py

It builds both classification and regression examples, prints metrics, compares cross-validation scores, and runs a small hyperparameter search.

Standalone orientation

You can read this article even if you have not read the earlier model-building modules. The key idea is simple: a model’s predictions are not valuable until we compare them with known answers on data the model did not use for learning.

This module is about trust. Whether a model predicts a category or a number, we need evidence that it generalizes. If you are following the whole series, this article is the checkpoint that makes every later model comparison meaningful. If you are reading it alone, focus on the distinction between training a model and evaluating whether the trained model deserves to influence a decision.

How to read the examples: two meanings of `X` and `y`

This module uses two datasets because evaluation looks different for classification and regression.

In the classification demo, X is a generated table of signal columns named signal_0, signal_1, and so on. Each row is an instance. y is a binary class label, where 1 represents the positive class and 0 represents the negative class. The model predicts class labels and class probabilities, then the script evaluates accuracy, ROC AUC, precision, recall, F1-score, and the confusion matrix.

In the regression demo, X is a generated table of numeric columns named feature_0, feature_1, and so on. y is a numeric target. The model predicts numbers, then the script evaluates MAE, RMSE, and R2.

The same notation appears in both cases, but the meaning of y changes the whole evaluation story:

			
classification: X -> classifier -> predicted class
regression:     X -> regressor  -> predicted number

cross_val_score repeatedly creates training and validation folds from X and y. GridSearchCV does the same while trying different hyperparameter values. In both cases, the goal is to estimate performance without letting validation answers leak into model fitting.

Loss functions and metrics are not the same story

A loss function is usually what an algorithm minimizes during training. A metric is what humans use to judge whether the trained model is useful. Sometimes they are related, but they are not always identical.

For example, logistic regression commonly learns by minimizing log loss. But a business stakeholder may care about recall, precision, or the cost of false alarms. A regression model may train by minimizing squared error, but a planner may care about average absolute error because it is easier to interpret in real units.

This distinction matters because optimization pressure shapes model behavior. If the training objective and the real-world decision are badly misaligned, the model can improve mathematically while becoming less useful operationally.

Classification metrics: looking inside correctness

Suppose Riverbend uses a classifier to predict high-demand days. A correct prediction is useful, but incorrect predictions are not all equal.

If the model predicts high demand and the day is normal, Riverbend may overstaff and waste food. That is a false positive. If the model predicts normal demand and the day is high demand, the shop may run out of products and disappoint customers. That is a false negative.

The confusion matrix shows these outcomes:

			
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions)

From the confusion matrix, we calculate:

Metric	Question it answers
Accuracy	How often are predictions correct overall?
Precision	When the model predicts positive, how often is it right?
Recall	Of all actual positives, how many did the model find?
F1-score	What is the balance between precision and recall?
ROC AUC	How well does the model rank positives above negatives across thresholds?

Accuracy is useful only when the classes and error costs are reasonably balanced. In fraud, disease screening, safety alerts, and rare-event detection, accuracy can be dangerously comforting.

Regression metrics: measuring the size of mistakes

Regression errors have units. If a model predicts trip price, the error is dollars. If it predicts order count, the error is orders. If it predicts delivery duration, the error is minutes.

That is why metrics should be explained in the language of the problem.

MAE says how far off the model is on a typical prediction. RMSE punishes large mistakes more strongly. R-squared compares the model to predicting the average. These metrics are easy to calculate:

			
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
mae = mean_absolute_error(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
r2 = r2_score(y_test, predictions)

		

The right metric depends on what hurts. If one huge underprediction is disastrous, RMSE may be more revealing than MAE. If everyday planning tolerance matters, MAE may be more understandable.

Evaluation plots: seeing the errors

Tables of metrics are compact, but plots reveal patterns. A regression residual plot can show whether errors grow as predictions get larger. A predicted-vs-actual plot can show whether the model systematically underpredicts high values. A classification ROC curve can show how true positive rate and false positive rate trade off across thresholds.

Even when you do not generate plots in production, you should think visually during development. Patterns in errors often reveal missing features, leakage, nonlinearity, outliers, or changing data.

The companion script avoids extra plotting dependencies so it stays lightweight, but it prints enough information to motivate the next step: if a metric surprises you, inspect the errors directly.

Cross-validation: one split is only one story

A single train/test split can be lucky or unlucky. Maybe the test set contains unusually easy examples. Maybe it contains a cluster of hard examples. Cross-validation gives a more stable estimate by training and evaluating multiple times.

In k-fold cross-validation, the data is divided into folds. The model trains on several folds and validates on the remaining fold. This repeats until each fold has served as validation data.

In scikit-learn:

			
from sklearn.model_selection import cross_val_score
scores = cross_val_score(
    model,
    X,
    y,
    cv=5,
    scoring="neg_mean_absolute_error",
)

		

scikit-learn follows the convention that higher scores are better, so loss-like metrics are represented as negative values. That is why we often multiply by -1 when reporting MAE or RMSE.

Cross-validation is not magic. It must respect the structure of the data. Time series require time-aware splits. Grouped data may require group-aware splits. Duplicate users, repeated patients, or shared households can leak information across folds if we split naively.

Model selection and tuning

Once evaluation is trustworthy, we can compare models and tune hyperparameters. A parameter is learned during training, such as a coefficient. A hyperparameter is chosen before training, such as the number of neighbors in kNN or the regularization strength in Ridge regression.

GridSearchCV lets us search over hyperparameter choices using cross-validation:

			
from sklearn.model_selection import GridSearchCV
search = GridSearchCV(
    estimator=model,
    param_grid={"classifier__C": [0.1, 1.0, 10.0]},
    cv=5,
    scoring="f1",
)
search.fit(X_train, y_train)

		

Notice the name classifier__C. In a pipeline, double underscores let us reach inside a step and tune one of its hyperparameters.

The expert caution is that hyperparameter tuning can overfit the validation process. Keep a final test set untouched until the end, especially when many experiments are being tried.

What to notice when running the sample

The classification section prints both threshold-dependent metrics and a threshold-independent metric. Accuracy, precision, recall, and F1-score depend on the chosen class threshold. ROC AUC evaluates ranking behavior across thresholds. A model can rank examples well but still need threshold tuning before deployment.

The regression section uses cross-validation to show that performance is a distribution, not a single number. The fold scores will not be identical. That variation is useful information. If fold performance varies wildly, the model may be sensitive to which examples it sees, or the dataset may contain subgroups that should be evaluated separately.

The grid search section demonstrates a safe pattern: tune hyperparameters inside cross-validation, then evaluate the selected model on held-out data when appropriate. The search object is not magic; it simply automates repeated fitting, scoring, and comparison.

Common evaluation traps

One trap is optimizing for a metric that does not match the decision. Another is tuning repeatedly on the same test set until the test set effectively becomes part of training. A third is using random cross-validation on data with time, groups, or leakage paths. If tomorrow’s observations are correlated with today’s, or if the same customer appears in both train and test folds, a random split may exaggerate performance.

Evaluation design is part of modeling. The split strategy, metric choice, and validation process encode your assumptions about how the model will be used later.

The module in one journey

Evaluation is the bridge between model building and decision making. Loss functions guide training. Metrics guide judgment. Confusion matrices reveal classification trade-offs. Regression metrics translate mistakes into real units. Cross-validation reduces dependence on one split. Hyperparameter search improves models while preserving a reproducible evaluation process.

Run the sample:

python module_04_evaluation_validation.py