Module 01: Machine Learning Foundations with scikit-learn

Listen to this article

Every machine learning project begins before anyone imports scikit-learn, opens a notebook, or trains a model. It begins with a question.

For this first module, imagine a neighbourhood coffee shop called Riverbend Roasters. The shop is popular, but unpredictable. Some mornings the pastry case is empty by 10:00 AM. Other days, trays of unsold croissants remain at closing. The owner has years of receipts, notes about local events, weather records, and a simple question:

Can we predict tomorrow’s demand before we decide how much food to prepare and how many people to schedule?

That question is small enough to understand, but rich enough to introduce the vocabulary and workflow of machine learning. As we follow Riverbend Roasters from messy observations to a working scikit-learn model, each major concept will appear naturally. By the end, words like instance, feature, target, model, algorithm, training, testing, baseline, bias, and variance should feel connected rather than isolated.

The companion code for this article lives here: ML-Blog/module_01_foundations.py at main · aduwillie/ML-Blog

It creates a synthetic coffee-demand dataset and walks through a complete scikit-learn workflow. The article explains the ideas; the script turns those ideas into runnable Python.

Standalone orientation

You can read this article as the first stop in the series or as a refresher before jumping into any later module. No earlier machine learning knowledge is assumed. The only background you need is a basic comfort with Python variables, tables, and the idea that a program can use data to make a prediction.

If you are reading the whole series, this module supplies the shared vocabulary used everywhere else: rows become instances, columns become features, the answer column becomes the target, and a scikit-learn pipeline connects preprocessing to a model. If you are reading this article alone, focus on the central workflow: define a question, build X and y, split the data, train a model, evaluate predictions, and ask whether the result is useful.

How to read the examples: `X`, `y`, and the moving parts

Before the story continues, it is worth slowing down for the two most common symbols in scikit-learn examples: X and y.

In supervised learning, X means “the information the model is allowed to use.” It is usually a table of feature columns. In the coffee-shop example, X contains columns such as temperature_f, rain_inches, event_score, is_holiday, and day_of_week. Each row is one day at Riverbend Roasters. Each column describes something known before the shop makes tomorrow’s plan.

y means “the answer the model is trying to learn.” For regression, y is orders, the number of coffee orders observed for each day. For classification, y becomes something like high_demand, a yes/no label derived from order volume.

That relationship is the heart of the examples:

			
X = df[["temperature_f", "rain_inches", "event_score", "is_holiday", "day_of_week"]]
y = df["orders"]

The dataframe df is the full synthetic dataset. X_train and y_train are the part used to teach the model. X_test and y_test are the part held back to check whether the model learned a reusable pattern. The Pipeline is the assembly line: it preprocesses the columns and then passes them into an estimator such as Ridge, LogisticRegression, or MLPClassifier. The estimator is the component that actually learns from the training data.

When you see model.fit(X_train, y_train), read it as: “learn the relationship between these feature rows and these known answers.” When you see model.predict(X_test), read it as: “use that learned relationship to estimate answers for rows whose answers were hidden during training.”

The first transformation: from experience to data

The owner of Riverbend already has experience. She knows Saturdays are busier than Tuesdays. She knows rain changes foot traffic. She knows a street festival can turn an ordinary afternoon into a rush. Experience is valuable, but it is hard to reuse consistently unless we record it.

Machine learning starts by turning experience into data.

In our story, each historical day becomes one row in a table. A row is called an instance. An instance is one observed example of the situation we care about. For Riverbend, one instance might be “last Saturday.” Another might be “a rainy Tuesday in March.” Each instance has information we know before making a prediction and an outcome we learn afterward.

The information we know beforehand becomes the features:

			
temperature_f
rain_inches
event_score
is_holiday
day_of_week

		

The outcome we want to predict becomes the target:

orders

That separation is one of the most important habits in machine learning. Features are the inputs. The target is the answer we want the model to learn to predict. If we accidentally include information that would not be available at prediction time, the model may look brilliant during testing and fail in real life.

In Python, we often represent this kind of data as a pandas dataframe:

			
import pandas as pd
df = pd.DataFrame({
    "temperature_f": [61.2, 74.5, 52.8],
    "rain_inches": [0.00, 0.02, 0.40],
    "event_score": [0.10, 0.85, 0.00],
    "day_of_week": ["Monday", "Saturday", "Wednesday"],
    "orders": [103, 211, 94],
})

		

Then we make the feature-target split explicit:

			
X = df[["temperature_f", "rain_inches", "event_score", "day_of_week"]]
y = df["orders"]

This small move has a big meaning. We are saying, “Here is what the model may look at, and here is what it must learn to predict.”

The second transformation: from table columns to machine-readable features

A human can read Saturday and immediately understand something about weekends, errands, brunch, and foot traffic. A machine learning model does not automatically understand that. Most scikit-learn estimators expect numbers. That means we must think carefully about feature types.

Some Riverbend features are already numeric. Temperature, rainfall, and event score are numbers. But even numeric features may need cleaning. Missing rainfall values may need imputation. Temperature might benefit from scaling. Other features are categorical. day_of_week is not a number; it is a label chosen from a small set of possible labels.

This is where preprocessing enters the story. Preprocessing is not busywork before the “real” modeling begins. It is part of the model’s view of the world. If we encode days of the week poorly, the model receives a distorted version of reality. If we handle missing values inconsistently, training and production behavior may diverge.

scikit-learn gives us a clean way to describe this:

			
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
numeric_features = ["temperature_f", "rain_inches", "event_score"]
categorical_features = ["day_of_week"]
numeric_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])
categorical_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore")),
])
preprocessor = ColumnTransformer(transformers=[
    ("num", numeric_pipeline, numeric_features),
    ("cat", categorical_pipeline, categorical_features),
])

		

The code says: numeric columns should be filled and scaled; categorical columns should be filled and one-hot encoded. If the model later sees an unexpected category, handle_unknown="ignore" prevents the prediction step from crashing.

For a beginner, the key lesson is that raw data rarely goes straight into a model. For an expert, the deeper lesson is that preprocessing must be reproducible, versioned, and attached to the model. A model without its preprocessing steps is only half a model.

The third transformation: from pattern to model

Once the data has a shape the computer can understand, we can introduce the word model.

A model is a learned function. It takes features as input and produces a prediction as output:

features -> model -> prediction

For Riverbend Roasters, the model might learn that warmer weather usually increases orders, rain usually decreases walk-in demand, weekends tend to be busier, and local events create spikes. The model does not “know” coffee culture the way a person does. It learns statistical relationships from examples.

An algorithm is the procedure used to learn the model. This distinction matters. Ridge regression is an estimator that uses an algorithm to learn coefficients from data. LogisticRegression uses an algorithm to learn how features affect the probability of a class. A decision tree uses an algorithm to learn splitting rules. The model is the result; the algorithm is how we get there.

scikit-learn expresses this with a consistent interface:

			
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Those two lines are simple, but they hide a profound idea. During fit, the algorithm studies examples whose answers are known. During predict, the learned model is asked to estimate answers for examples it has not been trained on.

For Riverbend, we can combine preprocessing and modeling into one pipeline:

			
from sklearn.linear_model import Ridge
model = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("regressor", Ridge(alpha=1.0)),
])

		

Now the model object contains the whole journey from raw columns to predictions. When we call fit, it learns preprocessing statistics from the training data and learns regression coefficients. When we call predict, it applies the same preprocessing steps before producing predictions.

That is why pipelines are central to serious scikit-learn work. They turn a sequence of fragile manual steps into a single reproducible object.

The fourth transformation: from memorization to generalization

If Riverbend gives us three years of historical records, it is tempting to train on all of them and celebrate when the model explains the past. But machine learning is not mainly about explaining examples we already know. It is about performing well on future examples.

That is why we split data into training and test sets:

			
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42,
)

		

The training set is the model’s classroom. The test set is the exam. The model is allowed to learn from the classroom examples, but the exam examples are held back until evaluation.

This split introduces a central goal: generalization. A useful model does not merely memorize the training data. It learns patterns that continue to hold for new data.

This is also where data leakage becomes dangerous. Suppose we scale every row before splitting into training and test sets. The scaling step has already seen the test data. The model evaluation now contains a small preview of the exam. In many projects, leakage is subtle and much more damaging than this simple example. It can come from future timestamps, duplicate users, target-derived features, or preprocessing performed before the split.

Pipelines help because scikit-learn fits each preprocessing step only on the training fold during training. The same learned transformation is then applied to validation or test data. The pipeline keeps the modeling story honest.

Regression: predicting “how much”

Riverbend’s first question is numeric: how many orders will we receive tomorrow? That is a regression problem.

Regression predicts a quantity. The target might be tomorrow’s orders, a home’s sale price, a delivery time, an energy bill, or a patient’s blood pressure. The important point is that the prediction is a number whose size matters.

For Riverbend:

y = df["orders"]

A regression model might predict 148 orders. If the actual value is 153, the model is probably useful. If the actual value is 230, the shop may be in trouble. The size of the error matters because the prediction is tied to a real decision.

In the companion script, the regression workflow compares a simple baseline to a Ridge regression pipeline:

			
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import Ridge
baseline = Pipeline(steps=[
    ("preprocess", build_preprocessor()),
    ("regressor", DummyRegressor(strategy="mean")),
])
ridge_model = Pipeline(steps=[
    ("preprocess", build_preprocessor()),
    ("regressor", Ridge(alpha=1.0)),
])

		

The baseline always predicts the average. It is intentionally simple. That simplicity makes it powerful: if our trained model cannot beat “always guess the average,” then our model has not earned our trust.

Baselines keep us honest. They prevent us from mistaking complexity for progress.

Metrics: turning predictions into evidence

Once the model predicts demand, Riverbend needs to know whether the predictions are good enough to use. That means we need metrics.

For regression, three common metrics tell different parts of the story.

Mean absolute error, or MAE, measures the average absolute difference between predicted and actual values. If the MAE is 11, we can say the model is off by about 11 orders on a typical day. That is easy to explain to a shop owner.

Root mean squared error, or RMSE, also measures prediction error, but it penalizes large mistakes more strongly. If Riverbend can tolerate being off by 8 orders but cannot tolerate being off by 60, RMSE helps reveal those painful misses.

R-squared, often written as R², compares the model to a simple average-based explanation of the target. It is useful as a relative measure, but it should not replace business interpretation. A model can have a respectable R² and still be unacceptable if its largest errors happen on the busiest days.

In scikit-learn:

			
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
predictions = model.predict(X_test)
mae = mean_absolute_error(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
r2 = r2_score(y_test, predictions)

		

Metrics are not just math. They are a translation layer between statistical performance and human decisions. Riverbend does not care about a metric because it is elegant; Riverbend cares because it informs inventory, staffing, waste, and customer experience.

Classification: predicting “which kind”

After a few weeks, the owner asks a slightly different question:

Can we predict whether tomorrow will be a high-demand day?

Now the exact number of orders matters less than the category. Maybe Riverbend defines a high-demand day as any day with at least 175 orders:

df["high_demand"] = df["orders"] >= 175

This turns the problem into classification. Classification predicts a class label. Instead of predicting 184 orders, the model predicts high or normal.

This shift changes the modeling vocabulary. For classification, we might use logistic regression:

			
from sklearn.linear_model import LogisticRegression
classifier = Pipeline(steps=[
    ("preprocess", build_preprocessor()),
    ("classifier", LogisticRegression(max_iter=1000)),
])

		

It also changes evaluation. Accuracy may be useful, but it can be misleading if one class is much more common than another. Precision, recall, and F1-score often tell a richer story. If high-demand days are rare, a model that always predicts “normal” may look accurate while completely failing the business need.

The conceptual bridge is simple:

			
Regression asks: how much?
Classification asks: which class?

Both are supervised learning tasks because the training data includes known answers. In regression, the known answer is numeric. In classification, the known answer is categorical.

Bias and variance: the model as a storyteller

At this point, Riverbend has a dataset, features, targets, preprocessing, train/test splits, models, and metrics. The next question is harder:

How complex should the model be?

This question leads to the bias-variance tradeoff, one of the most important ideas in machine learning.

Think of a model as a storyteller. A high-bias model tells a story that is too simple. It might say, “Demand only depends on temperature.” That story is easy to understand, but it misses weekends, holidays, rain, and local events. When a model is too simple to capture the real pattern, it underfits.

A high-variance model tells a story that is too specific. It might treat every strange detail in the training data as if it were a permanent law. Maybe one Tuesday had unusually high demand because a tour bus stopped nearby. A high-variance model may overreact and decide that Tuesdays are always explosive. It performs well on training data but poorly on new data. That is overfitting.

Good modeling is not a race toward maximum complexity. It is the search for a level of complexity that generalizes.

The companion script demonstrates this idea by fitting polynomial models of different degrees to one feature. A degree-1 model may be too simple. A very high-degree model may chase noise. A middle option may capture useful structure without becoming too fragile.

scikit-learn’s cross_val_score helps us estimate this behavior more reliably:

			
from sklearn.model_selection import cross_val_score
scores = cross_val_score(
    model,
    X,
    y,
    scoring="neg_mean_absolute_error",
    cv=5,
)
mae_scores = -scores

		

Cross-validation repeatedly trains and evaluates the model on different folds of the data. Instead of trusting one lucky or unlucky split, we get a more stable estimate of how the model behaves across multiple train/test partitions.

The beginner version is: test the model on data it did not train on. The expert version is: use validation design to estimate generalization under the same conditions the model will face after deployment.

Ethics: the model enters a human system

It is easy to end a first machine learning module with metrics, but that would leave the story unfinished. Riverbend’s model does not live inside a spreadsheet. It affects people.

If demand predictions are used to prepare ingredients, the model may reduce waste and improve customer experience. If predictions are used to schedule workers, the model may affect income, workload, and fairness. If errors are worse in certain seasons or neighborhoods, the model may create uneven service quality. Even a friendly coffee-shop example has human consequences.

Ethical machine learning begins with questions like these:

Question	Why it matters
Who benefits if the model is accurate?	The model should create value for more than one stakeholder.
Who is harmed when the model is wrong?	Some mistakes are more costly than others.
Are the features appropriate?	Some variables can be proxies for sensitive or unfair signals.
Is the training data representative?	Historical data can preserve past inequities or blind spots.
Can a person challenge the prediction?	Human oversight matters, especially when decisions affect people.
Will performance be monitored after deployment?	Data changes, and models decay.

For Riverbend, a responsible deployment might use predictions to guide inventory while allowing managers to override unusual local knowledge. It might monitor errors by weekday and season. It might avoid employee-level features unless there is a clear, fair, and transparent governance process.

For higher-stakes fields like healthcare, lending, hiring, insurance, education, and public safety, this ethical layer is not optional. It is part of the engineering work.

A short detour: recommendation systems are still built from these ideas

Before we close Module 1, consider a very different-seeming example: a music app recommending songs.

At first, recommendations feel far away from coffee demand. But the ingredients are familiar. The instances might be listening sessions. The features might include genre, tempo, artist, time of day, skip history, replay count, or similarity to other songs. The target might be whether a user replayed a song, skipped it, liked it, or added it to a playlist.

The model may be more complex than our Ridge regression example, but the conceptual chain remains:

instances -> features -> target -> model -> prediction -> metric -> decision

This is why foundations matter. If a recommendation system optimizes clicks, it may learn to promote addictive or low-quality content. If it optimizes long-term satisfaction, it needs different targets and metrics. The machine learning terms are not just vocabulary; they shape the product.

Running the companion script

The companion script is intentionally complete enough to serve as a starting point for a GitHub sample. It generates data, builds pipelines, trains models, evaluates results, and demonstrates bias and variance.

Install the dependencies:

pip install scikit-learn pandas numpy

Run the script:

python samples/module_01_foundations.py

Optionally write the generated dataset to a CSV file:

python samples/module_01_foundations.py --write-data

Read the script from top to bottom as a second version of the article. The function make_coffee_demand_data() creates the world. build_preprocessor() turns raw columns into model-ready features. run_regression_workflow() answers “how many orders?” run_classification_workflow() answers “high demand or normal?” demonstrate_bias_variance() shows why model complexity must be chosen carefully.

What to notice before moving on

The most important habit in this first article is not memorizing model names. It is learning to ask what role each object plays in the workflow. The dataframe is the observed world. X is the part of that world we are allowed to use as evidence. y is the answer we want to learn. The split protects the test set. The pipeline protects preprocessing. The estimator learns the relationship. The metric translates model behavior into a decision-relevant summary.

When you run the companion script, do not only look for the best number. Compare the baseline to the trained model. Notice how the classification task is created from the regression target. Notice that the bias-variance demonstration intentionally limits the model to one feature so model complexity becomes easier to see. These are teaching choices. In a real project, you would iterate on features, data quality, metrics, and deployment constraints, not just swap estimators.

If this module feels like a lot, that is normal. It is the vocabulary layer for the rest of the series. Every later article reuses the same grammar: define the question, construct X and y, fit a model, evaluate honestly, and decide whether the result is useful.

Common foundation traps

The first trap is starting with an algorithm instead of a question. “Use neural networks” is not a project goal. “Predict high-demand days early enough to plan staffing” is a project goal. The question determines the target, the data, the metric, and the acceptable error.

The second trap is confusing training performance with future performance. A model that explains the past may still fail tomorrow. That is why train/test splits, cross-validation, baselines, and leakage prevention appear so early in the series.

The third trap is forgetting that every dataset is a simplified view of reality. Features are not the world itself; they are measurements chosen by people and systems. A model can only learn from what was recorded, how it was recorded, and which outcomes were chosen as targets.

The journey so far

The Riverbend Roasters story started with a business question, not a model. That matters. Machine learning is not a bag of algorithms looking for somewhere to be applied. It is a disciplined way to turn examples into predictions that support decisions.

We began with raw experience and converted it into instances, features, and a target. We learned that feature types determine preprocessing. We wrapped preprocessing and modeling into a scikit-learn pipeline so the workflow stays reproducible. We split data so evaluation measures generalization rather than memorization. We used baselines so our model has to prove it adds value. We used metrics to translate predictions into evidence. We separated regression from classification. We introduced bias and variance as the tension between underfitting and overfitting. Finally, we placed the model back into the human system it affects.

The code may begin with:

			
model.fit(X_train, y_train)
predictions = model.predict(X_test)

But the real craft is everything surrounding those lines: the question, the data, the features, the target, the split, the pipeline, the baseline, the metric, the diagnosis, and the decision about whether the prediction should be used.

That is the foundation for the rest of the series. In the next module, we can build on this foundation and move deeper into classification models, where the question changes from “how much will happen?” to “which outcome is most likely?”