Module 05: Data Preprocessing and Feature Engineering

Listen to this article

If models are engines, features are the road. A powerful engine does not help if the road is broken, mislabeled, or missing bridges.

This module focuses on the work that happens before training: cleaning missing values, transforming numeric features, encoding categories, and building pipelines that apply those steps consistently. It is tempting to treat preprocessing as a chore. In practice, preprocessing is often where machine learning projects succeed or fail.

The companion script is: ML-Blog/module_05_preprocessing.py at main · aduwillie/ML-Blog

It creates a synthetic healthcare-readmission dataset with missing values, skewed features, categorical variables, and a binary target. Then it trains a logistic regression model using a complete preprocessing pipeline.

Standalone orientation

You can start the series here if your immediate interest is messy data. The only machine learning background you need is that models learn from input columns and a target column. This article focuses on what happens before the model sees those inputs.

If you are reading the whole series, Module 5 explains the machinery that quietly supports almost every earlier and later example: imputation, scaling, encoding, transformations, and pipelines. If you are reading this article by itself, think of preprocessing as the translation layer between real-world records and the numeric arrays that scikit-learn estimators can learn from.

How to read the examples: `X`, `y`, and preprocessing components

In this module, df is the full synthetic healthcare dataset. It contains patient-like columns such as age, previous_visits, medication_count, lab_risk_score, length_of_stay, followup_scheduled, discharge_type, and insurance_type, plus the target column readmitted.

X is created by dropping the target:

X = df.drop(columns=["readmitted"])

That means X contains every input the model is allowed to use. y is the target:

y = df["readmitted"]

Because readmitted is a yes/no outcome encoded as 0 or 1, the estimator is a classifier. The important part of this module is not only the classifier, though. The preprocessing components are the main characters:

Component	What it does
`SimpleImputer`	Fills missing numeric or categorical values using training-data statistics.
`FunctionTransformer(np.log1p)`	Applies a log transform to skewed count-like columns.
`StandardScaler`	Standardizes numeric columns after imputation or transformation.
`OneHotEncoder`	Converts categorical columns into numeric indicator columns.
`ColumnTransformer`	Sends each column group through the right preprocessing path.
`Pipeline`	Attaches preprocessing and classification so they are fit together safely.

When you read the script, think of X as raw patient records and the pipeline as the translation system that turns those records into model-ready numbers.

Raw data is a rough draft

Imagine a clinic trying to predict whether a patient is at risk of readmission. The dataset includes age, number of previous visits, lab values, medication count, discharge type, insurance type, and whether follow-up was scheduled.

Some values are missing. Some numeric columns are skewed. Some columns are categories. Some values are measured on different scales. None of that means the dataset is useless. It means the dataset is raw.

Machine learning needs a translation layer between raw records and model-ready arrays. That layer is preprocessing.

Imputation: making missingness explicit

Missing data is not automatically an error. A lab value may be missing because the test was not ordered. Follow-up information may be missing because scheduling happened in a different system. The reason for missingness can itself be meaningful.

The simplest imputation strategies fill missing numeric values with the median and missing categorical values with the most frequent category:

			
from sklearn.impute import SimpleImputer
numeric_imputer = SimpleImputer(strategy="median")
categorical_imputer = SimpleImputer(strategy="most_frequent")

These strategies are not always enough, but they are reliable starting points. More advanced projects may add missingness indicators or use model-based imputation. The key rule is that imputation must be learned from the training data only, then applied to validation, test, and production data.

Transformations: changing the shape of information

Some numeric features are skewed. Visit counts, transaction amounts, income, and claim costs often have many small values and a few large values. A log transform can make these features easier for linear models to use:

			
from sklearn.preprocessing import FunctionTransformer
import numpy as np
log_transformer = FunctionTransformer(np.log1p, feature_names_out="one-to-one")

Scaling is another common transformation. Many algorithms behave better when numeric features have comparable ranges:

			
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

Scaling is especially important for distance-based methods, regularized linear models, support vector machines, and neural networks.

Encoding categories: turning labels into columns

Models do not automatically understand categories like private, medicare, or self_pay. One-hot encoding creates a separate binary column for each category:

			
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown="ignore")

The handle_unknown="ignore" option is important. Production data often contains categories not seen during training. Ignoring unknown categories is not always perfect, but it is much better than crashing during prediction.

Encoding is a modeling choice. If a category has natural order, ordinal encoding may be appropriate. If a category has high cardinality, target encoding or learned embeddings may be considered. The beginner lesson is that categories need conversion. The expert lesson is that encoding controls what relationships the model can learn.

ColumnTransformer: different columns, different treatment

Real datasets contain mixed feature types. scikit-learn’s ColumnTransformer lets each group of columns receive the right preprocessing:

			
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ("num", numeric_pipeline, numeric_features),
    ("skew", skewed_numeric_pipeline, skewed_features),
    ("cat", categorical_pipeline, categorical_features),
])

		

This keeps preprocessing readable. It also keeps it attached to the model in a pipeline:

			
model = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000)),
])

The result is one object that knows how to clean, transform, encode, train, and predict.

Feature engineering is storytelling with constraints

Feature engineering means creating useful model inputs from available data. In healthcare, previous visits may become a risk signal. In retail, day of week may become a seasonality feature. In finance, transaction velocity may be more useful than a single transaction amount.

Good feature engineering respects time. If a feature would not be known at prediction time, it must not be used. A readmission model cannot use information recorded after discharge if the prediction is supposed to happen at discharge.

This is where domain knowledge and machine learning meet. The model learns patterns, but humans decide what information is fair, available, meaningful, and safe to provide.

What to notice when running the sample

The classifier in the script is intentionally ordinary. That is the point. Module 5 is not trying to impress you with a complex model. It is showing that the path from raw data to numeric model input is itself a major part of machine learning.

Watch how columns are grouped. Some columns are ordinary numeric features. Some are skewed count-like features that receive a log transform. Some are categorical features that need one-hot encoding. Each group follows a different route through the ColumnTransformer, and the outputs are joined into one model-ready matrix.

Also notice that missing values are introduced deliberately. This makes the pipeline more realistic. A model that only works on perfectly complete data is often a classroom artifact. Real workflows need explicit missing-value behavior.

Common preprocessing traps

The biggest trap is preprocessing outside the pipeline. If you impute, scale, encode, or transform the full dataset before splitting or cross-validation, you risk leakage. The preprocessing step has learned from data that should have been hidden.

Another trap is assuming categories are stable. Production data may contain a new insurance type, product category, location, or device. OneHotEncoder(handle_unknown="ignore") is a defensive choice that keeps prediction possible, but unknown categories should still be monitored.

A third trap is treating feature engineering as purely technical. In healthcare, finance, education, and hiring, features can encode sensitive social patterns. A feature may be predictive and still inappropriate. Good preprocessing prepares data for models; good feature engineering also prepares models for responsible use.

The module in one journey

Preprocessing turns raw data into trustworthy model input. Imputation handles missing values. Transformations reshape numeric information. Encoding converts categories. ColumnTransformer applies different logic to different columns. Pipelines keep the entire process reproducible and protect evaluation from leakage.

Run the sample:

python module_05_preprocessing.py