As datasets grow, features multiply. A nutrition study might track calories, protein, carbohydrates, fat, fiber, sodium, vitamins, minerals, meal timing, supplements, and lifestyle variables. More features can help, but they can also create noise, redundancy, instability, and slower training.
Module 09 asks a practical question:
How do we keep the useful signal while reducing unnecessary complexity?
There are two broad answers. Feature selection keeps a subset of original features. Feature extraction creates new features that summarize or transform the originals.
The companion script is: ML-Blog/module_09_feature_selection_extraction.py at main · aduwillie/ML-Blog
It generates synthetic nutrition-style data, demonstrates univariate feature selection, recursive feature elimination, PCA, and kernel PCA.
Standalone orientation
You can read this article on its own if you understand that machine learning models learn from input features. This module asks what to do when there are many features, redundant features, noisy features, or features that are expensive to keep.
If you are reading the whole series, Module 10 sits after preprocessing and evaluation because feature selection and extraction must be evaluated carefully inside pipelines. If you are reading it alone, focus on the difference between keeping a smaller set of original columns and creating new compressed representations from the original columns.
How to read the examples: X, y, selectors, and extractors
In the companion script, X is a table of nutrition and lifestyle features: calories, protein, carbs, fat, fiber, sodium, vitamin_c, calcium, added_sugar, exercise_minutes, sleep_hours, and stress.
y is health_score, the numeric target the model tries to predict. Because y is numeric, the final estimator in each pipeline is a regressor, specifically Ridge.
The module compares different middle steps between scaling and regression:
| Component | Type | What it does to X |
|---|---|---|
SelectKBest | Feature selection | Keeps the top-scoring original columns. |
RFE | Feature selection | Repeatedly removes weaker original columns based on an estimator. |
PCA | Feature extraction | Creates new linear components from all original columns. |
KernelPCA | Feature extraction | Creates nonlinear components using a kernel method. |
Selection keeps original features, so the selected columns still have names like fiber or sleep_hours. Extraction creates new coordinates, so the transformed features are components rather than original columns. Both approaches change the version of X that reaches the regressor.
The critical detail is that selectors and extractors are inside the pipeline. They are fit only on training folds during cross-validation, which prevents test information from influencing the representation.
Feature selection: choosing columns to keep
Feature selection keeps original columns. That makes it attractive when interpretability matters. If a health researcher wants to know which measured variables are most predictive, keeping original feature names is valuable.
One simple approach is univariate selection. Each feature is scored individually against the target, and the top features are retained:
from sklearn.feature_selection import SelectKBest, f_regressionselector = SelectKBest(score_func=f_regression, k=6)
This is fast and easy, but it has a limitation: it evaluates features one at a time. A feature that is weak alone but powerful in combination may be missed.
Another approach is recursive feature elimination, or RFE. It trains a model, ranks features, removes weaker features, and repeats:
from sklearn.feature_selection import RFEfrom sklearn.linear_model import Ridgeselector = RFE(estimator=Ridge(), n_features_to_select=6)
RFE is more model-aware, but more expensive. It asks, “Which features help this estimator?”
Feature extraction: creating new coordinates
Feature extraction creates new features from existing ones. The new features may not have simple original names, but they can capture structure efficiently.
Principal component analysis, or PCA, is the classic linear extraction method. PCA finds directions of maximum variance in the feature space. The first principal component captures as much variance as possible. The second captures the next largest amount while being orthogonal to the first, and so on.
In scikit-learn:
from sklearn.decomposition import PCApca = PCA(n_components=3)
PCA is useful when features are correlated. In nutrition data, many nutrients rise together because they come from overall food quantity or particular food groups. PCA can compress those relationships into fewer components.
The tradeoff is interpretability. A component is a mixture of original features. It may represent a pattern, but it is not as directly explainable as a single original column.
Nonlinear extraction
Some structure is not linear. Kernel PCA uses kernel methods to perform nonlinear feature extraction:
from sklearn.decomposition import KernelPCAkpca = KernelPCA(n_components=3, kernel="rbf")
Kernel PCA can reveal curved structure, but it introduces additional hyperparameters and can be harder to interpret. Use it when there is a reason to believe nonlinear structure matters and when evaluation supports the extra complexity.
Avoiding leakage during feature processing
Feature selection and extraction must be inside the cross-validation or training pipeline. If PCA or feature selection is fit on the full dataset before the train/test split, information from the test set leaks into training.
The correct pattern is:
model = Pipeline(steps=[ ("scaler", StandardScaler()), ("selector", SelectKBest(f_regression, k=6)), ("regressor", Ridge()),])
Now feature selection is learned only from the training fold during cross-validation.
This point is easy to miss and important enough to repeat: selecting features is learning from data. Treat it like model fitting.
What to notice when running the sample
The sample compares pipelines rather than isolated transformations. This is important because feature processing is part of model training. SelectKBest, RFE, PCA, and KernelPCA all learn something from data. They should be fit only on training folds during cross-validation.
Notice the tradeoff between interpretability and compression. SelectKBest and RFE keep original feature columns, so you can still say which measured variables were used. PCA and Kernel PCA create new components. Those components may improve compression or performance, but they are less directly explainable.
Also notice that more transformation is not automatically better. If the all-features Ridge model performs well, selection or extraction may not improve MAE. Dimensionality reduction is a tool for managing noise, redundancy, cost, and complexity; it is not a guaranteed upgrade.
Common feature-processing traps
The biggest trap is fitting selectors or PCA before cross-validation. That leaks information because the transformation has already seen all rows, including validation rows. The second trap is selecting features based only on statistical scores without considering whether they are available, stable, ethical, and meaningful.
A third trap is forgetting the deployment pipeline. If the production system cannot compute a selected feature reliably, that feature is not useful no matter how predictive it was in an experiment.
The module in one journey
Feature selection reduces complexity by keeping useful original columns. Feature extraction reduces complexity by creating new representations. PCA is linear and efficient. Kernel PCA is nonlinear and more flexible. Pipelines keep the process honest.
The best approach depends on the goal. If interpretation matters, selection may be preferable. If compression and predictive performance matter, extraction may help. If deployment simplicity matters, fewer stable features may beat a clever transformation.
Run the sample:
python module_09_feature_selection_extraction.py

Leave a Reply