A 10-Part Machine Learning Journey with scikit-learn

Listen to this article

Machine learning can feel like a maze when you first encounter it. One path talks about regression. Another talks about classification. Somewhere nearby are metrics, pipelines, clustering, neural networks, and deep learning. Each topic is useful, but the bigger question is harder:

How do these pieces fit together into one coherent way of thinking?

This 10-part series is designed to answer that question. Each article can stand alone, so a reader can jump directly into classification, clustering, preprocessing, or Keras without feeling lost. But the full series also works as a connected journey. It starts with the basic grammar of machine learning and gradually builds toward more flexible models and richer workflows.

The companion code lives in the GitHub repository aduwillie/ML-Blog. The repo is meant to be more than a download folder. It is the hands-on side of the series: standalone Python scripts, reproducible synthetic datasets, and runnable examples that let readers move from “I understand the idea” to “I can run the workflow myself.”

The foundation throughout the series is Python with scikit-learn. That choice is intentional. scikit-learn gives us a consistent way to express the most important machine learning habits:

			
define the problem
build X and y
split the data
preprocess safely
train a model
evaluate honestly
interpret the result
iterate responsibly

		

Even when the final article introduces Keras, the mindset remains grounded in the same workflow: compare against baselines, avoid leakage, validate carefully, and choose metrics that match the decision.

How to read this series

If you are new to machine learning, read the articles in order. The sequence starts gently and repeats the important ideas in different contexts until they become natural. You will see X, y, train/test splits, pipelines, metrics, and model comparison many times, because those ideas are the backbone of applied machine learning.

If you already have experience, you can jump to the topic you need. Every article includes a Standalone orientation section and a How to read the examples section. Those sections explain the local dataset, what X means, what y means, and which scikit-learn components are doing the work.

Each article also includes a companion Python script. In the GitHub repository, the scripts live at the repository root rather than inside a samples/ folder:

			
module_01_foundations.py
module_02_classification.py
module_03_regression.py

The scripts generate their own synthetic datasets, so you do not need external CSV files to run the examples. Where useful, scripts support --write-data so you can export the generated sample data for inspection or GitHub use. In the GitHub repo, each script-style module can be run independently from the repository root.

Companion code on GitHub

The blog series and the code repository are designed to work together:

			
article -> explains the concept and story
script  -> runs the concept end to end
output  -> gives the reader something concrete to inspect

The GitHub repo, github.com/aduwillie/ML-Blog, organizes the examples as standalone Python modules. That is important for learning. A reader should not need to configure a database, download a hidden dataset, or run a full application just to understand classification, regression, clustering, preprocessing, model evaluation, feature engineering, or neural networks.

The repository supports three reader modes:

Reader mode	How the repo helps
Curious beginner	Run one file, inspect the printed metrics, and connect the output back to the article.
Hands-on learner	Modify features, model settings, or metrics and rerun the script to see what changes.
Experienced practitioner	Use the scripts as compact reference patterns for pipelines, validation, synthetic data generation, and model comparison.

The synthetic datasets are intentional. They keep the examples reproducible and portable. Because each script creates its own data, the reader can focus on the machine learning workflow rather than file downloads or data-cleaning surprises. Later, the same pattern can be applied to real data: replace the synthetic dataset function, keep the split/preprocess/train/evaluate structure, and adapt the metric to the decision.

The full learning path

#	Article	What you learn	Companion script
1	Machine Learning Foundations with scikit-learn	The core vocabulary: instances, features, targets, models, algorithms, pipelines, baselines, metrics, bias, variance, and ethics.	`module_01_foundations.py`
2	Classification Models with scikit-learn	How models predict categories using k-nearest neighbors, logistic regression, naive Bayes, and discriminant analysis.	`module_02_classification.py`
3	Regression Models with scikit-learn	How models predict numbers using linear regression, Ridge, Lasso, Elastic Net, and k-nearest neighbors regression.	`module_03_regression.py`
4	Model Evaluation, Metrics, and Cross-Validation	How to decide whether a model is trustworthy using metrics, confusion matrices, cross-validation, and hyperparameter search.	`module_04_evaluation_validation.py`
5	Data Preprocessing and Feature Engineering	How raw data becomes model-ready through imputation, scaling, transformations, encoding, `ColumnTransformer`, and pipelines.	`module_05_preprocessing.py`
6	Neural Networks with scikit-learn	How multilayer perceptrons learn nonlinear patterns through hidden layers, activations, scaling, and regularization.	`module_06_neural_networks.py`
7	Gradient Descent and Training Neural Networks	How learning rate, regularization, early stopping, and training settings affect neural-network performance.	`module_07_training_neural_networks.py`
8	Clustering and Unsupervised Learning	How to find groups without a target using k-means, hierarchical clustering, DBSCAN, scaling, and cluster interpretation.	`module_08_clustering.py`
9	Feature Selection and Feature Extraction	How to reduce feature complexity using SelectKBest, RFE, PCA, Kernel PCA, and leakage-safe pipelines.	`module_09_feature_selection_extraction.py`
10	Deep Learning with Keras, Grounded in scikit-learn	How to bridge from scikit-learn baselines to Keras text models with vectorization, embeddings, dense layers, and validation.	`module_10_keras_bridge.py`

The story arc

The series begins with a simple question: how can data help us make better predictions? Module 1 answers by introducing the basic machine learning workflow. It shows that the central structure is not an algorithm; it is a disciplined path from question to evidence.

Modules 2 and 3 split supervised learning into its two most common forms. Classification predicts categories. Regression predicts numbers. These articles show that the same X/y structure can support very different kinds of decisions.

Module 4 slows down and asks whether the predictions are any good. This is where the series becomes practical. Accuracy, precision, recall, MAE, RMSE, R2, cross-validation, and grid search are not just technical details. They are how we decide whether a model deserves trust.

Module 5 goes back to the beginning of the workflow and examines the data itself. Missing values, skewed numeric columns, and categorical variables are normal. Preprocessing is the bridge between real-world records and model-ready features.

Modules 6 and 7 introduce neural networks in two steps. First, you learn what layered models are and why hidden layers matter. Then you learn how training works: gradient descent, learning rate, regularization, convergence, and early stopping.

Module 8 changes the rules by removing y. Clustering is unsupervised learning: the model does not learn from known answers. It proposes structure, and humans must interpret whether that structure is meaningful.

Module 9 asks what to do when there are too many features. It compares selecting original columns with extracting new components. It also reinforces a theme that matters everywhere: feature processing must happen inside the training and validation workflow to avoid leakage.

Module 10 closes the series by stepping into Keras. The article shows that deep learning is not a break from the earlier workflow. It is an extension of it. Text still becomes X, sentiment still becomes y, training still learns from examples, and evaluation still decides whether the model is useful.

What readers should be able to do by the end

By the end of the series, a beginner should be able to read a scikit-learn script and understand the major pieces:

			
What is the dataset?
What is X?
What is y?
What preprocessing happens?
Which estimator learns from the data?
Which metric evaluates the result?
What mistakes would matter in the real world?

		

An experienced reader should find a practical, reusable structure for teaching or documenting machine learning workflows. The articles deliberately use narratives, synthetic data, and runnable scripts so concepts are not trapped in abstract definitions.

The goal is not to memorize every estimator. The goal is to develop a durable mental model:

Machine learning is the disciplined process of turning examples into predictions, predictions into evidence, and evidence into responsible decisions.

That is the thread that connects all 10 articles.

Running the companion scripts

Install the core dependencies:

pip install -r requirements.txt

Then run any module script directly:

			
python module_01_foundations.py
python module_02_classification.py
python module_03_regression.py

The Keras section of Module 10 requires TensorFlow. If TensorFlow is not installed, the Module 10 script still runs the scikit-learn baseline and explains how to enable the Keras portion.

A 10-Part Machine Learning Journey with scikit-learn

How to read this series

Companion code on GitHub

The full learning path

The story arc

What readers should be able to do by the end

Running the companion scripts

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from aduwillie.com