Module 07: Gradient Descent and Training Neural Networks

Listen to this article

A neural network can be described as a layered prediction machine, but that description leaves out the most important practical question: how does the machine learn useful weights from data?

Training a neural network is an optimization story. The model starts with imperfect weights. It makes predictions. The predictions produce errors. The algorithm adjusts the weights to reduce those errors. This repeats many times.

The central character is gradient descent.

The companion script is: ML-Blog/module_07_training_neural_networks.py at main · aduwillie/ML-Blog

It compares neural-network training settings in scikit-learn and shows how learning rate, regularization, and early stopping affect performance.

Standalone orientation

You can read this article independently if you know only this much: neural networks learn by adjusting internal weights so predictions become less wrong over time. This module explains the training process behind that adjustment.

If you are reading the whole series, this article deepens the neural-network story from Module 6. If you are reading it alone, focus on the knobs that shape training: learning rate controls step size, regularization controls flexibility, max_iter controls opportunity to learn, and early stopping uses validation behavior to decide when enough training is enough.

How to read the examples: `X`, `y`, and training knobs

In the companion script, X is a generated table of numeric features. y is a binary class label. The task is classification, but the module’s focus is not the dataset itself. The focus is how different training choices change the behavior of the same general model family.

Each configuration creates a pipeline with two components:

			
Pipeline([
    ("scaler", StandardScaler()),
    ("mlp", MLPClassifier(...)),
])

StandardScaler prepares X so each feature has a comparable scale. MLPClassifier learns from X_train and y_train. The values inside MLPClassifier(...) are training knobs:

Setting	What it controls
`hidden_layer_sizes`	Model capacity: how many hidden units and layers the network has.
`learning_rate_init`	Initial step size for optimization.
`alpha`	Regularization strength.
`early_stopping`	Whether to stop training when validation performance stops improving.
`validation_fraction`	The share of training data used internally for early-stopping validation.
`max_iter`	The maximum number of optimization iterations.

The script compares predictions against y_test using accuracy and F1-score. If two models use the same X and y but different training settings, changes in performance come from how the model was trained, not from a different problem definition.

The landscape metaphor

Imagine standing in a foggy valley. Your goal is to reach the lowest point, but you can only sense the slope under your feet. Gradient descent works like that. The loss function defines a landscape. The model’s weights define a location in that landscape. The gradient points in the direction of steepest increase, so the algorithm steps in the opposite direction.

The size of the step is controlled by the learning rate.

If the learning rate is too small, training may crawl. If it is too large, training may bounce around or fail to settle. A good learning rate helps the model improve steadily.

In scikit-learn’s MLP estimators:

			
MLPClassifier(
    solver="adam",
    learning_rate_init=0.001,
)

The default adam solver is a strong general-purpose optimizer. It adapts learning behavior during training and often works well without extensive tuning.

Epochs, iterations, and convergence

Training happens over repeated passes through data. In scikit-learn, max_iter controls the maximum number of training iterations. If the model has not converged before that limit, you may see a convergence warning.

Increasing max_iter can help, but it is not always the right fix. Poor scaling, an unsuitable learning rate, too much model complexity, or noisy data can also cause training trouble.

This is why neural-network training is not just “make the model bigger.” Training settings are part of the model design.

Regularization: controlling flexibility

MLPs in scikit-learn include an alpha parameter for L2 regularization. Larger values penalize large weights more strongly.

MLPClassifier(alpha=0.01)

Regularization helps reduce overfitting. If the training score is high and validation score is weak, increasing alpha, simplifying the network, or using early stopping may help.

This connects directly to the bias-variance tradeoff. A very flexible neural network can memorize. Regularization asks it to learn a smoother story.

Early stopping: knowing when to stop learning

Early stopping monitors validation performance during training. If validation performance stops improving, training stops before the model overfits further.

			
MLPClassifier(
    early_stopping=True,
    validation_fraction=0.15,
)

Early stopping is practical because it uses the model’s own learning curve as evidence. Instead of deciding in advance exactly how long training should run, we let validation behavior help decide.

Training options are modeling choices

The companion script compares several MLP configurations:

			
small network
larger network
stronger regularization
early stopping
different learning rates

		

The point is not that one setting always wins. The point is that training behavior changes the model. Neural networks are flexible, but that flexibility must be shaped by validation.

For beginners, the training loop is: predict, measure error, adjust, repeat. For experts, the deeper view is: optimization, regularization, validation design, and data preprocessing all interact.

What to notice when running the sample

The sample holds the dataset fixed and changes training settings. This is intentional. If performance changes, the difference comes from the training configuration rather than a different problem.

Compare the small network with the larger network. The larger network has more capacity, but more capacity is not always better. Compare regularization settings. Stronger regularization may reduce overfitting but can also reduce the model’s ability to fit real structure. Compare early stopping. If it works well, it acts like a guardrail that stops training when validation behavior suggests the model is no longer improving.

Learning rate is especially important. A larger learning rate can speed training, but it can also overshoot useful solutions. A smaller learning rate can be stable but slow. The right value depends on the data, scaling, architecture, optimizer, and loss landscape.

Common training traps

One trap is treating max_iter as the only training control. If the model does not converge, increasing iterations may help, but it may also hide deeper issues. Check scaling, learning rate, architecture size, and regularization.

Another trap is interpreting the test set during every tuning attempt. If you repeatedly change settings after looking at test performance, the test set stops being a neutral exam. Use validation or cross-validation for tuning, then reserve the test set for final evaluation.

Finally, remember that training curves matter. If training performance improves while validation performance worsens, the model is overfitting. If both remain poor, the model may be underfitting, poorly configured, or missing useful features.

The module in one journey

Gradient descent explains how models improve. Learning rate controls step size. Iterations control training opportunity. Regularization controls flexibility. Early stopping uses validation performance to stop before overfitting. Together, these ideas make neural networks trainable rather than mysterious.

Run the sample:

python module_07_training_neural_networks.py

Module 07: Gradient Descent and Training Neural Networks

Standalone orientation

How to read the examples: X, y, and training knobs

The landscape metaphor

Epochs, iterations, and convergence

Regularization: controlling flexibility

Early stopping: knowing when to stop learning

Training options are modeling choices

What to notice when running the sample

Common training traps

The module in one journey

Share this:

Like this:

One response

Leave a ReplyCancel reply

Discover more from aduwillie.com

How to read the examples: `X`, `y`, and training knobs