aduwillie.com

Enjoy Coding!

Listen to this article

So far, our models have learned from labeled examples. We knew the target: order count, satisfaction, readmission, high demand. But sometimes the most interesting question appears before labels exist.

Imagine a basketball analytics team studying player styles. They have measurements: scoring rate, assists, rebounds, defensive activity, shot distance, and minutes played. They do not begin with labels like rim_protector, floor_general, or three_point_specialist. They want the data to suggest groups.

That is the world of unsupervised learning. In clustering, the model receives features but no target. It tries to discover structure.

The companion script is: ML-Blog/module_08_clustering.py at main · aduwillie/ML-Blog

It creates a synthetic player-profile dataset and compares k-means, hierarchical clustering, and DBSCAN using scikit-learn.


Standalone orientation

You can read this article without knowing the supervised-learning modules. Clustering is different because there is no answer column for the model to learn from. The model receives only input features and tries to discover groups.

If you are reading the whole series, this module is the turn from supervised learning to unsupervised learning. If you are reading it alone, keep this distinction in mind: clustering does not prove that natural categories exist. It proposes groupings based on feature similarity, and humans must decide whether those groupings are meaningful.


How to read the examples: X, no y, and cluster labels

Clustering is different from the supervised modules because there is no training target. In the companion script, X contains player-profile features such as points_per_game, assists_per_game, rebounds_per_game, defensive_activity, three_point_attempts, and minutes_per_game.

There is no y used for training because the clustering algorithms are not trying to learn from known answers. They are trying to discover groups from the structure of X itself.

The script includes a column named simulated_archetype, but it is deliberately dropped before clustering:

X = df.drop(columns=["simulated_archetype"])

That column exists only so you, the reader, can compare discovered clusters with the hidden story that generated the synthetic data. Real clustering projects usually do not have such convenient truth labels.

The output of clustering is usually called labels, not predictions, because the algorithm is assigning group IDs rather than predicting known classes. A label of 0, 1, or 2 has no built-in meaning. You interpret it by examining the average feature values and examples in that cluster. DBSCAN can also output -1, which means the point was treated as noise.


Clustering asks a different kind of question

Supervised learning asks:

Given features and known answers, can we predict answers for new examples?

Clustering asks:

Given features but no known answers, do examples naturally form groups?

The lack of a target changes everything. There is no accuracy score against a known label unless we have external ground truth. Instead, we evaluate whether clusters are compact, separated, stable, and meaningful for the domain.

For basketball players, a cluster is useful only if it helps analysts reason about player roles, development plans, scouting, or roster construction.


k-means: finding centers

k-means clustering tries to place k centers in the data. Each instance is assigned to the nearest center. The algorithm updates centers and assignments until the clusters stabilize.

In scikit-learn:

from sklearn.cluster import KMeans
model = KMeans(n_clusters=4, random_state=42, n_init="auto")

k-means is fast and intuitive. It works best when clusters are roughly round, similarly sized, and separated in feature space. It requires you to choose k, the number of clusters, before training.

That choice should not be arbitrary. You can compare inertia, silhouette scores, and domain usefulness across several values of k. But no metric can fully replace interpretation. Four statistically neat clusters may be less useful than three clusters that coaches can actually understand.


Hierarchical clustering: building a tree of similarity

Hierarchical clustering creates a nested structure of groups. Agglomerative clustering starts with each instance as its own cluster, then repeatedly merges the closest clusters.

In scikit-learn:

from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(n_clusters=4)

Hierarchical clustering is useful when you want to explore structure at multiple levels. Basketball players might first divide into guards, wings, and bigs, then subdivide into more specific play styles.

The story is not just “which cluster?” It is “how do clusters relate to each other?”


DBSCAN: finding dense regions and noise

Some data does not form neat round clusters. DBSCAN groups points that live in dense neighborhoods and marks isolated points as noise.

In scikit-learn:

from sklearn.cluster import DBSCAN
model = DBSCAN(eps=0.9, min_samples=8)

DBSCAN can discover irregular shapes and identify outliers. It does not require choosing the number of clusters directly, but it is sensitive to eps, the neighborhood radius, and min_samples, the density requirement.

For player profiles, DBSCAN might reveal a group of common role players and mark rare hybrid players as noise. Whether that is useful depends on the analysis goal.


Scaling and interpretation

Clustering is extremely sensitive to feature scaling. If salary is measured in millions and assists are measured per game, salary may dominate distance. The companion script scales features before clustering:

from sklearn.preprocessing import StandardScaler
scaled_features = StandardScaler().fit_transform(X)

After clustering, interpretation returns to the original domain. Cluster labels like 0, 1, and 2 have no inherent meaning. We give them meaning by comparing feature averages, distributions, and examples.

This is the expert habit: clustering produces hypotheses, not final truth.


What to notice when running the sample

The script prints cluster averages. This is the first step in interpretation. If one cluster has high assists and moderate scoring, you might describe it as playmaking guards. If another has high rebounds and defensive activity, you might describe it as interior anchors. The algorithm gives numbers; the analyst gives meaning.

Compare k-means and hierarchical clustering. They may produce similar groups when the data has clear compact clusters. DBSCAN may behave differently because it is looking for dense regions and can mark points as noise. That difference is not a bug. It reflects a different definition of what a cluster is.

Silhouette score can help compare cluster separation, but it should not be the only judge. A high silhouette score does not guarantee useful basketball archetypes. A lower score may still reveal meaningful hybrid roles that matter to coaches.


Common clustering traps

The first trap is believing cluster IDs are names. Cluster 0 does not mean anything until you inspect the rows assigned to it. The second trap is forcing clusters where none exist. Most algorithms will produce labels even when the data is a continuum.

The third trap is evaluating unsupervised learning as if it were supervised learning. If you happen to have simulated labels, they can help you understand the exercise, but real clustering often has no ground truth. Stability, interpretability, and usefulness matter more than pretending there is one perfect answer.


The module in one journey

Clustering changes the machine learning story. There is no target and no simple correctness score. k-means finds centers. Hierarchical clustering builds nested groups. DBSCAN finds dense regions and noise. Scaling is essential. Interpretation is a human responsibility.

Run the sample:

python module_08_clustering.py

Leave a Reply

Discover more from aduwillie.com

Subscribe now to keep reading and get access to the full archive.

Continue reading