ML.CLUSTERING.KMEANS¶
Creates a K-Means model to group similar data.
Syntax¶
Arguments¶
| Name | Type | Default | Description |
|---|---|---|---|
| n_clusters | int | 8 | Number of clusters to create. Choose based on your data and business needs. Use techniques like the elbow method to find optimal number. |
| init | str | "k-means++" | How to initialize cluster centers: 'k-means++' (recommended, smarter initialization) or 'random' (faster but less optimal). Use 'k-means++' for better results. |
| n_init | int | str | "auto" |
| max_iter | int | 300 | Maximum iterations per run. Higher values may find better clusters but take longer. 300 (default) works well for most cases. |
| tol | float | 0.0001 | Tolerance for convergence. Lower values (0.0001 default) are more precise but may take longer. Increase to 0.001 for faster convergence. |
| random_state | int | None | Seed for random number generation. Use any integer (e.g., 42) to get reproducible results. Leave empty for different results each time. |
| algorithm | str | "lloyd" | Algorithm variant: 'lloyd' (default, works for all cases) or 'elkan' (faster for dense data). Use 'lloyd' unless you have performance issues. |
Returns¶
A K-Means model handle, ready to pass into ML.FIT and then ML.PREDICT.
When to use¶
Reach for K-Means when you have unlabeled numeric data and want to discover groups of similar rows — customer segments, well-log facies, store types, etc. K-Means is fast, simple, and a strong default first pass at clustering when you can guess a reasonable number of groups.
K-Means works best when:
- Your features are all numeric (encode categoricals first with
ML.PREPROCESSING.ONE_HOT_ENCODERorML.PREPROCESSING.ORDINAL_ENCODER). - Clusters are roughly spherical and similarly sized.
- You can scale features so distances are meaningful (use
ML.PREPROCESSING.STANDARD_SCALERfirst).
If your clusters are non-spherical, very unequal in size, or you don't know
n_clusters ahead of time, K-Means may struggle.
Examples¶
Build a K-Means model with 3 clusters and fit it on the unlabeled data in
A2:E100, then read the cluster label predicted for each row:
Pre-scale the features so distances aren't dominated by the largest column:
=ML.PREPROCESSING.STANDARD_SCALER()
=ML.FIT_TRANSFORM(H1, A2:E100)
=ML.CLUSTERING.KMEANS(3)
=ML.FIT(H3, H2#)
Pass an integer to random_state for reproducible cluster labels across runs:
Remarks¶
n_clustersis the only argument you usually need to set. There is no universal "right" number — try a few values and pick the one that gives the most useful business interpretation.- The "elbow method" is a common way to pick
n_clusters: fit K-Means for several values, plot the inertia (sum of squared distances to nearest cluster center), and look for the elbow where the curve flattens. - Scale your features first with
ML.PREPROCESSING.STANDARD_SCALER— K-Means uses Euclidean distance, so unequal feature magnitudes will skew the cluster shapes. - K-Means assigns every row to exactly one cluster. If you need probabilities or soft assignments, K-Means is not the right tool.
- For reproducible results, pass an integer to
random_state.