Skip to content

ML.CLUSTERING.KMEANS

Creates a K-Means model to group similar data.

Syntax

ML.CLUSTERING.KMEANS(n_clusters, init, n_init, max_iter, tol, random_state, algorithm)

Arguments

Name Type Default Description
n_clusters int 8 Number of clusters to create. Choose based on your data and business needs. Use techniques like the elbow method to find optimal number.
init str "k-means++" How to initialize cluster centers: 'k-means++' (recommended, smarter initialization) or 'random' (faster but less optimal). Use 'k-means++' for better results.
n_init int str "auto"
max_iter int 300 Maximum iterations per run. Higher values may find better clusters but take longer. 300 (default) works well for most cases.
tol float 0.0001 Tolerance for convergence. Lower values (0.0001 default) are more precise but may take longer. Increase to 0.001 for faster convergence.
random_state int None Seed for random number generation. Use any integer (e.g., 42) to get reproducible results. Leave empty for different results each time.
algorithm str "lloyd" Algorithm variant: 'lloyd' (default, works for all cases) or 'elkan' (faster for dense data). Use 'lloyd' unless you have performance issues.

Returns

A K-Means model handle, ready to pass into ML.FIT and then ML.PREDICT.

When to use

Reach for K-Means when you have unlabeled numeric data and want to discover groups of similar rows — customer segments, well-log facies, store types, etc. K-Means is fast, simple, and a strong default first pass at clustering when you can guess a reasonable number of groups.

K-Means works best when:

  • Your features are all numeric (encode categoricals first with ML.PREPROCESSING.ONE_HOT_ENCODER or ML.PREPROCESSING.ORDINAL_ENCODER).
  • Clusters are roughly spherical and similarly sized.
  • You can scale features so distances are meaningful (use ML.PREPROCESSING.STANDARD_SCALER first).

If your clusters are non-spherical, very unequal in size, or you don't know n_clusters ahead of time, K-Means may struggle.

Examples

Build a K-Means model with 3 clusters and fit it on the unlabeled data in A2:E100, then read the cluster label predicted for each row:

=ML.CLUSTERING.KMEANS(3)
=ML.FIT(H1, A2:E100)
=ML.PREDICT(H2, A2:E100)

Pre-scale the features so distances aren't dominated by the largest column:

=ML.PREPROCESSING.STANDARD_SCALER()
=ML.FIT_TRANSFORM(H1, A2:E100)
=ML.CLUSTERING.KMEANS(3)
=ML.FIT(H3, H2#)

Pass an integer to random_state for reproducible cluster labels across runs:

=ML.CLUSTERING.KMEANS(3, "k-means++", "auto", 300, 0.0001, 42)

Remarks

  • n_clusters is the only argument you usually need to set. There is no universal "right" number — try a few values and pick the one that gives the most useful business interpretation.
  • The "elbow method" is a common way to pick n_clusters: fit K-Means for several values, plot the inertia (sum of squared distances to nearest cluster center), and look for the elbow where the curve flattens.
  • Scale your features first with ML.PREPROCESSING.STANDARD_SCALER — K-Means uses Euclidean distance, so unequal feature magnitudes will skew the cluster shapes.
  • K-Means assigns every row to exactly one cluster. If you need probabilities or soft assignments, K-Means is not the right tool.
  • For reproducible results, pass an integer to random_state.

See also