Skip to content

ML.PREPROCESSING.ORDINAL_ENCODER

Converts categorical variables into numerical values by assigning each unique category a unique integer.

Syntax

ML.PREPROCESSING.ORDINAL_ENCODER(handle_unknown)

Arguments

Name Type Default Description
handle_unknown Any "error" How to handle unknown categories during transform. 'error' will raise an error, 'ignore' will ignore unknown categories.

Returns

An OrdinalEncoder transformer handle, ready to pass into ML.FIT_TRANSFORM or ML.PIPELINE.

When to use

Reach for ordinal_encoder when a column has ordered categorical values — "low" < "medium" < "high", "S" < "M" < "L" < "XL", satisfaction levels on a Likert scale — and you want to keep that ordering as a single integer column. Or use it as a compact encoding for tree-based models, which can split on integer-coded categories without needing one-hot expansion.

Compared to the alternative in this namespace:

  • Use ML.PREPROCESSING.ONE_HOT_ENCODER for unordered categories, or before linear models / SVMs / neural networks where assigning arbitrary integers would imply a misleading order.
  • Use ordinal_encoder when the order is meaningful, or when feeding a tree-based model (RANDOM_FOREST_CLF / RANDOM_FOREST_REG) that handles integer-coded splits natively.

Examples

Convert the categorical column in A2:A100 to integer codes:

=ML.PREPROCESSING.ORDINAL_ENCODER()
=ML.FIT_TRANSFORM(H1, A2:A100)

Encode a categorical feature column inside a tree-based regression pipeline:

=ML.PREPROCESSING.ORDINAL_ENCODER()
=ML.COMPOSE.DATA_TRANSFORMER(H1, "size")
=ML.COMPOSE.COLUMN_TRANSFORMER(H2)
=ML.REGRESSION.RANDOM_FOREST_REG()
=ML.PIPELINE(H3, H4)
=ML.FIT(H5, A2:E100, F2:F100)

Use handle_unknown="ignore" so categories appearing only at predict time don't raise an error:

=ML.PREPROCESSING.ORDINAL_ENCODER("ignore")

Remarks

  • The integer codes assigned by ordinal_encoder are based on the sorted alphabetical order of the categories, which may not match the business-meaningful order. If the order matters and is non-alphabetical, pre-sort or remap your column before encoding.
  • Linear models, distance-based models (KMeans, KNN), and SVMs treat the ordinal codes as numeric distances — assigning "low"=0, "medium"=1, "high"=2 implies that "low" and "high" are twice as far apart as "low" and "medium", which may or may not be true. Use ML.PREPROCESSING.ONE_HOT_ENCODER if that assumption breaks.
  • Always fit on the training data only, then ML.TRANSFORM the test data with the same fitted encoder.

See also