ML.PREPROCESSING.ORDINAL_ENCODER¶
Converts categorical variables into numerical values by assigning each unique category a unique integer.
Syntax¶
Arguments¶
| Name | Type | Default | Description |
|---|---|---|---|
| handle_unknown | Any | "error" | How to handle unknown categories during transform. 'error' will raise an error, 'ignore' will ignore unknown categories. |
Returns¶
An OrdinalEncoder transformer handle, ready to pass into ML.FIT_TRANSFORM or ML.PIPELINE.
When to use¶
Reach for ordinal_encoder when a column has ordered categorical values
— "low" < "medium" < "high", "S" < "M" < "L" < "XL", satisfaction levels
on a Likert scale — and you want to keep that ordering as a single integer
column. Or use it as a compact encoding for tree-based models, which can
split on integer-coded categories without needing one-hot expansion.
Compared to the alternative in this namespace:
- Use
ML.PREPROCESSING.ONE_HOT_ENCODERfor unordered categories, or before linear models / SVMs / neural networks where assigning arbitrary integers would imply a misleading order. - Use ordinal_encoder when the order is meaningful, or when feeding a
tree-based model (
RANDOM_FOREST_CLF/RANDOM_FOREST_REG) that handles integer-coded splits natively.
Examples¶
Convert the categorical column in A2:A100 to integer codes:
Encode a categorical feature column inside a tree-based regression pipeline:
=ML.PREPROCESSING.ORDINAL_ENCODER()
=ML.COMPOSE.DATA_TRANSFORMER(H1, "size")
=ML.COMPOSE.COLUMN_TRANSFORMER(H2)
=ML.REGRESSION.RANDOM_FOREST_REG()
=ML.PIPELINE(H3, H4)
=ML.FIT(H5, A2:E100, F2:F100)
Use handle_unknown="ignore" so categories appearing only at predict time
don't raise an error:
Remarks¶
- The integer codes assigned by
ordinal_encoderare based on the sorted alphabetical order of the categories, which may not match the business-meaningful order. If the order matters and is non-alphabetical, pre-sort or remap your column before encoding. - Linear models, distance-based models (KMeans, KNN), and SVMs treat the
ordinal codes as numeric distances — assigning
"low"=0,"medium"=1,"high"=2implies that"low"and"high"are twice as far apart as"low"and"medium", which may or may not be true. UseML.PREPROCESSING.ONE_HOT_ENCODERif that assumption breaks. - Always fit on the training data only, then
ML.TRANSFORMthe test data with the same fitted encoder.