Skip to content

ML.PREPROCESSING.ONE_HOT_ENCODER

Converts categorical variables into a format that works better with machine learning algorithms.

Syntax

ML.PREPROCESSING.ONE_HOT_ENCODER(handle_unknown)

Arguments

Name Type Default Description
handle_unknown Any "error" How to handle unknown categories during transform. 'error' will raise an error, 'ignore' will ignore unknown categories.

Returns

A OneHotEncoder transformer handle, ready to pass into ML.FIT_TRANSFORM or ML.PIPELINE.

When to use

Reach for one_hot_encoder when a column holds unordered categories — country, color, product type, well name — and you need numeric inputs for a model. Each unique category becomes its own 0/1 column, so the model can treat the categories independently without imposing any ordering between them.

Compared to the alternative in this namespace:

  • Use one_hot_encoder for unordered categories, especially when you have only a handful of distinct values per column.
  • Use ML.PREPROCESSING.ORDINAL_ENCODER when the categories do have a natural order (e.g. "low" < "medium" < "high") — assigning integer ranks then makes sense.

Examples

One-hot encode a single categorical column in A2:A100:

=ML.PREPROCESSING.ONE_HOT_ENCODER()
=ML.FIT_TRANSFORM(H1, A2:A100)

Apply one-hot encoding only to the categorical columns and pass the numeric columns through unchanged using ML.COMPOSE.COLUMN_TRANSFORMER:

=ML.PREPROCESSING.ONE_HOT_ENCODER()
=ML.COMPOSE.DATA_TRANSFORMER(H1, "category_col")
=ML.COMPOSE.COLUMN_TRANSFORMER(H2)
=ML.FIT_TRANSFORM(H3, A2:E100)

Use handle_unknown="ignore" so categories appearing only at predict time don't raise an error:

=ML.PREPROCESSING.ONE_HOT_ENCODER("ignore")

Remarks

  • One-hot encoding multiplies the number of columns by the number of distinct categories. For a column with hundreds of distinct values, the result is a very wide and very sparse matrix — consider ML.PREPROCESSING.ORDINAL_ENCODER or merging rare categories first.
  • When handle_unknown="error" (default), an unseen category at predict time raises a clear error. Switch to "ignore" to encode unseen categories as all-zeros rows instead.
  • Always fit on the training data only, then ML.TRANSFORM the test data with the same fitted encoder so the column ordering matches.

See also