Skip to content

ML.PREPROCESSING.TRAIN_TEST_SPLIT

Splits your dataset into training and testing sets for machine learning.

Syntax

ML.PREPROCESSING.TRAIN_TEST_SPLIT(data, test_size, random_state, dataset_type)

Arguments

Name Type Default Description
data object The input data to split, typically a DataFrame containing your features and/or target variable.
test_size float Fraction of the dataset to include in the test split (e.g., 0.2 means 20% of data will be used for testing).
random_state int Seed for the random number generator to ensure reproducible splits. Use any integer value.
dataset_type int Which split to return: 0 for training set, 1 for test set. Use 0 to get the training data, 1 for the test data.

Returns

A DataFrame containing either the training or the test split, depending on dataset_type.

When to use

Use train_test_split whenever you want to evaluate how a model will perform on data it has not seen. Splitting the dataset into a training portion (used to fit the model) and a test portion (used only to score it) is the basic defense against overfitting — and the first step in almost every supervised workflow.

Call this function twice — once with dataset_type=0 to get the training rows and once with dataset_type=1 to get the test rows. Use the same random_state both times so the splits are consistent.

Examples

Load a dataset, then split into 80% train / 20% test with a fixed seed:

=ML.DATASETS.IRIS()
=ML.PREPROCESSING.TRAIN_TEST_SPLIT(H1, 0.2, 42, 0)
=ML.PREPROCESSING.TRAIN_TEST_SPLIT(H1, 0.2, 42, 1)

H2 now holds the training set; H3 holds the test set. Fit a model on the train split and score it on the test split:

=ML.CLASSIFICATION.LOGISTIC()
=ML.FIT(H4, ML.DATA.SELECT_COLUMNS(H2, "feature_*"), ML.DATA.SELECT_COLUMNS(H2, "target"))
=ML.EVAL.SCORE(H5, ML.DATA.SELECT_COLUMNS(H3, "feature_*"), ML.DATA.SELECT_COLUMNS(H3, "target"))

Remarks

  • The data argument must be a DataFrame handle. If your data lives in a cell range, pipe it through ML.DATA.CONVERT_TO_DF first; if it comes from a built-in dataset, ML.DATASETS.* already returns a DataFrame.
  • test_size is the fraction held out for testing. 0.2 (20%) is a common default; raise it to 0.3 for very small datasets where the test split needs more examples to be reliable.
  • Pass the same random_state both times so the train and test halves match. Different seeds produce different splits.
  • For class-imbalanced classification problems, plain random splitting can leave one class barely represented in the test set. For more rigorous evaluation, prefer ML.EVAL.CV_SCORE which uses stratified cross-validation.

See also