ML.PREPROCESSING.TRAIN_TEST_SPLIT¶
Splits your dataset into training and testing sets for machine learning.
Syntax¶
Arguments¶
| Name | Type | Default | Description |
|---|---|---|---|
| data | object | The input data to split, typically a DataFrame containing your features and/or target variable. | |
| test_size | float | Fraction of the dataset to include in the test split (e.g., 0.2 means 20% of data will be used for testing). | |
| random_state | int | Seed for the random number generator to ensure reproducible splits. Use any integer value. | |
| dataset_type | int | Which split to return: 0 for training set, 1 for test set. Use 0 to get the training data, 1 for the test data. |
Returns¶
A DataFrame containing either the training or the test split, depending on dataset_type.
When to use¶
Use train_test_split whenever you want to evaluate how a model will perform
on data it has not seen. Splitting the dataset into a training portion (used
to fit the model) and a test portion (used only to score it) is the basic
defense against overfitting — and the first step in almost every supervised
workflow.
Call this function twice — once with dataset_type=0 to get the training
rows and once with dataset_type=1 to get the test rows. Use the same
random_state both times so the splits are consistent.
Examples¶
Load a dataset, then split into 80% train / 20% test with a fixed seed:
=ML.DATASETS.IRIS()
=ML.PREPROCESSING.TRAIN_TEST_SPLIT(H1, 0.2, 42, 0)
=ML.PREPROCESSING.TRAIN_TEST_SPLIT(H1, 0.2, 42, 1)
H2 now holds the training set; H3 holds the test set. Fit a model on
the train split and score it on the test split:
=ML.CLASSIFICATION.LOGISTIC()
=ML.FIT(H4, ML.DATA.SELECT_COLUMNS(H2, "feature_*"), ML.DATA.SELECT_COLUMNS(H2, "target"))
=ML.EVAL.SCORE(H5, ML.DATA.SELECT_COLUMNS(H3, "feature_*"), ML.DATA.SELECT_COLUMNS(H3, "target"))
Remarks¶
- The
dataargument must be a DataFrame handle. If your data lives in a cell range, pipe it throughML.DATA.CONVERT_TO_DFfirst; if it comes from a built-in dataset,ML.DATASETS.*already returns a DataFrame. test_sizeis the fraction held out for testing.0.2(20%) is a common default; raise it to0.3for very small datasets where the test split needs more examples to be reliable.- Pass the same
random_stateboth times so the train and test halves match. Different seeds produce different splits. - For class-imbalanced classification problems, plain random splitting can
leave one class barely represented in the test set. For more rigorous
evaluation, prefer
ML.EVAL.CV_SCOREwhich uses stratified cross-validation.