Feature-Preprocessing Leakage During Data-Preparation
Are we allowed to transform the input data in supervised learning anyway we want?
Assume we have a supervised learning problem, and we would like to preprocess a feature, using a separate supervised model and the label.
Problem example: We would like to predict a 0-1 label, using 10 features: 9 numerical and 1 textual feature, using a decision tree (call it model A). We would like to take the textual feature and transform it, using a separate supervised model that takes a textual input and predicts a scalar label (call it model B), and use this model to convert the textual feature into a numerical feature, so that we will have 10 numerical features, and then use a decision tree to predict the 0-1 label.
The question is if this process is legit. And if it is legit, are there any restrictions on the process to make it legit?
To make it more specific, can we first train model B, any way we want, and then transform the feature and train model A? Can we do a train/test split anyway we want (randomly) during training model B, and then do a train/test split (randomly) during training model A? Or must the split be the same during training of model B and model A, to prevent data/feature leakage? If this requirement is needed, it can be a bit complicated in real life scenarios, where you need to enforce the same train/test split procedure in all ML teams in an organization involved in the project.
If we say that the split must be identical, we need to prove it by giving a counterexample, showing what happens if the split is not identical.
Let’s make the problem even more compact: 1 textual input feature, and one 0-1 label. Assume that the textual feature is a random string, per example, meaning it carries zero information in it. Then we train model B, which is heavily overfitted (complex, high VC dimension), meaning that training accuracy is 100%, and test accuracy is 50%. This can happen when the model memorizes all the random texts that correspond with 0-labels and all the texts that correspond with 1-labels. That means that transforming the textual feature using model B, will convert the training set features into the labels themselves.
Now, let’s say that in model A training, the test-set includes 50% of the train-set examples of model B (because we did a new random train/test split). This essentially means that half the test-set of model A now includes features, which are identical to the labels. And that means that the test accuracy of model A will be at least 50%, although the features have zero information in them. If our random split picked 75% of the training set examples of model B, we will have at least 75% text accuracy, even though the features carry zero information with them. I call this “Preprocessing Leakage”. If we would have kept the train/test split identical across model B and model A, the problem would have been avoided.
What happens if we have, instead of 1 textual feature, 100 textual features (again, completely random), and we want to train 100 sub-models, like model B? Then, when we randomly choose the validation set, there is a very high chance that each example in the validation set of model B will include a few overfitted outputs of the various sub-models. Let’s say model A classifier uses some majority vote decision over the 100 features. Then we can get model B validation accuracy to be very close to 100%, although the textual features are completely random strings!
Mitigation: The way to solve this preprocessing leakage is to avoid a random train/test split, but rather split deterministically, using a stable hash function over the examples. For example, split by the hash of the user id, account id, etc., so that all sub-models will have the same train/test split. This also means having full control over the train/test split, and avoiding using different third party libraries that each can split the data in a different way. The train/test split is important, and we need to keep it under control.