Label Shift and Domain Adaptation in Machine Learning

TL;DR: If you want the best accuracy on the target domain, you have to match the class frequency in the training set. You cannot affect the ROCAUC nor the PRAUC, but you can affect the accuracy. If you don’t know the target distribution at training time, you can measure the distribution during test time using only the features, calculate a weight-correction for every training example class, and retrain the model so it will match the newly detected distribution (See this paper). Sometimes you can know the future target distribution. For example, if you predict a dice image classifier, you can expect the rolled dice to have a uniform distribution, and so you can balance the classes at training time.

What is a Label Shift?

Label shift is when the classes/labels/targets distribution at deployment time (test time) is different from what you had in training time. For example, you train a cat/dog classifier using 1000 images of dogs and 4000 images of cats, because that’s the distribution of pets people have at home in France. However, when you deploy the model in Germany, when 50% of the people have cats and 50% have dogs, that’s a label shift. In label shift, the target distribution is different, but the manifestation of targets as features remains the same. That means dogs in Germany look the same as dogs in France. You only have more dogs, but they are the same kind of dogs. If the dogs in Germany look different than dogs in France, that’s a different phenomenon, not a label shift. More formally, if the source distribution is $p$ and target is $q$, the feature manifestation remains the same: $p(\boldsymbol{x}|y)=q(\boldsymbol{x}|y)$

Live example

We will try to see what happens in a label shift: we train a classifier on a source distribution of items, and check the performance when the distribution does not change, and when the distribution of true labels is changed.

Let’s say a basketball player’s average height is 180 cm, with stdev of 10 cm, and a football average height is 170 cm, with stdev of 10 cm. Assume this is true globally (in every country).

We collect a dataset of players in France, and our dataset contains 70% of basketball players, and 30% of football players. We will train the model and see the performance in France. Then, we will check the performance when we deploy the model in Germany, when we have 50% of basketball players, and 50% of football players.

Let’s write some code. First, a few helper functions.

# @title Click to Expand/Collapse
basketball_height, basket_std = 180, 10
football_height, football_std = 170, 10

dataset_length = 20000
test_set_portion = 0.4

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, average_precision_score, confusion_matrix

np.random.seed(44)


def generate_dataset(dataset_length, probability_of_0):
  y = np.random.choice([0, 1], size=dataset_length, p=[probability_of_0, 1.0 - probability_of_0])
  num_ones = np.sum(y)
  # print(f"#[0] == {len(y)-num_ones}, #[1] == {num_ones}")
  x = np.empty(dataset_length, dtype=float)
  for i in range(dataset_length):
    if y[i]==0:
      x[i] = np.random.normal(loc=basketball_height, scale=basket_std, size=1)[0]
    else:
      x[i] = np.random.normal(loc=football_height, scale=football_std, size=1)[0]
  X = x.reshape(-1, 1) # make X a matrix and not a vector
  return X,y


def print_metrics(y_test, y_proba, threshold):

    y_pred = (y_proba >= threshold).astype(int)

    print(f"\nConfusion Matrix - each row true class (percentage), at treshold of {threshold}:")
    cm = confusion_matrix(y_test, y_pred)
    cm_percentage = (cm / cm.sum()) * 100  # Normalize by the total number of samples
    print(np.round(cm_percentage, 2))  # Print with two decimal places

    # print(f"Model accuracy: {accuracy_score(y_test, y_pred):.2f}")

    # Print classification report
    print(f"\nClassification Report at treshold {threshold}:")
    print(classification_report(y_test, y_pred))

    # Calculate and print ROC AUC
    if len(set(y_test)) == 2:  # Ensure it's binary classification
        roc_auc = roc_auc_score(y_test, y_proba)
        print(f"\nROC AUC: {roc_auc:.2f}")

        # Calculate and print PR AUC
        pr_auc = average_precision_score(y_test, y_proba)
        print(f"PR AUC (Precision-Recall AUC): {pr_auc:.2f}")
    else:
        print("\nROC AUC: Not applicable for multi-class classification")

def test_on_country_with_this_class_0_prob(model, probability_of_0, threshold, train=False):
  X,y = generate_dataset(dataset_length, probability_of_0)
  X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=test_set_portion)
  if train:
    model = LogisticRegression()
    model.fit(X_train, y_train)

  y_proba = model.predict_proba(X_test)[:, 1]

  print_metrics(y_test, y_proba, threshold)
  if train:
    return model


def train_test_on_source_then_test_on_two_more_countries(source_prob, target_probabilities, threshold):
  print(f"\n**** country we train+test at class 0 ratio of {source_prob}")
  model = test_on_country_with_this_class_0_prob(model=None, probability_of_0 = source_prob, threshold=threshold, train=True)
  for target_prob in target_probabilities:
    print(f"\n**** Test in a country with class 0 ratio of {target_prob}")
    test_on_country_with_this_class_0_prob(model, target_prob, threshold)

Now train + test on the SOURCE country, with 0 class frequency in population of 0.7, and test it also on 0.9 and 0.5:

train_test_on_source_then_test_on_two_more_countries(source_prob=0.7, target_probabilities=[0.9, 0.5], threshold=0.5)

And the results:

**** country we train+test at class 0 ratio of 0.7

Confusion Matrix - each row true class (percentage), at treshold of 0.5:
[[63.3   6.42]
 [19.31 10.96]]

Classification Report at treshold 0.5:
              precision    recall  f1-score   support

           0       0.77      0.91      0.83      5578
           1       0.63      0.36      0.46      2422

    accuracy                           0.74      8000
   macro avg       0.70      0.63      0.65      8000
weighted avg       0.73      0.74      0.72      8000


ROC AUC: 0.76
PR AUC (Precision-Recall AUC): 0.58

**** test in a country with class 0 ratio of 0.9

Confusion Matrix - each row true class (percentage), at treshold of 0.5:
[[81.7   8.15]
 [ 6.68  3.48]]

Classification Report at treshold 0.5:
              precision    recall  f1-score   support

           0       0.92      0.91      0.92      7188
           1       0.30      0.34      0.32       812

    accuracy                           0.85      8000
   macro avg       0.61      0.63      0.62      8000
weighted avg       0.86      0.85      0.86      8000


ROC AUC: 0.76
PR AUC (Precision-Recall AUC): 0.30

**** test in a country with class 0 ratio of 0.5

Confusion Matrix - each row true class (percentage), at treshold of 0.5:
[[45.5   4.75]
 [31.15 18.6 ]]

Classification Report at treshold 0.5:
              precision    recall  f1-score   support

           0       0.59      0.91      0.72      4020
           1       0.80      0.37      0.51      3980

    accuracy                           0.64      8000
   macro avg       0.70      0.64      0.61      8000
weighted avg       0.69      0.64      0.61      8000


ROC AUC: 0.76
PR AUC (Precision-Recall AUC): 0.75

In the country with 0.9 basketball players, the accuracy has increased (75%->86%), the ROC AUC remained the same, but the PR AUC decreased.

In the target country, where we have an equal number of players 0.5, we can see that the accuracy is lower than the 1st country (75%->63%). But the ROC AUC is the same. The ROC AUC will remain the same no matter the threshold we choose, and no matter the probability_of_0_in_target we choose. Also, the PR AUC increased.

Why does accuracy change in domain shift? When we train a classifier, the classifier takes into account not only the relation between the features and the label, but also the class proportions in the distribution.

Now let’s try to adapt the model to the 0.9 target country:

train_test_on_source_then_test_on_two_more_countries(source_prob=0.9, target_probabilities=[0.9, 0.5], threshold=0.5)

Confusion Matrix - each row true class (percentage), at treshold of 0.5:
[[89.46  0.32]
 [ 9.69  0.52]]

Classification Report at treshold 0.5:
              precision    recall  f1-score   support

           0       0.90      1.00      0.95      7183
           1       0.62      0.05      0.09       817

    accuracy                           0.90      8000
   macro avg       0.76      0.52      0.52      8000
weighted avg       0.87      0.90      0.86      8000


ROC AUC: 0.78
PR AUC (Precision-Recall AUC): 0.31

**** test in a country with class 0 ratio of 0.9

Confusion Matrix - each row true class (percentage), at treshold of 0.5:
[[89.64  0.26]
 [ 9.6   0.5 ]]

Classification Report at treshold 0.5:
              precision    recall  f1-score   support

           0       0.90      1.00      0.95      7192
           1       0.66      0.05      0.09       808

    accuracy                           0.90      8000
   macro avg       0.78      0.52      0.52      8000
weighted avg       0.88      0.90      0.86      8000


ROC AUC: 0.76
PR AUC (Precision-Recall AUC): 0.31

**** test in a country with class 0 ratio of 0.5

Confusion Matrix - each row true class (percentage), at treshold of 0.5:
[[49.74  0.15]
 [47.78  2.34]]

Classification Report at treshold 0.5:
              precision    recall  f1-score   support

           0       0.51      1.00      0.67      3991
           1       0.94      0.05      0.09      4009

    accuracy                           0.52      8000
   macro avg       0.72      0.52      0.38      8000
weighted avg       0.73      0.52      0.38      8000


ROC AUC: 0.76
PR AUC (Precision-Recall AUC): 0.76

We can see that instead of previous 86% accuracy, we now have 90% accuracy. Our model performed better in deployment time, since we adapted it to the new distribution. Label adaptation will produce the highest accuracy on target domain.

What will happen if we train our model using balanced classes (0.5)?

train_test_on_source_then_test_on_two_more_countries(source_prob=0.5, target_probabilities=[0.9, 0.5], threshold=0.5)

And the results:

**** country we train+test at class 0 ratio of 0.5

Confusion Matrix - each row true class (percentage), at treshold of 0.5:
[[34.35 14.96]
 [16.14 34.55]]

Classification Report at treshold 0.5:
              precision    recall  f1-score   support

           0       0.68      0.70      0.69      3945
           1       0.70      0.68      0.69      4055

    accuracy                           0.69      8000
   macro avg       0.69      0.69      0.69      8000
weighted avg       0.69      0.69      0.69      8000


ROC AUC: 0.76
PR AUC (Precision-Recall AUC): 0.76

**** test in a country with class 0 ratio of 0.9

Confusion Matrix - each row true class (percentage), at treshold of 0.5:
[[62.12 27.86]
 [ 3.16  6.85]]

Classification Report at treshold 0.5:
              precision    recall  f1-score   support

           0       0.95      0.69      0.80      7199
           1       0.20      0.68      0.31       801

    accuracy                           0.69      8000
   macro avg       0.57      0.69      0.55      8000
weighted avg       0.88      0.69      0.75      8000


ROC AUC: 0.76
PR AUC (Precision-Recall AUC): 0.30

**** test in a country with class 0 ratio of 0.5

Confusion Matrix - each row true class (percentage), at treshold of 0.5:
[[35.5  15.09]
 [15.18 34.24]]

Classification Report at treshold 0.5:
              precision    recall  f1-score   support

           0       0.70      0.70      0.70      4047
           1       0.69      0.69      0.69      3953

    accuracy                           0.70      8000
   macro avg       0.70      0.70      0.70      8000
weighted avg       0.70      0.70      0.70      8000


ROC AUC: 0.76
PR AUC (Precision-Recall AUC): 0.75

We can see that both the accuracy and the ROC AUC remained the same in the two different target distributions (~0.69), but the PR AUC changed.

That means that if you train your model on balanced classes, the model will perform with the same accuracy on any future target distribution. Not necessarily optimal accuracy, but constant.

Final Conclusions

If you want the accuracy (the trace of the confusion matrix) not to change during label shift, train with balanced classes. However, this constant accuracy comes with a price: The constant accuracy will be lower than if you train with the correct proportion as in the target domain. If you want to achieve the highest accuracy on the target domain, you should train (in source domain) with the same class proportion as in the target domain. This is called domain-adaptation.
Interesting to see that the PR AUC in a target domain only depends on the class-ratio in that domain, and it does not depend on the training ratio in the source domain. So PR AUC is not affected by class-balancing during training, you cannot fix it.
ROC AUC does not change during label-shift, no matter what your training distribution is.
When you move from domain A to B (with label shift assumption), while the ROC AUC stays the same, the accuracy may improve/worsen, and the PR AUC may improve/worsen, but, while you can’t affect the ROCAUC and PRAUC in the target domain, you can affect the accuracy, by retraining the classifier with the right proportion. Even if accuracy increases when moving from A to B, you can make it even higher by matching the label proportion.