So you have a fixed training set and a test set given to you. Sometimes you even split one of the two to create a dev set to test various hyperparameters during sweeping.
You have network A and network B, and you want to know which is a better architecture. Assume that when I say ‘network A’, I mean the model architecture plus its hyperparameters, before training.
The hyperparameters of a network can have a substantial effect on the performance of the model, so they should be taken seriously, but for the scope of this talk ‘network A’ includes everything besides the trained weights.

Why not to just choose the network with the best result on test set?

There are a few reason why choosing just the model with best result on eval set is not good enough.

First, if it is for academic research, increasing the eval result by a small $\epsilon$ doesn’t mean we have a new discovery. It can definitely happen due to a bad seed choice for model A and good seed for model B, as we will show below. It can also happen due to bad/good hyperparameters.

Second, assume you work in a company, and you see that model B is $\epsilon$ better than A. So, if you decide to swap models in production, you spend energy and effort in doing that, and besides that you may also reach to wrong conclusions and wrong insights that aren’t necessarily true. If network A uses 10 features and you found out that network B with 5 additional features was $\epsilon$ better, you may also spend unnecessary compute resources, storage, and general type of waste. Even worse, two months afterwards a different researcher may find that model C, with 5 features less, which is identical to the original model A is better than model B by $\epsilon$. And then what? You replace to model A again? This can go on forever, and even worse, can hinder the discovery of networks or features that really improves the metrics.

So there in advantage in knowing that model B is really better than model A.

Seed sensitivity

When you run training for network A you don’t always get the same result (same model) unless you fix the stochasticity of the process. This can be done by fixing the random seed.
I’ve been reviewing papers proposing method B and trying to show that it is better than model A. But what about the random seed? Sometimes the random seed can swap the winner. So, in order to take this into account, running the results with a single run (single seed) is definitely not sufficient, and definitely not statistically significant enough.

So, we can run the experiment with a few seeds, for example 3 or 5 seeds, and then have the average and stdev for each model. This way we can get a notion of whether there is a clear winner.
But, is it enough? This is qualitative and not quantitative. In order to have a numerical definition of the winner we need something else. We need the t-test.

Define the results of model A as A and the results of model B as B. If the two vectors have the same order, meaning that we know that $A_0$ correlates to $B_0$, etc, we can have the paired t-test:

\[\begin{equation} \label{eq:1} \begin{aligned} d_i &= A_i - B_i \end{aligned} \end{equation}\] \[\begin{equation} \label{eq:2} \begin{aligned} \bar{d} &= \frac{1}{n} \sum_{i=1}^{n} d_i \end{aligned} \end{equation}\] \[\begin{equation} \label{eq:3} \begin{aligned} s_d &= \sqrt{\frac{1}{n - 1} \sum_{i=1}^{n} (d_i - \bar{d})^2} \end{aligned} \end{equation}\] \[\begin{equation} \label{eq:4} \begin{aligned} t &= \frac{\bar{d}}{s_d / \sqrt{n}} \quad &\nu = n - 1 \end{aligned} \end{equation}\]

where t is the statistic and $\nu$ is the effective degree of freedom.

then you plug them into:

\[\begin{equation} \label{eq:5} \begin{aligned} p = 2 \left( 1 - F_{t,\nu}(|t|) \right) \end{aligned} \end{equation}\]

where $F_{t,\nu}(\cdot)$ is the cumulative distribution function (CDF) of the Student’s t-distribution with $\nu$ degrees of freedom.

In situations that we can’t correlate the two, for example when we are given A and B but we don’t know if the vectors were permuted, we can do the Welch’s two-sample test:

\[\begin{equation} \label{eq:6} \begin{aligned} t &= \frac{\bar{A} - \bar{B}}{\sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}} \quad &\nu \approx \frac{\left( \frac{s_A^2}{n_A} + \frac{s_B^2}{n_B} \right)^2} {\frac{\left( \frac{s_A^2}{n_A} \right)^2}{n_A - 1} + \frac{\left( \frac{s_B^2}{n_B} \right)^2}{n_B - 1}} \end{aligned} \end{equation}\]

Then, if $p<0.05$ (or 0.01, depending on your preference), you can reject the null hypothesis and know that the difference is significant.
If it is $p>0.05$, you can’t conclude anything besides that you need to improve the experiment in order to potentially get a stronger signal (lower p-number)

Use Multiple Data Splits

If you are the one who decides on the train/test split (unlike many common benchmarks in which the train/test split is already given to you), and in case you want stronger and more confident results, you can take a few randomized splits, and run for each model A training and model B, and then do the t-test, to see if indeed model B improves over model A, although some research shows that using the t-test for k-fold is not correct. If you have near unlimited data, gather k separate train and test datasets to calculate 10 truly independent skill scores for each method. You may then correctly apply the paired Student’s t-test.

If there’s no seed and you can’t control the train/test split?

Well, then you can only run each model once, and then you can pick the best model (e.g. model B), but you can’t tell if model B is really better than model A, or if it is just due to chance.