Current classification models will always have varying degrees of accuracy drop when generalizing to new data, and conventional wisdom holds that this drop is related to the fitness of the model. However, this paper proves through experiments that the reason for the drop in accuracy is that the model cannot generalize to images that are more difficult to classify than the original test set.
The primary goal of machine learning is to generate generalized models. We often quantify the generalization ability of a model by measuring its performance on the test set. What does good performance on the test set mean? The model also performed well on a new test set consisting of the same data sources, at least when following the same data cleaning protocol.
In this paper, the researchers achieve this experiment by replicating the creation process of two important baseline datasets (CIFAR-10 and ImageNet). Contrary to their ideal results, they found that many classification models failed to achieve their raw accuracy scores. The accuracy of CIFAR-10 dropped by 3%~15%, and the accuracy of ImageNet dropped by 11%~14%. On ImageNet, the loss of accuracy would require roughly five years of progress during a period of high activity in machine learning research to make up for it.
Conventional wisdom holds that this drop occurs because the model has adapted to specific images in the original test set, e.g., through extensive hyperparameter tuning. However, our experiments show that the relative order of the models remains almost unchanged on the new test set: the model with the highest accuracy on the original test set is still the model with the highest accuracy on the new test set. Also, there is no decrease in accuracy. In fact, every percentage point improvement in accuracy on the original test set means a larger improvement on the new test set. So, while later models could have fit better on the test set, their accuracy dropped less. Experimental results demonstrate that comprehensive test set evaluation is an effective way to improve image classification models. Therefore, adaptability is unlikely to be the cause of the drop in accuracy.
Instead, the researchers propose an alternative explanation based on the relative difficulty of the original and new test sets. They demonstrate that the original ImageNet accuracy can be almost completely restored if the new dataset contains only the simplest images in the candidate pool. This shows that the accuracy scores of even the best image classifiers are highly sensitive to the details of the data cleaning process. It also shows that current classifiers still do not generalize reliably, even in the benign setting of carefully controlled repetitive experiments.
Figure 1 shows the main results of the experiment. To support future research, the researchers also released a new test set and corresponding code.
Figure 1: Model accuracy on the original test set and the new test set. Each data point corresponds to a model in the testbed (shown as 95% Clopper-Pearson confidence intervals). The figure reveals two main phenomena: 1) The accuracy drops significantly from the original test set to the new test set. 2) Model accuracy follows a linear function with a slope greater than 1 (1.7 for CIFAR-10 and 1.1 for ImageNet). This means that every percentage point improvement on the original test set translates to more than one percentage point improvement on the new test set. From the above figure, the slopes on both sides can be visually compared. The red area is the 95% confidence region for the linear fit of the 100,000 bootstrap samples.
Paper: Do ImageNet Classifiers Generalize to ImageNet?
Paper address: http://people.csail.mit.edu/ludwigs/papers/imagenet.pdf
Abstract: We construct new test sets for the CIFAR-10 and ImageNet datasets. These two benchmark test sets have been the focus of research for nearly a decade, increasing the risk of over-reusing test sets. By paying close attention to the original dataset creation process, we tested how well the current classification model generalized to new data. We evaluated a large number of models and found that the accuracy dropped by 3%~15% on CIFAR-10 and 11%~14% on ImageNet. However, the improvement in accuracy on the original test set can bring even greater improvements to the new test set. It turns out that the drop in accuracy is not due to adaptation, but due to the model’s inability to generalize to images that are harder to classify than the original test set.
Experimental summary
The main steps of a reproducible experiment are as follows. Appendices B and C describe this method in detail. The first step is to choose an informative dataset.
Table 1: Model accuracy on original CIFAR-10 test set, original ImageNet validation set, and new test set. ΔRank is the relative difference in ranking from the original test set to the new test set in the full ranking of all models (see Appendix B.3.3 and C.4.4). For example, ΔRank=-2 means that the model on the new test set has dropped by two compared to the original test set. Confidence intervals are 95% Clopper-Pearson intervals. Due to space limitations, references to the model are given in Appendix B.3.2 and C.4.3.
Table 2: Effects of three sampling strategies on the ImageNet test set. The table shows the average MTurk selection frequency in the resulting dataset, and the average change in model accuracy compared to the original validation set. The average selection frequency for all three test sets is over 0.7, but the model accuracies still vary widely. In contrast, in the MTurk experiments, the average selection frequency of the original ImageNet validation set was 0.71. The change in average accuracy is 14% and 10% in top-1 and top-5, respectively. This shows that the details of the sampling strategy have a large impact on the accuracy of the results.
Figure 2: Model accuracy on the original ImageNet validation set vs. accuracy on two variants of the new test set. Each data point corresponds to a model in the testbed (shown as 95% Clopper-Pearson confidence intervals). With a threshold of 0.7, the model accuracy is 3% lower than on the original test set. On TopImages, the most frequently selected image by MTurk workers, the model outperformed by 2% on the original test set. The accuracy of both datasets follows a linear function law, similar to MatchedFrequency in Figure 1. The red shaded area is the 95% confidence region for the linear fit of the 100,000 bootstrap samples.
The Links: CM200DU-24NFH 2SAR553P5T100
0 Comments for “Can ImageNet classifiers generalize to ImageNet?”