Analysis for Image Classification Benchmark CIFAR-10 and CIFAR-100 Datasets

Mar. 22, 2023

Introduction

MATLAB提供了一个使用ResNet分类CIFAR-10图像数据示例:Train Residual Network for Image Classification - MathWorks。CIFAR数据集是图像分类领域的Benchmark,收集自Alex Krizhevsky(AlexNet的作者),分为CIFAR-10和CIFAR-100数据集。该MATLAB示例所使用的CIFAR-10就是下载自Alex Krizhevsky的网站(CIFAR-10 and CIFAR-100 datasets (toronto.edu))。Alex Krizhevsky的网站对这两个数据集进行了简要的介绍,并且这个MATLAB示例展示了如何导入和使用下载的数据集,因此本博客就结合这两方面的资料,整理一下CIFAR-10和CIFAR-100的基本信息。


CIFAR-10 Dataset

MATLAB版本的CIFAR-10图像数据集下载地址:https://www.cs.toronto.edu/~kriz/cifar-10-matlab.tar.gz

MATLAB版本的CIFAR-10数据集共包含7个.mat文件,分别是:

1
2
3
4
5
6
7
8
9
10
>> dir
.                 
..                
batches.meta.mat 
data_batch_1.mat 
data_batch_2.mat
data_batch_3.mat
data_batch_4.mat
data_batch_5.mat
test_batch.mat

Information file: batches.meta.mat

batches.meta.mat文件中仅仅包含着CIFAR-10数据集的label信息:

1
2
3
4
>> load batches.meta.mat
>> whos
  Name              Size            Bytes  Class    Attributes
  label_names      10x1              1140  cell        
1
2
3
4
5
6
7
8
9
10
11
12
13
>> label_names
label_names =
  10×1 cell array
    {'airplane'  }
    {'automobile'}
    {'bird'      }
    {'cat'       }
    {'deer'      }
    {'dog'       }
    {'frog'      }
    {'horse'     }
    {'ship'      }
    {'truck'     }

Training dataset: data_batch_1.mat~data_batch_5.mat

训练数据分别保存在五个.mat文件中:data_batch_1.mat~data_batch_5.mat。每个.mat文件中保存的数据都是类似的。以data_batch_1.mat为例,该文件包含三个变量:

1
2
3
4
5
6
>> load data_batch_1.mat
>> whos
  Name                 Size                 Bytes  Class    Attributes
  batch_label          1x21                    42  char               
  data             10000x3072            30720000  uint8             
  labels           10000x1                  10000  uint8              

其中:

  • batch_label变量中保存着batch的信息:

    1
    2
    3
    
    >> batch_label
    batch_label =
        'training batch 1 of 5'
    
  • 变量data中保存着图像的RGB数据(均为整数),每张图像包含32x32x3(3072)个数据点;data中一共保存了10,000张图像;

  • 变量labels是这10,000张图像所对应的标签,从0到9

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
    >> unique(labels)
    ans =
      10×1 uint8 column vector
       0
       1
       2
       3
       4
       5
       6
       7
       8
       9
    

我比较关心的是:

  • 每一个训练集的.mat文件中一共包含多少条训练数据?
  • 每一个训练集的.mat文件中所包含的训练数据中每一类样本的数量是一致的吗?
  • 所有的训练数据中每一类样本的数量是一致的吗?

因此,就简单地写了一个脚本文件来查看这些信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
clc,clear,close all

numsBatches = 5;
nums = nan(numsBatches,10);

for i = 1:numsBatches
    filename = sprintf("data_batch_%s.mat",num2str(i));
    load(filename)
    disp(batch_label)

    [ClassVariable,Class] = findgroups(labels);
    nums(i,:) = splitapply(@numel, labels, ClassVariable);

    disp("Sample size")
    disp(sum(nums(i,:)))
    disp("Sample size of each class in per batch")
    disp(nums(i,:))

    clear batch_label data labels
end
disp("Sample size of each class in whole training dataset")
disp(sum(nums))

data_batch_1.mat

1
2
3
4
5
training batch 1 of 5
Sample size
       10000
Sample size of each class in per batch
        1005         974        1032        1016         999         937        1030        1001        1025         981

data_batch_2.mat

1
2
3
4
5
training batch 2 of 5
Sample size
       10000
Sample size of each class in per batch
         984        1007        1010         995        1010         988        1008        1026         987         985

data_batch_3.mat

1
2
3
4
5
training batch 3 of 5
Sample size
       10000
Sample size of each class in per batch
         994        1042         965         997         990        1029         978        1015         961        1029

data_batch_4.mat

1
2
3
4
5
training batch 4 of 5
Sample size
       10000
Sample size of each class in per batch
        1003         963        1041         976        1004        1021        1004         981        1024         983

data_batch_5.mat

1
2
3
4
5
training batch 5 of 5
Sample size
       10000
Sample size of each class in per batch
        1014        1014         952        1016         997        1025         980         977        1003        1022

总体样本的数量

1
2
Sample size of each class in whole training dataset
        5000        5000        5000        5000        5000        5000        5000        5000        5000        5000

最终将结果汇总在表格中:

image-20230321180632180

可以看到:

  • 每一个.mat文件中都包含了10,000条数据,一共是50,000条训练数据;
  • 每一个.mat文件中的各类数据的数量并不是一致的,有一些细微的差别;
  • 所有的训练数据中每一类样本的数量是一致的,各5,000条;

Test dataset: test_batch.mat

类似地,测试集数据也可以进行类似的分析:

1
2
3
4
5
6
7
>> load test_batch.mat
>> whos
  Name                 Size                 Bytes  Class    Attributes

  batch_label          1x20                    40  char               
  data             10000x3072            30720000  uint8              
  labels           10000x1                  10000  uint8              
1
2
3
4
5
6
7
8
clc,clear,close all

load test_batch.mat

[ClassVariable,Class] = findgroups(labels);
nums = splitapply(@numel, labels, ClassVariable);

disp(nums')
1
        1000        1000        1000        1000        1000        1000        1000        1000        1000        1000

可以看到测试集中每类有1000个样本。

注:实际上,Alex在其网站中简单的介绍了CIFAR-10数据集的构成情况,上述分析只是简单地做了验证。“The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.”

Preprocess by MATLAB example

博客一开头所提到的MATLAB示例展示了对于这些原始图片数据的处理。使用了自定义的函数文件loadCIFARData.m

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
function [XTrain,YTrain,XTest,YTest] = loadCIFARData(location)
location = fullfile(location,'cifar-10-batches-mat');
[XTrain1,YTrain1] = loadBatchAsFourDimensionalArray(location,'data_batch_1.mat');
[XTrain2,YTrain2] = loadBatchAsFourDimensionalArray(location,'data_batch_2.mat');
[XTrain3,YTrain3] = loadBatchAsFourDimensionalArray(location,'data_batch_3.mat');
[XTrain4,YTrain4] = loadBatchAsFourDimensionalArray(location,'data_batch_4.mat');
[XTrain5,YTrain5] = loadBatchAsFourDimensionalArray(location,'data_batch_5.mat');
XTrain = cat(4,XTrain1,XTrain2,XTrain3,XTrain4,XTrain5);
YTrain = [YTrain1;YTrain2;YTrain3;YTrain4;YTrain5];

[XTest,YTest] = loadBatchAsFourDimensionalArray(location,'test_batch.mat');
end

function [XBatch,YBatch] = loadBatchAsFourDimensionalArray(location,batchFileName)
s = load(fullfile(location,batchFileName));
XBatch = s.data';
XBatch = reshape(XBatch,32,32,3,[]);
XBatch = permute(XBatch,[2 1 3 4]); % Attention here
YBatch = convertLabelsToCategorical(location,s.labels);
end

function categoricalLabels = convertLabelsToCategorical(location,integerLabels)
s = load(fullfile(location,'batches.meta.mat'));
categoricalLabels = categorical(integerLabels,0:9,s.label_names);
end

总得来说,代码非常简单,这里就不再赘述。

调用loadCIFARData.m

1
[XTrain,TTrain,XValidation,TValidation] = loadCIFARData(pwd);

查看图像的效果:

1
2
3
4
figure;
idx = randperm(size(XTrain,4),20);
im = imtile(XTrain(:,:,:,idx),ThumbnailSize=[96,96]);
imshow(im)

image-20230321183317343


CIFAR-100 Dataset

MATLAB版本的CIFAR-100图像数据集下载地址:https://www.cs.toronto.edu/~kriz/cifar-100-matlab.tar.gz

MATLAB版本的CIFAR-100数据集共包含3个.mat文件,分别是:

1
2
3
4
5
6
>> dir
.          
..         
meta.mat   
test.mat   
train.mat  

Label information file: meta.mat

meta.mat中保存着数据的分类信息,具有两个变量:

1
2
3
4
5
>> load meta.mat
>> whos
  Name                      Size            Bytes  Class    Attributes
  coarse_label_names       20x1              2696  cell               
  fine_label_names        100x1             11650  cell               

CIFAR-100一共有20大类(Superclass)的图像,而这20个大类又可以细分为100个小类(Classes),每个大类均包含5个小类:

image-20230321190428322

coarse_label_namesfine_label_names就保存着Superclasses和Classes的名称。

Training dataset: train.mat

训练集数据保存在train.mat中:

1
2
3
4
5
6
7
8
>> load train.mat
>> whos
  Name                   Size                  Bytes  Class    Attributes
  batch_label            1x21                     42  char               
  coarse_labels      50000x1                   50000  uint8              
  data               50000x3072            153600000  uint8              
  filenames          50000x1                 7400190  cell               
  fine_labels        50000x1                   50000  uint8              

其中,比较陌生的是filenames变量,里面保存着是图片的文件名称(尽管压缩包中并没有真正包含这些图像):

1
2
3
4
5
6
7
8
>> filenames(1:5)
ans =
  5×1 cell array
    {'bos_taurus_s_000507.png' }
    {'stegosaurus_s_000125.png'}
    {'mcintosh_s_000643.png'   }
    {'altar_boy_s_001435.png'  }
    {'cichlid_s_000031.png'    }

同样地,我们可以分析一下每个Superclass和Class的样本数量:

1
2
3
4
5
6
7
8
9
10
11
load train.mat

% For superclass
[ClassVariable,~] = findgroups(coarse_labels);
nums_superclass = splitapply(@numel, coarse_labels, ClassVariable);
disp(nums_superclass')

% For class
[ClassVariable,~] = findgroups(fine_labels);
nums_class = splitapply(@numel, fine_labels, ClassVariable);
disp(nums_class')

最后的结果是,每个Class都有500张图片,因此,每个Superclass都有2,500张图片,一共具有50,000张图片的数据。


Test dataset: test.mat

对于测试数据test.mat

1
2
3
4
5
6
7
8
>> load test.mat
>> whos
  Name                   Size                 Bytes  Class    Attributes
  batch_label            1x20                    40  char               
  coarse_labels      10000x1                  10000  uint8              
  data               10000x3072            30720000  uint8              
  filenames          10000x1                1480250  cell               
  fine_labels        10000x1                  10000  uint8              
1
2
3
4
5
6
7
8
9
10
11
12
13
clc,clear,close all

load test.mat

% For superclass
[ClassVariable,~] = findgroups(coarse_labels);
nums_superclass = splitapply(@numel, coarse_labels, ClassVariable);
disp(nums_superclass')

% For class
[ClassVariable,~] = findgroups(fine_labels);
nums_class = splitapply(@numel, fine_labels, ClassVariable);
disp(nums_class')

对于测试集,每个Class都有100张图片,因此,每个Superclass都有500张图片,一共具有10,000张图片的数据。


Datasets Merge

在某些情况下,将训练集和测试集的数据合并成一整个数据集是更加方便的。可以参照MATLAB示例对于CIFAR-10数据集的处理方式,将CIFAR-10和CIFAR-100数据集的训练集数据和测试集数据分别合并起来。

注:以下的脚本文件需要与数据集放在相同的工作文件夹中。

对于CIFAR-10数据集:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
clc,clear,close all

[XTrain,YTrain,XTest,YTest] = loadCIFARData(pwd);

data = cat(4,XTrain,XTest);
labels = [YTrain;YTest];

save("CIFAR-10.mat","data","labels")

function [XTrain,YTrain,XTest,YTest] = loadCIFARData(location)
[XTrain1,YTrain1] = loadBatchAsFourDimensionalArray(location,'data_batch_1.mat');
[XTrain2,YTrain2] = loadBatchAsFourDimensionalArray(location,'data_batch_2.mat');
[XTrain3,YTrain3] = loadBatchAsFourDimensionalArray(location,'data_batch_3.mat');
[XTrain4,YTrain4] = loadBatchAsFourDimensionalArray(location,'data_batch_4.mat');
[XTrain5,YTrain5] = loadBatchAsFourDimensionalArray(location,'data_batch_5.mat');
XTrain = cat(4,XTrain1,XTrain2,XTrain3,XTrain4,XTrain5);
YTrain = [YTrain1;YTrain2;YTrain3;YTrain4;YTrain5];

[XTest,YTest] = loadBatchAsFourDimensionalArray(location,'test_batch.mat');
end

function [XBatch,YBatch] = loadBatchAsFourDimensionalArray(location,batchFileName)
s = load(fullfile(location,batchFileName));
XBatch = s.data';
XBatch = reshape(XBatch,32,32,3,[]);
XBatch = permute(XBatch,[2 1 3 4]);
YBatch = convertLabelsToCategorical(location,s.labels);
end

function categoricalLabels = convertLabelsToCategorical(location,integerLabels)
s = load(fullfile(location,'batches.meta.mat'));
categoricalLabels = categorical(integerLabels,0:9,s.label_names);
end

对于CIFAR-100数据集:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
clc,clear,close all

[X_train,Y_train_coarse,Y_train_fine,filenames_train] = loadBatchAsFourDimensionalArray('train.mat');
[X_test,Y_test_coarse,Y_test_fine,filenames_test] = loadBatchAsFourDimensionalArray('test.mat');
data = cat(4,X_train,X_test);
coarse_labels = [Y_train_coarse;Y_test_coarse];
fine_labels = [Y_train_fine;Y_test_fine];
filenames = [filenames_train;filenames_test];

save("CIFAR-100.mat","data","coarse_labels","fine_labels","filenames")

function [XBatch,coarse_categoricalLabels,fine_categoricalLabels,filenames] = loadBatchAsFourDimensionalArray(filename)
s = load(filename);
XBatch = s.data';
XBatch = reshape(XBatch,32,32,3,[]);
XBatch = permute(XBatch,[2 1 3 4]);
[coarse_categoricalLabels,fine_categoricalLabels] = convertLabelsToCategorical(s.coarse_labels,s.fine_labels);
filenames = s.filenames;
end

function [coarse_categoricalLabels,fine_categoricalLabels] = ...
    convertLabelsToCategorical(coarse_integerLabels,fine_integerLabels)
s = load("meta.mat");
coarse_categoricalLabels = categorical(coarse_integerLabels,0:19,s.coarse_label_names);
fine_categoricalLabels = categorical(fine_integerLabels,0:99,s.fine_label_names);
end


References

[1] CIFAR-10 and CIFAR-100 datasets.

[2] Train Residual Network for Image Classification - MathWorks.