Generate Machine Learning Model .m Code Files Automatically in MATLAB

Aug. 02, 2022

MATLAB Classification Learner app 提供了方便的、图形化操作的训练机器学习模型的手段,并且可以将整个的训练过程自动生成为函数代码或者模型代码

image-20220802214800679

本文就导出了一个核函数为线性函数的支持向量机模型,得到了函数文件 trainClassifier.m ,并以此文件为例分析代码的整体结构。


(1)函数文件注释

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
function [trainedClassifier, validationAccuracy] = trainClassifier(trainingData)
% [trainedClassifier, validationAccuracy] = trainClassifier(trainingData)
% Returns a trained classifier and its accuracy. This code recreates the
% classification model trained in Classification Learner app. Use the
% generated code to automate training the same model with new data, or to
% learn how to programmatically train models.
%
%  Input:
%      trainingData: A table containing the same predictor and response
%       columns as those imported into the app.
%
%  Output:
%      trainedClassifier: A struct containing the trained classifier. The
%       struct contains various fields with information about the trained
%       classifier.
%
%      trainedClassifier.predictFcn: A function to make predictions on new
%       data.
%
%      validationAccuracy: A double containing the accuracy as a
%       percentage. In the app, the Models pane displays this overall
%       accuracy score for each model.
%
% Use the code to train the model with new data. To retrain your
% classifier, call the function from the command line with your original
% data or new data as the input argument trainingData.
%
% For example, to retrain a classifier trained with the original data set
% T, enter:
%   [trainedClassifier, validationAccuracy] = trainClassifier(T)
%
% To make predictions with the returned 'trainedClassifier' on new data T2,
% use
%   yfit = trainedClassifier.predictFcn(T2)
%
% T2 must be a table containing at least the same predictor columns as used
% during training. For details, enter:
%   trainedClassifier.HowToPredict

% Auto-generated by MATLAB on 2022-08-02 21:36:27

注释部分主要介绍了函数的输入输出和其他相关信息

  • 输入:变量 trainingData ,该变量是一个 table 变量,包含了特征列(predictor columns)和标签列(response column)

  • 输出:

    变量 trainedClassifier,该变量是一个结构体,其中包含一个已经训练好的分类模型,以及其相关信息和函数

    变量 validationAccuracy,交叉验证准确率

另外,可以使用 trainedClassifier.predictFcn 预测未知数据集的标签.

⭐ 注:

trainedClassifier 是一个结构体,其中除了保存训练好的分类模型外,还保存了其他相关信息和函数(比如 trainedClassifier.predictFcn。对于该示例,trainedClassifier.ClassificationSVM 才是训练好的分类模型,数据类型为 1x1 ClassificationECOC


(2)分离特征列和标签列

1
2
3
4
5
6
7
8
% Extract predictors and response
% This code processes the data into the right shape for training the
% model.
inputTable = trainingData;
predictorNames = {'FirstPeakValue', 'ValleyValue', 'SecondPeakValue', 'stage1', 'stage2', 'stage3', 'stage4', 'duration', 'BeginningVoltage', 'MaxVoltage', 'EndingVoltage', 'BeginningTime', 'MaxVoltageTime', 'Stroke', 'Velocity'};
predictors = inputTable(:, predictorNames);
response = inputTable.FaultCode;
isCategoricalPredictor = [false, false, false, false, false, false, false, false, false, false, false, false, false, false, false];


(3)设置分类器超参数并进行训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
% Train a classifier
% This code specifies all the classifier options and trains the classifier.
template = templateSVM(...
    'KernelFunction', 'linear', ...
    'PolynomialOrder', [], ...
    'KernelScale', 'auto', ...
    'BoxConstraint', 1, ...
    'Standardize', true);
classificationSVM = fitcecoc(...
    predictors, ...
    response, ...
    'Learners', template, ...
    'Coding', 'onevsone', ...
    'ClassNames', [1; 2; 3; 4; 5; 6; 7]);
  • templateSVM :设置分类器超参数
  • fitcecoc :训练模型


(4)设置结构体 trainedClassifierpredictFcn 函数,用于预测未知数据的标签

1
2
3
4
% Create the result struct with predict function
predictorExtractionFcn = @(t) t(:, predictorNames);
svmPredictFcn = @(x) predict(classificationSVM, x);
trainedClassifier.predictFcn = @(x) svmPredictFcn(predictorExtractionFcn(x));


(5)在结构体中保存其他信息

1
2
3
4
5
6
% Add additional fields to the result struct
trainedClassifier.RequiredVariables = {'BeginningTime', 'BeginningVoltage', 'EndingVoltage', 'FirstPeakValue', 'MaxVoltage', 'MaxVoltageTime', 'SecondPeakValue', 'Stroke', 'ValleyValue', 'Velocity', 'duration', 'stage1', 'stage2', 'stage3', 'stage4'};
trainedClassifier.ClassificationSVM = classificationSVM;
trainedClassifier.About = 'This struct is a trained model exported from Classification Learner R2022a.';
trainedClassifier.HowToPredict = sprintf('To make predictions on a new table, T, use: \n  yfit = c.predictFcn(T) \nreplacing ''c'' with the name of the variable that is this struct, e.g. ''trainedModel''. \n \nThe table, T, must contain the variables returned by: \n  c.RequiredVariables \nVariable formats (e.g. matrix/vector, datatype) must match the original training data. \nAdditional variables are ignored. \n \nFor more information, see <a href="matlab:helpview(fullfile(docroot, ''stats'', ''stats.map''), ''appclassification_exportmodeltoworkspace'')">How to predict using an exported model</a>.');


(6)分离特征列和标签列

1
2
3
4
5
6
7
8
% Extract predictors and response
% This code processes the data into the right shape for training the
% model.
inputTable = trainingData;
predictorNames = {'FirstPeakValue', 'ValleyValue', 'SecondPeakValue', 'stage1', 'stage2', 'stage3', 'stage4', 'duration', 'BeginningVoltage', 'MaxVoltage', 'EndingVoltage', 'BeginningTime', 'MaxVoltageTime', 'Stroke', 'Velocity'};
predictors = inputTable(:, predictorNames);
response = inputTable.FaultCode;
isCategoricalPredictor = [false, false, false, false, false, false, false, false, false, false, false, false, false, false, false];

该部分与第(2)部分代码完全一致。这可能是因为我设置了模型交叉验证,这是整个交叉验证步骤所生成代码的一部分,MTALB 并不会对重复的代码进行简化。


(7)交叉验证

1
2
3
4
5
6
7
8
% Perform cross-validation
partitionedModel = crossval(trainedClassifier.ClassificationSVM, 'KFold', 5);

% Compute validation predictions
[validationPredictions, validationScores] = kfoldPredict(partitionedModel);

% Compute validation accuracy
validationAccuracy = 1 - kfoldLoss(partitionedModel, 'LossFun', 'ClassifError');
  • 函数 crossval 的功能是配置交叉验证的标准(类似于测试误差的计算方法)、数据集分割折数等选项信息,之后传入到 kfoldPredict 函数中。详见:crossval - MATLAB Documentation
  • 函数 kfoldPredict 是真正进行交叉验证的操作
  • 使用 kfoldLoss 函数计算总的验证准确率

这部分的代码依赖于一开始的交叉验证设置,比如上述代码就是使用下图所示的 k 折交叉验证法

image-20220802223550857

如果改变交叉验证策略,比如使用留出法

image-20220802223818251

则相应的代码为

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
...
% Set up holdout validation
cvp = cvpartition(response, 'Holdout', 0.25);
trainingPredictors = predictors(cvp.training, :);
trainingResponse = response(cvp.training, :);
trainingIsCategoricalPredictor = isCategoricalPredictor;

% Train a classifier
% This code specifies all the classifier options and trains the classifier.
template = templateSVM(...
    'KernelFunction', 'linear', ...
    'PolynomialOrder', [], ...
    'KernelScale', 'auto', ...
    'BoxConstraint', 1, ...
    'Standardize', true);
classificationSVM = fitcecoc(...
    trainingPredictors, ...
    trainingResponse, ...
    'Learners', template, ...
    'Coding', 'onevsone', ...
    'ClassNames', [1; 2; 3; 4; 5; 6; 7]);

% Create the result struct with predict function
svmPredictFcn = @(x) predict(classificationSVM, x);
validationPredictFcn = @(x) svmPredictFcn(x);

% Add additional fields to the result struct


% Compute validation predictions
validationPredictors = predictors(cvp.test, :);
validationResponse = response(cvp.test, :);
[validationPredictions, validationScores] = validationPredictFcn(validationPredictors);

% Compute validation accuracy
correctPredictions = (validationPredictions == validationResponse);
isMissing = isnan(validationResponse);
correctPredictions = correctPredictions(~isMissing);
validationAccuracy = sum(correctPredictions)/length(correctPredictions);

如果训练集和验证集使用的是同一个数据集

image-20220802224513049

1
2
3
4
5
% Compute resubstitution predictions
[validationPredictions, validationScores] = predict(trainedClassifier.ClassificationSVM, predictors);

% Compute resubstitution accuracy
validationAccuracy = 1 - resubLoss(trainedClassifier.ClassificationSVM, 'LossFun', 'ClassifError');



最终,完整的函数文件代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
function [trainedClassifier, validationAccuracy] = trainClassifier(trainingData)
% [trainedClassifier, validationAccuracy] = trainClassifier(trainingData)
% Returns a trained classifier and its accuracy. This code recreates the
% classification model trained in Classification Learner app. Use the
% generated code to automate training the same model with new data, or to
% learn how to programmatically train models.
%
%  Input:
%      trainingData: A table containing the same predictor and response
%       columns as those imported into the app.
%
%  Output:
%      trainedClassifier: A struct containing the trained classifier. The
%       struct contains various fields with information about the trained
%       classifier.
%
%      trainedClassifier.predictFcn: A function to make predictions on new
%       data.
%
%      validationAccuracy: A double containing the accuracy as a
%       percentage. In the app, the Models pane displays this overall
%       accuracy score for each model.
%
% Use the code to train the model with new data. To retrain your
% classifier, call the function from the command line with your original
% data or new data as the input argument trainingData.
%
% For example, to retrain a classifier trained with the original data set
% T, enter:
%   [trainedClassifier, validationAccuracy] = trainClassifier(T)
%
% To make predictions with the returned 'trainedClassifier' on new data T2,
% use
%   yfit = trainedClassifier.predictFcn(T2)
%
% T2 must be a table containing at least the same predictor columns as used
% during training. For details, enter:
%   trainedClassifier.HowToPredict

% Auto-generated by MATLAB on 2022-08-02 21:36:27


% Extract predictors and response
% This code processes the data into the right shape for training the
% model.
inputTable = trainingData;
predictorNames = {'FirstPeakValue', 'ValleyValue', 'SecondPeakValue', 'stage1', 'stage2', 'stage3', 'stage4', 'duration', 'BeginningVoltage', 'MaxVoltage', 'EndingVoltage', 'BeginningTime', 'MaxVoltageTime', 'Stroke', 'Velocity'};
predictors = inputTable(:, predictorNames);
response = inputTable.FaultCode;
isCategoricalPredictor = [false, false, false, false, false, false, false, false, false, false, false, false, false, false, false];

% Train a classifier
% This code specifies all the classifier options and trains the classifier.
template = templateSVM(...
    'KernelFunction', 'linear', ...
    'PolynomialOrder', [], ...
    'KernelScale', 'auto', ...
    'BoxConstraint', 1, ...
    'Standardize', true);
classificationSVM = fitcecoc(...
    predictors, ...
    response, ...
    'Learners', template, ...
    'Coding', 'onevsone', ...
    'ClassNames', [1; 2; 3; 4; 5; 6; 7]);

% Create the result struct with predict function
predictorExtractionFcn = @(t) t(:, predictorNames);
svmPredictFcn = @(x) predict(classificationSVM, x);
trainedClassifier.predictFcn = @(x) svmPredictFcn(predictorExtractionFcn(x));

% Add additional fields to the result struct
trainedClassifier.RequiredVariables = {'BeginningTime', 'BeginningVoltage', 'EndingVoltage', 'FirstPeakValue', 'MaxVoltage', 'MaxVoltageTime', 'SecondPeakValue', 'Stroke', 'ValleyValue', 'Velocity', 'duration', 'stage1', 'stage2', 'stage3', 'stage4'};
trainedClassifier.ClassificationSVM = classificationSVM;
trainedClassifier.About = 'This struct is a trained model exported from Classification Learner R2022a.';
trainedClassifier.HowToPredict = sprintf('To make predictions on a new table, T, use: \n  yfit = c.predictFcn(T) \nreplacing ''c'' with the name of the variable that is this struct, e.g. ''trainedModel''. \n \nThe table, T, must contain the variables returned by: \n  c.RequiredVariables \nVariable formats (e.g. matrix/vector, datatype) must match the original training data. \nAdditional variables are ignored. \n \nFor more information, see <a href="matlab:helpview(fullfile(docroot, ''stats'', ''stats.map''), ''appclassification_exportmodeltoworkspace'')">How to predict using an exported model</a>.');

% Extract predictors and response
% This code processes the data into the right shape for training the
% model.
inputTable = trainingData;
predictorNames = {'FirstPeakValue', 'ValleyValue', 'SecondPeakValue', 'stage1', 'stage2', 'stage3', 'stage4', 'duration', 'BeginningVoltage', 'MaxVoltage', 'EndingVoltage', 'BeginningTime', 'MaxVoltageTime', 'Stroke', 'Velocity'};
predictors = inputTable(:, predictorNames);
response = inputTable.FaultCode;
isCategoricalPredictor = [false, false, false, false, false, false, false, false, false, false, false, false, false, false, false];

% Perform cross-validation
partitionedModel = crossval(trainedClassifier.ClassificationSVM, 'KFold', 5);

% Compute validation predictions
[validationPredictions, validationScores] = kfoldPredict(partitionedModel);

% Compute validation accuracy
validationAccuracy = 1 - kfoldLoss(partitionedModel, 'LossFun', 'ClassifError');