MATLAB filloutliers function

Sep. 26, 2022

Introduction

信号数据预处理中比较常见的一个步骤是离群值(outliers)处理,MATLAB提供了相应的函数处理离群值,包括检测并移除离群值的rmoutliers函数,以及检测并替换离群值的fillouliers函数,后者应用更为广泛,因为我们通常想要保持相同类型的信号的长度是一致的,因此本文就简单整理一下fillouliers函数的使用和原理。

fillouliers函数中可供用户设置的属性主要有两大部分,分别是:异常值替换算法fillmethod,异常值检测算法movemethod。本文举出了几个常用的使用示例来进行说明。


Examples

` filloutliers(Data, “linear”, “mean”)`

1
[cleanedData, outlierIndices, thresholdL, thresholdH, center] = filloutliers(Data, "linear", "mean");

该函数:

  • fillmethod使用的是"linear",表示使用的是线性填充离群值,相当于在离群点附近的两个非离群点之间拉一条线段;
  • findmethod使用的是"mean",表示离群值的查找规则是"mean",该方法认为离群值点是超过均值正负三个标准差范围的点。

示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
clc, clear, close all

A = [57 59 60 100 59 58 57 58 300 61 62 60 62 58 57];
[cleanedA, outlierIndices, thresholdL, thresholdH, center] = ...
    filloutliers(A, "linear", "mean");

figure, axes 
hold(gca, "on")
box(gca, "on")
grid(gca, "on")

% Plot original data
plot(A, "Color", [77 190 238]/255, "DisplayName", "Input data")
% Plot results
plot(cleanedA, "Color", [0 114 189]/255, "LineWidth", 1.5,...
    "DisplayName", "Cleaned data")
% Plot outliers
plot(find(outlierIndices), A(outlierIndices), "x", "MarkerSize", 10, ...
    "Color", [64 64 64]/255, "DisplayName", "Outliers")
% Plot filled outliers
plot(find(outlierIndices), cleanedA(outlierIndices), ".", "MarkerSize", 20,...
    "Color", [217 83 25]/255, "DisplayName", "Filled outliers")
% Plot outlier thresholds
plot([(1:numel(A))'; missing; (1:numel(A))'],...
    [thresholdH(:)*ones(numel(A), 1); missing; thresholdL(:)*ones(numel(A), 1)], ...
    "Color", [145 145 145]/255, ...
    "DisplayName", "Outlier thresholds")
% Plot outlier center
plot(center*ones(numel(A), 1), "k", "LineWidth", 1.5, "DisplayName", "Outlier center")
title("Number of outliers cleaned: " + nnz(outlierIndices))
legend('Location', 'best')

image-20220925191239519

可以验证一下离群值查找规则"mean"

1
2
3
4
5
6
7
8
9
10
11
>> mean(A)-3*std(A), mean(A)+3*std(A)
ans =
 -109.2459
ans =
  264.9792
  
>> thresholdL, thresholdH
thresholdL =
 -109.2459
thresholdH =
  264.9792

filloutliers(Data, "linear", "movmedian", window)

1
[cleanedData, outlierIndices, thresholdL, thresholdH, center] = filloutliers(Data, "linear", "movmedian", window);

该函数findmethod使用的是"movmedian"方法,该方法:Outliers are defined as elements more than three local scaled MAD from the local median over a window length specified by window. This method is also known as a Hampel filter.

对于具有N个observations的向量A,它的中值绝对偏差(median absolute deviation, MAD)定义为:

\[\mathrm{MAD}=\mathrm{median}\Big(\vert A_i-\mathrm{median}(A)\vert\Big),\quad i=1,2,\cdots, N\notag\]

scaled MAD被定义为:

\[\mathrm{scaled\ MAD}=c\times\mathrm{median}\Big(\vert A_i-\mathrm{median}(A)\vert\Big),\quad i=1,2,\cdots, N\notag\]

其中,c=-1/(sqrt(2)*erfcinv(3/2))erfcinv是逆互补误差函数(Inverse complementary error function),并且有erfcinv(erfc(x))=x

$x$的互补误差函数(Complementary error function,即erfc(x))定义为:

\[\mathrm{erfc}(x)=\dfrac{2}{\sqrt\pi}\int^{\infty}_xe^{-t^2}\mathrm{d}t=1-\mathrm{erf}(x)\notag\]

erf为误差函数(Error function),定义为: \(\mathrm{erf}(x)=\dfrac{2}{\sqrt\pi}\int^{x}_0e^{-t^2}\mathrm{d}t\notag\)

示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
clc, clear, close all
rng('default')

A = ceil(randn(100, 1)*10); A(20) = 200;
[cleanedA, outlierIndices, thresholdL, thresholdH, center] = ...
    filloutliers(A, "linear", "movmedian",  15);

figure, axes 
hold(gca, "on")
box(gca, "on")
grid(gca, "on")

% Plot original data
plot(A, "Color", [77 190 238]/255, "DisplayName", "Input data")
% Plot results
plot(cleanedA, "Color", [0 114 189]/255, "LineWidth", 1.5,...
    "DisplayName", "Cleaned data")
% Plot outliers
plot(find(outlierIndices), A(outlierIndices), "x", "MarkerSize", 10, ...
    "Color", [64 64 64]/255, "DisplayName", "Outliers")
% Plot filled outliers
plot(find(outlierIndices), cleanedA(outlierIndices), ".", "MarkerSize", 20,...
    "Color", [217 83 25]/255, "DisplayName", "Filled outliers")
% Plot outlier thresholds
plot([(1:numel(A))'; missing; (1:numel(A))'],...
    [thresholdH(:); missing; thresholdL(:)], ...
    "Color", [145 145 145]/255, ...
    "DisplayName", "Outlier thresholds")
% Plot outlier center
plot(center, "k", "LineWidth", 1.5, "DisplayName", "Outlier center")
title("Number of outliers cleaned: " + nnz(outlierIndices))
legend('Location', 'best')

image-20220925210702401

验证一下算法原理:

1
2
3
4
5
contrast1 = movmedian(A, 15) == center;
c = -1/(sqrt(2)*erfcinv(3/2));
contrast2 = center+3*c*movmedian(abs(A - movmedian(A, 15)), 15)-thresholdH;
contrast3 = center-3*c*movmedian(abs(A - movmedian(A, 15)), 15)-thresholdL;
nnz(contrast1), nnz(contrast2), nnz(contrast3)
1
2
3
4
5
6
ans =
   100
ans =
    72
ans =
    72

可以看到,根据算法原理计算出的值和函数输出的center是完全一致的,但是上下限只有72个点是一致的,其余28个点有偏差,这里的问题不清楚出在了哪里。

filloutliers(Data, "clip")

1
[cleanedData, outlierIndices, thresholdL, thresholdH, center] = filloutliers(Data, "clip");

另一种比较常用的fillmethod"clip"方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
clc, clear, close all

A = [60 59 49 49 58 100 61 57 48 58];
[cleanedA, outlierIndices, thresholdL, thresholdH, center] = ...
    filloutliers(A, "clip");

figure, axes 
hold(gca, "on")
box(gca, "on")

% Plot original data
plot(A, "Color", [77 190 238]/255)
% Plot results
plot(cleanedA, "Color", [0 114 189]/255, "LineWidth", 1.5)
% Plot outliers
plot(find(outlierIndices), A(outlierIndices), "x", "MarkerSize", 10, ...
    "Color", [64 64 64]/255)
% Plot filled outliers
plot(find(outlierIndices), cleanedA(outlierIndices), ".", "MarkerSize", 20,...
    "Color", [217 83 25]/255)
% Plot lower thrshold, upper threshold, and center
yline([thresholdL thresholdH center], ":", ["Lower Threshold","Upper Threshold","Center Value"])
title("Number of outliers cleaned: " + nnz(outlierIndices))
legend("Input data", "Cleaned data", "Outliers", "Filled outliers", "", "", "", ...
    "Location", "best")

image-20220926152540973


参考

[1] filloutliers - MathWorks.

[2] median - MathWorks.

[3] erfcinv - MathWorks.

[4] erfc - MathWorks.

[5] erf - MathWorks.

[6] rmoutliers - MathWorks.