The Metrics of Average Precision (AP) and Average Recall (AR) in Object Detection Tasks

May. 01, 2024

IntroductionPermalink

Recently, I’ve been reading Tsung-Yi Lin1 et al.’s paper, Focal Loss for Dense Object Detection2 (Best Student Paper Award in 2017 ICCV). In their ablation experiments for testing Focal Loss, several metrics were adopted to evaluate detectors, like AP, AP50, AP75, APS, APM, APL:

image-20240423210808300

These metrics are so commonly used in object detection that they didn’t give any explicit explanation. But I am totally fresh in this area, and have no idea about them. This is why I write this post.

AP is the abbreviation for Average Precision, but it’s actually not the average of detection precision, which looks reasonable according to its literal meaning. Rather, like the notion of AUC for ROC curve (TPR-FPR) in classification task, AP is also computed based on a specific curve, that is precision-recall curve.

Due to that I ever learned ROC curve in classification problem in detail (see my previous blog3), so when I first knew about AP, I realized that it is a good start to introduce it from the concept of ROC curve. Therefore, in the following text, I would first explain ROC curve briefly. Then, AP and some metrics related to it will be introduced. Besides, AR, i.e., Average Recall, is also usually used for evaluating detectors, and it has the similar calculation method with AP. So, I would make some explanations about it as well.


The concepts of ROC curve and AUC in binary classification taskPermalink

In the binary classification task, a classifier after training can output predicted possibilities (more precisely, say scores) for the input observations. Researchers can set a threshold, and those observations whose corresponding scores are over/below the specified threshold can be labeled as positive/negative. Then, according to the true labels and corresponding predicted labels for observations, we can obtain a 2-by-2 confusion matrix, classifying each observation as TP (True positive), FN (False negative), FP (False positive), or TN (True negative), and hence TPR (True positive rate, or say recall) and FPR (False positive rate) can be calculated:

(1)TPR (a.k.a. recall)=TPTP+FN (2)FPR=FPFP+TN

By varying the specified threshold, we can obtain different TPR-FPR pairs, and based on which an ROC (Receiver Operating Characteristic) curve, showing how TPR changes as FPR, can be plotted. The area under the ROC curve (also known as AUC), that is the integral of ROC curve over the interval FPR[0,1], is a metric that can reflect the average performance of the classifier.

More detailed information concerning ROC and AUC can be found in my previous blog3.


AP and AP-related metricsPermalink

Average Precision (AP): AUC of precision-recall curvePermalink

In object detection tasks, the detector usually predicts bounding boxes and corresponding class labels4. AP and some related metrics are to evaluate the performance of detectors predicting bounding boxes.

Take single object detection.

The true position of bounding boxes (usually is annotated by hands) is called ground truth bounding box, and the predicted position is predicted bounding box 56:

Intersection over Union - object detection bounding boxes.jpg
By Adrian Rosebrock, CC BY-SA 4.0, Link.

Based on them, we can define the notion of Intersection over Union (IoU) as follows7:

Intersection over Union - visual equation.png
By Adrian Rosebrock, CC BY-SA 4.0, Link.

Apparently, different cases will yields different IoU values (range from [0,1]), for example8,

Intersection over Union - poor, good and excellent score.png
By Adrian Rosebrock, CC BY-SA 4.0, Link.

Here, if adopt different IoU thresholds, we can classify the predicted detection results as TP or FP.

Take the above case. If we set threshold as 0.5, then we have:

(3)TP=2, FP=1

and if the threshold is 0.8, then:

(4)TP=1, FP=2

In target detection, where kind of different from constructing the ROC curve based on TPR-FPR pairs in classification task, we hope to obtain a precision-recall curve (p-r curve, or p(r)​). Among which, the definition of precision is9:

(5)precision=TPTP+FP

and the definition of recall is the same as Eq. (1), and FN in Eq. (1) represents the case that “a detector does not detect an object that is present—either by silently failing to detect the object’s presence entirely, or by miscalculating the object’s location or classification”10, or rather, “If a ground-truth bounding box has no TP detection to math with, it is labeled as a false negative FN.”11

For aforementioned two cases, assuming that FN is 2 in both cases, then precision-recall pairs are (0.67,0.5) and (0.33,0.33), respectively. By varying IoU thresholds more densely, more precision-recall pairs can be calculated, and hence p(r) plotted.

Similar to the definition of AUC made for ROC curve in binary classification task, for object detection, “Average Precision (AP) computes the average value of p(r) over the interval from r=0 to r=1​​, that is the area under the precision-recall curve”12.

Note that, in practice, the starting point of the integration interval is usually selected as 0.5 rather than 0. This point will be explained later.

COCO (Common Objects in Context) official website provides an webpage to give more detailed information about evaluation metrics13, in addition to AP, there are some others related to it:

Average Precision (AP):APAP at IoU=.50:.05:.95 (primary challenge metric)APIoU=.50AP at IoU=.50 (PASCAL VOC metric)APIoU=.75AP at IoU=.75 (strict metric)AP Across Scales:APsmallAP for small objects: area<322;APmediumAP for medium objects: 322<area<962;APlargeAP for large objects: area>962;Average Recall (AR):ARmax=1AR given 1 detection per image;ARmax=10AR given 10 detections per image;ARmax=100AR given 100 detections per image;AR Across Scales:ARsmallAR for small objects: area <322;ARmediumAR for medium objects: 322<area<962;ARlargeAR for large objects: area>962;

It’s not hard to figure out that, the metrics AP50, AP75, APS, APM, and APL in Lin’s paper correspond to APIoU=.50, APIoU=.75, APsmall, APmedium, and APlarge in the above table, respectively.

After learning previous content, it’s OK for me to understand what do AP, APS, APM and APL mean and how they work. For example, for those metrics belonging to “AP Across Scales” part, AP-related metrics are used to judge whether or not detectors behave well for detecting small (area<322), medium (322<area<962), and large (area>962) objects.

However, I still feel confused about APIoU=.50 and APIoU=.75. I’ve found some other references, but they all point out that AP metrics are computed based on the precision-recall curve1415, but how to determine a precision-recall curve based on only one precision-recall pair which is calculated by setting IoU threshold as 0.5 or 0.75? So finally, I make a guess that, maybe in these cases, precision-recall curve is horizontal line. Or rather, the precision always keeps the same with the precision value calculated based on setting threshold as 0.5/0.75 although recall changes, so that the area under the precision-recall curve is simply rectangular.


AR and AR-related metricsPermalink

Average Recall (AR): AUC of recall-IoU curvePermalink

As can be seen from Table I, besides AP, COCO detection challenge also take AR-related metrics.

AR is the abbreviation for Average Regression. According to Zenggyu’s blog16, like how we create the precision-recall curve while calculating AP, we can obtain a series of recall values by varying IoU thresholds. A recall-IoU curve can be plotted based on recall-IoU pairs, and the AUC of this recall-IoU curve is so-called AR, which can be expressed as (in continuous form):

(6)AR=20.51recall(o)do

where o is IoU threshold and recall(o) is the corresponding recall.

Eq. (6) is consistent with the paper, What makes for effective detection proposals?17, which is referred by COCO evaluation documentation13 when introducing AR metric (actually it is the paper proposing AR metrics):

image-20240429145404206

image-20240429145600528

If understand AP, it’s not difficult to understand how AR is computed. However, as for the specific meaning of “ARmax=X: AR given X (1/10/100) detection(s) per image” in Table I, I am not that sure.

Some references explain about them, like Picsellia’s blog18:

image-20240429151253166

which says that “X” is the confidence threshold that used to determined whether a prediction is considered a true positive or false positive, and that will influence recall values. But it seems not much clear, at least for me. Didn’t we take IoU threshold as the threshold to determine recall when plotting the recall-IoU curve?

So, I read Hosang’s paper17 again, and I find that, “X” in “ARmax=X : AR given X (1/10/100) detection(s) per image” may represent the number of proposals.

From Ma’s paper19, “Object proposal aims to generate a certain amount of candidate bounding boxes to determine the potential objects and their locations in an image, which is widely applied to many visual tasks for pre-processing, e.g., object detection, segmentation, object discovery, and 3D match.” Therefore, I think that a candidate bounding box can be viewed as a proposal.

Although I haven’t conducted any specific target detection work, after searching some references, I guess that there exists a type of object detection neural network, may called Region Proposal Network (RPN)20, which can generate multiple proposals with different scores for a single image. Therefore, if we calculate AR based on the proposal with the highest score for each image, then the AR metric in this case is so-called ARmax=1; similarly, if we calculate AR based on the proposals with the first 100 highest score for each image, then ARmax=100; and so on. It’s not hard to conclude that, if the RPN is not that bad, as X increases (corresponding to a stricter test for detectors), metric ARmax=X usually has a tendency to increase, as shown as another figure in Hosang’s paper17:

image-20240429161846668


By the way …Permalink

The integration interval when calculating AP and ARPermalink

One thing that should be noted that, although theoretically, we can calculate the precision-recall/recall-IoU threshold curve across the interval of IoU threshold ranging from 0 to 1, in practice we usually choose the interval starting from 0.5 to calculate AP/AR, shown as Table I and Fig. 1, the example of calculating AP and that of AR.

About this point, a footnote in Hosang’s paper17 provides a fairly clear explanation; there is no need to discuss more:

image-20240501184542483

mAP and mARPermalink

As mentioned before, many detectors could not only predict the positions of bounding boxes, but also the labels (categories) for objects in the boxes. Therefore, if we calculate APs/ARs for all categories, and then average their sum, we can obtain the metric Mean Average Precision (mAP)/Mean Average Recall (mAR).

That being said, the nuance between AP/AR and mAP/mAR would be deliberately neglected at times, like COCO detection challenge413:

“AP is averaged over all categories. Traditionally, this is called “mean average precision” (mAP). We make no distinction between AP and mAP (and likewise AR and mAP) and assume the difference is clear from context.”

However, it doesn’t mean that AP for each category doesn’t matter. For example, Hosang’s paper17 made a detailed comparison and analysis for APs at category level:

image-20240501194038559

So, although COCO detection evaluation webpage13 says their code doesn’t emphasize the difference between AP and mAP, I think COCO API21 still supports computing and outputting AP for each category while evaluating detectors. I would verify this point further if I have the opportunity to deal with specific object detection task.


ReferencesPermalink

  1. Tsung-Yi Lin˄

  2. Lin, Tsung-Yi, et al. “Focal loss for dense object detection.” Proceedings of the IEEE international conference on computer vision. 2017, available at: [1708.02002] Focal Loss for Dense Object Detection˄

  3. Receiver Operating Characteristic (ROC) Curve in MATLAB - What a starry night~˄ ˄2

  4. mAP (mean Average Precision) might confuse you! | by Shivy Yohanandan | Towards Data Science˄ ˄2

  5. Jaccard index - Wikipedia˄

  6. Intersection over Union - object detection bounding boxes - Jaccard index - Wikipedia˄

  7. Intersection over Union - visual equation - Jaccard index - Wikipedia˄

  8. Intersection over Union - poor, good and excellent score - Jaccard index - Wikipedia˄

  9. Precision and recall - Wikipedia˄

  10. Miller, Dimity, et al. “What’s in the black box? the false negative mechanisms inside object detectors.” IEEE Robotics and Automation Letters 7.3 (2022): 8510-8517, available at: [2203.07662] What’s in the Black Box? The False Negative Mechanisms Inside Object Detectors˄

  11. Sun, Xian, et al. “FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery.” ISPRS Journal of Photogrammetry and Remote Sensing 184 (2022): 116-130, available at: FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery - ScienceDirect˄

  12. Evaluation measures (information retrieval): Average Precision - Wikipedia˄

  13. COCO - Common Objects in Context˄ ˄2 ˄3 ˄4

  14. Evaluation Metrics for Object detection algorithms | by Vijay Dubey | Medium˄

  15. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Development Kit˄

  16. Zenggyu的博客 - An Introduction to Evaluation Metrics for Object Detection˄

  17. Hosang, Jan, et al. “What makes for effective detection proposals?.” IEEE transactions on pattern analysis and machine intelligence 38.4 (2015): 814-830, available at: [1502.05082] What makes for effective detection proposals?˄ ˄2 ˄3 ˄4 ˄5

  18. COCO Evaluation metrics explained — Picsellia˄

  19. Ma, Jianxiang, et al. “Object-level proposals.” Proceedings of the IEEE international conference on computer vision. 2017, available at: ICCV 2017 Open Access Repository˄

  20. Region Proposal Network (RPN) in Object Detection - GeeksforGeeks˄

  21. cocodataset/cocoapi: COCO API - Dataset @ http://cocodataset.org/˄