Accuracy,
Efficiency, and Expansion: Surveying the Advancements in DETR during
2021-2023
Abstract
DEtection TRansformer (DETR) is a framework for
object detection that views it as a direct set prediction problem,
removing the need for hand-designed components and utilizing a
transformer encoder-Decoder architecture to improve the accuracy and
efficiency of object detection. Within two years, Detection Transformer
(DETR) has undergone a remarkable transformation. This survey dissects
key advancements, analyzes its current state, and ponders its future,
revealing how DETR redefines object detection.
Key Words: Object Detection, Transformer, Detection Transformer
Introduction
Object Detection refers to the task of automatically
identifying and localizing objects within an image or video. It involves
using computer vision techniques, such as deep learning models, to
analyze and classify regions of an image that contain objects of
interest. As a fundamental building block of computer vision, object
detection has undergone a remarkable transformation in recent years.
Early efforts relied on meticulously crafted features and laborious
two-stage pipelines, struggling to achieve both accuracy and efficiency.
However, the emergence of DETR (Detection Transformer)
in 2020 marks a pivotal moment, introducing a novel paradigm that
transcends limitations and unveils exciting possibilities for the future
of object detection.
DETR views object detection as a set prediction problem and
introduces a remarkably concise pipeline for object detection. It
involves using a Convolutional Neural Network (CNN) to
extract foundational features, which are then input into a Transformer
for relationship modeling. The resulting output is matched with ground
truth on the image using a bipartite graph matching algorithm. The
detailed methodology of DETR is illustrated in the above diagram, and
its key design inceptions include:
Modeling object detection as a set prediction
problem:
DETR conceptualizes object detection as a set prediction problem.
Instead of treating each object individually, DETR aims to predict the
entire set of objects collectively. This global perspective is a
departure from the conventional paradigm.
Bipartite Matching for Label Assignment:
To accomplish label assignment, DETR employs a bipartite matching
strategy. This involves using the Hungarian algorithm, a combinatorial
optimization algorithm, to determine the optimal matching between the
predicted objects and the ground truth. This approach ensures effective
and accurate label assignment.
Transformer-based Encoder-Decoder Structure:
DETR leverages the Transformer architecture with an encoder-Decoder
structure. This choice transforms object detection into an end-to-end
problem, eliminating the need for post-processing steps like Non-Maximum
Suppression (NMS). The Transformer's attention mechanism enables global
context understanding, contributing to improved detection
accuracy.
Avoidance of handcrafted anchor priors:
Unlike traditional methods that rely on manually defined anchor
priors, DETR avoids such handcrafted position prior information. This is
achieved through its set-based approach, making the model more flexible
and less dependent on predefined anchor boxes.