My exploration of object detection & computer vision

Hungarian Algorithm, NMS, Greedy Matching, SimOTA

Hungarian Algorithm, NMS, Greedy Matching, SimOTA - similar yet different concepts in object detection and tracking

When training modern object detectors or tracking objects across video frames, a fundamental question arises: How are predicted bounding boxes matched to ground truth objects?

This is a classic combinatorial optimization problem known as the Linear Sum Assignment Problem (LSAP), and in computer vision, it is solved using the Hungarian Algorithm.


What is it, and How Does It Work?

Suppose there are N predicted bounding boxes and M ground-truth bounding boxes. Each predicted box has a different matching cost associated with each ground-truth box. The goal is to match predicted bounding boxes to ground-truth boxes in a 1-to-1 fashion such that the total cost is minimized.

In object detection:

The Hungarian algorithm solves this bipartite matching problem in O(N3) time complexity. In Python, the standard library solver is scipy.optimize.linear_sum_assignment.


1. End-to-End Object Detection (DETR)

Traditionally, object detectors predicted thousands of candidate boxes and relied on handcrafted Non-Maximum Suppression (NMS) to filter duplicate boxes at inference time. NMS is a greedy heuristic that selects the highest-confidence box and discards overlapping boxes of the same class that exceed an IoU threshold.

DETR (DEtection TRansformer) changed this by framing detection as a direct set-prediction problem, removing NMS from the pipeline entirely. It outputs a fixed set of N queries (e.g., 100) and matches them to the M ground truth objects (MN) using bipartite matching during training.

Where is it in the code?

In the official facebookresearch/detr repository, the matching is performed in models/matcher.py by the HungarianMatcher class.

Here is the key logic inside HungarianMatcher.forward (lines 43–88):

# Compute the classification cost (approx. as -prob[target_class])
cost_class = -out_prob[:, tgt_ids]

# Compute the L1 cost between boxes
cost_bbox = torch.cdist(out_bbox, tgt_bbox, p=1)

# Compute the GIoU cost between boxes
cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))

# Final cost matrix C
C = self.cost_bbox * cost_bbox + self.cost_class * cost_class + self.cost_giou * cost_giou
C = C.view(bs, num_queries, -1).cpu()

# Run linear sum assignment per batch element
sizes = [len(v["boxes"]) for v in targets]
indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]

By computing a global 1-to-1 matching cost based on class prediction, L1 box loss, and Generalized IoU (GIoU), the model updates its parameters to ensure unique queries specialize in unique objects, removing the need for NMS.


2. Multi-Object Tracking (SORT & DeepSORT)

In tracking, the challenge is associating objects across video frames. Between frame T1 and frame T, the tracker must determine which predicted Kalman Filter track matches which new detection.

This data association step is solved at each frame using Hungarian matching.

Where is it in the code?

In the canonical implementation of SORT (Simple Online and Realtime Tracking) by Alex Bewley (abewley/sort), this matching occurs in sort.py inside the associate_detections_to_trackers function.

The cost matrix is constructed using negative Intersection-over-Union (IoU) overlap between tracker predictions and new detections, and solved using scipy.optimize.linear_sum_assignment:

from scipy.optimize import linear_sum_assignment

# Minimizing negative IoU is equivalent to maximizing IoU overlap
cost_matrix = -iou_matrix
x, y = linear_sum_assignment(cost_matrix)
matched_indices = np.array(list(zip(x, y)))

If a matched pair has an IoU below a threshold (e.g., 0.3), the match is discarded, allowing the tracker to handle new objects entering the frame or old objects leaving.


The Plot Twist: Is it used in mAP calculation?

It is intuitive to assume that evaluating a model's mean Average Precision (mAP) would also use Hungarian matching. After all, predictions must be matched to ground truths to count True Positives (TP) and False Positives (FP).

However, standard object detection evaluation (like COCO evaluation) does NOT use Hungarian matching. It uses a Greedy Matching strategy.

Why Greedy over Hungarian for mAP?

Evaluation metrics must reward confidence calibration. If the Hungarian algorithm were used, it would globally optimize matching to maximize total IoU, potentially pairing a low-confidence prediction with a ground-truth object instead of a high-confidence one.

By forcing a greedy match sorted by confidence, the metric ensures that high-confidence false positives are penalized correctly and that the precision-recall curve reflects the reliability of the model's confidence scores.

How does Greedy Matching look in code?

Here is a simplified Python representation of greedy evaluation matching for a single image:

def match_predictions_to_ground_truths(predictions, ground_truths, iou_threshold=0.5):
    # 1. Sort predictions by confidence score descending
    predictions = sorted(predictions, key=lambda x: x['score'], reverse=True)
    
    matched_gt_indices = set()
    results = []  # List of ('TP' or 'FP') for each prediction
    
    for pred in predictions:
        best_iou = -1
        best_gt_idx = -1
        
        # 2. Find the unmatched ground truth with the highest IoU overlap
        for gt_idx, gt in enumerate(ground_truths):
            if gt_idx in matched_gt_indices:
                continue
            if pred['class'] != gt['class']:
                continue
                
            iou = compute_iou(pred['bbox'], gt['bbox'])
            if iou > best_iou:
                best_iou = iou
                best_gt_idx = gt_idx
                
        # 3. Apply the IoU gating threshold
        if best_iou >= iou_threshold:
            matched_gt_indices.add(best_gt_idx)
            results.append("TP")  # True Positive
        else:
            results.append("FP")  # False Positive
            
    return results

Dynamic Label Assignment: SimOTA in YOLOX [work in progress]

In object detectors, a major training challenge is label assignment: deciding which predicted bounding boxes (or grid cells) should be supervised as positive samples (foreground) and which as negative samples (background).

Instead of using static IoU thresholds or computationally heavy Hungarian matching during training, YOLOX introduced SimOTA (Simplified Optimal Transport Assignment). SimOTA views label assignment as an assignment cost problem solved via a fast heuristic.

Why is SimOTA Needed? (Prediction-Blind vs. Prediction-Aware)

To train an object detector, predictions must be mapped to ground-truth objects so the loss can be computed.

Historically, older object detectors (such as YOLOv3, YOLOv4, RetinaNet, and SSD) solved this using Prediction-Blind (Static) assignment:

SimOTA replaces these static rules with a Prediction-Aware (Dynamic) assignment strategy:


A Concrete Example of Label Assignment

To illustrate the difference, consider a training image containing a single object: a dog.

Assume the model's output feature map is 16×16 (giving 256 spatial locations).

Scenario A: Static Assignment (YOLOv3 / RetinaNet)

Scenario B: Dynamic Assignment (YOLOX / SimOTA)

Does NMS filter boxes during training?

A common confusion is whether NMS is used during training to select which boxes are backpropagated.

NMS is not used during training. NMS is strictly an inference post-processing step to clean up overlapping predictions for the user. During training, the loss must be computed for all candidate locations. Label assignment algorithms (like the Hungarian matcher, SimOTA, or static IoU matching) are what determine which predicted boxes are designated as foreground (positives) and which as background (negatives) for gradient descent.


How SimOTA Works

  1. Calculate Dynamic k: For each ground truth (GT) object, the top-10 candidate predictions with the highest IoU are selected. The IoU values of these candidates are summed and rounded to the nearest integer. This sum determines the dynamic number of positive anchors (k) allocated to that specific GT object.
  2. Compute Cost Matrix: A matching cost matrix is computed, where each cell represents the cost of assigning a prediction to a GT object: Cost=Losscls+λ·Lossreg
  3. Select Candidates: For each GT object, the k predictions with the lowest cost in the cost matrix are selected.
  4. Resolve Conflicts: If a single prediction is assigned to multiple GT objects, the ambiguity is resolved by assigning it to the GT object that yields the minimum cost.

Where is it in the code?

In the official Megvii-BaseDetection/YOLOX repository, this logic is implemented in the get_assignments and dynamic_k_matching methods inside yolox/models/yolo_head.py.

Here is the exact conflict-resolution snippet from YOLOX's dynamic_k_matching:

# Check if any anchor is matched to more than one ground truth
anchor_matching_gt = matching_matrix.sum(0)
if (anchor_matching_gt > 1).sum() > 0:
    # Find the GT index that yields the minimum cost for the conflicted anchor
    cost_min, cost_argmin = torch.min(cost[:, anchor_matching_gt > 1], dim=0)
    # Reset matching for the conflicted anchors
    matching_matrix[:, anchor_matching_gt > 1] *= 0
    # Assign them exclusively to the lowest-cost GT
    matching_matrix[cost_argmin, anchor_matching_gt > 1] = 1.0

Key Clarifications & Nuances

1. Is SimOTA interchangeable with the Hungarian algorithm?

In theory, both algorithms solve label assignment during training. However, they are not drop-in replacements for one another because they optimize different assignment properties:

2. How does DETR backpropagate through the Hungarian Matcher?

The Hungarian matching step (linear_sum_assignment) is a discrete optimization problem. Because selecting indices is a non-continuous step, the matching process itself is non-differentiable.

DETR circumvents this limitation using a two-stage set-prediction loss workflow:

  1. Discrete Routing (Forward Pass): The model makes predictions. The cost matrix is detached from the computation graph and computed on the CPU using the Hungarian solver. This returns a set of static matched pairs: (prediction_index, ground_truth_index).
  2. Differentiable Loss computation (Backward Pass): Using these fixed matching pairs as a mask, standard differentiable losses (such as Cross Entropy, L1 Box Loss, and GIoU Loss) are calculated only for the matched predictions.

Because the assignment indices are treated as constant index routers during backpropagation, gradients flow smoothly through the neural network parameters that produced those matched predictions, allowing the network to learn.


Comparative Cheat Sheet of Object Detection Concepts

To clarify confusing jargon in object detection architectures, the table below compares core structural concepts:

Concept What It Is Role in Matching / Loss Examples
Bounding Box (Bbox) A rectangle defined by coordinates representing an object's boundary (e.g., [x, y, w, h]). The final output predicted by the network, or the ground-truth target used to calculate regression losses. All detectors (YOLO, DETR, R-CNN, etc.)
Anchor Box (Prior Box / Bbox Prototype) Handcrafted reference boxes of various sizes and aspect ratios tiled across grid locations. Statically matched to ground truths to serve as starting templates for regression offsets. YOLOv3, YOLOv4, RetinaNet, Faster R-CNN
Grid Cell (Feature Point / Pixel) A single coordinate cell on the model's spatial output feature map. Introduced by YOLOv1 (where the input was logically divided into a 7×7 grid corresponding to the output feature map size). Serves as the candidate location that directly predicts class and bounding box coordinates in anchor-free architectures. YOLOX, FCOS, YOLOv8, original YOLOv1
Object Query (Query / Box Proposal) A learnable positional embedding vector in transformer architectures representing an object slot. Dynamically binds to target objects in the image via self/cross-attention and bipartite matching. DETR, RT-DETR, Deformable DETR
Region of Interest (RoI / Region Proposal) Coarse candidate bounding boxes generated by a Region Proposal Network (RPN). Selected based on classification scores, pooled (e.g., RoIAlign), and refined into final predictions. Faster R-CNN, Cascade R-CNN

Comparison of Bipartite and Greedy Matching Algorithms

The table below summarizes the key matching and filtering algorithms discussed:

Algorithm Main Usage Time Complexity Pros Cons Possible Alternatives
Hungarian Algorithm (LSAP) Bipartite matching in training (DETR) and tracking association (SORT) O(N3) Guarantees globally optimal 1-to-1 matching. High time complexity; non-differentiable. Jonker-Volgenant (lapjv), Sinkhorn (Optimal Transport)
Greedy Matching Precision/Recall mapping in evaluation (COCO mAP) O(NlogN+NM) Evaluates confidence calibration; computationally straightforward. Locally optimal but globally suboptimal; highly order-sensitive. Hungarian algorithm
SimOTA Dynamic training label assignment (YOLOX) O(NM) (highly vectorized) Dynamic, prediction-aware positive sample count; fast convergence. Restricted to training phase; complex conflict resolution logic. Task-Aligned Assigner (TOOD), OTA (Exact Sinkhorn)
Non-Maximum Suppression (NMS) Inference filtering of duplicate bounding boxes O(N2) (or O(NlogN)) Extremely fast and effective at cleaning dense candidate outputs. Relies on handcrafted thresholds; struggles with heavy occlusions. Soft-NMS, Matrix NMS, Set Prediction (DETR)