Hungarian Algorithm, NMS, Greedy Matching, SimOTA

27 May, 2026

Hungarian Algorithm, NMS, Greedy Matching, SimOTA - similar yet different concepts in object detection and tracking

When training modern object detectors or tracking objects across video frames, a fundamental question arises: How are predicted bounding boxes matched to ground truth objects?

This is a classic combinatorial optimization problem known as the Linear Sum Assignment Problem (LSAP), and in computer vision, it is solved using the Hungarian Algorithm.

What is it, and How Does It Work?

Suppose there are $N$ predicted bounding boxes and $M$ ground-truth bounding boxes. Each predicted box has a different matching cost associated with each ground-truth box. The goal is to match predicted bounding boxes to ground-truth boxes in a 1-to-1 fashion such that the total cost is minimized.

In object detection:

Predictions = Predicted bounding boxes (or tracker queries).
Targets = Ground-truth objects (or new detections).
Cost = A distance metric (like negative Intersection-over-Union (IoU), L1 distance, or class probability error).

The Hungarian algorithm solves this bipartite matching problem in $O (N^{3})$ time complexity. In Python, the standard library solver is scipy.optimize.linear_sum_assignment.

1. End-to-End Object Detection (DETR)

Traditionally, object detectors predicted thousands of candidate boxes and relied on handcrafted Non-Maximum Suppression (NMS) to filter duplicate boxes at inference time. NMS is a greedy heuristic that selects the highest-confidence box and discards overlapping boxes of the same class that exceed an IoU threshold.

DETR (DEtection TRansformer) changed this by framing detection as a direct set-prediction problem, removing NMS from the pipeline entirely. It outputs a fixed set of $N$ queries (e.g., 100) and matches them to the $M$ ground truth objects ( $M \leq N$ ) using bipartite matching during training.

Where is it in the code?

In the official facebookresearch/detr repository, the matching is performed in models/matcher.py by the HungarianMatcher class.

Here is the key logic inside HungarianMatcher.forward (lines 43–88):

# Compute the classification cost (approx. as -prob[target_class])
cost_class = -out_prob[:, tgt_ids]

# Compute the L1 cost between boxes
cost_bbox = torch.cdist(out_bbox, tgt_bbox, p=1)

# Compute the GIoU cost between boxes
cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))

# Final cost matrix C
C = self.cost_bbox * cost_bbox + self.cost_class * cost_class + self.cost_giou * cost_giou
C = C.view(bs, num_queries, -1).cpu()

# Run linear sum assignment per batch element
sizes = [len(v["boxes"]) for v in targets]
indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]

By computing a global 1-to-1 matching cost based on class prediction, L1 box loss, and Generalized IoU (GIoU), the model updates its parameters to ensure unique queries specialize in unique objects, removing the need for NMS.

2. Multi-Object Tracking (SORT & DeepSORT)

In tracking, the challenge is associating objects across video frames. Between frame $T - 1$ and frame $T$ , the tracker must determine which predicted Kalman Filter track matches which new detection.

This data association step is solved at each frame using Hungarian matching.

Where is it in the code?

In the canonical implementation of SORT (Simple Online and Realtime Tracking) by Alex Bewley (abewley/sort), this matching occurs in sort.py inside the associate_detections_to_trackers function.

The cost matrix is constructed using negative Intersection-over-Union (IoU) overlap between tracker predictions and new detections, and solved using scipy.optimize.linear_sum_assignment:

from scipy.optimize import linear_sum_assignment

# Minimizing negative IoU is equivalent to maximizing IoU overlap
cost_matrix = -iou_matrix
x, y = linear_sum_assignment(cost_matrix)
matched_indices = np.array(list(zip(x, y)))

If a matched pair has an IoU below a threshold (e.g., 0.3), the match is discarded, allowing the tracker to handle new objects entering the frame or old objects leaving.

The Plot Twist: Is it used in mAP calculation?

It is intuitive to assume that evaluating a model's mean Average Precision (mAP) would also use Hungarian matching. After all, predictions must be matched to ground truths to count True Positives (TP) and False Positives (FP).

However, standard object detection evaluation (like COCO evaluation) does NOT use Hungarian matching. It uses a Greedy Matching strategy.

Why Greedy over Hungarian for mAP?

Evaluation metrics must reward confidence calibration. If the Hungarian algorithm were used, it would globally optimize matching to maximize total IoU, potentially pairing a low-confidence prediction with a ground-truth object instead of a high-confidence one.

By forcing a greedy match sorted by confidence, the metric ensures that high-confidence false positives are penalized correctly and that the precision-recall curve reflects the reliability of the model's confidence scores.

How does Greedy Matching look in code?

Here is a simplified Python representation of greedy evaluation matching for a single image:

def match_predictions_to_ground_truths(predictions, ground_truths, iou_threshold=0.5):
    # 1. Sort predictions by confidence score descending
    predictions = sorted(predictions, key=lambda x: x['score'], reverse=True)
    
    matched_gt_indices = set()
    results = []  # List of ('TP' or 'FP') for each prediction
    
    for pred in predictions:
        best_iou = -1
        best_gt_idx = -1
        
        # 2. Find the unmatched ground truth with the highest IoU overlap
        for gt_idx, gt in enumerate(ground_truths):
            if gt_idx in matched_gt_indices:
                continue
            if pred['class'] != gt['class']:
                continue
                
            iou = compute_iou(pred['bbox'], gt['bbox'])
            if iou > best_iou:
                best_iou = iou
                best_gt_idx = gt_idx
                
        # 3. Apply the IoU gating threshold
        if best_iou >= iou_threshold:
            matched_gt_indices.add(best_gt_idx)
            results.append("TP")  # True Positive
        else:
            results.append("FP")  # False Positive
            
    return results

Dynamic Label Assignment: SimOTA in YOLOX [work in progress]

In object detectors, a major training challenge is label assignment: deciding which predicted bounding boxes (or grid cells) should be supervised as positive samples (foreground) and which as negative samples (background).

Instead of using static IoU thresholds or computationally heavy Hungarian matching during training, YOLOX introduced SimOTA (Simplified Optimal Transport Assignment). SimOTA views label assignment as an assignment cost problem solved via a fast heuristic.

To train an object detector, predictions must be mapped to ground-truth objects so the loss can be computed.

Historically, older object detectors (such as YOLOv3, YOLOv4, RetinaNet, and SSD) solved this using Prediction-Blind (Static) assignment:

Fixed anchor boxes were defined at every grid cell.
Any anchor box with an IoU overlap with a ground-truth object greater than a set threshold (e.g., 0.5) was statically assigned as positive.
This rule is prediction-blind because it matches anchors before the model even makes a prediction. It ignores the model's actual capability. A grid cell might have high geometric overlap with a target box, but the features inside that cell might be poor at predicting that class (for instance, due to occlusions). Conversely, a cell slightly further away might predict the object perfectly but get ignored because its static IoU is below the threshold.

SimOTA replaces these static rules with a Prediction-Aware (Dynamic) assignment strategy:

Instead of matching anchors before looking at predictions, the model generates bounding boxes and class probabilities first.
The cost (classification loss + regression loss) is computed for each candidate cell based on the current model weights.
Labels are dynamically assigned to the grid cells that yield the lowest total loss for each object.
The assignment adapts to the model's strengths as training progresses. If a certain cell starts predicting an object accurately, the matcher dynamically routes the supervision to that cell. This makes the training process much more stable and speeds up convergence.

A Concrete Example of Label Assignment

To illustrate the difference, consider a training image containing a single object: a dog.

Assume the model's output feature map is $16 \times 16$ (giving $256$ spatial locations).

Scenario A: Static Assignment (YOLOv3 / RetinaNet)

Candidate Boxes: Each grid location has 9 anchor boxes. Total candidates = $256 \times 9 = 2, 304$ boxes.
The Rule: The overlap (IoU) between all 2,304 anchor box templates and the dog's ground-truth box is calculated.
Matching: Anchors with an IoU > 0.5 are marked as positives. Let's assume 3 specific anchors satisfy this.
Backpropagation: The loss is backpropagated only for these 3 pre-selected anchors to force them to predict the dog.
The Problem: This choice is completely blind to what the model is actually predicting. Even if one of these 3 anchors predicts gibberish, and another anchor with an IoU of 0.45 is predicting the dog perfectly, the model is forced to penalize the 0.45 IoU anchor (treating it as background) and reward the bad static anchors.

Scenario B: Dynamic Assignment (YOLOX / SimOTA)

Candidate Boxes: YOLOX is anchor-free. It has 1 prediction per grid location. Total candidates = $256$ boxes.
The Rule: The model makes its forward pass. The classification probability and bounding box coordinates are predicted.
Matching: A cost is computed for all 256 candidates: $Cost = {Loss}_{cls} + {Loss}_{reg}$ . If a grid cell predicts "dog" with high confidence and matches the dog's coordinates closely, its cost is very low. SimOTA selects the top- $k$ locations with the lowest cost.
Backpropagation: The loss is backpropagated only on these dynamically selected locations. The training target adapts to the model's current strengths.

Does NMS filter boxes during training?

A common confusion is whether NMS is used during training to select which boxes are backpropagated.

NMS is not used during training. NMS is strictly an inference post-processing step to clean up overlapping predictions for the user. During training, the loss must be computed for all candidate locations. Label assignment algorithms (like the Hungarian matcher, SimOTA, or static IoU matching) are what determine which predicted boxes are designated as foreground (positives) and which as background (negatives) for gradient descent.

How SimOTA Works

Calculate Dynamic $k$ : For each ground truth (GT) object, the top-10 candidate predictions with the highest IoU are selected. The IoU values of these candidates are summed and rounded to the nearest integer. This sum determines the dynamic number of positive anchors ( $k$ ) allocated to that specific GT object.
Compute Cost Matrix: A matching cost matrix is computed, where each cell represents the cost of assigning a prediction to a GT object: $Cost = {Loss}_{cls} + λ \cdot {Loss}_{reg}$
Select Candidates: For each GT object, the $k$ predictions with the lowest cost in the cost matrix are selected.
Resolve Conflicts: If a single prediction is assigned to multiple GT objects, the ambiguity is resolved by assigning it to the GT object that yields the minimum cost.

Where is it in the code?

In the official Megvii-BaseDetection/YOLOX repository, this logic is implemented in the get_assignments and dynamic_k_matching methods inside yolox/models/yolo_head.py.

Here is the exact conflict-resolution snippet from YOLOX's dynamic_k_matching:

# Check if any anchor is matched to more than one ground truth
anchor_matching_gt = matching_matrix.sum(0)
if (anchor_matching_gt > 1).sum() > 0:
    # Find the GT index that yields the minimum cost for the conflicted anchor
    cost_min, cost_argmin = torch.min(cost[:, anchor_matching_gt > 1], dim=0)
    # Reset matching for the conflicted anchors
    matching_matrix[:, anchor_matching_gt > 1] *= 0
    # Assign them exclusively to the lowest-cost GT
    matching_matrix[cost_argmin, anchor_matching_gt > 1] = 1.0

Key Clarifications & Nuances

1. Is SimOTA interchangeable with the Hungarian algorithm?

In theory, both algorithms solve label assignment during training. However, they are not drop-in replacements for one another because they optimize different assignment properties:

Hungarian Matching is strictly one-to-one (Bipartite): Every ground truth object is matched to exactly one prediction query. This is essential for query-based models like DETR, where queries must learn to represent unique objects.
SimOTA is many-to-one: Because convolutional models like YOLOX do not use queries and instead predict bounding boxes across grid cells, multiple adjacent anchor points/pixels can represent the same ground truth object. SimOTA allocates a dynamic number of positive anchors ( $k \geq 1$ ) to each object. If $k$ were forced to be exactly 1 for all objects, the greedy conflict resolution in SimOTA would approximate a one-to-one matching, but it would still be a local heuristic, unlike the globally optimized assignment solved by the Hungarian algorithm.

2. How does DETR backpropagate through the Hungarian Matcher?

The Hungarian matching step (linear_sum_assignment) is a discrete optimization problem. Because selecting indices is a non-continuous step, the matching process itself is non-differentiable.

DETR circumvents this limitation using a two-stage set-prediction loss workflow:

Discrete Routing (Forward Pass): The model makes predictions. The cost matrix is detached from the computation graph and computed on the CPU using the Hungarian solver. This returns a set of static matched pairs: (prediction_index, ground_truth_index).
Differentiable Loss computation (Backward Pass): Using these fixed matching pairs as a mask, standard differentiable losses (such as Cross Entropy, L1 Box Loss, and GIoU Loss) are calculated only for the matched predictions.

Because the assignment indices are treated as constant index routers during backpropagation, gradients flow smoothly through the neural network parameters that produced those matched predictions, allowing the network to learn.

Comparative Cheat Sheet of Object Detection Concepts

To clarify confusing jargon in object detection architectures, the table below compares core structural concepts:

Concept	What It Is	Role in Matching / Loss	Examples
Bounding Box (Bbox)	A rectangle defined by coordinates representing an object's boundary (e.g., `[x, y, w, h]`).	The final output predicted by the network, or the ground-truth target used to calculate regression losses.	All detectors (YOLO, DETR, R-CNN, etc.)
Anchor Box (Prior Box / Bbox Prototype)	Handcrafted reference boxes of various sizes and aspect ratios tiled across grid locations.	Statically matched to ground truths to serve as starting templates for regression offsets.	YOLOv3, YOLOv4, RetinaNet, Faster R-CNN
Grid Cell (Feature Point / Pixel)	A single coordinate cell on the model's spatial output feature map. Introduced by YOLOv1 (where the input was logically divided into a $7 \times 7$ grid corresponding to the output feature map size).	Serves as the candidate location that directly predicts class and bounding box coordinates in anchor-free architectures.	YOLOX, FCOS, YOLOv8, original YOLOv1
Object Query (Query / Box Proposal)	A learnable positional embedding vector in transformer architectures representing an object slot.	Dynamically binds to target objects in the image via self/cross-attention and bipartite matching.	DETR, RT-DETR, Deformable DETR
Region of Interest (RoI / Region Proposal)	Coarse candidate bounding boxes generated by a Region Proposal Network (RPN).	Selected based on classification scores, pooled (e.g., RoIAlign), and refined into final predictions.	Faster R-CNN, Cascade R-CNN

Comparison of Bipartite and Greedy Matching Algorithms

The table below summarizes the key matching and filtering algorithms discussed:

Algorithm	Main Usage	Time Complexity	Pros	Cons	Possible Alternatives
Hungarian Algorithm (LSAP)	Bipartite matching in training (DETR) and tracking association (SORT)	$O (N^{3})$	Guarantees globally optimal 1-to-1 matching.	High time complexity; non-differentiable.	Jonker-Volgenant (`lapjv`), Sinkhorn (Optimal Transport)
Greedy Matching	Precision/Recall mapping in evaluation (COCO mAP)	$O (N \log N + N M)$	Evaluates confidence calibration; computationally straightforward.	Locally optimal but globally suboptimal; highly order-sensitive.	Hungarian algorithm
SimOTA	Dynamic training label assignment (YOLOX)	$O (N M)$ (highly vectorized)	Dynamic, prediction-aware positive sample count; fast convergence.	Restricted to training phase; complex conflict resolution logic.	Task-Aligned Assigner (TOOD), OTA (Exact Sinkhorn)
Non-Maximum Suppression (NMS)	Inference filtering of duplicate bounding boxes	$O (N^{2})$ (or $O (N \log N)$ )	Extremely fast and effective at cleaning dense candidate outputs.	Relies on handcrafted thresholds; struggles with heavy occlusions.	Soft-NMS, Matrix NMS, Set Prediction (DETR)

My exploration of object detection & computer vision

Hungarian Algorithm, NMS, Greedy Matching, SimOTA