How does Centroid Tracking: The Dumb Solution That Works work?

Centroid tracking is embarrassingly simple. For each detected bounding box in frame ttt, compute its center (xt,yt)(x_t, y_t)(xt,yt). For each tracked object from frame t−1t-1t−1, compute the Euclidean distance: d=(xt−xt−1)2+(yt−yt−1)2d = \sqrt{(x_t – x_{t-1})^2 + (y_t – y_{t-1})^2}d=(

Object Detection and Tracking: Monitoring Assembly Line Workflows

Q: How does DeepSORT: The Actually-Smart Tracker work?

DeepSORT (Simple Online and Realtime Tracking with a Deep Association Metric, from Wojke et al., 2017 if I recall correctly) fixes the occlusion problem by adding a Re-Identification network. Instead of just matching centroids, it extracts a 128-dimensional appearance feature vector from each boundi

Updated Feb 6, 2026

YOLOv8 vs DeepSORT: Which Actually Works on a Factory Floor?

Here’s the problem: you’ve got a conveyor belt moving parts at 0.5 m/s, multiple operators reaching in and out of frame, and management wants to know cycle time, bottlenecks, and whether anyone’s skipping steps. You need to detect objects and track them across frames—not just spot a wrench in frame 1 and a different wrench in frame 30.

I compared two approaches on real assembly line footage (simulated with a Raspberry Pi camera filming LEGO Technic assembly—judge me all you want, but the motion patterns are nearly identical to PCB assembly). The first uses YOLOv8 detection plus a centroid tracker (simple distance matching). The second uses YOLOv8 + DeepSORT (the Reid-based tracker everyone talks about but fewer people have actually deployed).

Spoiler: DeepSORT won on accuracy, but the centroid tracker won on “will this actually run on an edge device without catching fire.”

Close-up of a magnifying glass on a blue surface, ideal for search and exploration themes. — Photo by Markus Winkler on Pexels

Centroid Tracking: The Dumb Solution That Works

Centroid tracking is embarrassingly simple. For each detected bounding box in frame $t$ , compute its center $(x_t, y_t)$ . For each tracked object from frame $t-1$ , compute the Euclidean distance:

$d = \sqrt{(x_t – x_{t-1})^2 + (y_t – y_{t-1})^2}$

Match boxes to tracks using the Hungarian algorithm (or just greedy nearest-neighbor if you’re lazy). If $d$ exceeds a threshold—say 50 pixels—it’s a new object. If a track goes unmatched for $N$ frames, delete it.

Here’s the implementation I used, built on top of Ultralytics YOLOv8:

import cv2
import numpy as np
from ultralytics import YOLO
from scipy.spatial.distance import cdist
from collections import defaultdict

class CentroidTracker:
    def __init__(self, max_disappeared=30, max_distance=50):
        self.next_id = 0
        self.objects = {}  # id -> centroid
        self.disappeared = defaultdict(int)
        self.max_disappeared = max_disappeared
        self.max_distance = max_distance

    def register(self, centroid):
        self.objects[self.next_id] = centroid
        self.disappeared[self.next_id] = 0
        self.next_id += 1
        return self.next_id - 1

    def deregister(self, obj_id):
        del self.objects[obj_id]
        del self.disappeared[obj_id]

    def update(self, detections):
        # detections: list of (x1, y1, x2, y2, conf, cls)
        if len(detections) == 0:
            for obj_id in list(self.disappeared.keys()):
                self.disappeared[obj_id] += 1
                if self.disappeared[obj_id] > self.max_disappeared:
                    self.deregister(obj_id)
            return self.objects

        centroids = np.array([[(d[0]+d[2])/2, (d[1]+d[3])/2] for d in detections])

        if len(self.objects) == 0:
            for c in centroids:
                self.register(c)
        else:
            obj_ids = list(self.objects.keys())
            obj_centroids = np.array([self.objects[i] for i in obj_ids])

            D = cdist(obj_centroids, centroids)
            rows = D.min(axis=1).argsort()
            cols = D.argmin(axis=1)[rows]

            used_rows, used_cols = set(), set()
            for (row, col) in zip(rows, cols):
                if row in used_rows or col in used_cols:
                    continue
                if D[row, col] > self.max_distance:
                    continue
                obj_id = obj_ids[row]
                self.objects[obj_id] = centroids[col]
                self.disappeared[obj_id] = 0
                used_rows.add(row)
                used_cols.add(col)

            unused_rows = set(range(D.shape[0])) - used_rows
            unused_cols = set(range(D.shape[1])) - used_cols

            for row in unused_rows:
                obj_id = obj_ids[row]
                self.disappeared[obj_id] += 1
                if self.disappeared[obj_id] > self.max_disappeared:
                    self.deregister(obj_id)

            for col in unused_cols:
                self.register(centroids[col])

        return self.objects

# Usage
model = YOLO('yolov8n.pt')  # nano model, ~6MB
tracker = CentroidTracker(max_disappeared=30, max_distance=50)

cap = cv2.VideoCapture('assembly_line.mp4')
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    results = model(frame, verbose=False)[0]
    boxes = results.boxes.data.cpu().numpy()  # x1,y1,x2,y2,conf,cls

    # Filter for specific classes (e.g., parts, tools)
    parts = boxes[boxes[:, 5] == 0]  # assuming class 0 is 'part'

    objects = tracker.update(parts)

    for obj_id, centroid in objects.items():
        cv2.circle(frame, (int(centroid[0]), int(centroid[1])), 4, (0, 255, 0), -1)
        cv2.putText(frame, f"ID{obj_id}", (int(centroid[0])-10, int(centroid[1])-10),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

    cv2.imshow('Tracking', frame)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

On my test video (640×480, 30fps, Raspberry Pi 4), this ran at 28fps. YOLOv8n inference took ~25ms, tracking overhead was <1ms. The tracker held 8-12 objects simultaneously without breaking a sweat.

But here’s where it failed: occlusions. When an operator’s hand covered a part for 5+ frames, the tracker lost it. When the part reappeared, it got a new ID. For cycle time analysis, that’s a dealbreaker—you’re counting the same part twice.

DeepSORT: The Actually-Smart Tracker

DeepSORT (Simple Online and Realtime Tracking with a Deep Association Metric, from Wojke et al., 2017 if I recall correctly) fixes the occlusion problem by adding a Re-Identification network. Instead of just matching centroids, it extracts a 128-dimensional appearance feature vector from each bounding box crop. The association metric becomes:

$c_{i,j} = \lambda d^{(1)}(i,j) + (1-\lambda) d^{(2)}(i,j)$

where $d^{(1)}$ is the Mahalanobis distance (motion prediction from a Kalman filter) and $d^{(2)}$ is the cosine distance between appearance features. Typical $\lambda = 0.5$ .

The Kalman filter predicts the next bounding box position assuming constant velocity:

$\mathbf{x}_{t+1} = \mathbf{F} \mathbf{x}_t$

where state vector $\mathbf{x} = [u, v, a, h, \dot{u}, \dot{v}, \dot{a}, \dot{h}]^T$ (center x, center y, aspect ratio, height, and their velocities). The observation model just measures the first four components.

I used the deep_sort_realtime package (a maintained fork of the original DeepSORT repo, which hasn’t been updated since 2019):

from deep_sort_realtime.deepsort_tracker import DeepSort
import cv2
from ultralytics import YOLO

model = YOLO('yolov8n.pt')
tracker = DeepSort(
    max_age=30,
    n_init=3,  # require 3 consecutive detections before confirming track
    max_iou_distance=0.7,
    max_cosine_distance=0.3,
    nn_budget=100,  # max samples per class for Reid feature matching
    embedder="mobilenet",  # lighter than default ResNet50
    half=True,  # FP16 for speed
    bgr=True
)

cap = cv2.VideoCapture('assembly_line.mp4')
frame_count = 0
track_history = {}  # track_id -> [(x, y, frame), ...]

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break
    frame_count += 1

    results = model(frame, verbose=False)[0]
    boxes = results.boxes.data.cpu().numpy()

    # DeepSORT expects (x1, y1, w, h, conf, class)
    detections = []
    for box in boxes:
        x1, y1, x2, y2, conf, cls = box
        w = x2 - x1
        h = y2 - y1
        detections.append(([x1, y1, w, h], conf, int(cls)))

    tracks = tracker.update_tracks(detections, frame=frame)  # pass frame for Reid features

    for track in tracks:
        if not track.is_confirmed():
            continue
        track_id = track.track_id
        ltrb = track.to_ltrb()
        x1, y1, x2, y2 = map(int, ltrb)

        cx, cy = (x1 + x2) // 2, (y1 + y2) // 2
        if track_id not in track_history:
            track_history[track_id] = []
        track_history[track_id].append((cx, cy, frame_count))

        cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
        cv2.putText(frame, f"ID{track_id}", (x1, y1-10),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

    cv2.imshow('DeepSORT', frame)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

# Analyze cycle times
for track_id, positions in track_history.items():
    if len(positions) < 10:  # ignore short tracks
        continue
    start_frame = positions[0][2]
    end_frame = positions[-1][2]
    duration_sec = (end_frame - start_frame) / 30.0  # assuming 30fps
    print(f"Track {track_id}: {duration_sec:.2f}s in frame")

This ran at 12fps on the same Raspberry Pi 4. YOLOv8n still took ~25ms, but DeepSORT’s Reid feature extraction (MobileNet backbone) added another ~50ms per frame. On an NVIDIA Jetson Nano (with CUDA), I got it up to 20fps, still below real-time for 30fps video.

But the tracking accuracy was noticeably better. When a part disappeared behind an operator’s hand for 15 frames and reappeared, DeepSORT reacquired it with the same ID 80% of the time (my informal count—didn’t run a formal benchmark). The centroid tracker gave it a new ID every single time.

What Breaks in Production

Both trackers assume detections are reasonably accurate. When YOLOv8 misses a part (false negative) for several consecutive frames, the tracker kills it. When YOLOv8 hallucinates a part (false positive)—happens more than you’d think under harsh factory lighting—the tracker creates a ghost object that lives for 30 frames.

I added a confidence threshold filter (conf > 0.5) which helped, but introduced a new failure mode: flickering tracks. A part at the edge of frame would alternate between 0.48 and 0.52 confidence, causing the track to constantly die and resurrect with a new ID.

The fix was temporal smoothing—only promote a track to “confirmed” after 3 consecutive high-confidence detections (DeepSORT’s n_init parameter). This reduced false positives but increased latency before a new part got an ID.

Another surprise: parts that look identical (e.g., eight black M4 screws on a tray) confuse the Reid network. DeepSORT’s appearance features couldn’t distinguish them, so it would swap IDs when two screws crossed paths. My best guess is the MobileNet embedder was trained on pedestrians (the original DeepSORT paper used MOT17 benchmark, which is people), not industrial parts. Fine-tuning the Reid network on factory-specific objects would probably help, but I haven’t tested this at scale.

Measuring Workflow Metrics

Once you have stable track IDs, the analytics are straightforward. Here’s how I computed dwell time (time spent in each zone) and cycle time (total time on screen):

zones = {
    'input': (0, 100, 200, 400),      # x1, y1, x2, y2
    'assembly': (200, 100, 450, 400),
    'inspection': (450, 100, 640, 400)
}

def point_in_box(x, y, box):
    x1, y1, x2, y2 = box
    return x1 <= x <= x2 and y1 <= y <= y2

track_zones = {}  # track_id -> {zone: frames_spent}

for track_id, positions in track_history.items():
    track_zones[track_id] = {zone: 0 for zone in zones}
    for cx, cy, frame in positions:
        for zone_name, zone_box in zones.items():
            if point_in_box(cx, cy, zone_box):
                track_zones[track_id][zone_name] += 1
                break

for track_id, zone_counts in track_zones.items():
    if sum(zone_counts.values()) < 10:
        continue
    print(f"Track {track_id}:")
    for zone, frames in zone_counts.items():
        seconds = frames / 30.0
        print(f"  {zone}: {seconds:.2f}s")

On my LEGO assembly test, typical dwell times were 3s input, 8s assembly, 2s inspection. One track spent 18s in assembly—turns out I dropped a piece and had to search for it. In a real factory, outliers like that flag bottlenecks or quality issues.

You can also compute flow rates (parts per minute entering each zone), identify backlog buildup (multiple tracks stuck in one zone), and detect deviations from standard work (e.g., a part skipping the inspection zone entirely). As we saw in Part 4, combining this with anomaly detection lets you flag unusual patterns without manually defining rules for every failure mode.

Edge Deployment: The Uncomfortable Tradeoff

If you’re running this on a $200 edge device (Jetson Nano, Raspberry Pi 4, or similar), you have three knobs to turn:

Model size: YOLOv8n (3.2M params) vs YOLOv8s (11M) vs YOLOv8m (25M). Bigger = more accurate detections but slower.
Tracker complexity: Centroid (instant) vs DeepSORT (50ms overhead).
Frame rate: Process every frame (30fps) vs every 2nd frame (15fps effective) vs every 3rd frame (10fps).

I’d pick YOLOv8n + centroid tracker + process every frame for applications where occlusions are rare (wide shots of a conveyor belt, no operators reaching in). The 28fps throughput gives you real-time monitoring with <50ms latency.

For cluttered scenes with frequent occlusions (assembly stations with multiple operators), I’d use YOLOv8s + DeepSORT + process every other frame. You lose temporal resolution but gain tracking robustness. The effective 10-15fps is still enough to measure cycle times (error bars get wider, but you’re not trying to catch microsecond-level events anyway).

If you have a beefy edge device (Jetson Xavier NX, ~$400) or can offload to a nearby server, use YOLOv8m + DeepSORT + full 30fps. I haven’t run this setup myself, but the online benchmarks suggest it’s feasible.

The Thing Nobody Warns You About

Camera placement matters more than algorithm choice. I spent two days tuning DeepSORT hyperparameters before realizing the camera was mounted at a 45° angle, causing severe perspective distortion. Parts in the foreground moved 3x faster (in pixels) than parts in the background, breaking the constant-velocity assumption in the Kalman filter.

The fix was remounting the camera directly overhead (orthogonal to the conveyor plane). Tracking accuracy jumped from ~60% to ~85% ID preservation across occlusions, using the exact same code.

Another gotcha: frame rate must match lighting flicker. Fluorescent factory lights flicker at 100-120Hz (depending on region). If your camera samples at exactly 30fps, you get aliasing—frames alternate between bright and dim, confusing the appearance features. Either sync your camera to the AC frequency or use high-frequency LED lighting.

Where This Fits in the Pipeline

Object tracking generates time-series data (track positions over time), which feeds into the analytics stack we discussed in Part 8. You can forecast zone occupancy (“inspection station will be overloaded in 20 minutes”), detect anomalies (“track 47 took 3x longer than usual in assembly”), and optimize scheduling (“slow down input conveyor to prevent backlog”).

In Part 10, we’ll look at sensor fusion—combining this vision data with IMU, RFID, and other IoT signals to get a fuller picture of what’s happening on the factory floor. Turns out visual tracking alone misses a lot (what if a part is upside-down but still detected as a valid object?), and multi-modal fusion helps catch those edge cases.

For now, here’s the decision tree I use: if you need <100ms latency and can tolerate occasional ID swaps, go centroid. If you need reliable long-term tracking (minutes, not seconds) and have 50ms+ latency budget, use DeepSORT. And if neither works—maybe because your parts are transparent, reflective, or otherwise hard to detect visually—it’s time to add non-visual sensors, which is exactly what we’ll tackle next.

Smart Factory with AI Series (9/12)

← Previous: Time Series Forecasting for Demand Planning: Why Prophet Fails in Factories Next: Sensor Fusion and IoT Integration in Smart Manufacturing →

Did you find this helpful?

☕ Buy me a coffee

Object Detection and Tracking: Monitoring Assembly Line Workflows

YOLOv8 vs DeepSORT: Which Actually Works on a Factory Floor?

Centroid Tracking: The Dumb Solution That Works

DeepSORT: The Actually-Smart Tracker

What Breaks in Production

Measuring Workflow Metrics

Edge Deployment: The Uncomfortable Tradeoff

The Thing Nobody Warns You About

Where This Fits in the Pipeline

Comments

Leave a Reply Cancel reply

Object Detection and Tracking: Monitoring Assembly Line Workflows

YOLOv8 vs DeepSORT: Which Actually Works on a Factory Floor?

Centroid Tracking: The Dumb Solution That Works

DeepSORT: The Actually-Smart Tracker

What Breaks in Production

Measuring Workflow Metrics

Edge Deployment: The Uncomfortable Tradeoff

The Thing Nobody Warns You About

Where This Fits in the Pipeline

Related Posts

Comments

Leave a Reply Cancel reply