← Back to work

Video Tracking with Multi-Attribute Inference

5 min read
computer-visionvideo-processingtrackingyoloclip

Video Tracking with Multi-Attribute Inference

I wanted to track people across video frames and figure out things like their age, gender, and what they were doing. So I put together YOLO for detection, CLIP for feature extraction, and wrote a custom tracker to keep track of who's who across frames.

System Architecture

The pipeline is pretty straightforward - three stages:

Video Frame
    ↓
[1] Person Detection (YOLO-NAS)
    ↓
    Detections (bboxes + confidence)
    ↓
[2] Feature Extraction (CLIP)
    ↓
    Feature Vectors
    ↓
[3] Multi-Object Tracking
    ↓
    Tracked Persons + Attributes

Stage 1: Person Detection

I use YOLO-NAS Large to detect people in each frame, with a high confidence threshold (0.95). I focus on the upper body region (upper 2/3 of the bounding box) since that's where most attributes are visible.

┌─────────────────────────────────┐
│         Video Frame             │
│                                 │
│  ┌──────┐      ┌──────┐         │
│  │Person│      │Person│         │
│  │  #1  │      │  #2  │         │
│  └──────┘      └──────┘         │
└─────────────────────────────────┘
         ↓ YOLO-NAS
┌─────────────────────────────────┐
│  Detection Results              │
│  • Person #1: bbox, conf=0.97   │
│  • Person #2: bbox, conf=0.96   │
└─────────────────────────────────┘

Stage 2: Feature Extraction & Attribute Inference

Each detected person gets cropped, resized, and passed through CLIP to extract feature vectors. These features do two things:

  1. Re-identification: Matching the same person across frames
  2. Attribute inference: Figuring out age, gender, expression, etc.
Person Crop (224×224)
    ↓
CLIP Image Encoder (ONNX)
    ↓
Feature Vector (1024-dim)
    ↓
    ├─→ Attribute Explainers → [age, gender, expression, ...]
    └─→ Track Matching → [same person?]

The attribute inference uses pre-computed text features for each class:

# Pre-computed text features for each attribute type
explainers = {
    "age": MatriarchExplainer("human-age"),
    "gender": MatriarchExplainer("human-gender"),
    "facial_expression": MatriarchExplainer("human-facial-expression"),
    # ... more attributes
}

# For each detected person
image_features = feature_extractor.infer(crop)
attributes = {
    k: explainer.infer(image_features)
    for k, explainer in explainers.items()
}

Stage 3: Multi-Object Tracking

The tracker matches detections across frames using a scoring function that looks at:

Frame N                    Frame N+1
┌─────────┐              ┌─────────┐
│ Track 1 │              │ Track 1 │ ← matched
│ Track 2 │              │ Track 2 │ ← matched
│ Track 3 │              │ Track 3 │ ← matched
└─────────┘              │ Track 4 │ ← new
                         └─────────┘

Tracking Algorithm

The tracker keeps a history of each person's movement and uses momentum to predict where they'll be in the next frame:

def _is_match(distance, age, features_a, features_b):
    score = cosine_similarity(features_a, features_b)
    score = score - (distance * 5)      # Penalize movement
    score = score + (min(age, 5) / 30)    # Reward track stability
    return score

Matches are found greedily by highest score. A score above 0.7 typically indicates the same person.

Temporal Smoothing

Attributes flicker frame-by-frame. I use a decay-based accumulation system:

This prevents flickering between similar classes (e.g., "smiling" vs "neutral").

Processing Pipeline

The pipeline is structured like this:

  1. Setup: Load YOLO-NAS and CLIP models
  2. Frame Processing Loop:
    • Detect people in each frame
    • Extract features and infer attributes
    • Store results for tracking
  3. Tracking Phase:
    • Match detections across frames
    • Update track positions with momentum
    • Accumulate attributes over time
  4. Visualization:
    • Draw bounding boxes and labels
    • Render annotated video

I process the entire video first to get all detections, then run tracking as a separate pass. This lets the tracker use temporal information more effectively and makes the code easier to debug.

Performance

On 4K video at 30fps:

The bottleneck is detection - processing each frame through YOLO-NAS. Feature extraction and attribute inference are fast enough to run per-person without much overhead.

Challenges

Occlusion: When people overlap or leave the frame, momentum predicts their return. I keep tracks alive for 10 seconds before deleting them.

Scale variation: People get bigger/smaller as they move. I normalize by bounding box area and mostly rely on feature similarity for matching.

Attribute uncertainty: Low-confidence predictions get thrown out. I only show attributes with accumulated scores above thresholds.

False positives: The high-confidence detection threshold (0.95) filters most noise, and tracking helps validate detections over time.

What I Learned

This project was interesting because it combined several different approaches:

The combination of fast detection, robust tracking, and flexible attribute inference makes this a useful approach for understanding video content. It's also a good example of how modern vision-language models can be applied beyond their original use cases.

Result

Here's the final output showing tracked people with their inferred attributes: