Video Tracking with Multi-Attribute Inference

I wanted to track people across video frames and figure out things like their age, gender, and what they were doing. So I put together YOLO for detection, CLIP for feature extraction, and wrote a custom tracker to keep track of who's who across frames.

System Architecture

The pipeline is pretty straightforward - three stages:

Video Frame
    ↓
[1] Person Detection (YOLO-NAS)
    ↓
    Detections (bboxes + confidence)
    ↓
[2] Feature Extraction (CLIP)
    ↓
    Feature Vectors
    ↓
[3] Multi-Object Tracking
    ↓
    Tracked Persons + Attributes

Stage 1: Person Detection

I use YOLO-NAS Large to detect people in each frame, with a high confidence threshold (0.95). I focus on the upper body region (upper 2/3 of the bounding box) since that's where most attributes are visible.

┌─────────────────────────────────┐
│         Video Frame             │
│                                 │
│  ┌──────┐      ┌──────┐         │
│  │Person│      │Person│         │
│  │  #1  │      │  #2  │         │
│  └──────┘      └──────┘         │
└─────────────────────────────────┘
         ↓ YOLO-NAS
┌─────────────────────────────────┐
│  Detection Results              │
│  • Person #1: bbox, conf=0.97   │
│  • Person #2: bbox, conf=0.96   │
└─────────────────────────────────┘

Stage 2: Feature Extraction & Attribute Inference

Each detected person gets cropped, resized, and passed through CLIP to extract feature vectors. These features do two things:

Re-identification: Matching the same person across frames
Attribute inference: Figuring out age, gender, expression, etc.

Person Crop (224×224)
    ↓
CLIP Image Encoder (ONNX)
    ↓
Feature Vector (1024-dim)
    ↓
    ├─→ Attribute Explainers → [age, gender, expression, ...]
    └─→ Track Matching → [same person?]

The attribute inference uses pre-computed text features for each class:

# Pre-computed text features for each attribute type
explainers = {
    "age": MatriarchExplainer("human-age"),
    "gender": MatriarchExplainer("human-gender"),
    "facial_expression": MatriarchExplainer("human-facial-expression"),
    # ... more attributes
}

# For each detected person
image_features = feature_extractor.infer(crop)
attributes = {
    k: explainer.infer(image_features)
    for k, explainer in explainers.items()
}

Stage 3: Multi-Object Tracking

The tracker matches detections across frames using a scoring function that looks at:

Feature similarity: Cosine distance between CLIP embeddings
Spatial distance: How far the person moved
Track age: Longer tracks get priority

Frame N                    Frame N+1
┌─────────┐              ┌─────────┐
│ Track 1 │              │ Track 1 │ ← matched
│ Track 2 │              │ Track 2 │ ← matched
│ Track 3 │              │ Track 3 │ ← matched
└─────────┘              │ Track 4 │ ← new
                         └─────────┘

Tracking Algorithm

The tracker keeps a history of each person's movement and uses momentum to predict where they'll be in the next frame:

def _is_match(distance, age, features_a, features_b):
    score = cosine_similarity(features_a, features_b)
    score = score - (distance * 5)      # Penalize movement
    score = score + (min(age, 5) / 30)    # Reward track stability
    return score

Matches are found greedily by highest score. A score above 0.7 typically indicates the same person.

Temporal Smoothing

Attributes flicker frame-by-frame. I use a decay-based accumulation system:

Each attribute prediction adds to an accumulated score
Scores decay by 10% each frame if not reinforced
I display the top 1-2 attributes by accumulated score

This prevents flickering between similar classes (e.g., "smiling" vs "neutral").

Processing Pipeline

The pipeline is structured like this:

Setup: Load YOLO-NAS and CLIP models
Frame Processing Loop:
- Detect people in each frame
- Extract features and infer attributes
- Store results for tracking
Tracking Phase:
- Match detections across frames
- Update track positions with momentum
- Accumulate attributes over time
Visualization:
- Draw bounding boxes and labels
- Render annotated video

I process the entire video first to get all detections, then run tracking as a separate pass. This lets the tracker use temporal information more effectively and makes the code easier to debug.

Performance

On 4K video at 30fps:

Detection: ~20ms per frame (GPU-accelerated)
Feature extraction: ~5ms per person
Attribute inference: ~5ms per person per attribute set
Tracking update: ~5ms per frame
Total: Real-time capable with GPU

The bottleneck is detection - processing each frame through YOLO-NAS. Feature extraction and attribute inference are fast enough to run per-person without much overhead.

Challenges

Occlusion: When people overlap or leave the frame, momentum predicts their return. I keep tracks alive for 10 seconds before deleting them.

Scale variation: People get bigger/smaller as they move. I normalize by bounding box area and mostly rely on feature similarity for matching.

Attribute uncertainty: Low-confidence predictions get thrown out. I only show attributes with accumulated scores above thresholds.

False positives: The high-confidence detection threshold (0.95) filters most noise, and tracking helps validate detections over time.

What I Learned

This project was interesting because it combined several different approaches:

CLIP for re-identification: The same features used for attribute inference also work surprisingly well for tracking. No need for separate re-ID models.
Temporal smoothing matters: Frame-by-frame predictions are noisy. Accumulating over time with decay gives much more stable results.
Momentum helps with occlusion: When people disappear briefly, momentum predicts where they'll reappear.
ONNX makes it fast: Converting CLIP to ONNX cut inference time significantly while keeping the same accuracy.

The combination of fast detection, robust tracking, and flexible attribute inference makes this a useful approach for understanding video content. It's also a good example of how modern vision-language models can be applied beyond their original use cases.

Result

Here's the final output showing tracked people with their inferred attributes: