Video Tracking with Multi-Attribute Inference
Video Tracking with Multi-Attribute Inference
I wanted to track people across video frames and figure out things like their age, gender, and what they were doing. So I put together YOLO for detection, CLIP for feature extraction, and wrote a custom tracker to keep track of who's who across frames.
System Architecture
The pipeline is pretty straightforward - three stages:
Video Frame
↓
[1] Person Detection (YOLO-NAS)
↓
Detections (bboxes + confidence)
↓
[2] Feature Extraction (CLIP)
↓
Feature Vectors
↓
[3] Multi-Object Tracking
↓
Tracked Persons + Attributes
Stage 1: Person Detection
I use YOLO-NAS Large to detect people in each frame, with a high confidence threshold (0.95). I focus on the upper body region (upper 2/3 of the bounding box) since that's where most attributes are visible.
┌─────────────────────────────────┐
│ Video Frame │
│ │
│ ┌──────┐ ┌──────┐ │
│ │Person│ │Person│ │
│ │ #1 │ │ #2 │ │
│ └──────┘ └──────┘ │
└─────────────────────────────────┘
↓ YOLO-NAS
┌─────────────────────────────────┐
│ Detection Results │
│ • Person #1: bbox, conf=0.97 │
│ • Person #2: bbox, conf=0.96 │
└─────────────────────────────────┘
Stage 2: Feature Extraction & Attribute Inference
Each detected person gets cropped, resized, and passed through CLIP to extract feature vectors. These features do two things:
- Re-identification: Matching the same person across frames
- Attribute inference: Figuring out age, gender, expression, etc.
Person Crop (224×224)
↓
CLIP Image Encoder (ONNX)
↓
Feature Vector (1024-dim)
↓
├─→ Attribute Explainers → [age, gender, expression, ...]
└─→ Track Matching → [same person?]
The attribute inference uses pre-computed text features for each class:
# Pre-computed text features for each attribute type
explainers = {
"age": MatriarchExplainer("human-age"),
"gender": MatriarchExplainer("human-gender"),
"facial_expression": MatriarchExplainer("human-facial-expression"),
# ... more attributes
}
# For each detected person
image_features = feature_extractor.infer(crop)
attributes = {
k: explainer.infer(image_features)
for k, explainer in explainers.items()
}
Stage 3: Multi-Object Tracking
The tracker matches detections across frames using a scoring function that looks at:
- Feature similarity: Cosine distance between CLIP embeddings
- Spatial distance: How far the person moved
- Track age: Longer tracks get priority
Frame N Frame N+1
┌─────────┐ ┌─────────┐
│ Track 1 │ │ Track 1 │ ← matched
│ Track 2 │ │ Track 2 │ ← matched
│ Track 3 │ │ Track 3 │ ← matched
└─────────┘ │ Track 4 │ ← new
└─────────┘
Tracking Algorithm
The tracker keeps a history of each person's movement and uses momentum to predict where they'll be in the next frame:
def _is_match(distance, age, features_a, features_b):
score = cosine_similarity(features_a, features_b)
score = score - (distance * 5) # Penalize movement
score = score + (min(age, 5) / 30) # Reward track stability
return score
Matches are found greedily by highest score. A score above 0.7 typically indicates the same person.
Temporal Smoothing
Attributes flicker frame-by-frame. I use a decay-based accumulation system:
- Each attribute prediction adds to an accumulated score
- Scores decay by 10% each frame if not reinforced
- I display the top 1-2 attributes by accumulated score
This prevents flickering between similar classes (e.g., "smiling" vs "neutral").
Processing Pipeline
The pipeline is structured like this:
- Setup: Load YOLO-NAS and CLIP models
- Frame Processing Loop:
- Detect people in each frame
- Extract features and infer attributes
- Store results for tracking
- Tracking Phase:
- Match detections across frames
- Update track positions with momentum
- Accumulate attributes over time
- Visualization:
- Draw bounding boxes and labels
- Render annotated video
I process the entire video first to get all detections, then run tracking as a separate pass. This lets the tracker use temporal information more effectively and makes the code easier to debug.
Performance
On 4K video at 30fps:
- Detection: ~20ms per frame (GPU-accelerated)
- Feature extraction: ~5ms per person
- Attribute inference: ~5ms per person per attribute set
- Tracking update: ~5ms per frame
- Total: Real-time capable with GPU
The bottleneck is detection - processing each frame through YOLO-NAS. Feature extraction and attribute inference are fast enough to run per-person without much overhead.
Challenges
Occlusion: When people overlap or leave the frame, momentum predicts their return. I keep tracks alive for 10 seconds before deleting them.
Scale variation: People get bigger/smaller as they move. I normalize by bounding box area and mostly rely on feature similarity for matching.
Attribute uncertainty: Low-confidence predictions get thrown out. I only show attributes with accumulated scores above thresholds.
False positives: The high-confidence detection threshold (0.95) filters most noise, and tracking helps validate detections over time.
What I Learned
This project was interesting because it combined several different approaches:
- CLIP for re-identification: The same features used for attribute inference also work surprisingly well for tracking. No need for separate re-ID models.
- Temporal smoothing matters: Frame-by-frame predictions are noisy. Accumulating over time with decay gives much more stable results.
- Momentum helps with occlusion: When people disappear briefly, momentum predicts where they'll reappear.
- ONNX makes it fast: Converting CLIP to ONNX cut inference time significantly while keeping the same accuracy.
The combination of fast detection, robust tracking, and flexible attribute inference makes this a useful approach for understanding video content. It's also a good example of how modern vision-language models can be applied beyond their original use cases.
Result
Here's the final output showing tracked people with their inferred attributes: