Semantic Video Search with DINOv3 and CLIP

I wanted to build a video search system where you could query with natural language and find relevant temporal segments. DINOv3 has great scene understanding, and CLIP has language alignment, but they're in different embedding spaces. So I trained a lightweight projector to map DINO embeddings into CLIP space, giving me the best of both worlds.

System Architecture

The pipeline extracts embeddings, trains a projector, and builds a searchable index:

Video Files (MP4)
    ↓
[1] Frame Extraction
    ↓
    Frames + Timestamps
    ↓
[2] Embedding Extraction
    ↓
    DINO Embeddings + CLIP Embeddings
    ↓
[3] Projector Training
    ↓
    DINO → CLIP Projector
    ↓
[4] Index Building
    ↓
    FAISS Index (Transformed Embeddings)
    ↓
[5] Text Query
    ↓
    Temporal Segments

Stage 1: Video Ingestion

I extract frames from MP4 files at their native frame rate and assign timestamps. Since these aren't live streams, I synthesize timestamps: timestamp = NOW + (frame_idx / fps). This creates a consistent temporal reference for grouping results later.

def ingest_video(video_path, fps):
    frames = extract_frames(video_path)
    now = time.time()
    timestamps = [now + (i / fps) for i in range(len(frames))]
    return frames, timestamps

Stage 2: Embedding Extraction

I extract two types of embeddings for each frame:

DINOv3 embeddings: Using facebook/dinov3-vitb16-pretrain-lvd1689m for scene understanding. DINO is great at capturing spatial structure and temporal coherence.

CLIP embeddings: Using open_clip with ViT-B/16 for language alignment. CLIP lets me query with text.

def extract_dinov3(frames, device):
    model = AutoModel.from_pretrained("dinov3-vitb16")
    embeddings = []
    for frame in frames:
        inputs = processor(frame, return_tensors="pt")
        outputs = model(**inputs)
        # Use class token
        feat = outputs.last_hidden_state[:, 0, :]
        embeddings.append(feat)
    return embeddings

def extract_clip(frames, device):
    model, _, preprocess = open_clip.create_model_and_transforms("ViT-B-16")
    embeddings = []
    for frame in frames:
        image = preprocess(frame)
        feat = model.encode_image(image)
        embeddings.append(feat)
    return embeddings

I process frames in batches (default 64) to avoid running out of memory on large videos.

Stage 3: Projector Training

This is the key part. I train a small MLP to map DINO embeddings to CLIP space:

class ProjectorMLP(nn.Module):
    def __init__(self, in_dim, out_dim, hidden_dim=2048):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden_dim),
            nn.GELU(),
            nn.Linear(hidden_dim, out_dim),
        )
    
    def forward(self, x):
        return self.net(x)

The training uses contrastive loss - I want DINO embeddings that are close in space to map to CLIP embeddings that are also close. I use a combination of cosine loss and InfoNCE:

def cosine_loss(z_t, z_d):
    z_t = F.normalize(z_t, dim=-1)
    z_d = F.normalize(z_d, dim=-1)
    return (1.0 - (z_t * z_d).sum(dim=-1)).mean()

def info_nce(z_t, z_d, temperature=0.07):
    z_t = F.normalize(z_t, dim=-1)
    z_d = F.normalize(z_d, dim=-1)
    logits = (z_t @ z_d.t()) / temperature
    targets = torch.arange(z_t.size(0), device=z_t.device)
    return F.cross_entropy(logits, targets)

loss = cosine_loss(projected, clip) + lambda_infonce * info_nce(projected, clip)

I train on paired (DINO, CLIP) embeddings from all frames in the corpus. The projector learns to align the two spaces so that semantically similar content maps to similar vectors.

Stage 4: Index Building

After training, I transform all DINO embeddings through the projector and build a FAISS index:

# Transform DINO embeddings
dino_transformed = projector(dino_embeddings)

# Normalize for cosine similarity
dino_transformed = F.normalize(dino_transformed, dim=-1)

# Build FAISS index
index = faiss.IndexFlatIP(dino_transformed.shape[1])  # Inner product for cosine
index.add(dino_transformed.numpy())

I use IndexFlatIP (inner product) for cosine similarity search. For larger corpora, I'd use IVF+PQ or HNSW for faster search, but the flat index works fine for now.

Stage 5: Querying

To query, I encode the text with CLIP's text encoder and search the transformed index:

def query_text(query_text, index, timestamps, topk=200):
    # Encode text with CLIP
    tokens = tokenizer([query_text])
    text_embedding = clip_model.encode_text(tokens)
    text_embedding = text_embedding / text_embedding.norm(dim=-1, keepdim=True)
    
    # Search index
    D, I = index.search(text_embedding.numpy(), topk)
    
    # Group results by temporal proximity
    ranges = group_by_timestamp(I[0], timestamps, max_gap=1.0)
    return ranges

The results are grouped into temporal ranges - if frames 100-120 all match, I return that as a single segment rather than listing each frame individually.

Temporal Grouping

The query results are grouped by temporal proximity:

Query >>> "green"
Output >>> strongest temporal matches:
* timestamp1 - timestamp202
* timestamp599 - timestamp620

This makes the results more useful - instead of getting scattered individual frames, you get continuous segments where the concept appears.

Why This Approach?

DINO for scene understanding: DINO embeddings capture spatial structure and temporal coherence better than CLIP. They're trained on a huge dataset with self-supervised learning, so they learn rich visual representations.

CLIP for language alignment: CLIP's text encoder lets me query with natural language. But CLIP embeddings are optimized for image-text matching, not necessarily scene understanding.

Projector to bridge the gap: By training a projector DINO→CLIP, I get DINO's scene understanding in CLIP's space. This means I can query with text but still benefit from DINO's superior visual features.

Lightweight training: The projector is just a small MLP (2 layers), so it trains quickly and doesn't require much compute. I can train it on a small subset of frames and it generalizes well.

Performance

On a corpus of ~100k frames:

Embedding extraction: ~5-10ms per frame (batch processing)
Projector training: ~5 minutes for 100k frames
Index building: ~10 seconds
Query latency: ~50ms for top-200 results

The bottleneck is embedding extraction, but that's done offline during ingestion. Query time is fast enough for interactive use.

Challenges

Memory management: Large videos produce a lot of frames. I use numpy memmaps during pair building to avoid loading everything into RAM at once.

Temporal alignment: The synthesized timestamps work for grouping, but they don't reflect real-world time. This is fine for most use cases, but could be an issue if you need absolute time references.

Projector generalization: The projector is trained on the corpus being indexed. If you add new videos later, you might need to retrain or fine-tune the projector, though in practice it generalizes reasonably well.

Index size: For very large corpora, the flat index becomes slow. I'd need to switch to IVF or HNSW indexes, but that adds complexity.

What I Learned

This project taught me a lot about combining different embedding spaces:

Projection works well: You don't need to retrain everything from scratch. A small MLP can learn to align two embedding spaces effectively.
Contrastive learning is powerful: The InfoNCE loss helps the projector learn meaningful alignments, not just approximate mappings.
Temporal grouping matters: Raw nearest-neighbor results are noisy. Grouping by temporal proximity makes the results much more useful.
FAISS is fast: Even with flat indexes, FAISS is fast enough for interactive use. The GPU acceleration helps a lot.
DINO + CLIP is a good combo: DINO's scene understanding plus CLIP's language alignment gives you the best of both worlds.

The system enables semantic search over video content with natural language queries. It's also a good example of how to combine different embedding models to get the benefits of each.