Semantic Video Search with DINOv3 and CLIP
Building a video search system that uses DINOv3 embeddings for scene understanding and CLIP for text queries, with a learned projector to bridge the two.
I'm interested in robotics, vision-language-action models, mathematics, and AI. I build systems that connect these ideas—from edge devices running real-time inference to cloud infrastructure orchestrating large-scale training.
Much of my work involves taking AI models from research into production, optimizing them for real constraints: memory limits, latency requirements, and the practical needs of applications in defense, healthcare, and other domains.
I write about what I'm learning and building in these areas.
Applied AI and full-stack development across defense, mental health, finance, and IoT applications.
Built AI-powered tools for apparel design, optimizing diffusion models for 4K image generation in ~15 seconds.
Engineered real-time, on-device AI systems for large-scale visual data collection with temporal compression models.
Building a video search system that uses DINOv3 embeddings for scene understanding and CLIP for text queries, with a learned projector to bridge the two.
A distributed system for monitoring stock sentiment by scraping news, extracting structured information with LLMs, and indexing for semantic search.
Building a system that analyzes keyboard typing patterns to infer cognitive health metrics like attention, impulse control, and mood stability.
Building a real-time video analysis system that tracks people across frames and infers multiple attributes using CLIP and YOLO.
A real-time computer vision system that combines object detection, pose estimation, and ballistics calculations to automatically adjust for environmental conditions.