Technical WhitepaperProduction

BAVI V-JEPA 2.0

Semantic Action Understanding via Latent-Space Video Prediction

Lab: BlueX ResearchAuthors: J. Hwan, S. YoonPublished: January 2026Status: Production

Abstract

We present BAVI V-JEPA 2.0, a video understanding architecture achieving 99.4% recallon sports action classification with 50% fewer parameters than comparable Vision-Language Models (VLMs). Building on Meta's Joint-Embedding Predictive Architecture, our approach predicts masked video patches in latent space rather than pixel space, eliminating the computational overhead of generative reconstruction.

The key insight: action recognition requires understanding what is semantically meaningful, not reconstructing every visual detail. By predicting in embedding space, the model learns to discard unpredictable environmental noise (grass texture, crowd faces) while capturing kinematic patterns essential to action classification.

1

The Generative Overhead Problem

Current approaches to video action recognition fall into two categories—both with fundamental limitations for edge deployment in sports applications.

Vision-Language Models

LLaVA-13B, GPT-4V, Gemini Pro Vision

VRAM requirement26GB+
Inference latency2-5 seconds
Edge deploymentImpossible

Traditional CNNs

I3D, SlowFast, C3D

Optical flowRequired (2× compute)
Real-time capable"Not suited"
Temporal windowFixed (inflexible)

Root Cause: Pixel Reconstruction

Generative models (including VLMs) learn to reconstruct every pixelin masked regions. This wastes massive compute on:

  • Environmental noise: Grass texture, court surface patterns, lighting variations
  • Crowd details: Individual faces, clothing patterns, movement
  • Unpredictable elements: Ball spin, player expressions, random occlusions

None of these contribute to action classification—yet generative models must predict them all.

1.1 The Information Paradox

Video understanding presents a paradox: more visual detail ≠ better action recognition. A player's serving motion contains the same semantic information whether filmed in 4K or 480p, whether the crowd is visible or cropped out. The kinematic pattern—arm trajectory, body rotation, follow-through—is what matters.

This insight drives our architectural choice: predict in latent space where semantic information is preserved but pixel-level detail is abstracted away.

2

Joint-Embedding Predictive Architecture

V-JEPA (Video Joint-Embedding Predictive Architecture) learns representations by predicting masked video patches in embedding space, not pixel space. This fundamental shift enables the model to focus on semantically meaningful patterns.

The Core Insight

Prediction vs. Reconstruction

Generative (MAE, VLMs)

"Given visible patches, reconstruct the exact pixel values of masked patches."

Loss = ||pixels_pred - pixels_true||²

Must predict grass color, shadow angles, crowd clothing...

Predictive (V-JEPA)

"Given visible patches, predict the semantic embedding of masked patches."

Loss = ||embed_pred - embed_target||²

Only predicts semantic content—action, motion, pose.

2.1 Architecture Components

Context Encoder

Processes visible (unmasked) video patches through a Vision Transformer. Outputs contextualized embeddings that capture spatial and temporal relationships between visible regions.

ViT-L/16 backbonePatch size: 16×16Temporal: 16 frames

Predictor Network

Takes context encoder output + positional embeddings for masked regions. Predicts what the target encoder would have output for those masked patches. Narrower than encoder (asymmetric design prevents collapse).

12 transformer blocksHidden dim: 384

Target Encoder (EMA)

Processes masked patches to create target embeddings. Updated via Exponential Moving Average (EMA) of context encoder weights—no gradients flow through. This prevents representation collapse without contrastive negatives.

EMA decay: 0.999Stop-gradient

2.2 Spatiotemporal Masking Strategy

V-JEPA uses aggressive spatiotemporal masking (up to 90%) to force the model to learn robust representations. The masking pattern is designed to:

90%
Mask ratio

Forces learning of strong priors

4×4
Spatial blocks

Contiguous region masking

8
Temporal frames

Multi-frame prediction

3

BAVI Sports Adaptation

We fine-tuned the base V-JEPA model on sports-specific data with domain adaptations that improve action classification for racket sports.

Action Classes (Tennis)

  • Serve99.8%
  • Forehand99.2%
  • Backhand99.1%
  • Volley98.7%
  • Smash99.5%
  • Drop shot98.4%

Training Configuration

  • Pre-training dataVideoMix-2M
  • Fine-tuning dataSports-500K
  • Epochs (pre-train)800
  • Epochs (fine-tune)100
  • Batch size256
  • Learning rate1e-4 (cosine)

Why Latent Prediction Works for Sports

Sports actions have high kinematic regularity. A forehand swing follows predictable biomechanical patterns regardless of:

Court surface color
Player clothing
Camera angle (±30°)

The latent space naturally abstracts these variations while preserving motion signatures.

4

Benchmarks & Results

Performance Metrics

650M
Parameters
99.4%
Recall
12ms
Latency (G5)
2.85×
Faster Decode
ModelParamsVRAMLatencyAccuracy
LLaVA-13B13B26GB2-5s94.2%
I3D + OpticalFlow25M8GB180ms91.8%
SlowFast-R5034M6GB85ms93.5%
VideoMAE-L305M12GB45ms96.1%
BAVI V-JEPA 2.0650M4GB12ms99.4%

Key Findings

  • Latent prediction outperforms pixel reconstruction: V-JEPA achieves higher accuracy than VideoMAE despite similar architecture, purely from the prediction objective
  • No optical flow required: Unlike I3D and traditional methods, V-JEPA learns motion representations implicitly through temporal masking
  • Efficient VRAM usage: 4GB inference enables deployment on consumer GPUs and high-end edge devices
  • 2.85× faster decoding: No pixel reconstruction means faster output generation for real-time applications
5

Production Deployment

Infrastructure

  • • AWS g5.2xlarge (A10G)
  • • PyTorch 2.x + CUDA 12
  • • TorchScript optimization
  • • Mixed precision (FP16)

Pipeline

  • • FFmpeg frame extraction
  • • Sliding window (16 frames)
  • • Batch inference (stride: 8)
  • • Action smoothing filter

Throughput

  • • Port 8006 (Pro Analyzer)
  • • 83 FPS @ 1080p
  • • Async request handling
  • • GPU memory pooling

References

[1] Assran, M. et al. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR.

[2] Bardes, A. et al. (2024). V-JEPA: Latent Video Prediction for Visual Representation Learning. arXiv.

[3] He, K. et al. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR.

[4] Feichtenhofer, C. et al. (2019). SlowFast Networks for Video Recognition. ICCV.

[5] Carreira, J. & Zisserman, A. (2017). Quo Vadis, Action Recognition? I3D. CVPR.