Fighter Detection & Pose Analysis
Multi-Model Pipeline for Combat Sports Analytics
Abstract
This case study documents our multi-model pipeline for fighter detection and pose analysis in combat sports video. The system achieves real-time performance (<100ms latency) by combining custom-trained YOLOv8 for fighter identification, YOLOv8-Pose for 33-landmark skeletal tracking, and ST-GCN for kick technique classification.
Developed for taekwondo match analysis, the methodology generalizes to other combat sports (MMA, boxing, karate) where dual-fighter tracking, pose estimation, and action recognition are required at broadcast-quality frame rates.
The Combat Sports Detection Challenge
Combat sports video presents unique challenges for computer vision. Unlike static object detection or single-person pose estimation, the domain requires simultaneous tracking of multiple rapidly-moving, frequently-occluding subjects with real-time performance constraints.
Dual-Fighter Tracking
Two fighters must be tracked with consistent IDs across frames, even during clinches, spins, and rapid exchanges where bounding boxes overlap.
Identity Preservation
Fighters must be distinguished by corner (red vs. blue hogu) across occlusions, similar poses, and varying camera angles.
Real-Time Constraint
Match analysis requires 30+ FPS processing to capture rapid kick sequences, while streaming results to the client via SSE.
Why Generic Person Detection Fails
Standard COCO-trained person detectors cannot distinguish between fighters by corner color. Generic pose estimators struggle with combat-specific poses (high kicks, spinning techniques) that are underrepresented in general datasets.
Multi-Model Pipeline Architecture
The system employs a cascaded architecture where specialized models handle distinct tasks. Each component is optimized for its specific function, enabling parallel execution on GPU and modular upgrades.
Processing Pipeline
Stage 1: Fighter Detection (YOLOv8n-TKD)
Custom-trained YOLOv8 nano model for taekwondo-specific fighter detection. Trained to recognize red and blue hogu (chest protector) as separate classes.
Classes: [blue, red]Corner-specific detection
Model: yolov8n3.2M parameters
Latency: ~8msPer-frame inference
Stage 2: Pose Estimation (YOLOv8L-Pose)
Native YOLO pose estimation model providing 17 keypoints per detected person. Handles multiple people naturally, enabling simultaneous dual-fighter tracking.
Keypoints: 17COCO format
Latency: ~25msDual-person
Stage 3: Technique Classification (ST-GCN)
Spatial-Temporal Graph Convolutional Network for classifying kick techniques from pose sequences. Takes 15-30 frame windows of skeletal data.
Custom Training for Fighter Detection
The key innovation is training a lightweight YOLO model specifically for hogu color detection, enabling consistent fighter identification without complex re-identification networks.
Training Data
- Source Videos100+
- Labeled Frames10,000+
- Pose Sequences50,000+
- SourceWorld Taekwondo
Model Configuration
- Base ModelYOLOv8n
- Classes2 (blue, red)
- Image Size640px
- Parameters~3.2M
Kick Annotation Schema
{
"video_id": "match_001",
"frame_start": 1200,
"frame_end": 1230,
"technique": "roundhouse", // 돌려차기
"fighter": "red",
"target": "body",
"contact": true,
"score": 2
}Performance Results
System Performance
| Model Stage | Latency | GPU Memory | Accuracy |
|---|---|---|---|
| YOLOv8n-TKD (Fighter Detection) | ~8ms | ~1GB | 95%+ |
| YOLOv8L-Pose (Pose Estimation) | ~25ms | ~3GB | 92% |
| ST-GCN (Kick Classification) | ~40ms | ~2GB | 90% |
| Total Pipeline | <100ms | <8GB |
Per-Fighter Output Statistics
{
"fighter_id": "red",
"round": 2,
"statistics": {
"total_strikes": 45,
"kicks": {
"roundhouse": 18,
"front": 12,
"side": 8,
"back": 4,
"axe": 3
},
"punches": 12,
"strikes_per_minute": 22.5,
"head_strikes": 15,
"body_strikes": 30,
"accuracy": 0.67,
"territory_control": 0.55
}
}Generalization to Other Combat Sports
The multi-model pipeline architecture generalizes to other combat sports with minimal retraining. The key adaptation is training the Stage 1 detector for sport-specific visual markers (glove colors, uniform patterns) and the Stage 3 classifier for sport-specific techniques.
Applicable Domains
- Boxing (glove color detection)
- MMA (corner-based identification)
- Karate (belt color, uniform)
- Fencing (weapon + uniform tracking)
Transferable Components
- Multi-model cascade architecture
- SSE real-time streaming pipeline
- Pose estimation backbone
- Statistics aggregation framework
References
[1] Yan, S. et al. (2018). Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. AAAI.
[2] Jocher, G. et al. (2023). Ultralytics YOLOv8. GitHub repository.
[3] Lugaresi, C. et al. (2019). MediaPipe: A Framework for Building Perception Pipelines. CVPR Workshop.
[4] World Taekwondo. (2024). Competition Rules. worldtaekwondo.org.