From 137,000 community routes and 1.8M pose frames to a live difficulty prediction in under 50 ms — the full pipeline, the feature engineering, the mathematics, and how we measure accuracy honestly.
Every climb starts as a hold sequence and ends as a row of numbers. The pipeline ingests community data, fuses it with pose data extracted from beta videos, and produces a clean feature matrix.
Each climb is reduced to 86 features. Spatial features are computed in centimetres from hold (x, y) coordinates; pose features are imputed for every route by a model trained on real beta videos.
Reach distances, lateral/vertical span, hold density, and wall-zone counts — the raw shape of the climb.
Crux (longest) move, move-distance variance, top vertical gains, path linearity, dyno score — the flow between holds.
angle×reach, angle×dyno, angle×span. The same holds are a different climb at 40° than at 0° — these features encode that.
Body tension, hip angle, arm extension — inferred from MediaPipe's 33 landmarks on scraped beta videos.
A stacked ensemble regresses a continuous difficulty score ŝ ∈ [0,1], which is calibrated and mapped to a V-grade.
A Ridge meta-learner blends XGBoost and LightGBM, then isotonic calibration maps the score to a grade. A validation gate guarantees the model can only improve.
Trained on 357,928 rows with mirror + sequence-reversal augmentation, hard-grade oversampling, and consensus weighting (Eq. 8). Singleton-grade routes are filtered, and frozen-holdout route IDs are dropped to prevent leakage.
| Component | Key hyperparameters |
|---|---|
| XGBoost | 900 trees · lr 0.03 · depth 7 |
| · sampling | subsample 0.8 · colsample 0.7 |
| LightGBM | GBDT · 64 leaves · hist |
| Ridge | α by 5-fold CV |
| Calibration | isotonic (PAV), monotone |
Predicting an exact V-grade is famously hard — a Kilter grade is a crowd-consensus average, and even expert setters disagree by ±1. So within-1 and within-2 accuracy are the metrics that actually matter.
The frozen holdout is the strict, apples-to-apples yardstick: every model graded on identical routes excluded from all training. The test split gives the per-grade breakdown.
| Evaluation set | Exact | ±1 | ±2 |
|---|---|---|---|
| Frozen holdout | 23.5% | 60.7% | — |
| Test split (41.6K) | 27.1% | 68.7% | 83.0% |
83% of predictions land within the ±1 human-disagreement band.
Strongest on core grades V4–V6 (the bulk of community climbing). V9+ remains the open challenge — limited by data scarcity.
| Training corpus | Routes | Within-1 (frozen holdout) |
|---|---|---|
| Baseline | 22.8K | 51.4% |
| Full graded | 139K | 57.9% |
| Full augmented | 357.9K | 60.7% |
Within-1 rises monotonically with corpus size — +9.3 points from 22.8K → 357.9K. Data scale is the dominant lever, which is why the autonomous scraper keeps collecting beta videos.