The Process

How holds become
a grade.

From 137,000 community routes and 1.8M pose frames to a live difficulty prediction in under 50 ms — the full pipeline, the feature engineering, the mathematics, and how we measure accuracy honestly.

137K
Graded routes
86
Features
83%
Within 2 grades
<50ms
Live prediction
Step 01 · The Data Pipeline
From the wall to a feature matrix

Every climb starts as a hold sequence and ends as a row of numbers. The pipeline ingests community data, fuses it with pose data extracted from beta videos, and produces a clean feature matrix.

Aurora API · BoardLibsync 137K community-graded routes across 0–70° angles
SQLite + PostgreSQL357K climbs stored; pose-fused corpus in Postgres
Instagram betas → MediaPipe33 body landmarks per frame → 1.8M pose frames
feature_extraction.pyreduce each climb to an 86-dimension vector
Stacked ensembleXGBoost + LightGBM → Ridge → isotonic calibration
Flask API :5001 → PWAlive grade, uncertainty band, board visualizer

Codebase structure

pipeline/ feature_extraction.py # climb → 86 features beta_scraper.py # Instagram pose mining (yt-dlp + MediaPipe) auto_scrape.py # autonomous scrape + retrain supervisor (validation gate) ml/ difficulty_model.py # XGBoost + LightGBM + Ridge stack, isotonic calibration grade_models.py # frozen-holdout grading + cross-model leaderboard train_specialists.py # tiered low / medium / hard experts api/ app.py # Flask server board_config.py # per-size hold maps (7×10 → 16×12) js/ board.js · creator.js · stick-figure.js # PWA board + pose animation
Step 02 · Feature Engineering
86 features in four families

Each climb is reduced to 86 features. Spatial features are computed in centimetres from hold (x, y) coordinates; pose features are imputed for every route by a model trained on real beta videos.

Geometry

Reach distances, lateral/vertical span, hold density, and wall-zone counts — the raw shape of the climb.

Sequence

Crux (longest) move, move-distance variance, top vertical gains, path linearity, dyno score — the flow between holds.

Angle interactions

angle×reach, angle×dyno, angle×span. The same holds are a different climb at 40° than at 0° — these features encode that.

Pose-imputed

Body tension, hip angle, arm extension — inferred from MediaPipe's 33 landmarks on scraped beta videos.

The two key transforms

zj = ( xj − μj ) ⁄ σj(1)
Per-feature standardization to zero mean, unit variance.
xr×α = r · sin(α),   α = board angle(2)
Angle interaction — steeper walls amplify the effective reach and load of every move. Reach between consecutive holds i, i+1 is r = √[(Δx)² + (Δy)²]; the crux is maxi ri.
Board angle and its interactions are the model's three strongest features by gain (angle_group .25, board_angle .16, angle×reach .15) — the math above is doing the heavy lifting.
Step 03 · The Mathematics
Every piece, formally

A stacked ensemble regresses a continuous difficulty score ŝ ∈ [0,1], which is calibrated and mapped to a V-grade.

Prediction & ensemble

ĝ(x) = argming∈𝒢 | C(F(x)) − μg |(3)
The predicted grade is the nearest isotonic-calibrated grade centroid μg.
F(x) = Σm=1M η · hm(x),  hm ∈ ℋCART(4)
A gradient-boosted ensemble: M regression trees added stagewise with learning rate η.
𝓛(F) = Σi ℓ(yi, ŷi) + Σm [ γTm + ½λ‖wm‖² ](5)
Regularized objective: squared-error loss plus penalties on tree count T and leaf weights w.

Stacking & consensus weighting

ŝ = β0 + β1 FXGB(x) + β2 FLGB(x)(6)
A Ridge meta-learner blends the two base-model scores.
β* = argminβ ‖ y − Zβ ‖² + α‖β‖²(7)
L2-regularized least squares over the base-prediction matrix Z.
wi = log(1 + min(ai, 10⁴)) · qi(8)
Each route is weighted by its ascents ai and quality qi — well-validated grades dominate noisy ones.

Calibration, uncertainty & metrics

C* = argminC ↑ Σi ( C(ŝi) − yi(9)
Isotonic (monotone) calibration via pool-adjacent-violators maps raw scores to true grades.
ρτ(u) = u·( τ − 𝟙[u<0] ),  τ ∈ {0.1, 0.9}(10)
Pinball loss trains the q10/q90 quantiles → the uncertainty band [ŝq10, ŝq90] (e.g. "V5, range V4–V6").
Acc@k = (1⁄N) Σi 𝟙[ |ĝi − gi| ≤ k ](11)
Within-k grade accuracy — the metric that matters, since human graders themselves disagree by ±1.
sp = ŝ + Σk βk ( θk − θ̄k ) ⁄ σk(12)
Personal difficulty: re-scores a grade for the climber's body, θ ∈ {height, ape index, weight}.
Step 04 · Architecture & Training
Two boosted models, stacked & calibrated

A Ridge meta-learner blends XGBoost and LightGBM, then isotonic calibration maps the score to a grade. A validation gate guarantees the model can only improve.

Architecture

86-feature climb vector x
XGBoost (900 trees)
LightGBM (GBDT)
Ridge meta-learner (stacking)
Isotonic calibration → V-grade
Grade
q10–q90 band
Personal pd-score

Training protocol

Trained on 357,928 rows with mirror + sequence-reversal augmentation, hard-grade oversampling, and consensus weighting (Eq. 8). Singleton-grade routes are filtered, and frozen-holdout route IDs are dropped to prevent leakage.

ComponentKey hyperparameters
XGBoost900 trees · lr 0.03 · depth 7
· samplingsubsample 0.8 · colsample 0.7
LightGBMGBDT · 64 leaves · hist
Ridgeα by 5-fold CV
Calibrationisotonic (PAV), monotone
Validation gate: every retrain is scored on the frozen holdout and rolled back if it regresses. Accuracy can only move up.
Step 05 · Honest Evaluation
Measured leakage-free, reported straight

Predicting an exact V-grade is famously hard — a Kilter grade is a crowd-consensus average, and even expert setters disagree by ±1. So within-1 and within-2 accuracy are the metrics that actually matter.

Two protocols, no leakage

The frozen holdout is the strict, apples-to-apples yardstick: every model graded on identical routes excluded from all training. The test split gives the per-grade breakdown.

Evaluation setExact±1±2
Frozen holdout23.5%60.7%
Test split (41.6K)27.1%68.7%83.0%
Exact
27.1%
±1 grade
68.7%
±2 grades
83.0%

83% of predictions land within the ±1 human-disagreement band.

Per-grade accuracy (within-1)

V0
65.0%
V2
66.3%
V3
72.4%
V4
80.0%
V5
82.1%
V6
77.8%
V7
61.8%
V8
48.6%
V9
43.3%

Strongest on core grades V4–V6 (the bulk of community climbing). V9+ remains the open challenge — limited by data scarcity.

Why scale matters

Training corpusRoutesWithin-1 (frozen holdout)
Baseline22.8K51.4%
Full graded139K57.9%
Full augmented357.9K60.7%

Within-1 rises monotonically with corpus size — +9.3 points from 22.8K → 357.9K. Data scale is the dominant lever, which is why the autonomous scraper keeps collecting beta videos.