The Process

How holds become
a grade.

From 137,000 community routes and 1.8M pose frames to a live difficulty prediction in under 50 ms — the full pipeline, the feature engineering, the mathematics, and how we measure accuracy honestly.

137K

Graded routes

Features

83%

Within 2 grades

<50ms

Live prediction

Step 01 · The Data Pipeline

From the wall to a feature matrix

Every climb starts as a hold sequence and ends as a row of numbers. The pipeline ingests community data, fuses it with pose data extracted from beta videos, and produces a clean feature matrix.

Aurora API · BoardLib→sync 137K community-graded routes across 0–70° angles

SQLite + PostgreSQL→357K climbs stored; pose-fused corpus in Postgres

Instagram betas → MediaPipe→33 body landmarks per frame → 1.8M pose frames

feature_extraction.py→reduce each climb to an 86-dimension vector

Stacked ensemble→XGBoost + LightGBM → Ridge → isotonic calibration

Flask API :5001 → PWA→live grade, uncertainty band, board visualizer

Codebase structure

pipeline/ feature_extraction.py # climb → 86 features beta_scraper.py # Instagram pose mining (yt-dlp + MediaPipe) auto_scrape.py # autonomous scrape + retrain supervisor (validation gate) ml/ difficulty_model.py # XGBoost + LightGBM + Ridge stack, isotonic calibration grade_models.py # frozen-holdout grading + cross-model leaderboard train_specialists.py # tiered low / medium / hard experts api/ app.py # Flask server board_config.py # per-size hold maps (7×10 → 16×12) js/ board.js · creator.js · stick-figure.js # PWA board + pose animation

Step 02 · Feature Engineering

86 features in four families

Each climb is reduced to 86 features. Spatial features are computed in centimetres from hold (x, y) coordinates; pose features are imputed for every route by a model trained on real beta videos.

Geometry

Reach distances, lateral/vertical span, hold density, and wall-zone counts — the raw shape of the climb.

Sequence

Crux (longest) move, move-distance variance, top vertical gains, path linearity, dyno score — the flow between holds.

Angle interactions

angle×reach, angle×dyno, angle×span. The same holds are a different climb at 40° than at 0° — these features encode that.

Pose-imputed

Body tension, hip angle, arm extension — inferred from MediaPipe's 33 landmarks on scraped beta videos.

The two key transforms

z_j = ( x_j − μ_j ) ⁄ σ_j(1)

Per-feature standardization to zero mean, unit variance.

x_r×α = r · sin(α), α = board angle(2)

Angle interaction — steeper walls amplify the effective reach and load of every move. Reach between consecutive holds i, i+1 is r = √[(Δx)² + (Δy)²]; the crux is max_i r_i.

Board angle and its interactions are the model's three strongest features by gain (angle_group .25, board_angle .16, angle×reach .15) — the math above is doing the heavy lifting.

Step 03 · The Mathematics

Every piece, formally

A stacked ensemble regresses a continuous difficulty score ŝ ∈ [0,1], which is calibrated and mapped to a V-grade.

Prediction & ensemble

ĝ(x) = argmin_g∈𝒢 | C(F(x)) − μ_g |(3)

The predicted grade is the nearest isotonic-calibrated grade centroid μ_g.

F(x) = Σ_m=1^M η · h_m(x), h_m ∈ ℋ_CART(4)

A gradient-boosted ensemble: M regression trees added stagewise with learning rate η.

𝓛(F) = Σ_i ℓ(y_i, ŷ_i) + Σ_m [ γT_m + ½λ‖w_m‖² ](5)

Regularized objective: squared-error loss plus penalties on tree count T and leaf weights w.

Stacking & consensus weighting

ŝ = β₀ + β₁ F_XGB(x) + β₂ F_LGB(x)(6)

A Ridge meta-learner blends the two base-model scores.

β* = argmin_β ‖ y − Zβ ‖² + α‖β‖²(7)

L2-regularized least squares over the base-prediction matrix Z.

w_i = log(1 + min(a_i, 10⁴)) · q_i(8)

Each route is weighted by its ascents a_i and quality q_i — well-validated grades dominate noisy ones.

Calibration, uncertainty & metrics

C* = argmin_{C ↑} Σ_i ( C(ŝ_i) − y_i )²(9)

Isotonic (monotone) calibration via pool-adjacent-violators maps raw scores to true grades.

ρ_τ(u) = u·( τ − 𝟙[u<0] ), τ ∈ {0.1, 0.9}(10)

Pinball loss trains the q10/q90 quantiles → the uncertainty band [ŝ_q10, ŝ_q90] (e.g. "V5, range V4–V6").

Acc@k = (1⁄N) Σ_i 𝟙[ |ĝ_i − g_i| ≤ k ](11)

Within-k grade accuracy — the metric that matters, since human graders themselves disagree by ±1.

s_p = ŝ + Σ_k β_k ( θ_k − θ̄_k ) ⁄ σ_k(12)

Personal difficulty: re-scores a grade for the climber's body, θ ∈ {height, ape index, weight}.

Step 04 · Architecture & Training

Two boosted models, stacked & calibrated

A Ridge meta-learner blends XGBoost and LightGBM, then isotonic calibration maps the score to a grade. A validation gate guarantees the model can only improve.

Architecture

86-feature climb vector x

▼

XGBoost (900 trees)

LightGBM (GBDT)

▼

Ridge meta-learner (stacking)

▼

Isotonic calibration → V-grade

▼

Grade

q10–q90 band

Personal pd-score

Training protocol

Trained on 357,928 rows with mirror + sequence-reversal augmentation, hard-grade oversampling, and consensus weighting (Eq. 8). Singleton-grade routes are filtered, and frozen-holdout route IDs are dropped to prevent leakage.

Component	Key hyperparameters
XGBoost	900 trees · lr 0.03 · depth 7
· sampling	subsample 0.8 · colsample 0.7
LightGBM	GBDT · 64 leaves · hist
Ridge	α by 5-fold CV
Calibration	isotonic (PAV), monotone

Validation gate: every retrain is scored on the frozen holdout and rolled back if it regresses. Accuracy can only move up.

Step 05 · Honest Evaluation

Measured leakage-free, reported straight

Predicting an exact V-grade is famously hard — a Kilter grade is a crowd-consensus average, and even expert setters disagree by ±1. So within-1 and within-2 accuracy are the metrics that actually matter.

Two protocols, no leakage

The frozen holdout is the strict, apples-to-apples yardstick: every model graded on identical routes excluded from all training. The test split gives the per-grade breakdown.

Evaluation set	Exact	±1	±2
Frozen holdout	23.5%	60.7%	—
Test split (41.6K)	27.1%	68.7%	83.0%

Exact

27.1%

±1 grade

68.7%

±2 grades

83.0%

83% of predictions land within the ±1 human-disagreement band.

Per-grade accuracy (within-1)

65.0%

66.3%

72.4%

80.0%

82.1%

77.8%

61.8%

48.6%

43.3%

Strongest on core grades V4–V6 (the bulk of community climbing). V9+ remains the open challenge — limited by data scarcity.

Why scale matters

Training corpus	Routes	Within-1 (frozen holdout)
Baseline	22.8K	51.4%
Full graded	139K	57.9%
Full augmented	357.9K	60.7%

Within-1 rises monotonically with corpus size — +9.3 points from 22.8K → 357.9K. Data scale is the dominant lever, which is why the autonomous scraper keeps collecting beta videos.

How holds becomea grade.

Codebase structure

Geometry

Sequence

Angle interactions

Pose-imputed

The two key transforms

Prediction & ensemble

Stacking & consensus weighting

Calibration, uncertainty & metrics

Architecture

Training protocol

Two protocols, no leakage

Per-grade accuracy (within-1)

Why scale matters

How holds become
a grade.