AI/ML2026-05-1316 min readBy Abhishek Nair - Fractional CTO for Deep Tech & AI

The VLM That Scored a Collapsing Robot 62/100

#Vision Language Models#Reinforcement Learning#Humanoid Robotics#Evaluation#AI Safety#LLM Agents

TL;DR

After metrics kept missing degenerate gaits and LLM-iterated rewards hit the survival cliff, we tried a vision-language model as the fitness scorer. The VLM was harsher than metrics and produced actionable failure descriptions — and it scored a collapsing robot 62/100. The case study in honest fitness, plus the four-layer evaluation stack we landed on.

v18 best candidate: VLM scored 62/100 with is_walking=True — actual rollout: 0 real steps, 26cm pelvis drop in 0.6 seconds
The VLM is 20–50 points harsher than hand-coded fitness on the same policies — real upgrade in evaluation rigor
Plain-language VLM failure descriptions ('torso pitches backward severely from the first step') let the LLM propose targeted structural fixes
v19 fix: multi-angle temporal VLM + step counter + warmstart from v12 → 4 of 12 candidates pass the strict walking gate
Four-layer eval stack: thresholds → structural metrics → VLM critique → step counter. Each catches what the prior misses.

The VLM That Scored a Collapsing Robot 62/100

TL;DR

After hand-coded fitness functions kept missing degenerate gaits (the reward-hacking saga) and after LLM-iterated reward design hit the survival cliff (the Eureka post), we tried the obvious next thing: let a vision-language model judge the rollouts.

The hypothesis was that human-like qualitative judgment would catch the failure modes our metrics couldn't articulate. We swapped the hand-coded fitness scorer for Claude (Opus 4.7) looking at 8 rendered keyframes per rollout, scoring 0–100, and writing a plain-language critique.

It half-worked.

The VLM gave harsher, more honest scores than metrics did. Same policies that scored 73.8 on hand-coded fitness scored 62 on the VLM. The 20–50 point gap is the value: policies that look similar by metrics get distinct scores by the VLM, and the VLM's score correlates with what a human observer would say.
The LLM responded to the VLM's plain-language failure feedback. When iteration 0 candidates were flagged with "torso pitches backward severely from the very first step," iteration 1's first candidate explicitly proposed "uprightness gating + angular-velocity damping to fix the 'pitches backward' failure that killed all three priors." That's the Eureka feedback loop working as intended.
The VLM's "winner" was a 0.6-second forward collapse. The best v18 candidate scored 62/100, passed the VLM's is_walking=True check, and on careful inspection had taken zero real steps — its pelvis dropped 26 cm over 30 frames as the robot fell forward. The 8 keyframes captured a moment with bipedal pose during the collapse. The VLM was fooled the same way metrics had been: by static-looking stills that flatter the actual physical trajectory.

This is a case study in honest fitness. We thought adding a VLM would close the supervision gap. It narrowed the gap but didn't close it, and the next gap — frame-time disguise of collapses — needed a different fix.

The fix we landed on: a strict step counter (foot lift ≥ 4 cm, hold ≥ 3 frames, land ≥ 6 cm ahead of where it lifted) cross-validated against the VLM, with results always reported in seconds, not frames. Under that gate, both v17's "best" and v18's "best" score zero real steps. The v19 follow-up — same VLM critic but with multi-angle temporal sampling and warm-start from a working mocap policy — produced 4 of 12 candidates that pass the strict gate. One of them, iter 2 cand 1, is the genuine qualitative winner: pelvis stable, real bipedal walking, sustained 5+ seconds.

This post tells the whole story honestly, retraction included.

Why we tried a VLM critic

Hand-coded fitness for humanoid walking is fragile. Across this project we encountered policies that:

Scored well on "single-foot stance > 30%" by always being on the same foot — single-leg hopping (v16).
Scored well on "pelvis height > 0.55 m" by crouching at exactly 0.55 and never tipping (mode 2 of the reward-hacking saga).
Scored well on "forward velocity > 0.3 m/s" by walking forward briefly and then falling forward — gait failure in disguise.

Every metric we added caught the immediate failure mode and missed the next one. The general pattern: the space of "satisfies the reward" is strictly larger than the space of "looks correct," and policy optimizers are excellent at finding the gap.

A VLM trained on millions of human walking images doesn't have that limitation by construction. It sees rendered keyframes and describes them in natural language. A single-leg hopper looks nothing like walking; the VLM should say so. Crucially, the VLM cannot be gamed by satisfying specific numerical criteria — it judges rendered images, which encode the entire physical pose.

That was the hypothesis going in. Bear with me through the results.

Setup

VLM critic. Claude Opus 4.7 via the Claude Agent SDK, using subscription credentials. 8 keyframes per rollout, side view, rendered at 320×320 from MuJoCo. The prompt asks for three things in JSON:

A 0–100 score.
A plain-language description of what's happening across the keyframes.
The single most fixable issue, phrased as actionable feedback for a reward designer.

Latency: ~10–15 seconds per critique. API cost: zero incremental (subscription).

LLM proposer. Same Eureka loop as v17 (4 iterations × 3 candidates each). The proposer's prompt now describes the VLM critic as the scorer and includes prior VLM descriptions verbatim in the history of attempts. The proposer can see what the VLM said about previous candidates, not just the scores.

Training env. Same UnitreeG1ConfigurableEnv as v17, cold-start, with hard termination on pelvis below 0.55 m and non-foot floor contact. Identical baseline so the comparison is apples-to-apples.

Training budget. 2.5M PPO steps per candidate, 16 parallel envs, ~20 minutes wallclock each. 12 candidates total, ~4 hours.

Results: v18 head-to-head with v17

The comparison table, reported in human time scale rather than control frames:

Property	v17 (hand-coded fitness)	v18 (VLM critique)
Best candidate score	73.8 / 100	62 / 100
Episode duration (best)	61 frames = 1.2 seconds	30 frames = 0.6 seconds
Real steps taken (lift + land at new x ≥ 6 cm)	0	0
Pelvis drop start → end	~25 cm	26 cm
Forward distance traveled	0.27 m	0.09 m
Visual: is it walking?	No — backward collapse	No — forward collapse

Neither "winner" walks. Both fall in under 1.3 seconds, less than the duration of a single human walking step. The "bipedal alternation" we saw in metrics and in the v18 keyframes was the policy's feet swapping as it fell, not actual stepping.

The retraction

The first draft of the v18 result claimed this was a "real walker" — the VLM said is_walking=True, 62/100, and the keyframes showed bipedal pose. The retraction came when we added the strict step counter and re-rendered the same rollout in real time. 30 frames at 50 Hz is 0.6 seconds. No human walking step completes in 0.6 seconds, let alone the full stance-swing-stance cycle that constitutes a step.

The keyframes captured a moment with bipedal pose during the collapse. The VLM, judging stills, couldn't see the temporal trajectory that ended in a face-plant.

It's the same failure mode as the metrics — different layer, same shape. The metric scored "single-stance fraction > 30%" without knowing the policy was always on the same foot. The VLM scored "looks bipedal across 8 frames" without knowing those 8 frames spanned 0.6 seconds of falling. Each layer was correct given what it could see, and each was wrong in the same way: by being shown an artifact that flattered the underlying trajectory.

This was, again, embarrassing. Worth being explicit about.

What the VLM actually did better than metrics

Even retracted as a "result," v18 produced two genuinely new things compared to v17.

Harsher, more accurate scoring

Every v18 candidate scored lower than its v17 equivalent would have:

v18 iter	v18 cand	VLM score	What hand-coded fitness would have said
1	0	62	~75 (matches v17 best)
3	1	38	~70 (passes most v17 criteria)
0	0	18	65 (this is the v17 first-iteration policy, exactly)

The VLM is consistently 20–50 points harsher than the hand-coded scorer on the same policies. Policies that look similar by metrics get distinct scores by the VLM, and the VLM's score correlates with what a human observer would say.

That's a real upgrade. The Eureka loop trains the LLM on the score signal. When the score signal is harsher and more correlated with human judgment, the LLM's proposals improve. We see this empirically in the next subsection.

Actionable failure descriptions

Iter 0 of v18 produced three candidates, all scored 14–18 by the VLM, all flagged with related plain-language critiques:

"torso pitches backward severely from the very first step"
"Severe torso pitch instability... arms used as windmilling counterweights"
"The robot pitches backward from the first step instead of projecting its center of mass forward over the stance foot"

This is the kind of feedback an experienced motion researcher might give. Notice that all three reach the same diagnosis in slightly different words — the VLM is consistent across rollouts when the underlying failure is the same.

In iter 1, the first candidate's reward-design rationale was explicit:

"Stability-gated locomotion — velocity reward is multiplied by an uprightness gate so the policy only earns walking pay after it earns standing, plus explicit angular-velocity damping to fix the 'pitches backward' failure that killed all three priors."

That candidate scored 62 — a 44-point jump in a single iteration — and was confirmed as walking by the VLM (though, as the retraction shows, on artifacts the VLM couldn't see through).

The point isn't the 44-point jump on its own. It's that the LLM proposed a structural fix to a specific failure mode named by the VLM, rather than re-trying the same shape twice. That doesn't happen with metric-only feedback because metric-only feedback can't name structural failures.

New failure modes the metrics missed

These all became visible only with VLM critique:

"Arms locked in unnatural splayed pose instead of swinging reciprocally"
"Body is ballistically jumping/lunging rather than stepping"
"Pelvis collapses monotonically through the sequence"
"Forward dive instead of placing a foot to catch CoM"
"Both knees flex simultaneously instead of single-leg stance"

The hand-coded fitness had no view into any of these. A policy could splay its arms unnaturally, jump-launch, or collapse its pelvis while still satisfying every fitness criterion we'd encoded. The VLM names these failure modes; the LLM can then address them specifically.

What the VLM did NOT fix

The survival cliff persists. Best v18 policy walks for 30 frames before falling forward. Best v17 policy "walks" for 61 frames then falls. The underlying RL problem — that PPO cannot learn balance recovery from cold-start trajectories where it has never seen a near-fall — is not affected by changing the scorer. The reward function, however carefully designed, cannot produce training data that doesn't exist.

VLM critique is an evaluation upgrade, not a training upgrade. It tells you when your policy is wrong, more accurately than metrics do, but it doesn't make your policy right.

The fix: a four-layer evaluation stack + temporal critique

After v18 we built what we should have started with. Each layer catches a class of failure the previous one missed:

Hand-coded thresholds (v17): catches simple failures like non-foot contact and low pelvis.
Structural metrics (post-v16, response to the hopping retraction): bilateral imbalance, airborne fraction, double-support fraction. Catches single-leg hopping, flying, no-real-step transitions.
VLM critique (v18): catches qualitative failure modes the numerical metrics don't capture — windmilling arms, splayed posture, backward torso pitch, monotonic pelvis collapse.
Step counter (retroactively after v18): a real step requires foot lift ≥ 4 cm, hold ≥ 3 frames, land ≥ 6 cm ahead. is_walking=True requires ≥ 4 real steps AND ≤ 10 cm total pelvis drop AND ≥ 5 seconds duration. Reports always in seconds, not frames.

Under this gate, both v17's "best" and v18's "best" score 0 real steps and is_walking=False. The strict gate caught what the prior gates missed.

v19: same VLM + temporal sampling + warm-start

Once we trusted the gate, we ran the loop again with three changes:

Warm-start from v12. Each candidate's PPO initialization is v12 (the working mocap policy). Training is 2M steps of refinement at low learning rate (1e-4, entropy coef 0.001). This eliminates the cold-start coverage problem — the policy starts already walking and the LLM-proposed reward shapes the gait.
Multi-angle temporal VLM. The renderer captures 3 camera angles (side, front, three-quarter) at each of 5 timepoints across 5 seconds of motion = 15 keyframes per critique. The VLM now sees motion over time and in 3D pose space, not just static side stills. This is what fixes the v18 frame-disguise vulnerability.
Step-counter cross-validation. The hand-coded fitness now includes the physical step counter plus the pelvis-stability check. is_walking=True requires both the VLM saying yes and the step counter passing. No single signal can grant walking on its own.

v19 results: 4 of 12 candidates walk

12 candidates, same compute budget as v17 and v18. Under the strict gate:

	v17 (cold-start, hand-coded)	v18 (cold-start, single-angle VLM)	v19 (warmstart + multi-angle temporal VLM + step counter)
Candidates passing strict walking gate	0 / 12	0 / 12	4 / 12
Best episode duration	1.2 s	0.6 s	5+ s (sustained, capped at 10s rendering)
Best pelvis drop	~25 cm	26 cm	< 10 cm (one candidate fully stable)
Real steps in best rollout	0	0	≥ 4 (cross-validated by step counter)

The qualitative winner is v19 iter 2 cand 1 — VLM score 62, pelvis stable, 6 real steps in 6 seconds. It's the only candidate in the full v17/v18/v19 sweep that fixed the persistent pelvis-drop issue.

The reward for that candidate was a Gaussian uprightness gate plus an asymmetric pelvis penalty — the LLM diagnosed iter 1's failure (an explicit pelvis-height anchor caused a freeze-in-place trap) and proposed a smoother version that doesn't trigger over-rigidity. That kind of corrective iteration is what the loop is for.

Side-by-side with v12 (the working mocap baseline) for calibration:

v19 candidates stay close to v12 quality. They don't dramatically exceed it. The base mimic reward keeps pulling the refinement back toward the LAFAN1 mocap. Refinement-on-top-of-mocap is real and produces a measurably better pelvis-stability axis, but it doesn't escape the mocap's quality ceiling.

What v19 did not fix

Even with the better evaluation stack, three things stayed broken:

Arm posture. Every v19 VLM critique still flags stiff arms. The base mimic reward anchors arm joints to v12's mocap reference, and the LAFAN1 G1 retarget has stiff arms. Our additive reward at weight 1.0 cannot overcome the mimic anchor. To fix this we'd need either weight ≥ 5 or to attenuate the mimic env's arm-pose tracking specifically.
Cadence. v19 walkers do about 1 step per second. Healthy human cadence is ~2 steps per second. Same root cause — base mocap pins the cadence at LAFAN1's tempo.
Forward lean. Multiple VLM critiques flag this. The base policy has it too.

The cap on improvement is v12 itself. v19 brings refinement candidates close to v12 quality from below and one candidate goes slightly above on the pelvis-stability axis. None go dramatically above the v12 baseline, because the base mimic reward keeps pulling the policy back. To get fundamentally better than the mocap baseline, you'd need a richer reference dataset or a different supervision signal entirely.

What we won't claim

This post does not claim "VLM critique solves humanoid bipedal locomotion." It does not. The cold-start survival cliff is unbroken by VLM critique alone — that's the v18 honest negative result. The warm-start v19 produced refinement-quality improvements, not new capability.

What VLM critique demonstrably does:

Refuse to score-inflate degenerate gaits to the same extent metrics do.
Produce failure descriptions the LLM uses to iterate effectively.
Make the experiment honest enough that "the best candidate" is something a human watcher would also choose, not a metric-fooler.

That's a useful contribution. It's not perfect human walking. We won't pretend otherwise.

When to use this stack

VLM critique helps when:

Your reward space is rich enough to have many degenerate solutions (cold-start locomotion, complex manipulation, dexterous tasks).
You have a frame-level visual representation of the policy's behavior.
You want failure summaries the LLM can interpret structurally instead of just numerically.

VLM critique does not help when:

The underlying RL problem has fundamental coverage or exploration issues — our survival cliff. Better evaluation doesn't change what the policy can learn from cold-start trajectories.
Frame-level visuals can't capture the relevant state (friction coefficients, internal forces, partial observability).
You need real-time per-step reward. VLM latency is ~10–15 seconds; it's only practical per-rollout.

The step counter / human-time-scale layer is essentially free — a few hundred lines of code that runs in real time — and should be on regardless of what evaluation method you use elsewhere. Always report in seconds. "Survives 30 frames" sounds like progress; "survives 0.6 seconds" doesn't. The unit choice is doing work.

Honest caveats

Single seed, single run. Both v17 and v18 are single experimental runs. v19 is also a single run with the new stack. The 12-point differences in "best" scores would need replication to be statistically meaningful.
The VLM is sometimes consistent for the wrong reasons. We saw iter 0 cand 0 (gait clock) and iter 0 cand 2 (joint-space biomechanics) get scored within 4 points despite very different rewards, because both produced visually similar "robot pitches backward" rollouts. Consistency isn't the same as being right.
The VLM can't see survival explicitly. It scores the keyframes, not the duration. We compensate by including episode length in the prompt context, but the VLM weights what it sees more than what we tell it.
The LLM doesn't always combine winning techniques. Iter 2 cand 1 in v19 was the breakthrough; subsequent candidates sometimes abandoned its approach rather than refining it. This is an Eureka-pattern issue, not a VLM-specific one.

What we'd try next

Three open directions:

Attenuate mimic env arm tracking so the additive arm-swing reward can dominate on the upper body while leg tracking stays on. Test whether v19's arm-swing-aware candidates actually produce visible arm swing instead of being clamped by the mocap.
Multi-clip reference. Use a LAFAN1 segment with more natural arm swing (a faster walking sequence, or a "natural" labeled clip rather than the lab-walking clip) as the base. Higher quality reference → higher quality refinement ceiling.
Real-time human study. Multiple human observers watching side-by-side videos: do they prefer v19 winners over v12 baseline at p < 0.05? This is the qualitative test that ultimately matters and we haven't run it.

Closing the series

This is the last post in the humanoid-sim series. Five posts, one project, a lot of honest failure. The pattern across all of them: every "let's automate the supervision" attempt produced a new class of failure, not zero failures. Hand-coded metrics missed degenerate gaits. LLM-iterated rewards explored the design space without solving coverage. VLM critique closed some gaps and opened new ones around frame-time disguise. The reliable signal in this project remained the same as it started: a real motion-capture reference and the minimum-viable-mocap result that two carefully-ordered poses are enough.

The constructive arc — what actually walks — is the mocap post and the v19 refinement on top of mocap. The diagnostic arc — what doesn't — is reward hacking, LLM iteration, and this one. Both arcs are necessary, and both have to be honest, because the field is small enough that papers without retractions tend to be papers that haven't yet noticed their retractions.

If there's one practitioner lesson across all five posts, it's the one this post ended on: always lead with seconds. Whatever your evaluation function is, when you state your result, state it in time units that a human cares about. 30 frames sounds like progress. 0.6 seconds sounds like a collapse. They are the same fact. The choice of unit is doing the work of either flattering or honest reporting, and the field benefits from the honest one.

Abhishek Nair - Fractional CTO for Deep Tech & AI

Robotics & AI Engineer

About & contact

The VLM That Scored a Collapsing Robot 62/100

The VLM That Scored a Collapsing Robot 62/100

TL;DR

Why we tried a VLM critic

Setup