After hand-coded fitness functions kept missing degenerate gaits (the reward-hacking saga) and after LLM-iterated reward design hit the survival cliff (the Eureka post), we tried the obvious next thing: let a vision-language model judge the rollouts.
The hypothesis was that human-like qualitative judgment would catch the failure modes our metrics couldn't articulate. We swapped the hand-coded fitness scorer for Claude (Opus 4.7) looking at 8 rendered keyframes per rollout, scoring 0–100, and writing a plain-language critique.
It half-worked.
is_walking=True check, and on careful inspection had taken zero real steps — its pelvis dropped 26 cm over 30 frames as the robot fell forward. The 8 keyframes captured a moment with bipedal pose during the collapse. The VLM was fooled the same way metrics had been: by static-looking stills that flatter the actual physical trajectory.This is a case study in honest fitness. We thought adding a VLM would close the supervision gap. It narrowed the gap but didn't close it, and the next gap — frame-time disguise of collapses — needed a different fix.
The fix we landed on: a strict step counter (foot lift ≥ 4 cm, hold ≥ 3 frames, land ≥ 6 cm ahead of where it lifted) cross-validated against the VLM, with results always reported in seconds, not frames. Under that gate, both v17's "best" and v18's "best" score zero real steps. The v19 follow-up — same VLM critic but with multi-angle temporal sampling and warm-start from a working mocap policy — produced 4 of 12 candidates that pass the strict gate. One of them, iter 2 cand 1, is the genuine qualitative winner: pelvis stable, real bipedal walking, sustained 5+ seconds.
This post tells the whole story honestly, retraction included.
Hand-coded fitness for humanoid walking is fragile. Across this project we encountered policies that:
Every metric we added caught the immediate failure mode and missed the next one. The general pattern: the space of "satisfies the reward" is strictly larger than the space of "looks correct," and policy optimizers are excellent at finding the gap.
A VLM trained on millions of human walking images doesn't have that limitation by construction. It sees rendered keyframes and describes them in natural language. A single-leg hopper looks nothing like walking; the VLM should say so. Crucially, the VLM cannot be gamed by satisfying specific numerical criteria — it judges rendered images, which encode the entire physical pose.
That was the hypothesis going in. Bear with me through the results.
VLM critic. Claude Opus 4.7 via the Claude Agent SDK, using subscription credentials. 8 keyframes per rollout, side view, rendered at 320×320 from MuJoCo. The prompt asks for three things in JSON:
Latency: ~10–15 seconds per critique. API cost: zero incremental (subscription).
LLM proposer. Same Eureka loop as v17 (4 iterations × 3 candidates each). The proposer's prompt now describes the VLM critic as the scorer and includes prior VLM descriptions verbatim in the history of attempts. The proposer can see what the VLM said about previous candidates, not just the scores.
Training env. Same UnitreeG1ConfigurableEnv as v17, cold-start, with hard termination on pelvis below 0.55 m and non-foot floor contact. Identical baseline so the comparison is apples-to-apples.
Training budget. 2.5M PPO steps per candidate, 16 parallel envs, ~20 minutes wallclock each. 12 candidates total, ~4 hours.
The comparison table, reported in human time scale rather than control frames:
| Property | v17 (hand-coded fitness) | v18 (VLM critique) |
|---|---|---|
| Best candidate score | 73.8 / 100 | 62 / 100 |
| Episode duration (best) | 61 frames = 1.2 seconds | 30 frames = 0.6 seconds |
| Real steps taken (lift + land at new x ≥ 6 cm) | 0 | 0 |
| Pelvis drop start → end | ~25 cm | 26 cm |
| Forward distance traveled | 0.27 m | 0.09 m |
| Visual: is it walking? | No — backward collapse | No — forward collapse |
Neither "winner" walks. Both fall in under 1.3 seconds, less than the duration of a single human walking step. The "bipedal alternation" we saw in metrics and in the v18 keyframes was the policy's feet swapping as it fell, not actual stepping.
The first draft of the v18 result claimed this was a "real walker" — the VLM said is_walking=True, 62/100, and the keyframes showed bipedal pose. The retraction came when we added the strict step counter and re-rendered the same rollout in real time. 30 frames at 50 Hz is 0.6 seconds. No human walking step completes in 0.6 seconds, let alone the full stance-swing-stance cycle that constitutes a step.
The keyframes captured a moment with bipedal pose during the collapse. The VLM, judging stills, couldn't see the temporal trajectory that ended in a face-plant.
It's the same failure mode as the metrics — different layer, same shape. The metric scored "single-stance fraction > 30%" without knowing the policy was always on the same foot. The VLM scored "looks bipedal across 8 frames" without knowing those 8 frames spanned 0.6 seconds of falling. Each layer was correct given what it could see, and each was wrong in the same way: by being shown an artifact that flattered the underlying trajectory.
This was, again, embarrassing. Worth being explicit about.
Even retracted as a "result," v18 produced two genuinely new things compared to v17.
Every v18 candidate scored lower than its v17 equivalent would have:
| v18 iter | v18 cand | VLM score | What hand-coded fitness would have said |
|---|---|---|---|
| 1 | 0 | 62 | ~75 (matches v17 best) |
| 3 | 1 | 38 | ~70 (passes most v17 criteria) |
| 0 | 0 | 18 | 65 (this is the v17 first-iteration policy, exactly) |
The VLM is consistently 20–50 points harsher than the hand-coded scorer on the same policies. Policies that look similar by metrics get distinct scores by the VLM, and the VLM's score correlates with what a human observer would say.
That's a real upgrade. The Eureka loop trains the LLM on the score signal. When the score signal is harsher and more correlated with human judgment, the LLM's proposals improve. We see this empirically in the next subsection.
Iter 0 of v18 produced three candidates, all scored 14–18 by the VLM, all flagged with related plain-language critiques:
This is the kind of feedback an experienced motion researcher might give. Notice that all three reach the same diagnosis in slightly different words — the VLM is consistent across rollouts when the underlying failure is the same.
In iter 1, the first candidate's reward-design rationale was explicit:
"Stability-gated locomotion — velocity reward is multiplied by an uprightness gate so the policy only earns walking pay after it earns standing, plus explicit angular-velocity damping to fix the 'pitches backward' failure that killed all three priors."
That candidate scored 62 — a 44-point jump in a single iteration — and was confirmed as walking by the VLM (though, as the retraction shows, on artifacts the VLM couldn't see through).
The point isn't the 44-point jump on its own. It's that the LLM proposed a structural fix to a specific failure mode named by the VLM, rather than re-trying the same shape twice. That doesn't happen with metric-only feedback because metric-only feedback can't name structural failures.
These all became visible only with VLM critique:
The hand-coded fitness had no view into any of these. A policy could splay its arms unnaturally, jump-launch, or collapse its pelvis while still satisfying every fitness criterion we'd encoded. The VLM names these failure modes; the LLM can then address them specifically.
The survival cliff persists. Best v18 policy walks for 30 frames before falling forward. Best v17 policy "walks" for 61 frames then falls. The underlying RL problem — that PPO cannot learn balance recovery from cold-start trajectories where it has never seen a near-fall — is not affected by changing the scorer. The reward function, however carefully designed, cannot produce training data that doesn't exist.
VLM critique is an evaluation upgrade, not a training upgrade. It tells you when your policy is wrong, more accurately than metrics do, but it doesn't make your policy right.
After v18 we built what we should have started with. Each layer catches a class of failure the previous one missed:
is_walking=True requires ≥ 4 real steps AND ≤ 10 cm total pelvis drop AND ≥ 5 seconds duration. Reports always in seconds, not frames.Under this gate, both v17's "best" and v18's "best" score 0 real steps and is_walking=False. The strict gate caught what the prior gates missed.
Once we trusted the gate, we ran the loop again with three changes:
is_walking=True requires both the VLM saying yes and the step counter passing. No single signal can grant walking on its own.12 candidates, same compute budget as v17 and v18. Under the strict gate:
| v17 (cold-start, hand-coded) | v18 (cold-start, single-angle VLM) | v19 (warmstart + multi-angle temporal VLM + step counter) | |
|---|---|---|---|
| Candidates passing strict walking gate | 0 / 12 | 0 / 12 | 4 / 12 |
| Best episode duration | 1.2 s | 0.6 s | 5+ s (sustained, capped at 10s rendering) |
| Best pelvis drop | ~25 cm | 26 cm | < 10 cm (one candidate fully stable) |
| Real steps in best rollout | 0 | 0 | ≥ 4 (cross-validated by step counter) |
The qualitative winner is v19 iter 2 cand 1 — VLM score 62, pelvis stable, 6 real steps in 6 seconds. It's the only candidate in the full v17/v18/v19 sweep that fixed the persistent pelvis-drop issue.
The reward for that candidate was a Gaussian uprightness gate plus an asymmetric pelvis penalty — the LLM diagnosed iter 1's failure (an explicit pelvis-height anchor caused a freeze-in-place trap) and proposed a smoother version that doesn't trigger over-rigidity. That kind of corrective iteration is what the loop is for.
Side-by-side with v12 (the working mocap baseline) for calibration:
v19 candidates stay close to v12 quality. They don't dramatically exceed it. The base mimic reward keeps pulling the refinement back toward the LAFAN1 mocap. Refinement-on-top-of-mocap is real and produces a measurably better pelvis-stability axis, but it doesn't escape the mocap's quality ceiling.
Even with the better evaluation stack, three things stayed broken:
The cap on improvement is v12 itself. v19 brings refinement candidates close to v12 quality from below and one candidate goes slightly above on the pelvis-stability axis. None go dramatically above the v12 baseline, because the base mimic reward keeps pulling the policy back. To get fundamentally better than the mocap baseline, you'd need a richer reference dataset or a different supervision signal entirely.
This post does not claim "VLM critique solves humanoid bipedal locomotion." It does not. The cold-start survival cliff is unbroken by VLM critique alone — that's the v18 honest negative result. The warm-start v19 produced refinement-quality improvements, not new capability.
What VLM critique demonstrably does:
That's a useful contribution. It's not perfect human walking. We won't pretend otherwise.
VLM critique helps when:
VLM critique does not help when:
The step counter / human-time-scale layer is essentially free — a few hundred lines of code that runs in real time — and should be on regardless of what evaluation method you use elsewhere. Always report in seconds. "Survives 30 frames" sounds like progress; "survives 0.6 seconds" doesn't. The unit choice is doing work.
Three open directions:
This is the last post in the humanoid-sim series. Five posts, one project, a lot of honest failure. The pattern across all of them: every "let's automate the supervision" attempt produced a new class of failure, not zero failures. Hand-coded metrics missed degenerate gaits. LLM-iterated rewards explored the design space without solving coverage. VLM critique closed some gaps and opened new ones around frame-time disguise. The reliable signal in this project remained the same as it started: a real motion-capture reference and the minimum-viable-mocap result that two carefully-ordered poses are enough.
The constructive arc — what actually walks — is the mocap post and the v19 refinement on top of mocap. The diagnostic arc — what doesn't — is reward hacking, LLM iteration, and this one. Both arcs are necessary, and both have to be honest, because the field is small enough that papers without retractions tend to be papers that haven't yet noticed their retractions.
If there's one practitioner lesson across all five posts, it's the one this post ended on: always lead with seconds. Whatever your evaluation function is, when you state your result, state it in time units that a human cares about. 30 frames sounds like progress. 0.6 seconds sounds like a collapse. They are the same fact. The choice of unit is doing the work of either flattering or honest reporting, and the field benefits from the honest one.