The VLM That Scored a Collapsing Robot 62/100
After metrics kept missing degenerate gaits and LLM-iterated rewards hit the survival cliff, we tried a vision-language model as the fitness scorer. The VLM was harsher than metrics and produced actionable failure descriptions — and it scored a collapsing robot 62/100. The case study in honest fitness, plus the four-layer evaluation stack we landed on.