by MLL LAB · October 2025
Language agents operate in a fully-observable text world, where “state” is explicitly written.
But once we move to visual environments, agents face partial observability: the world is noisy, incomplete, and constantly changing.
This transforms the problem from a standard Markov Decision Process (MDP) into a Partially Observable MDP (POMDP).
In such cases, an agent must maintain a belief about the hidden world state: an internal world model.
Core Question:
Can we train Vision-Language Models (VLMs) to explicitly reason about what they see now and what will happen next?
This is where VAGEN comes in, we build a reinforcement learning framework that enforces and rewards explicit visual state reasoning.
VAGEN structures the model’s thinking into two reasoning stages:
We enforce this structure by making the model output a formatted reasoning trace:
<think>
<observation>...</observation>
<reasoning>...</reasoning>
<prediction>...</prediction>
</think>
<answer>...</answer>
This allows us to reward the reasoning process itself — not just the final action outcome.
We evaluate five reasoning formats during RL training: