by MLL LAB · October 2025

From Observation to Understanding: Why Visual Agents Need Internal Models

Language agents operate in a fully-observable text world, where “state” is explicitly written.

But once we move to visual environments, agents face partial observability: the world is noisy, incomplete, and constantly changing.

This transforms the problem from a standard Markov Decision Process (MDP) into a Partially Observable MDP (POMDP).

In such cases, an agent must maintain a belief about the hidden world state: an internal world model.

Core Question:

Can we train Vision-Language Models (VLMs) to explicitly reason about what they see now and what will happen next?

This is where VAGEN comes in, we build a reinforcement learning framework that enforces and rewards explicit visual state reasoning.

The VAGEN Framework: Making Reasoning Explicit

VAGEN structures the model’s thinking into two reasoning stages:

We enforce this structure by making the model output a formatted reasoning trace:

<think>
  <observation>...</observation>
  <reasoning>...</reasoning>
  <prediction>...</prediction>
</think>
<answer>...</answer>

This allows us to reward the reasoning process itself — not just the final action outcome.

We evaluate five reasoning formats during RL training: