GigaWorld-1: A Roadmap to
World Models for Robot Policy Evaluation

A systematic study of world models as policy evaluators, introducing WMBench and a practical design roadmap for building evaluator-oriented world models.

GigaAI

arXivPaper (PDF) Code ToyDataset DatasetCVPR Model Benchmark

Abstract

Evaluating embodied robot foundation models remains a critical bottleneck: unlike digital AI systems, robot policies must be tested through slow, costly, and hardware-limited real-world rollouts. This motivates the use of world models as scalable surrogate evaluators, but what makes a world model reliable for policy assessment is still not well understood.

We introduce WMBench, a real-robot benchmark built from teleoperation data and matched policy rollouts, enabling controlled comparisons across model families, action encodings, rollout horizons, and evaluation metrics. Using WMBench, we analyze 7 video world models, 4 action representations, and 324,000+ simulated rollouts paired with real robot executions, further supported by 12,000+ hours of training videos and CVPR 2026 GigaBrain Challenge submissions.

Our study shows that evaluator quality is driven less by short-term visual realism and more by long-horizon, action-faithful rollout consistency. It also highlights the importance of balancing general world knowledge and robot controllability, as well as architecture choices such as action encoding, memory, and evaluator-focused post-training. These insights lead to GigaWorld-1, a world model optimized for scalable robot policy evaluation.

Key Features

0 Data

0 Rollout Traj Evaluation

0 Model Evaluations

1.3B / 5B Model Variants

<24GB / 20FPS Fast Generation

∞ Horizon Infinite Generation

Data Sources & Curation

We build a 12,980-hour heterogeneous training corpus from internet and physics videos, open-source robot datasets, egocentric human-hand data, and Giga-collected demonstrations, combining broad visual-physical priors with embodiment-specific manipulation behaviors.

Data Snapshot

GigaWorld-1

12,980 hours

Category	Representative Sources	Robot Type	Hours	Modality
Physical	Internet Videos, Physics Videos	⚙️ N/A	~1,298	🎥 RGB Video
Robot	Open X, AgiBot	🦿 Single-arm 🤖 Dual-arm 🧍 Humanoid	~5,377	🦾 Robot Demonstration
Human	EgoDex, SynData	✋ Human Hands	~2,411	🎥 RGB Video 🖐️ Hand Pose
Giga	Giga Humanoid, Giga Dual-arm	🧍 Humanoid 🤖 Dual-arm	~3,894	🦾 Robot Demonstration

GigaWorld-1: Model and Experiments

Figure: Overall architecture of GigaWorld-1, an autoregressive diffusion-transformer world generator adapted with LoRA. Historical frames, future noisy latents, and temporally aligned controls such as actions, depth, semantic maps, and captions are fused for embodied rollout generation; each generated window is decoded and appended back into the rollout history.

Evaluator Benchmark — WMBench Leaderboard

Figure: Left: average score across seven evaluation metrics. Right: per-metric radar comparison. GigaWorld-1-Nano achieves the best overall score and is especially strong in JEPA Similarity, Semantic Alignment, Subject Consistency, and Trajectory Accuracy.

1st 2nd Last All metrics are higher-is-better (↑)

Rank	Model	Size	Type	Aesthetic↑	Image↑	JEPA↑	Semantic↑	Subject↑	Trajectory↑	AVG↑
1	GigaWorld-1-Plus	5B	Robot/Auto	0.3534	0.6765	0.9337	0.8926	0.8883	0.3561	0.6834
2	GigaWorld-1-Nano	1.3B	Robot/Auto	0.3538	0.6802	0.8911	0.8920	0.8600	0.3528	0.6716
3	Cosmos-Predict2.5	2B	Robot/Auto	0.3491	0.7184	0.6781	0.8764	0.8747	0.1770	0.6123
4	Wan 2.2	5B	TI2V 5B General	0.3538	0.6980	0.5853	0.8789	0.8883	0.1643	0.5948
5	LTX 2.3	22B	General	0.3900	0.6967	0.5380	0.8678	0.8248	0.1479	0.5775
6	CogVideoX	5B	General	0.3303	0.6775	0.6437	0.8633	0.6963	0.1609	0.5620
7	SVD	1.5B	General	0.2861	0.6497	0.6454	0.8411	0.8267	0.0926	0.5569
8	Wan 2.1	1.3B	I2V 1.3B General	0.3422	0.6856	0.6002	0.8705	0.5568	0.1576	0.5355

Table: Robot-oriented models generally outperform generic video backbones on embodied rollout evaluation. Cell background colors indicate ranking: green = 1st, yellow = 2nd, red = last. All metrics are higher-is-better (↑).

Multi-View Trajectory Accuracy

Trajectory Accuracy. Trajectory accuracy under multi-view control. The same control signal is faithfully reflected in both head and wrist views, with sub-pixel alignment between predicted and ground-truth keypoints.

Precise Multi-View Control. Pixel-aligned, view-specific control signals keep every camera anchored to the same underlying action.

Closed-Loop Policy Consistency

Closed-loop Success Correlation

Closed-loop Success Rate

Fold the box — closed-loop rollout

❌ Fail

✅ Success

Put banana in basket — closed-loop rollout

❌ Fail

✅ Success

❌ Fail

✅ Success

Put bowl on plate — closed-loop rollout

✅ Success

❌ Fail

✅ Success

Pour fries on plate — closed-loop rollout

✅ Success

❌ Fail

Physically Consistent World Model

backbone

ours

backbone

ours

backbone

ours

backbone

ours

backbone

ours

OOD Generalization

Beyond in-distribution WMBench, we test OOD rollout consistency across novel actions, unseen backgrounds, new containers, and unseen food categories.

Background Material — unseen table surface

Wood Table

Metal Table

Photo Table

Tablecloth

Foreground Container — unseen bowl types

White Bowl

Wooden Bowl

Red Bowl

Stainless Steel Bowl

Food Category — unseen food types

Rice

Noodles

Steak

Braised Pork

Action Trajectory — unseen manipulation patterns

✅ Bottom-Left

✅ Center

❌ Failure

⚡ Flash & Ultra-Long-Horizon Generation

🖥️ Consumer-Ready

< 24 GB

Peak VRAM fits in a single RTX 4090 — runs on off-the-shelf consumer hardware, no datacenter required.

⚡ Flash Inference

> 20 FPS

12 s produces 33 s of 1920×480 video on a single H20 — fast enough for interactive use.

♾️ Infinite Horizon

> 10 min

Sustained 11,000+ frames of autoregressive rollout without crashing or quality drift.

📊 Acceleration Benchmark

DMD2 step-distillation + tensor-parallel scaling pushes the world model to 27.23× on 4 GPUs, unlocking real-time interactive use.

♾️ Real-time Generation

⚡ Flash & Ultra-Long-Horizon Generation. Real-time uninterrupted generation demo: over 10 minutes and 11K frames.

Transfer & Generalization

Beyond standard OOD evaluation, GigaWorld-1 demonstrates rapid transfer capabilities across new domains and conditioning modalities. The same world-model design can be effortlessly adapted to autonomous driving, supports edge / depth map conditioning, and generalizes across different robot embodiments. Fine-tuning on new datasets or conditioning modalities typically requires only 8 GPUs and a few thousand training steps, making domain adaptation and modality extension highly efficient.

Cross-Domain Transfer

Autonomous Driving

AGI Robot

Control-Condition Transfer

First frame condition

Edge-to-Video

Depth-to-Video

VLM-assisted Rollout Evaluator

The VLM-assisted rollout evaluator scores generated videos on the same 1–3 ordinal scale as human annotators. Across 5,000+ videos, it reaches 87.80% exact agreement and 99.16% adjacent agreement with human ratings, with only 0.84% off by two score levels. Strong correlations (Spearman 0.7574, Kendall τ_b 0.7507) show that it is a reliable proxy for method-level WMES comparison.

Score 1

Put panda into pink plate.

GT1Pred1Match0.763LevelsAll levels matchReason0.472

Overall Video Quality	Medium Video is split into three views, but all are shaky and blurry with frequent motion blur and fast panning.	Medium Video is blurry and shaky, but objects (panda, plates, banana) are identifiable across all three views.	0.355
Instruction Following	Poor The robot arm attempts to pick up a panda but fails to place it into the pink plate; the panda remains on the table.	Poor The robot arm attempts to grasp the panda but fails to place it into the pink plate; the panda remains on the table.	0.863
Physical Adherence	Medium Contact is attempted but unstable; the panda is not securely grasped or placed, and final state shows panda outside the plate.	Medium The robot arm makes contact with the panda, but the grasp is unstable and the panda is not lifted or moved toward the plate.	0.467

GT Reason

The robot fails to place the panda into the pink plate; the panda is not inside the plate in the final state. Video quality is shaky and blurry, but the failed task is clear.

Pred Reason

The robot arm attempts to grasp the panda but fails to lift or move it into the pink plate. The panda remains on the table, and the final state does not show the task completed.

Put green bowl into pink plate.

GT1Pred1Match0.766LevelsAll levels matchReason0.473

Overall Video Quality	Poor Video is extremely blurry, shaky, and lacks clear object details, making it difficult to identify objects or actions.	Poor Video is extremely blurry, shaky, and lacks clear object visibility, making it difficult to discern actions or final states.	0.737
Instruction Following	Poor The task 'put green bowl into pink plate' is not completed; the final state does not show a green bowl inside a pink plate.	Poor The video does not clearly show the green bowl being placed into the pink plate; the final state is ambiguous and not visible.	0.503
Physical Adherence	Unclear Due to severe visual corruption, it's impossible to assess physical interactions, contact, or stability.	Unclear Due to severe visual corruption and lack of clear contact or placement, physical interactions cannot be reliably assessed.	0.478

GT Reason

The video is too corrupted to clearly identify objects or actions. The task is not completed, and the final state is ambiguous due to poor video quality.

Pred Reason

The video is severely corrupted with extreme blur and motion, preventing clear identification of objects or actions. The task of putting the green bowl into the pink plate is not visibly completed, and the final state is ambiguous. Therefore, the task is failed.

Put banana into basket.

GT2Pred2Match0.801LevelsAll levels matchReason0.661

Overall Video Quality	Medium The video is clear with consistent views from three cameras, but the action is partially obscured by robotic arms and objects.	Medium Video is clear but shows a split view; the right panel is often blurry or shows only a partial object.	0.290
Instruction Following	Medium The robot arm picks up the banana and moves it toward the basket, but the final placement is not clearly visible.	Medium The robot arm picks up the banana and moves it toward the basket, but the final placement is not clearly shown.	0.922
Physical Adherence	Medium The banana is grasped and moved plausibly, but the interaction with the basket is not fully shown, leaving stability and containment ambiguous.	Medium The banana is grasped and moved, but the video ends before confirming stable containment in the basket.	0.477

GT Reason

The robot arm attempts to place the banana into the basket, showing meaningful progress. However, the final state is not clearly visible, making completion ambiguous and supporting a partial score.

Pred Reason

The robot successfully grasps the banana and moves it toward the basket, showing meaningful progress. However, the final state is not clearly visible, making completion ambiguous.

Put panda into pink plate.

GT2Pred2Match0.788LevelsAll levels matchReason0.584

Overall Video Quality	Medium Three camera views are provided, but motion blur and varying camera angles reduce clarity in some frames.	Medium Video is split into three views; middle and right panels are often blurry or obstructed, making detailed observation difficult.	0.238
Instruction Following	Medium The robot picks up the panda toy and moves it toward the pink plate, but the final placement is not clearly completed.	Medium Robot arm picks up the panda toy and moves it toward the pink plate, but the final placement is not clearly shown.	0.883
Physical Adherence	Medium The robot arm makes plausible contact with the panda toy and moves it, but the final state in the pink plate is ambiguous.	Medium The robot arm grasps the panda and moves it, but the final state of the panda on the plate is ambiguous due to camera angles.	0.593

GT Reason

The robot successfully picks up the panda toy and initiates movement toward the pink plate, showing meaningful progress. However, the final placement into the plate is not clearly visible or completed in the provided frames.

Pred Reason

The robot arm successfully picks up the panda and moves it toward the pink plate, showing progress. However, the final placement is not clearly visible, making the task completion ambiguous.

Put panda into pink plate.

GT3Pred3Match0.787LevelsAll levels matchReason0.554

Overall Video Quality	Good All three camera views are clear, showing consistent object positions and robot movements without major artifacts.	Good The video is clear, with consistent object identities and smooth camera movements across all three views.	0.482
Instruction Following	Good The robot picks up the panda toy and places it directly onto the pink plate, achieving the requested action.	Good The robot successfully picks up the panda toy and places it into the pink plate, as instructed.	0.639
Physical Adherence	Good The robot arm makes plausible contact with the panda, lifts it, moves it, and releases it stably onto the plate.	Good The robot's gripper makes plausible contact with the panda, lifts it, and releases it into the plate without physical impossibilities.	0.637

GT Reason

The video clearly shows the robot arm successfully grasping the panda toy, transporting it, and placing it into the pink plate. All three views confirm the action and stable final state.

Pred Reason

The video shows the robot arm picking up the panda toy and placing it into the pink plate, which is the correct final state. The action is completed, and the final state is stable and correct.

Put panda into pink plate.

GT3Pred3Match0.754LevelsAll levels matchReason0.588

Overall Video Quality	Good Three synchronized camera views are clear, with consistent object identity and no major visual artifacts.	Good The video is clear, with consistent object identities and smooth camera movements across all three views.	0.453
Instruction Following	Good The robot arm picks up the panda toy and places it directly into the pink plate, fulfilling the instruction.	Good The robot correctly identifies the panda and moves it into the pink plate, completing the task as instructed.	0.490
Physical Adherence	Good The robot's grasp is stable, the panda is lifted and moved plausibly, and it lands securely inside the plate.	Good The robot's gripper makes plausible contact with the panda, lifts it, and places it stably onto the plate.	0.418

GT Reason

The video clearly shows the robot successfully picking up the panda and placing it into the pink plate, with stable motion and a correct final state across all three views.

Pred Reason

The video shows the robot successfully picking up the panda and placing it into the pink plate, with clear visual evidence of the final state. All aspects of the task are completed correctly and plausibly.

WMBench at a Glance

WMBench evaluates world models as robot policy evaluators, not only as video generators. Its protocol mirrors real deployment: collect real policy rollouts, train world models under a strict holdout split, run closed-loop policy evaluation inside the learned world model, and compare generated outcomes with real-world conclusions.

Step 1

Real-World Collection

Collect physical closed-loop policy rollouts with initial observations, task instructions, multi-view videos, and human success labels.

Step 2

Strict World-Model Holdout

Train on the designated split while holding out test layouts and initial states to measure generalization rather than memorization.

Step 3

Closed-Loop Rollout

The policy acts on generated observations, and the world model feeds predictions back to the policy until task termination.

Step 4

Metrics & Outcome Assessment

Automatic diagnostics measure rollout quality, while WMES from human or VLM evaluators measures task-level outcome alignment.

Outcome Evaluation

WMES measures whether generated rollouts support task-level policy comparison and preserve real-world success conclusions.

Human WMES
VLM-assisted WMES
PCC
MMRV

Frame & Representation Fidelity

Perceptual quality, content stability, and feature-level similarity across generated and reference videos.

Image Quality
Aesthetic Quality
JEPA Similarity
Subject Consistency
Background Consistency
Photometric Consistency

Geometry, Semantics & Interaction

Spatial structure, 3D plausibility, task semantics, instruction following, and action-conditioned robot-object interaction.

Geometry Accuracy
Perspectivity
Semantic Alignment
Instruction Following
Interaction Quality
Trajectory Accuracy

Motion & Long-Horizon Rollout

Short-term motion behavior and long-horizon autoregressive degradation, including temporal smoothness, flow dynamics, and standard video-generation distances.

Dynamic Degree
Flow Score
Motion Smoothness
PSNR
FID
FVD

Summary tables focus on six key diagnostics: Aesthetic Quality, Image Quality, JEPA Similarity, Semantic Alignment, Subject Consistency, and Trajectory Accuracy.

Citation

@article{gigaworld2025,
  title     = {GigaWorld-1: A Roadmap to World Models for Robot Policy Evaluation},
  author    = {{GigaAI}},
  journal   = {arXiv preprint},
  year      = {2025}
}