Video wall background
Humanoid silhouette Giga Logo

GigaWorld-1: A Roadmap to
World Models for Robot Policy Evaluation

A systematic study of world models as policy evaluators, introducing WMBench and a practical design roadmap for building evaluator-oriented world models.

GigaAI

GigaWorld-1 teaser

Abstract

Evaluating embodied robot foundation models remains a critical bottleneck: unlike digital AI systems, robot policies must be tested through slow, costly, and hardware-limited real-world rollouts. This motivates the use of world models as scalable surrogate evaluators, but what makes a world model reliable for policy assessment is still not well understood.


We introduce WMBench, a real-robot benchmark built from teleoperation data and matched policy rollouts, enabling controlled comparisons across model families, action encodings, rollout horizons, and evaluation metrics. Using WMBench, we analyze 7 video world models, 4 action representations, and 324,000+ simulated rollouts paired with real robot executions, further supported by 12,000+ hours of training videos and CVPR 2026 GigaBrain Challenge submissions.


Our study shows that evaluator quality is driven less by short-term visual realism and more by long-horizon, action-faithful rollout consistency. It also highlights the importance of balancing general world knowledge and robot controllability, as well as architecture choices such as action encoding, memory, and evaluator-focused post-training. These insights lead to GigaWorld-1, a world model optimized for scalable robot policy evaluation.

Key Features

0 Data
0 Rollout Traj Evaluation
0 Model Evaluations
1.3B / 5B Model Variants
<24GB / 20FPS Fast Generation
∞ Horizon Infinite Generation

Data Sources & Curation

We build a 12,980-hour heterogeneous training corpus from internet and physics videos, open-source robot datasets, egocentric human-hand data, and Giga-collected demonstrations, combining broad visual-physical priors with embodiment-specific manipulation behaviors.

Data Snapshot
Robot manipulation video wall
GigaWorld-1
12,980 hours
Category Representative Sources Robot Type Hours Modality
Physical Internet Videos, Physics Videos ⚙️ N/A ~1,298 🎥 RGB Video
Robot Open X, AgiBot 🦿 Single-arm 🤖 Dual-arm 🧍 Humanoid ~5,377 🦾 Robot Demonstration
Human EgoDex, SynData ✋ Human Hands ~2,411 🎥 RGB Video 🖐️ Hand Pose
Giga Giga Humanoid, Giga Dual-arm 🧍 Humanoid 🤖 Dual-arm ~3,894 🦾 Robot Demonstration

GigaWorld-1: Model and Experiments

GigaWorld-1 Model Architecture

Figure: Overall architecture of GigaWorld-1, an autoregressive diffusion-transformer world generator adapted with LoRA. Historical frames, future noisy latents, and temporally aligned controls such as actions, depth, semantic maps, and captions are fused for embodied rollout generation; each generated window is decoded and appended back into the rollout history.

Evaluator Benchmark — WMBench Leaderboard

Architecture Comparison

Figure: Left: average score across seven evaluation metrics. Right: per-metric radar comparison. GigaWorld-1-Nano achieves the best overall score and is especially strong in JEPA Similarity, Semantic Alignment, Subject Consistency, and Trajectory Accuracy.

1st 2nd Last All metrics are higher-is-better (↑)
Rank Model Size Type Aesthetic↑ Image↑ JEPA↑ Semantic↑ Subject↑ Trajectory↑ AVG↑
1 GigaWorld-1-Plus 5B Robot/Auto 0.3534 0.6765 0.9337 0.8926 0.8883 0.3561 0.6834
2 GigaWorld-1-Nano 1.3B Robot/Auto 0.3538 0.6802 0.8911 0.8920 0.8600 0.3528 0.6716
3 Cosmos-Predict2.5 2B Robot/Auto 0.3491 0.7184 0.6781 0.8764 0.8747 0.1770 0.6123
4 Wan 2.2 5B TI2V 5B General 0.3538 0.6980 0.5853 0.8789 0.8883 0.1643 0.5948
5 LTX 2.3 22B General 0.3900 0.6967 0.5380 0.8678 0.8248 0.1479 0.5775
6 CogVideoX 5B General 0.3303 0.6775 0.6437 0.8633 0.6963 0.1609 0.5620
7 SVD 1.5B General 0.2861 0.6497 0.6454 0.8411 0.8267 0.0926 0.5569
8 Wan 2.1 1.3B I2V 1.3B General 0.3422 0.6856 0.6002 0.8705 0.5568 0.1576 0.5355

Table: Robot-oriented models generally outperform generic video backbones on embodied rollout evaluation. Cell background colors indicate ranking: green = 1st, yellow = 2nd, red = last. All metrics are higher-is-better (↑).

Multi-View Trajectory Accuracy

Trajectory Accuracy

Trajectory Accuracy. Trajectory accuracy under multi-view control. The same control signal is faithfully reflected in both head and wrist views, with sub-pixel alignment between predicted and ground-truth keypoints.

Multi-view control rollout grid

Precise Multi-View Control. Pixel-aligned, view-specific control signals keep every camera anchored to the same underlying action.

Closed-Loop Policy Consistency

Closed-loop success-rate correlation

Closed-loop Success Correlation

Closed-loop success-rate per model

Closed-loop Success Rate

Fold the box — closed-loop rollout
❌ Fail
❌ Fail
❌ Fail
✅ Success
Put banana in basket — closed-loop rollout
❌ Fail
✅ Success
❌ Fail
✅ Success
Put bowl on plate — closed-loop rollout
✅ Success
✅ Success
❌ Fail
✅ Success
Pour fries on plate — closed-loop rollout
✅ Success
✅ Success
❌ Fail
❌ Fail

Physically Consistent World Model

backbone
ours
backbone
ours
backbone
ours
backbone
ours
backbone
ours

OOD Generalization

Beyond in-distribution WMBench, we test OOD rollout consistency across novel actions, unseen backgrounds, new containers, and unseen food categories.

Background Material — unseen table surface
Wood Table
Metal Table
Photo Table
Tablecloth
Foreground Container — unseen bowl types
White Bowl
Wooden Bowl
Red Bowl
Stainless Steel Bowl
Food Category — unseen food types
Rice
Noodles
Steak
Braised Pork
Action Trajectory — unseen manipulation patterns
✅ Bottom-Left
✅ Center
❌ Failure
❌ Failure

⚡ Flash & Ultra-Long-Horizon Generation

🖥️ Consumer-Ready
< 24 GB
Peak VRAM fits in a single RTX 4090 — runs on off-the-shelf consumer hardware, no datacenter required.
⚡ Flash Inference
> 20 FPS
12 s produces 33 s of 1920×480 video on a single H20 — fast enough for interactive use.
♾️ Infinite Horizon
> 10 min
Sustained 11,000+ frames of autoregressive rollout without crashing or quality drift.
Inference Acceleration Benchmark
📊 Acceleration Benchmark

DMD2 step-distillation + tensor-parallel scaling pushes the world model to 27.23× on 4 GPUs, unlocking real-time interactive use.

♾️ Real-time Generation

⚡ Flash & Ultra-Long-Horizon Generation. Real-time uninterrupted generation demo: over 10 minutes and 11K frames.

Transfer & Generalization

Beyond standard OOD evaluation, GigaWorld-1 demonstrates rapid transfer capabilities across new domains and conditioning modalities. The same world-model design can be effortlessly adapted to autonomous driving, supports edge / depth map conditioning, and generalizes across different robot embodiments. Fine-tuning on new datasets or conditioning modalities typically requires only 8 GPUs and a few thousand training steps, making domain adaptation and modality extension highly efficient.

Cross-Domain Transfer
Autonomous Driving
AGI Robot
Control-Condition Transfer
First frame condition
First frame condition
Edge-to-Video
Depth-to-Video

VLM-assisted Rollout Evaluator

VLM-assisted rollout evaluator pipeline

The VLM-assisted rollout evaluator scores generated videos on the same 1–3 ordinal scale as human annotators. Across 5,000+ videos, it reaches 87.80% exact agreement and 99.16% adjacent agreement with human ratings, with only 0.84% off by two score levels. Strong correlations (Spearman 0.7574, Kendall τb 0.7507) show that it is a reliable proxy for method-level WMES comparison.

Score 1
Put panda into pink plate.
GT1Pred1Match0.763LevelsAll levels matchReason0.472
Overall Video Quality Medium

Video is split into three views, but all are shaky and blurry with frequent motion blur and fast panning.

Medium

Video is blurry and shaky, but objects (panda, plates, banana) are identifiable across all three views.

0.355
Instruction Following Poor

The robot arm attempts to pick up a panda but fails to place it into the pink plate; the panda remains on the table.

Poor

The robot arm attempts to grasp the panda but fails to place it into the pink plate; the panda remains on the table.

0.863
Physical Adherence Medium

Contact is attempted but unstable; the panda is not securely grasped or placed, and final state shows panda outside the plate.

Medium

The robot arm makes contact with the panda, but the grasp is unstable and the panda is not lifted or moved toward the plate.

0.467
GT Reason

The robot fails to place the panda into the pink plate; the panda is not inside the plate in the final state. Video quality is shaky and blurry, but the failed task is clear.

Pred Reason

The robot arm attempts to grasp the panda but fails to lift or move it into the pink plate. The panda remains on the table, and the final state does not show the task completed.

Put green bowl into pink plate.
GT1Pred1Match0.766LevelsAll levels matchReason0.473
Overall Video Quality Poor

Video is extremely blurry, shaky, and lacks clear object details, making it difficult to identify objects or actions.

Poor

Video is extremely blurry, shaky, and lacks clear object visibility, making it difficult to discern actions or final states.

0.737
Instruction Following Poor

The task 'put green bowl into pink plate' is not completed; the final state does not show a green bowl inside a pink plate.

Poor

The video does not clearly show the green bowl being placed into the pink plate; the final state is ambiguous and not visible.

0.503
Physical Adherence Unclear

Due to severe visual corruption, it's impossible to assess physical interactions, contact, or stability.

Unclear

Due to severe visual corruption and lack of clear contact or placement, physical interactions cannot be reliably assessed.

0.478
GT Reason

The video is too corrupted to clearly identify objects or actions. The task is not completed, and the final state is ambiguous due to poor video quality.

Pred Reason

The video is severely corrupted with extreme blur and motion, preventing clear identification of objects or actions. The task of putting the green bowl into the pink plate is not visibly completed, and the final state is ambiguous. Therefore, the task is failed.

Put banana into basket.
GT2Pred2Match0.801LevelsAll levels matchReason0.661
Overall Video Quality Medium

The video is clear with consistent views from three cameras, but the action is partially obscured by robotic arms and objects.

Medium

Video is clear but shows a split view; the right panel is often blurry or shows only a partial object.

0.290
Instruction Following Medium

The robot arm picks up the banana and moves it toward the basket, but the final placement is not clearly visible.

Medium

The robot arm picks up the banana and moves it toward the basket, but the final placement is not clearly shown.

0.922
Physical Adherence Medium

The banana is grasped and moved plausibly, but the interaction with the basket is not fully shown, leaving stability and containment ambiguous.

Medium

The banana is grasped and moved, but the video ends before confirming stable containment in the basket.

0.477
GT Reason

The robot arm attempts to place the banana into the basket, showing meaningful progress. However, the final state is not clearly visible, making completion ambiguous and supporting a partial score.

Pred Reason

The robot successfully grasps the banana and moves it toward the basket, showing meaningful progress. However, the final state is not clearly visible, making completion ambiguous.

Put panda into pink plate.
GT2Pred2Match0.788LevelsAll levels matchReason0.584
Overall Video Quality Medium

Three camera views are provided, but motion blur and varying camera angles reduce clarity in some frames.

Medium

Video is split into three views; middle and right panels are often blurry or obstructed, making detailed observation difficult.

0.238
Instruction Following Medium

The robot picks up the panda toy and moves it toward the pink plate, but the final placement is not clearly completed.

Medium

Robot arm picks up the panda toy and moves it toward the pink plate, but the final placement is not clearly shown.

0.883
Physical Adherence Medium

The robot arm makes plausible contact with the panda toy and moves it, but the final state in the pink plate is ambiguous.

Medium

The robot arm grasps the panda and moves it, but the final state of the panda on the plate is ambiguous due to camera angles.

0.593
GT Reason

The robot successfully picks up the panda toy and initiates movement toward the pink plate, showing meaningful progress. However, the final placement into the plate is not clearly visible or completed in the provided frames.

Pred Reason

The robot arm successfully picks up the panda and moves it toward the pink plate, showing progress. However, the final placement is not clearly visible, making the task completion ambiguous.

Put panda into pink plate.
GT3Pred3Match0.787LevelsAll levels matchReason0.554
Overall Video Quality Good

All three camera views are clear, showing consistent object positions and robot movements without major artifacts.

Good

The video is clear, with consistent object identities and smooth camera movements across all three views.

0.482
Instruction Following Good

The robot picks up the panda toy and places it directly onto the pink plate, achieving the requested action.

Good

The robot successfully picks up the panda toy and places it into the pink plate, as instructed.

0.639
Physical Adherence Good

The robot arm makes plausible contact with the panda, lifts it, moves it, and releases it stably onto the plate.

Good

The robot's gripper makes plausible contact with the panda, lifts it, and releases it into the plate without physical impossibilities.

0.637
GT Reason

The video clearly shows the robot arm successfully grasping the panda toy, transporting it, and placing it into the pink plate. All three views confirm the action and stable final state.

Pred Reason

The video shows the robot arm picking up the panda toy and placing it into the pink plate, which is the correct final state. The action is completed, and the final state is stable and correct.

Put panda into pink plate.
GT3Pred3Match0.754LevelsAll levels matchReason0.588
Overall Video Quality Good

Three synchronized camera views are clear, with consistent object identity and no major visual artifacts.

Good

The video is clear, with consistent object identities and smooth camera movements across all three views.

0.453
Instruction Following Good

The robot arm picks up the panda toy and places it directly into the pink plate, fulfilling the instruction.

Good

The robot correctly identifies the panda and moves it into the pink plate, completing the task as instructed.

0.490
Physical Adherence Good

The robot's grasp is stable, the panda is lifted and moved plausibly, and it lands securely inside the plate.

Good

The robot's gripper makes plausible contact with the panda, lifts it, and places it stably onto the plate.

0.418
GT Reason

The video clearly shows the robot successfully picking up the panda and placing it into the pink plate, with stable motion and a correct final state across all three views.

Pred Reason

The video shows the robot successfully picking up the panda and placing it into the pink plate, with clear visual evidence of the final state. All aspects of the task are completed correctly and plausibly.

WMBench at a Glance

WMBench evaluates world models as robot policy evaluators, not only as video generators. Its protocol mirrors real deployment: collect real policy rollouts, train world models under a strict holdout split, run closed-loop policy evaluation inside the learned world model, and compare generated outcomes with real-world conclusions.

Step 1

Real-World Collection

Collect physical closed-loop policy rollouts with initial observations, task instructions, multi-view videos, and human success labels.

Step 2

Strict World-Model Holdout

Train on the designated split while holding out test layouts and initial states to measure generalization rather than memorization.

Step 3

Closed-Loop Rollout

The policy acts on generated observations, and the world model feeds predictions back to the policy until task termination.

Step 4

Metrics & Outcome Assessment

Automatic diagnostics measure rollout quality, while WMES from human or VLM evaluators measures task-level outcome alignment.

Outcome Evaluation

WMES measures whether generated rollouts support task-level policy comparison and preserve real-world success conclusions.

  • Human WMES
  • VLM-assisted WMES
  • PCC
  • MMRV

Frame & Representation Fidelity

Perceptual quality, content stability, and feature-level similarity across generated and reference videos.

  • Image Quality
  • Aesthetic Quality
  • JEPA Similarity
  • Subject Consistency
  • Background Consistency
  • Photometric Consistency

Geometry, Semantics & Interaction

Spatial structure, 3D plausibility, task semantics, instruction following, and action-conditioned robot-object interaction.

  • Geometry Accuracy
  • Perspectivity
  • Semantic Alignment
  • Instruction Following
  • Interaction Quality
  • Trajectory Accuracy

Motion & Long-Horizon Rollout

Short-term motion behavior and long-horizon autoregressive degradation, including temporal smoothness, flow dynamics, and standard video-generation distances.

  • Dynamic Degree
  • Flow Score
  • Motion Smoothness
  • PSNR
  • FID
  • FVD

Summary tables focus on six key diagnostics: Aesthetic Quality, Image Quality, JEPA Similarity, Semantic Alignment, Subject Consistency, and Trajectory Accuracy.

Citation

@article{gigaworld2025,
  title     = {GigaWorld-1: A Roadmap to World Models for Robot Policy Evaluation},
  author    = {{GigaAI}},
  journal   = {arXiv preprint},
  year      = {2025}
}