Introducing AgiBot World Colosseo:
A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

March 10, 2025

We explore how scalable robot data can address real-world challenges for generalized robotic manipulation. Introducing AgiBot World, a large-scale platform comprising over 1 million trajectories across 217 tasks in five deployment scenarios. Accelerated by a standardized collection pipeline with human-in-the-loop verification, AgiBot World guarantees high-quality and diverse data distribution.

Building on top of AgiBot World, we introduce Genie Operator-1 (GO-1), a novel generalist policy that leverages latent action representations to maximize data utilization, demonstrating predictable performance scaling with increased data volume.

Policies pre-trained on our dataset achieve an average performance improvement of 30% over those trained on Open X-Embodiment, GO-1 exhibits exceptional capability in real-world dexterous and long-horizon tasks, achieving over 60% success rate on complex tasks and outperforming prior SOTA policy RDT by 32%.

Full Dataset: A Large-Scale High-Fidelity Robot Dataset

AgiBot World Colosseo is a full-stack and open-source embodied intelligence ecosystem. Based on our hardware platform AgiBot G1, we construct AgiBot World—an open-source robot manipulation dataset collected by more than 100 homogeneous robots, providing high-quality data for challenging tasks spanning a wide spectrum of real-life scenarios.

We go beyond basic tabletop tasks such as pick-and-place in lab environments; instead, concentrate on real-world scenarios involving dual-arm manipulation, dexterous hands, and collaborative tasks. AgiBot World aims to provide an inclusive benchmark to drive the future development of advanced and robust algorithms.

Figure 1: Data collection pipeline. AgiBot World embraces a human-in-the-loop framework to ensure high quality, enriched with detailed annotations and error recovery behaviors.

Concurrent with feedback collection from data annotators, we adopt a human-in-the-loop approach to assess and refine data quality. This process involves an iterative cycle of collecting a small set of demonstrations, training a policy, and deploying the resulting policy to evaluate data availability. Based on the policy's performance, we iteratively refine the data collection pipeline to address identified gaps or inefficiencies. This feedback-driven methodology ensures continuous improvement in data quality.

GO-1 Model: A Scalable Robot Foundation Policy

Figure 2: GO-1 is a robot foundation policy using latent action representations to unlock web-scale pre-training on heterogeneous data.

To effectively utilize web-scale heterogeneous videos alongside our high-quality AgiBot World dataset and enhance the policy's generalizability, we propose a hierarchical Vision-Language-Latent-Action (ViLLA) framework. Compared to Vision-Language-Action (VLA) model where action is vision-language conditioned, the ViLLA model predicts latent action tokens, bridging the gap between image-text inputs and robot actions generated by the action expert.

Latent Action Model

Despite considerable advancements in gathering diverse robot demonstrations, the volume of action-labeled robot data remains limited relative to web-scale datasets. To broaden the data pool by incorporating internet-scale human videos lacking action labels and cross-embodiment robot data, we employ latent actions to model the inverse dynamics of consecutive frames. This approach enables the transfer of real-world dynamics from heterogeneous data sources into universal manipulation knowledge.

To extract latent actions from video frames, the latent action model is constructed around an inverse dynamics model-based encoder and a forward dynamics model-based decoder. The encoder employs a spatial-temporal transformer with casual temporal masks, while the decoder is a spatial transformer that takes the initial frame and discretized latent action tokens as input. The latent action tokens are quantized using a VQ-VAE objective.

Latent Planner

With the aim of establishing a solid foundation for scene and object understanding and general reasoning ability, the ViLLA model harnesses a VLM pre-trained on web-scale vision-language data and incorporates a latent planner for embodiment-agnostic planning within the latent action space. We use InternVL2.5-2B as the VLM backbone due to its strong transfer learning capabilities. Multiview image observations are first encoded using InternViT before being projected into the language space. The latent planner consists of 24 transformer layers, which enable layer-by-layer conditioning from the VLM backbone with full bidirectional attention.

Specifically, given multiview input images (typically from the head, left wrist, and right wrist) along with a language instruction describing the ongoing task, the latent planner predicts latent action tokens, with supervision produced by the LAM encoder. Since the latent action space is orders of magnitude smaller than the discretized low-level actions used in OpenVLA, this approach also facilitates the efficient adaptation of general purpose VLMs into robot policies.

Action Expert

To achieve high-frequency and dexterous manipulation, we integrate an action expert that utilizes a diffusion objective to model the continuous distribution of low-level actions. Although the action expert shares the same architectural framework as the latent planner, their objectives diverge: the latent planner generates discretized latent action tokens through masked language modeling, while the action expert regresses low-level actions via an iterative denoising process. Both expert modules are conditioned hierarchically on preceding modules, including the action expert itself, ensuring coherent integration and information flow within the dual-expert system.

During inference, the VLM, latent planner, and action expert are synergistically combined within the generalist policy GO-1, which initially predicts k latent action tokens and subsequently conditions the denoising process to produce the final control signals.

Experiment

Does AgiBot World boost policy learning at scale?

Figure 3: Policies pre-trained on our dataset outperform those trained on OXE in both seen (0.77 v.s. 0.47) and out-of-distribution scenarios (0.67 v.s. 0.38).

We choose the open-source RDT model to study how much the AgiBot World dataset can help policy learning. Models pre-trained on the AgiBot World dataset demonstrate a significant improvement in the “Table Bussing” task, nearly tripling performance. On average, the completion score increases by 0.30 and 0.29 for in-distribution and out-of-distribution setups, respectively. Notably, the AgiBot World alpha dataset, despite having a significantly smaller data volume than OXE (e.g., 236h compared to ~2000h), achieves a higher success rate, underscoring the exceptional data quality of our dataset.

Is GO-1 a more capable generalist policy?

Figure 4: We evaluate different models pre-trained on the AgiBot World dataset, where GO-1 significantly outperforms previous SOTA policy RDT.

We evaluate GO-1 on five tasks of varying complexity, categorized by their visual richness and task horizon. The results are averaged over 30 trials per task, with 10 trials conducted in a seen setup and 20 trials under variations or distractions. GO-1 significantly outperforms RDT, particularly in tasks such as “Pour Water”, which demands robustness to object positions, and “Restock Beverage”, which requires visual robustness and instruction following capabilities. The inclusion of the latent planner in the ViLLA model further improves performance, resulting in an average improvement of 0.12 task completion score.

Does GO-1's ability scale with data size?

Figure 5: We evaluate how model performance scales with data size.

To investigate whether a power-law scaling relationship exists between the size of pre-training data and policy capability, we conduct an analysis using 10% subsets of the alpha, 100% alpha, and beta dataset, where the number of training trajectories are ranged from 9.2k to 1M. We evaluate the out-of-the-box performance of resulting policies on four seen tasks in pre-training the. As shown in fig. 5, the performance of GO-1 exhibits a predictable power-law scaling relationship with the number of trajectories, supported by a Pearson correlation coefficient of r = 0.97.

Demos on AgiBot G1 robots with GO-1

Bottom Line

We introduce AgiBot World, an open-source ecosystem aimed at democratizing access to large-scale, high-quality robot learning datasets. It is complete with toolchains and foundation models to advance embodied general intelligence through community collaboration. Our dataset distinguishes itself through unparalleled scale, diversity, and quality, underpinned by carefully crafted tasks.

Policy learning evaluations confirm AgiBot World’s value in enhancing performance and generalizability. To further explore its impact, we develop GO-1, a generalist policy utilizing latent actions for webscale pre-training. GO-1 excels in real-world complex tasks, outperforming existing generalist policies and demonstrating scalable performance with increased data volume.

We invite the broader community to collaborate in fostering an ecosystem and maximizing the potential of our extensive dataset.

This article is written by Modi Shi, Yuxiang Lu, Huijie Wang, Chengen Xie, Qingwen Bu.
This project is accomplished by these contributors.