Overview of RoboDual. Our objective is to develop a synergistic dual-system framework which supplements the generalizability of large-scale pre-trained generalist with the efficient and task-specific adaptation of specialist. (a) The fast specialist policy obsesses real-time and accurate control by aid of the slow yet generalized outcome from the generalist one with large-scale data. (b) RoboDual exhibits significant improvement in terms of performance and efficiency over a single standalone option and surpasses previous state-of-the-arts in the real-robot setting.

The increasing demand for versatile robotic systems to operate in diverse and dynamic environments has emphasized the importance of a generalist policy, which leverages a large cross-embodiment data corpus to facilitate broad adaptability and high-level reasoning. However, the generalist would struggle with inefficient inference and cost-expensive training. The specialist policy, instead, is curated for specific domain data and excels at task-level precision with efficiency. Yet, it lacks the generalization capacity for a wide range of applications. Inspired by these observations, we introduce RoboDual, a synergistic dual-system that supplements the merits of both generalist and specialist policy. A diffusion transformer-based specialist is devised for multi-step action rollouts, exquisitely conditioned on the high-level task understanding and discretized action output of a vision-language- action (VLA) based generalist. Compared to OpenVLA, RoboDual achieves a 12% improvement on CALVIN and 26.7% in real-world and by adapting the specialist policy with 20M trainable parameters only. It maintains strong performance with merely 5% of demonstration data, and enables a 3.8× higher control frequency in real-world deployment. Code and models would be made publicly available

The overall architecture of RoboDual. (a) Generalist: The generalist takes as inputs RGB images and language prompts, generating conditioning sources for the specialist model, including latent representations and discretized actions. (b) Specialist: Comprising stacked Diffusion Transformer (DiT) blocks, the specialist is conditioned by multiple sensory inputs and the generalist's output through a cross-attention mechanism. It predicts noise injected into ground truth actions, providing fast, precise control by leveraging the slower, high-level guidance of the generalist.

Success Rates of Real-World Experiments

All real-world experiments are conducted with an AIRBOT Play robotic arm featuring a 7-DoF action space and a third-view RGB camera. We evaluate different policies on both single-instruction tasks ("Lift the Pod Lid", "Pour Shrimp into Bowl", and "Push the Block Left") and multi-instruction tasks ("Put [object] into Basket" and Knock [object] Over").

Single-instruction Tasks

Push block left

Pour shrimp into bowl

Lift the pod lid

Robodual shows strong instruction-following ability and excels at multi-instruction tasks.

Task 1: Put [object] into the basket

Put banana into basket

Put carrot into basket

Put eggplant into basket

Put banana into basket

Put carrot into basket

Put eggplant into basket

Task 2: Knock the [object] over

Knock the stuffed bear over

Knock the stuffed egg over

Knock the stuffed kitten over

Generalization Experiment Setting

Position Variation

Regrasping at position #1

Regrasping at position #2

Position #3

Position #4

Robodual achieves a control frequency of 15 Hz in our real-world setup using NVIDIA A5000 Ada GPUs, facilitating deployment in more dexterous tasks. Notably, inference latency is a primary factor contributing to the performance degradation of OpenVLA. Operating at only 3.9 Hz within our system, it significantly alters the system dynamics compared to the 20 Hz non-blocking controller used in our real-world tasks.

Put banana into the basket

We compare the performance of RoboDual with other state-of-the-art methods on CALVIN ABC-D. We yield an improvement from 3.27 to 3.66 on the average length of completed tasks. The success rate of accomplishing consecutive 5 tasks is elevated by 13.2%. Additionally, we further investigate the robustness to free-form task instructions of various methods. We incorporate RoboFlamingo and LCB, both of which also utilize LLMs (MPT-1B and LLaVA-7B respectively), as our baseline approaches. All methods are trained exclusively on the ABC split using the original language annotations and are evaluated with GPT-4 enriched ones. While the performance of baseline methods decreases significantly compared to their results in default setting, our method exhibits minimal impact and nearly doubles the average length compared to LCB. This improvement can be attributed to both the semantic understanding capability of the generalist and the specialist model's robustness to variations in conditioning latents.

Language-conditioned visuomotor control on CALVIN ABC→D

Robustness to free-form language instructions

BibTex

              @article{bu2024robodual,
                title={Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation}, 
                author={Qingwen Bu and Hongyang Li and Li Chen and Jisong Cai and Jia Zeng and Heming Cui and Maoqing Yao and Yu Qiao},
                journal={arXiv preprint arXiv:2410.08001},
                year={2024},
              }

Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation

Abstract

Model Overview

Real-world Visuomotor Control Tasks

Success Rates of Real-World Experiments

Single-instruction Tasks

Multi-instruction Tasks

Task 1: Put [object] into the basket

Task 2: Knock the [object] over

Generalization Experiment Setting

Generalizability Evaluation in the Real World

Position Variation

Visual Distractor

Unseen Background

Novel Object

More Generalization Experiments

Inference Efficiency Comparison with OpenVLA

Put banana into the basket

Simulation Evaluations on CALVIN

Language-conditioned visuomotor control on CALVIN ABC→D

Robustness to free-form language instructions

BibTex