Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation

Qingwen Bu1,4    Hongyang Li2,4    Li Chen2    Jisong Cai4    Jia Zeng4    Heming Cui2    Maoqing Yao3    Yu Qiao4   


1Shanghai Jiao Tong University          2The University of Hong Kong          3AgiBot          4Shanghai AI Lab          

Image description

Overview of RoboDual. Our objective is to develop a synergistic dual-system framework which supplements the generalizability of large-scale pre-trained generalist with the efficient and task-specific adaptation of specialist. (a) The fast specialist policy obsesses real-time and accurate control by aid of the slow yet generalized outcome from the generalist one with large-scale data. (b) RoboDual exhibits significant improvement in terms of performance and efficiency over a single standalone option and surpasses previous state-of-the-arts in the real-robot setting.

Abstract

The increasing demand for versatile robotic systems to operate in diverse and dynamic environments has emphasized the importance of a generalist policy, which leverages a large cross-embodiment data corpus to facilitate broad adaptability and high-level reasoning. However, the generalist would struggle with inefficient inference and cost-expensive training. The specialist policy, instead, is curated for specific domain data and excels at task-level precision with efficiency. Yet, it lacks the generalization capacity for a wide range of applications. Inspired by these observations, we introduce RoboDual, a synergistic dual-system that supplements the merits of both generalist and specialist policy. A diffusion transformer-based specialist is devised for multi-step action rollouts, exquisitely conditioned on the high-level task understanding and discretized action output of a vision-language- action (VLA) based generalist. Compared to OpenVLA, RoboDual achieves a 12% improvement on CALVIN and 26.7% in real-world and by adapting the specialist policy with 20M trainable parameters only. It maintains strong performance with merely 5% of demonstration data, and enables a 3.8× higher control frequency in real-world deployment. Code and models would be made publicly available

Model Overview

Image description

The overall architecture of RoboDual. (a) Generalist: The generalist takes as inputs RGB images and language prompts, generating conditioning sources for the specialist model, including latent representations and discretized actions. (b) Specialist: Comprising stacked Diffusion Transformer (DiT) blocks, the specialist is conditioned by multiple sensory inputs and the generalist's output through a cross-attention mechanism. It predicts noise injected into ground truth actions, providing fast, precise control by leveraging the slower, high-level guidance of the generalist.

Real-world Visuomotor Control Tasks

Success Rates of Real-World Experiments

All real-world experiments are conducted with an AIRBOT Play robotic arm featuring a 7-DoF action space and a third-view RGB camera. We evaluate different policies on both single-instruction tasks ("Lift the Pod Lid", "Pour Shrimp into Bowl", and "Push the Block Left") and multi-instruction tasks ("Put [object] into Basket" and Knock [object] Over").

Image description

Single-instruction Tasks

Push block left
Pour shrimp into bowl

Lift the pod lid

Multi-instruction Tasks

Robodual shows strong instruction-following ability and excels at multi-instruction tasks.

Task 1: Put [object] into the basket

Put banana into basket
Put carrot into basket

Put eggplant into basket
Put banana into basket
Put carrot into basket

Put eggplant into basket

Task 2: Knock the [object] over

Knock the stuffed bear over

Knock the stuffed egg over

Knock the stuffed kitten over

Generalization Experiment Setting

Generalization Setting
Generalizability Results

Generalizability Evaluation in the Real World

Position Variation

Regrasping at position #1

Regrasping at position #2

Position #3

Position #4

Visual Distractor

Original

Changed

Unseen Background

Original

Changed

Novel Object

Put banana into plate

Put eggplant into plate

More Generalization Experiments

The following experiments are conducted with a NVIDIA RTX 4060 laptop GPU with only 8GB memories. We perform 4-bit quantization to OpenVLA and our generalist model to fit in the device. Specialist of RoboDual can still run at full precision.

Put Block into Bowl (Original)

Human Intervention

Various Distration + Novel Object

Video Playing

Inference Efficiency Comparison with OpenVLA

Robodual achieves a control frequency of 15 Hz in our real-world setup using NVIDIA A5000 Ada GPUs, facilitating deployment in more dexterous tasks. Notably, inference latency is a primary factor contributing to the performance degradation of OpenVLA. Operating at only 3.9 Hz within our system, it significantly alters the system dynamics compared to the 20 Hz non-blocking controller used in our real-world tasks.

Put banana into the basket


Simulation Evaluations on CALVIN

We compare the performance of RoboDual with other state-of-the-art methods on CALVIN ABC-D. We yield an improvement from 3.27 to 3.66 on the average length of completed tasks. The success rate of accomplishing consecutive 5 tasks is elevated by 13.2%. Additionally, we further investigate the robustness to free-form task instructions of various methods. We incorporate RoboFlamingo and LCB, both of which also utilize LLMs (MPT-1B and LLaVA-7B respectively), as our baseline approaches. All methods are trained exclusively on the ABC split using the original language annotations and are evaluated with GPT-4 enriched ones. While the performance of baseline methods decreases significantly compared to their results in default setting, our method exhibits minimal impact and nearly doubles the average length compared to LCB. This improvement can be attributed to both the semantic understanding capability of the generalist and the specialist model's robustness to variations in conditioning latents.

Language-conditioned visuomotor control on CALVIN ABC→D

Image description

Robustness to free-form language instructions

Image description

BibTex

              @article{bu2024robodual,
                title={Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation}, 
                author={Qingwen Bu and Hongyang Li and Li Chen and Jisong Cai and Jia Zeng and Heming Cui and Maoqing Yao and Yu Qiao},
                journal={arXiv preprint arXiv:2410.08001},
                year={2024},
              }