Visuomotor Policies to Grasp Anything with Dexterous Hands

1NVIDIA 2University of California Berkeley

Abstract

One of the most important yet challenging skills for robots is dexterous multi-fingered grasping of a diverse range of objects. Much of the prior work is limited by the speed, dexterity, or reliance on depth maps. In this paper, we introduce DextrAH-RGB, a system that can perform dexterous arm-hand grasping end2end from stereo RGB input. We train a teacher policy in simulation through reinforcement learning that acts on a geometric fabric action space to ensure reactivity and safety. We then distill this teacher into an RGB-based student in simulation. To our knowledge, this is the first work that is able to demonstrate robust sim2real transfer of an end2end RGB-based policy for a complex, dynamic, contact-rich tasks such as dexterous grasping. Our policies are also able to generalize to grasping novel objects with unseen geometry, texture, or lighting conditions during training.

Training Pipeline

We use a two stage training pipeline. First, we train a state-based teacher policy in simulation through reinforcement learning that acts on a geometric fabric action space to ensure reactivity and safety. Then, we distill this teacher into an RGB-based student in simulation.

Training Pipeline
Training Pipeline
Training Pipeline

Stereo Encoder

Our stereo encoder consists of a ResNet-18 backbone and a stereo attention module. The images are first fed in a Siamese manner through the ResNet-18 backbone with the last two layers removed. The output feature map is reshaped into 128 128-dimensional keys. These are then fed into the stereo attention transformer along with a learnable [embed] token. The transformer performs cross-attention between the tokens from the left image and right image and the output for the [embed] token is used as the stereo embedding.

Stereo Encoder Image 1
Stereo Encoder
Attention Map for the Transformer
Attention Map for the Transformer

RGB Training in Simulation

Stereo RGB Input