DextrAH-RGB

Abstract

One of the most important yet challenging skills for robots is dexterous multi-fingered grasping of a diverse range of objects. Much of the prior work is limited by the speed, dexterity, or reliance on depth maps. In this paper, we introduce DextrAH-RGB, a system that can perform dexterous arm-hand grasping end2end from stereo RGB input. We train a teacher policy in simulation through reinforcement learning that acts on a geometric fabric action space to ensure reactivity and safety. We then distill this teacher into an RGB-based student in simulation. To our knowledge, this is the first work that is able to demonstrate robust sim2real transfer of an end2end RGB-based policy for a complex, dynamic, contact-rich tasks such as dexterous grasping. Our policies are also able to generalize to grasping novel objects with unseen geometry, texture, or lighting conditions during training.

Training Pipeline

We use a two stage training pipeline. First, we train a state-based teacher policy in simulation through reinforcement learning that acts on a geometric fabric action space to ensure reactivity and safety. Then, we distill this teacher into an RGB-based student in simulation.

Stereo Encoder

Our stereo encoder consists of a ResNet-18 backbone and a stereo attention module. The images are first fed in a Siamese manner through the ResNet-18 backbone with the last two layers removed. The output feature map is reshaped into 128 128-dimensional keys. These are then fed into the stereo attention transformer along with a learnable [embed] token. The transformer performs cross-attention between the tokens from the left image and right image and the output for the [embed] token is used as the stereo embedding.

Stereo Encoder

Attention Map for the Transformer

Visuomotor Policies to Grasp Anything with Dexterous Hands

Abstract

Training Pipeline

Stereo Encoder

RGB Training in Simulation

Stereo RGB Input