🧠AI Fetch Interface

This document describes the interface specification for neural network models that control the Fetch mobile manipulator robot in ManiSkill/MS-HAB environments. Models following this specification can be used as drop-in replacements for the default control policies.


Quick Reference

Component
Specification

Input

Observation (state)

Input Shape

[batch_size, state_dim] or [state_dim,]

Input Data Type

float32

Input Range

Typically [-inf, inf] or normalized [-1, 1]

Neural Network

Architecture

MLP / CNN / Transformer / etc.

Weights Format

PyTorch (.pt) / ONNX (.onnx) / Keras (.h5)

Output

Action Vector

Output Shape

[batch_size, 13] or [13,]

Output Data Type

float32

Output Range

[-1.0, 1.0] (normalized)

Overview

The Fetch robot is a mobile manipulator with:

  • 7-DOF arm (shoulder, elbow, wrist joints)

  • 2-finger gripper (mimic-controlled)

  • 3-DOF body (head pan/tilt, torso lift)

  • Mobile base (2D translation)

The control interface uses delta position control (pd_joint_delta_pos mode), where actions specify incremental changes to joint positions rather than absolute targets.


Model Interface

Input: Observation Space

Format: Dict[str, Array] or flattened Array

Observation Mode: state (default) or rgbd

When obs_mode="state", the observation format depends on the environment:

Format 1: Flattened Tensor (most common, e.g., ReplicaCAD_SceneManipulation-v1)

Format 2: Dictionary (some environments)

State Dimensions by Environment:

  • ReplicaCAD_SceneManipulation-v1: 30 dimensions (flattened tensor)

  • PickCube-v1: Typically 40-50 dimensions

  • SequentialTask-v0: Typically 50-100 dimensions (varies by task)

Data Type: float32

Normalization: Observations may be normalized to [-1, 1] range depending on the training setup.

Model Input Handling: Models should handle both formats for maximum compatibility:

RGBD Observation (Alternative)

When obs_mode="rgbd", the observation includes camera images:

Note: For model compatibility, state observations are recommended as they are more compact and environment-agnostic.


Output: Action Space

Format: Array[float32]

Shape: (13,) - Fixed dimension

Range: [-1.0, 1.0] (normalized)

Action Vector Breakdown:

Detailed Action Components

Index
Component
Joint Name
Control Type
Range (normalized)
Physical Range

0-6

Arm

0

shoulder_pan_joint

Shoulder pan (left/right)

Delta position

[-1, 1]

[-0.1, 0.1] rad

1

shoulder_lift_joint

Shoulder lift (up/down)

Delta position

[-1, 1]

[-0.1, 0.1] rad

2

upperarm_roll_joint

Upper arm roll

Delta position

[-1, 1]

[-0.1, 0.1] rad

3

elbow_flex_joint

Elbow flexion

Delta position

[-1, 1]

[-0.1, 0.1] rad

4

forearm_roll_joint

Forearm roll

Delta position

[-1, 1]

[-0.1, 0.1] rad

5

wrist_flex_joint

Wrist flexion

Delta position

[-1, 1]

[-0.1, 0.1] rad

6

wrist_roll_joint

Wrist roll

Delta position

[-1, 1]

[-0.1, 0.1] rad

7

Gripper

7

l_gripper_finger_joint

Gripper open/close

Position (mimic)

[-1, 1]

[-0.01, 0.05] m

8-10

Body

8

head_pan_joint

Head pan (left/right)

Delta position

[-1, 1]

[-0.1, 0.1] rad

9

head_tilt_joint

Head tilt (up/down)

Delta position

[-1, 1]

[-0.1, 0.1] rad

10

torso_lift_joint

Torso lift

Delta position

[-1, 1]

[-0.1, 0.1] m

11-12

Base

11

root_x_axis_joint

Base translation X (left/right)

Velocity

[-1, 1]

[-1.0, 1.0] m/s

12

root_y_axis_joint

Base translation Y (forward/back)

Velocity

[-1, 1]

[-1.0, 1.0] m/s

Note: The gripper uses a mimic controller where r_gripper_finger_joint automatically mirrors l_gripper_finger_joint.


Model Architecture Requirements

Input Processing

The model should accept:

  1. State observations (recommended):

    • Input shape: (batch_size, state_dim) where state_dim is:

      • 30 for ReplicaCAD_SceneManipulation-v1

      • 40-50 for PickCube-v1

      • 50-100 for other environments (varies)

    • Or: Dictionary with agent key containing state vector

    • Data type: float32

  2. RGBD observations (optional):

    • Input shape: (batch_size, H, W, C) for images

    • May require CNN backbone (ResNet, etc.)

    • Data type: uint8 for RGB, float32 for depth

Output Processing

The model must output:

  • Shape: (batch_size, 13) or (13,) for single inference

  • Data type: float32

  • Range: [-1.0, 1.0] (normalized actions)

  • Activation: tanh is commonly used for the final layer

Example Architectures

PyTorch MLP (Minimal)

Keras/TensorFlow (Includes Architecture)

Advantage of Keras: The .h5 file contains both architecture and weights, so it can be loaded without providing the architecture definition:


Model Weights Format

Supported Formats

Models should be saved in one of the following formats:

  1. PyTorch (.pt or .pth):

  2. ONNX (.onnx):

    • Standardized format, framework-agnostic

    • Can be loaded by PyTorch, TensorFlow, etc.

  3. TensorFlow/Keras (.h5 or SavedModel):

    • Key Advantage: Keras models include architecture + weights in a single file

    • Can be loaded directly without defining architecture separately

    • Perfect for URL-based loading (like Keras model zoo)

    • Example: model = keras.models.load_model(url) - no architecture code needed!

  4. HuggingFace Model Hub:

    • Models can be hosted and loaded via URL

    • Example: model = torch.hub.load('user/repo', 'model')

Model Metadata

The model file or repository should include metadata for proper loading and usage:

Metadata Location:

  • For PyTorch: Include in checkpoint dict or separate config.json

  • For Keras: Stored in model file or config.json

  • For HuggingFace: In repository root as config.json


Model Loading Interface

URL-Based Loading

Models should be loadable via URL or local path:

Example Usage


Environment Compatibility

Required Environment Settings

Observation Preprocessing

If the model was trained with normalized observations:


Testing Model Compatibility

Validation Checklist

Before using a model, verify:

Test Script


Example Model Repositories

Format for HuggingFace

Format for Local Storage


Model Loading from URL

Supported URL Formats

Models can be loaded from:

  1. HuggingFace Model Hub:

  2. Direct HTTP/HTTPS URLs:

  3. Local file paths:

Implementation Example

Keras/TensorFlow Models

For Keras models (which include architecture + weights in a single file):

Key Advantage of Keras: The model file is self-contained - it includes:

  • Architecture definition (layers, connections)

  • Weights (trained parameters)

  • Optimizer state (optional)

  • Training configuration (optional)

This means you can load a Keras model with just:

No need to define the architecture separately!


Summary

Component
Specification

Input

Observation dict or flattened array, shape (batch_size, state_dim) or (state_dim,)

Input Dimension

Typically 30-100 (varies by environment)

Output

Action array, shape (13,), range [-1.0, 1.0], dtype float32

Control Mode

pd_joint_delta_pos (required)

Robot

Fetch mobile manipulator

Action Components

7 arm joints + 1 gripper + 3 body joints + 2 base velocities

Model Format

PyTorch (.pt), ONNX (.onnx), or Keras (.h5)

Loading

URL (HTTP/HTTPS/HuggingFace) or local path supported

Architecture

User-defined (MLP, CNN, Transformer, etc.)


Action Vector Visualization

Action Vector (13 dimensions)

Index
Component
Joint Name
Control Type
Description

0-6

Arm (7 DOF)

0

Shoulder Pan

shoulder_pan_joint

Δpos

Shoulder rotation (left/right)

1

Shoulder Lift

shoulder_lift_joint

Δpos

Shoulder elevation (up/down)

2

Upper Arm Roll

upperarm_roll_joint

Δpos

Upper arm rotation

3

Elbow Flex

elbow_flex_joint

Δpos

Elbow flexion

4

Forearm Roll

forearm_roll_joint

Δpos

Forearm rotation

5

Wrist Flex

wrist_flex_joint

Δpos

Wrist flexion

6

Wrist Roll

wrist_roll_joint

Δpos

Wrist rotation

7

Gripper

7

Gripper

l_gripper_finger_joint

pos

Gripper open/close (mimic-controlled)

8-10

Body (3 DOF)

8

Head Pan

head_pan_joint

Δpos

Head rotation (left/right)

9

Head Tilt

head_tilt_joint

Δpos

Head tilt (up/down)

10

Torso Lift

torso_lift_joint

Δpos

Torso vertical movement

11-12

Base (2 DOF)

11

Base X

root_x_axis_joint

vel

Base translation X (left/right)

12

Base Y

root_y_axis_joint

vel

Base translation Y (forward/back)

Legend:

  • Δpos = Delta position (incremental change from current position)

  • pos = Absolute position target

  • vel = Velocity control


References

Last updated