For the complete documentation index, see llms.txt. This page is also available as Markdown.

AI Fetch Interface

This document describes the interface specification for neural network models that control the Fetch mobile manipulator robot in ManiSkill/MS-HAB environments. Models following this specification can be used as drop-in replacements for the default control policies.


Quick Reference

Component
Specification

Input

Observation (state)

Input Shape

[batch_size, state_dim] or [state_dim,]

Input Data Type

float32

Input Range

Typically [-inf, inf] or normalized [-1, 1]

Neural Network

Architecture

MLP / CNN / Transformer / etc.

Weights Format

PyTorch (.pt) / ONNX (.onnx) / Keras (.h5)

Output

Action Vector

Output Shape

[batch_size, 13] or [13,]

Output Data Type

float32

Output Range

[-1.0, 1.0] (normalized)

Overview

The Fetch robot is a mobile manipulator with:

  • 7-DOF arm (shoulder, elbow, wrist joints)

  • 2-finger gripper (mimic-controlled)

  • 3-DOF body (head pan/tilt, torso lift)

  • Mobile base (2D translation)

The control interface uses delta position control (pd_joint_delta_pos mode), where actions specify incremental changes to joint positions rather than absolute targets.


Model Interface

Input: Observation Space

Format: Dict[str, Array] or flattened Array

Observation Mode: state (default) or rgbd

When obs_mode="state", the observation format depends on the environment:

Format 1: Flattened Tensor (most common, e.g., ReplicaCAD_SceneManipulation-v1)

Format 2: Dictionary (some environments)

State Dimensions by Environment:

  • ReplicaCAD_SceneManipulation-v1: 30 dimensions (flattened tensor)

  • PickCube-v1: Typically 40-50 dimensions

  • SequentialTask-v0: Typically 50-100 dimensions (varies by task)

Data Type: float32

Normalization: Observations may be normalized to [-1, 1] range depending on the training setup.

Model Input Handling: Models should handle both formats for maximum compatibility:

RGBD Observation (Alternative)

When obs_mode="rgbd", the observation includes camera images:

Note: For model compatibility, state observations are recommended as they are more compact and environment-agnostic.


Output: Action Space

Format: Array[float32]

Shape: (13,) - Fixed dimension

Range: [-1.0, 1.0] (normalized)

Action Vector Breakdown:

Detailed Action Components

Index
Component
Joint Name
Control Type
Range (normalized)
Physical Range

0-6

Arm

0

shoulder_pan_joint

Shoulder pan (left/right)

Delta position

[-1, 1]

[-0.1, 0.1] rad

1

shoulder_lift_joint

Shoulder lift (up/down)

Delta position

[-1, 1]

[-0.1, 0.1] rad

2

upperarm_roll_joint

Upper arm roll

Delta position

[-1, 1]

[-0.1, 0.1] rad

3

elbow_flex_joint

Elbow flexion

Delta position

[-1, 1]

[-0.1, 0.1] rad

4

forearm_roll_joint

Forearm roll

Delta position

[-1, 1]

[-0.1, 0.1] rad

5

wrist_flex_joint

Wrist flexion

Delta position

[-1, 1]

[-0.1, 0.1] rad

6

wrist_roll_joint

Wrist roll

Delta position

[-1, 1]

[-0.1, 0.1] rad

7

Gripper

7

l_gripper_finger_joint

Gripper open/close

Position (mimic)

[-1, 1]

[-0.01, 0.05] m

8-10

Body

8

head_pan_joint

Head pan (left/right)

Delta position

[-1, 1]

[-0.1, 0.1] rad

9

head_tilt_joint

Head tilt (up/down)

Delta position

[-1, 1]

[-0.1, 0.1] rad

10

torso_lift_joint

Torso lift

Delta position

[-1, 1]

[-0.1, 0.1] m

11-12

Base

11

root_x_axis_joint

Base translation X (left/right)

Velocity

[-1, 1]

[-1.0, 1.0] m/s

12

root_y_axis_joint

Base translation Y (forward/back)

Velocity

[-1, 1]

[-1.0, 1.0] m/s

Note: The gripper uses a mimic controller where r_gripper_finger_joint automatically mirrors l_gripper_finger_joint.


Model Architecture Requirements

Input Processing

The model should accept:

  1. State observations (recommended):

    • Input shape: (batch_size, state_dim) where state_dim is:

      • 30 for ReplicaCAD_SceneManipulation-v1

      • 40-50 for PickCube-v1

      • 50-100 for other environments (varies)

    • Or: Dictionary with agent key containing state vector

    • Data type: float32

  2. RGBD observations (optional):

    • Input shape: (batch_size, H, W, C) for images

    • May require CNN backbone (ResNet, etc.)

    • Data type: uint8 for RGB, float32 for depth

Output Processing

The model must output:

  • Shape: (batch_size, 13) or (13,) for single inference

  • Data type: float32

  • Range: [-1.0, 1.0] (normalized actions)

  • Activation: tanh is commonly used for the final layer

Example Architectures

PyTorch MLP (Minimal)

Keras/TensorFlow (Includes Architecture)

Advantage of Keras: The .h5 file contains both architecture and weights, so it can be loaded without providing the architecture definition:


Model Weights Format

Supported Formats

Models should be saved in one of the following formats:

  1. PyTorch (.pt or .pth):

  2. ONNX (.onnx):

    • Standardized format, framework-agnostic

    • Can be loaded by PyTorch, TensorFlow, etc.

  3. TensorFlow/Keras (.h5 or SavedModel):

    • Key Advantage: Keras models include architecture + weights in a single file

    • Can be loaded directly without defining architecture separately

    • Perfect for URL-based loading (like Keras model zoo)

    • Example: model = keras.models.load_model(url) - no architecture code needed!

  4. HuggingFace Model Hub:

    • Models can be hosted and loaded via URL

    • Example: model = torch.hub.load('user/repo', 'model')

Model Metadata

The model file or repository should include metadata for proper loading and usage:

Metadata Location:

  • For PyTorch: Include in checkpoint dict or separate config.json

  • For Keras: Stored in model file or config.json

  • For HuggingFace: In repository root as config.json


Model Loading Interface

URL-Based Loading

Models should be loadable via URL or local path:

Example Usage


Environment Compatibility

Required Environment Settings

Observation Preprocessing

If the model was trained with normalized observations:


Testing Model Compatibility

Validation Checklist

Before using a model, verify:

Test Script


Example Model Repositories

Format for HuggingFace

Format for Local Storage


Model Loading from URL

Supported URL Formats

Models can be loaded from:

  1. HuggingFace Model Hub:

  2. Direct HTTP/HTTPS URLs:

  3. Local file paths:

Implementation Example

Keras/TensorFlow Models

For Keras models (which include architecture + weights in a single file):

Key Advantage of Keras: The model file is self-contained - it includes:

  • Architecture definition (layers, connections)

  • Weights (trained parameters)

  • Optimizer state (optional)

  • Training configuration (optional)

This means you can load a Keras model with just:

No need to define the architecture separately!


Summary

Component
Specification

Input

Observation dict or flattened array, shape (batch_size, state_dim) or (state_dim,)

Input Dimension

Typically 30-100 (varies by environment)

Output

Action array, shape (13,), range [-1.0, 1.0], dtype float32

Control Mode

pd_joint_delta_pos (required)

Robot

Fetch mobile manipulator

Action Components

7 arm joints + 1 gripper + 3 body joints + 2 base velocities

Model Format

PyTorch (.pt), ONNX (.onnx), or Keras (.h5)

Loading

URL (HTTP/HTTPS/HuggingFace) or local path supported

Architecture

User-defined (MLP, CNN, Transformer, etc.)


Action Vector Visualization

Action Vector (13 dimensions)

Index
Component
Joint Name
Control Type
Description

0-6

Arm (7 DOF)

0

Shoulder Pan

shoulder_pan_joint

Δpos

Shoulder rotation (left/right)

1

Shoulder Lift

shoulder_lift_joint

Δpos

Shoulder elevation (up/down)

2

Upper Arm Roll

upperarm_roll_joint

Δpos

Upper arm rotation

3

Elbow Flex

elbow_flex_joint

Δpos

Elbow flexion

4

Forearm Roll

forearm_roll_joint

Δpos

Forearm rotation

5

Wrist Flex

wrist_flex_joint

Δpos

Wrist flexion

6

Wrist Roll

wrist_roll_joint

Δpos

Wrist rotation

7

Gripper

7

Gripper

l_gripper_finger_joint

pos

Gripper open/close (mimic-controlled)

8-10

Body (3 DOF)

8

Head Pan

head_pan_joint

Δpos

Head rotation (left/right)

9

Head Tilt

head_tilt_joint

Δpos

Head tilt (up/down)

10

Torso Lift

torso_lift_joint

Δpos

Torso vertical movement

11-12

Base (2 DOF)

11

Base X

root_x_axis_joint

vel

Base translation X (left/right)

12

Base Y

root_y_axis_joint

vel

Base translation Y (forward/back)

Legend:

  • Δpos = Delta position (incremental change from current position)

  • pos = Absolute position target

  • vel = Velocity control


References

Last updated