🧠AI Fetch Interface
This document describes the interface specification for neural network models that control the Fetch mobile manipulator robot in ManiSkill/MS-HAB environments. Models following this specification can be used as drop-in replacements for the default control policies.
Quick Reference
Input
Observation (state)
Input Shape
[batch_size, state_dim] or [state_dim,]
Input Data Type
float32
Input Range
Typically [-inf, inf] or normalized [-1, 1]
Neural Network
Architecture
MLP / CNN / Transformer / etc.
Weights Format
PyTorch (.pt) / ONNX (.onnx) / Keras (.h5)
Output
Action Vector
Output Shape
[batch_size, 13] or [13,]
Output Data Type
float32
Output Range
[-1.0, 1.0] (normalized)
Overview
The Fetch robot is a mobile manipulator with:
7-DOF arm (shoulder, elbow, wrist joints)
2-finger gripper (mimic-controlled)
3-DOF body (head pan/tilt, torso lift)
Mobile base (2D translation)
The control interface uses delta position control (pd_joint_delta_pos mode), where actions specify incremental changes to joint positions rather than absolute targets.
Model Interface
Input: Observation Space
Format: Dict[str, Array] or flattened Array
Observation Mode: state (default) or rgbd
State Observation (Recommended)
When obs_mode="state", the observation format depends on the environment:
Format 1: Flattened Tensor (most common, e.g., ReplicaCAD_SceneManipulation-v1)
Format 2: Dictionary (some environments)
State Dimensions by Environment:
ReplicaCAD_SceneManipulation-v1: 30 dimensions (flattened tensor)PickCube-v1: Typically 40-50 dimensionsSequentialTask-v0: Typically 50-100 dimensions (varies by task)
Data Type: float32
Normalization: Observations may be normalized to [-1, 1] range depending on the training setup.
Model Input Handling: Models should handle both formats for maximum compatibility:
RGBD Observation (Alternative)
When obs_mode="rgbd", the observation includes camera images:
Note: For model compatibility, state observations are recommended as they are more compact and environment-agnostic.
Output: Action Space
Format: Array[float32]
Shape: (13,) - Fixed dimension
Range: [-1.0, 1.0] (normalized)
Action Vector Breakdown:
Detailed Action Components
0-6
Arm
0
shoulder_pan_joint
Shoulder pan (left/right)
Delta position
[-1, 1]
[-0.1, 0.1] rad
1
shoulder_lift_joint
Shoulder lift (up/down)
Delta position
[-1, 1]
[-0.1, 0.1] rad
2
upperarm_roll_joint
Upper arm roll
Delta position
[-1, 1]
[-0.1, 0.1] rad
3
elbow_flex_joint
Elbow flexion
Delta position
[-1, 1]
[-0.1, 0.1] rad
4
forearm_roll_joint
Forearm roll
Delta position
[-1, 1]
[-0.1, 0.1] rad
5
wrist_flex_joint
Wrist flexion
Delta position
[-1, 1]
[-0.1, 0.1] rad
6
wrist_roll_joint
Wrist roll
Delta position
[-1, 1]
[-0.1, 0.1] rad
7
Gripper
7
l_gripper_finger_joint
Gripper open/close
Position (mimic)
[-1, 1]
[-0.01, 0.05] m
8-10
Body
8
head_pan_joint
Head pan (left/right)
Delta position
[-1, 1]
[-0.1, 0.1] rad
9
head_tilt_joint
Head tilt (up/down)
Delta position
[-1, 1]
[-0.1, 0.1] rad
10
torso_lift_joint
Torso lift
Delta position
[-1, 1]
[-0.1, 0.1] m
11-12
Base
11
root_x_axis_joint
Base translation X (left/right)
Velocity
[-1, 1]
[-1.0, 1.0] m/s
12
root_y_axis_joint
Base translation Y (forward/back)
Velocity
[-1, 1]
[-1.0, 1.0] m/s
Note: The gripper uses a mimic controller where r_gripper_finger_joint automatically mirrors l_gripper_finger_joint.
Model Architecture Requirements
Input Processing
The model should accept:
State observations (recommended):
Input shape:
(batch_size, state_dim)wherestate_dimis:30 for
ReplicaCAD_SceneManipulation-v140-50 for
PickCube-v150-100 for other environments (varies)
Or: Dictionary with
agentkey containing state vectorData type:
float32
RGBD observations (optional):
Input shape:
(batch_size, H, W, C)for imagesMay require CNN backbone (ResNet, etc.)
Data type:
uint8for RGB,float32for depth
Output Processing
The model must output:
Shape:
(batch_size, 13)or(13,)for single inferenceData type:
float32Range:
[-1.0, 1.0](normalized actions)Activation:
tanhis commonly used for the final layer
Example Architectures
PyTorch MLP (Minimal)
Keras/TensorFlow (Includes Architecture)
Advantage of Keras: The .h5 file contains both architecture and weights, so it can be loaded without providing the architecture definition:
Model Weights Format
Supported Formats
Models should be saved in one of the following formats:
PyTorch (
.ptor.pth):ONNX (
.onnx):Standardized format, framework-agnostic
Can be loaded by PyTorch, TensorFlow, etc.
TensorFlow/Keras (
.h5or SavedModel):Key Advantage: Keras models include architecture + weights in a single file
Can be loaded directly without defining architecture separately
Perfect for URL-based loading (like Keras model zoo)
Example:
model = keras.models.load_model(url)- no architecture code needed!
HuggingFace Model Hub:
Models can be hosted and loaded via URL
Example:
model = torch.hub.load('user/repo', 'model')
Model Metadata
The model file or repository should include metadata for proper loading and usage:
Metadata Location:
For PyTorch: Include in checkpoint dict or separate
config.jsonFor Keras: Stored in model file or
config.jsonFor HuggingFace: In repository root as
config.json
Model Loading Interface
URL-Based Loading
Models should be loadable via URL or local path:
Example Usage
Environment Compatibility
Required Environment Settings
Observation Preprocessing
If the model was trained with normalized observations:
Testing Model Compatibility
Validation Checklist
Before using a model, verify:
Test Script
Example Model Repositories
Format for HuggingFace
Format for Local Storage
Model Loading from URL
Supported URL Formats
Models can be loaded from:
HuggingFace Model Hub:
Direct HTTP/HTTPS URLs:
Local file paths:
Implementation Example
Keras/TensorFlow Models
For Keras models (which include architecture + weights in a single file):
Key Advantage of Keras: The model file is self-contained - it includes:
Architecture definition (layers, connections)
Weights (trained parameters)
Optimizer state (optional)
Training configuration (optional)
This means you can load a Keras model with just:
No need to define the architecture separately!
Summary
Input
Observation dict or flattened array, shape (batch_size, state_dim) or (state_dim,)
Input Dimension
Typically 30-100 (varies by environment)
Output
Action array, shape (13,), range [-1.0, 1.0], dtype float32
Control Mode
pd_joint_delta_pos (required)
Robot
Fetch mobile manipulator
Action Components
7 arm joints + 1 gripper + 3 body joints + 2 base velocities
Model Format
PyTorch (.pt), ONNX (.onnx), or Keras (.h5)
Loading
URL (HTTP/HTTPS/HuggingFace) or local path supported
Architecture
User-defined (MLP, CNN, Transformer, etc.)
Action Vector Visualization
Action Vector (13 dimensions)
0-6
Arm (7 DOF)
0
Shoulder Pan
shoulder_pan_joint
Δpos
Shoulder rotation (left/right)
1
Shoulder Lift
shoulder_lift_joint
Δpos
Shoulder elevation (up/down)
2
Upper Arm Roll
upperarm_roll_joint
Δpos
Upper arm rotation
3
Elbow Flex
elbow_flex_joint
Δpos
Elbow flexion
4
Forearm Roll
forearm_roll_joint
Δpos
Forearm rotation
5
Wrist Flex
wrist_flex_joint
Δpos
Wrist flexion
6
Wrist Roll
wrist_roll_joint
Δpos
Wrist rotation
7
Gripper
7
Gripper
l_gripper_finger_joint
pos
Gripper open/close (mimic-controlled)
8-10
Body (3 DOF)
8
Head Pan
head_pan_joint
Δpos
Head rotation (left/right)
9
Head Tilt
head_tilt_joint
Δpos
Head tilt (up/down)
10
Torso Lift
torso_lift_joint
Δpos
Torso vertical movement
11-12
Base (2 DOF)
11
Base X
root_x_axis_joint
vel
Base translation X (left/right)
12
Base Y
root_y_axis_joint
vel
Base translation Y (forward/back)
Legend:
Δpos = Delta position (incremental change from current position)
pos = Absolute position target
vel = Velocity control
References
Fetch Robot URDF: Included in ManiSkill assets
Control Modes: See
mani_skill/agents/robots/fetch/fetch.py
Last updated

