🦾OpenVLA

1. What is Vision-Language-Action (VLA) and what is OpenVLA?

A Vision-Language-Action (VLA) model is a large multimodal foundation model that takes image(s) of a scene and a text instruction and directly predicts low-level robot actions (e.g., end-effector deltas, gripper open/close, done flag).

Typical pipeline:

  1. A vision-language encoder (an LLM with a visual module) ingests camera image(s) plus a text command.

  2. Produces a latent representation.

  3. An action decoder converts it into a sequence of tokens, which are then decoded into a continuous action vector (dx, dy, dz, dθ, gripper, done, etc.).

OpenVLA is an open VLA model (7B) from Stanford/Berkeley and collaborators:

  • Website: https://openvla.github.io

  • Code and weights: GitHub + HuggingFace (e.g., openvla/openvla-7b-…).

Key properties:

  • Trained on ~970k manipulation demonstrations from multiple robots (WidowX, Franka, etc.).

  • Works in the format “camera images + text instruction → robot action” without explicit planning.

  • Ready-made weights fine-tuned on LIBERO tasks (e.g., libero_10, libero_object, libero_spatial, etc.).

For KNX/Konnex:

  • Serves as the “manipulation brain”: turns a text task + image into an action sequence.

  • A strong base layer that can be adapted to your robots/sims/tasks.


2. Hardware and high-level launch diagram

This reflects what we actually ran on a headless server (e.g., vast.ai) using NVIDIA A10 / A6000 / A40:

Recommended minimum:

  • GPU: 1× A10 (24 GB) or similar. 16 GB is borderline but can work with 8-bit/4-bit loading.

  • RAM: 32 GB (64+ preferred).

  • Disk: 200+ GB for:

    • repository clones,

    • HuggingFace caches (HF_HOME),

    • LIBERO data and generated videos.

  • OS: Ubuntu 20.04 / 22.04 with NVIDIA drivers compatible with CUDA 12.1.

Suggested directory layout:


3. Installation: conda env and dependencies

3.1. Conda environment

3.2. PyTorch with CUDA 12.1

For A10, the official cu121 wheels are convenient:

Check:

3.3. Clone repositories

This brings in the main dependencies: transformers, accelerate, einops, imageio, matplotlib, etc.

3.4. LIBERO and simulators

LIBERO provides:

  • task suites (bddl files),

  • environments via robosuite + mujoco (OffScreenRenderEnv).

Simulators:

Additional deps commonly needed on first import:

Common dependency conflict we’ve hit:

  • tensorflow==2.15.0 wants numpy<2.0.0

  • some opencv-python builds want numpy>=2

For OpenVLA (without TensorFlow), the simplest fix is pinning NumPy below 2:

If another package tries to bump NumPy ≥ 2.0.0, rerun/install with the pin preserved.

3.5. Extra LIBERO packages

If you hit ModuleNotFoundError: No module named 'XXX', install it in the same env.


4. Environment variables

On a headless-GPU server, the following helps:

Then:


5. Quick test: built-in LIBERO script

… (content unchanged from the original file) …

For the full, up‑to‑date scripts, see the upstream repository.

Last updated