Zero-Shot Prompt to Action¶

Teddy Warner| Spring, 2025 | X-X minutes

Zero-shot prompt to action on a $160 3D printed robotic arm with π₀.

The future will undoubtedly consist of robots that “just work”. Where autonomous machines are informed by semantics and senses in the same way we humans are.

In his 2022 paper, A Path Towards Autonomous Machine Intelligence Dr. Yann LeCun proposes that for ML systems to truly mimic the human ability to act in novel situations/environments, said ML systems must master both the semantic and scientific domains.¹ They must both understand language/society and physical/time.

While current Large Language Models (LLMs) and Vision Language Models (VLMs) very accurately grasp language and society, they fall short in the later physical domains. To establish machines that “just work”, we must fill in this hole: we must equip ML systems with an understanding of time and space.

I work full-time on this very problem at Intempus, equipping agents with an understanding of time and space through the collection and analysis of physiological data. Yet that’s not what this piece is about, we’ll be focusing on a different approach: be focusing on generalist robotics policies.

In early February 2025, Physical Intelligence, a company developing foundational robotic control policies, open-sourced π₀, a state-of-the-art Vision Language Action (VLA) model. π₀ excels in a number of diverse tasks (folding laundry, cleaning a table, scooping coffee beans, etc.) while remaining general enough to control a variety of robot types (single arm, dual arms, mobile robots, etc.).

Many of you asked for code & weights for π₀, we are happy to announce that we are releasing π₀ and pre-trained checkpoints in our new openpi repository! We tested the model on a few public robots, and we include code for you to fine-tune it yourself. pic.twitter.com/1peLDU9XJ6
— Physical Intelligence (@physical_int) February 4, 2025

Traditional robotic control policies often rely on specialized data/programming for each task, limiting their versatility. Vision Language Action models like π₀ provide a generalist baseline for autonomous machines, allowing said robots to adapt and learn new tasks with minimal additional data.

Thus robots informed by generalist policies like π₀ can truly assist/interact in diverse environments, leveraging semantic knowledge and visual understanding from ‘internet-scale’ pretraining, and in turn understanding and executing complex tasks much like humans do.

π₀ is built on ‘internet-scale’ vision-language pretraining, giving the model an incredibly robust visual understanding and semantic knowledge. The model extends into the physical real by incorporating action and observation state tokens, enabling the model to output continuous motor commands and high frequencies (50 Hz). Physical intelligence notes that their model was initially trained on data from 8 distinct robots, and later expanded to 7 robotic platforms with 68 unique tasks.

[PI0 robots photo]

The open-source release of π₀ is still considered an experiment: “π₀ was developed for [Physical Intellegence’s] own robots … and though [they] are optimistic that researchers and practitioners will be able to run creative new experiments by adapting π₀ to their own platforms, [they] do not expect every such attempt to be successful. All this is to say: π₀ may or may not work for you, but you are welcome to try it and see”!²

So Physical Intelligence has gifted us with this very cool model which happens to work with many expensive robots. The folks at 🤗 Hugging Face open-sourced a PyToarch port of π₀ concurrently with Physical Intelligence’s release, and it felt only right to get π₀ up and running on a cheap and sturdy robotic arm: the $160 LeRobot So-100!

Setup Your Environment¶

LeRebot is an awesome subsidiary of Hugging Face which provides open-source models, datasets, and tools for robotics - all written with PyToarch. They recently developed a new robotic arm, the So-100 which has been taking the robotics community by storm for its low cost and ease of use. This is the arm we’ll be using to run π₀, as its accessibility means that many will be able to replicate this work.

You can build a So-100 teleoperation setup for yourself by following this LeRobot guide, or you can purchase a prebuilt setup from Seeed Studio.

[s0-100 images]

[INSERT lerobot isntall + WSL paragraph]

To provide our VLA with the vision it needs, we’ll be making an upgrade to the base So-100 build with a 160° FOV gripper cam. I mocked up a quick wrist mount for my camera in Fusion 360 [link to file] for my bot but also stumbled across @cmcgartoll’s redesigned S0-100 wrist with an integrated camera mount - either option should work.

[gripper cam photo]

In addition to the gripper cam, a third-person angle of your environment will be needed. I used a desk-mounted tripod and my phone with the droidCam app for this.

[third person cam angle]

Think of your setup as a clock: place your So-100 at 6 o’clock and your third-person camera at 12 o’clock.

[Diagram + Photos]

You can follow this guide to set up your cameras in LeRobot - this addresses streaming your phone’s cameras as well. For my implementation, this ended up being the most time-consuming part. WSL needs to have all USB devices ported to it from Windows and certainly does not get along well without video devices, so I spent a good chunk of time wrangling with my WSL kernel. Doing this natively in Mac or Windows shouldn’t be as tricky.

Once you’ve built (or bought) your robots and set up your cameras in LeRobot, you’re ready to test with teleoperation!

[teleoperation script]

If you see both camera streams pop up and can control the follower arm with the leader arm, you’re golden!

Running Zero-Shot VLA Inference¶

At this point we enter a bit of a choose-your-own journey, I’ve gone ahead and collected a bunch of manual teleoperation data from my So-100 setup and used it to finetune π₀ for my environment. Keep reading this section and I’ll we’ll cover how to use my fine-tuned π₀ weights to run an inference with your own So-100 setup. While I’ve done my best to recount how my septic environment is set up, there is no guarantee that my fine-tuned weights will work with your environment. With that in mind, if you’d like to finetune π₀ for yourself, skip to the finetuning section.

[ How to run pi0 inference]

[results + media]

Fine-Tuning π₀¶

As mentioned in the previous section, while I’ve done my best to recount how to recreate my exact setup for yourself, there’s no guarantee my fine-tuned weights will work perfectly for your environment. Fortunately, you can fine-tune π₀ for yourself! To get started you’ll need to collect some data.

Capture a Dataset¶

To capture a dataset with LeRobot, run the following script. In this, you’ll specify the task description of what task you’ll be manually recording, how many times you’ll be repeating that task (episodes - generally 5-15 per task), how long each episode should be (in seconds), and how long you’ll have between each episode to reset the environment (in seconds).

[Inser script] [grep explanation]

After you’ve collected your last episode, your new dataset will be saved to the “/data” directory in the root of your LeRobot folder.

I captured four example datasets for you to reference while capturing your own, each consisting of 5 episodes with episode duration varying from 25-45 seconds and reset duration varying from 10-25 seconds.

[Insert dataset demo slider - iframe hugging face viz dataset]

Augment Data¶

Great! You’ve now collected a bunch of manual teleoperation data from your So-100 environment. Now let’s augment that data to finetune π₀.

lerobot/scripts/augment_dataset.py will process the dataset in your /data directory, augment that data, and push if to Hugging Face. Data augmentation is pretty sweet, we’re essentially 4x’ing all that teleoperation data you manually captured simply bly flipping the footage and reversing the polarity of the action and state values for the base botor on the So-100 (i.e. 42.73 becomes -42.73 and vice versa)³

[insert script]

Finetune¶

With this newly augmented data, we’re finally ready to move into finetuning π₀.

Π₀ is quite a substantial model and even finetuning takes ~80GB of VRAM. I certainly don’t have that on my laptop GPU, and I’m assuming very few people do, so we’ll need to outsource or compute. For this, I’ve used an A6000 on Lambda Cloud, but I’ve also heard great things about SF Compute. It’s a good idea to run this script when you don’t need your computer, as this can take a while. My finetuning on 100 episodes took a little over 32 hours.

If its your first time using Lambda Cloud, follow this getting started guide. Once you’re SSH’ed into your GPU instance, you’ll need to install miniconda.

[insert scripts https://lambdalabs.com/blog/setting-up-a-anaconda-environment?srsltid=AfmBOora6_1ZJEaQv1quwbtTYA0xaI8RgvU-MlTLvPb6MSZEDesJCGks]

https://openreview.net/pdf?id=BZ5a1r-kVsf ↩
https://www.physicalintelligence.company/blog/openpi ↩
https://gist.github.com/shreyasgite/3de71719c1f03439ed7278b9ba85b14b ↩