Big Tech Is Now Developing Powerful AI Brains for Real-World Robots

3 min read
Curated from vice.com →

Researchers at Google and the Berlin Institute of Technology have released an AI model called PaLM-E this week that combines language and vision capabilities to control robots, allowing them to complete tasks autonomously in the real world—from getting a chip bag from a kitchen to sorting blocks by color into corners of a rectangle. 

According to the researchers, this is the largest Visual Language Model (VLM) reported to date, with 562 billion parameters. This AI has a “wide array of capabilities” which includes math reasoning, multi-image reasoning, and chain-of-thought reasoning. The researchers wrote in a paper that the AI uses multi-task training to transfer skills across tasks, rather than being trained on individual tasks. According to the paper, the AI model when controlling robots even displays “emergent capabilities like multimodal chain of thought reasoning, and the ability to reason over multiple images, despite being trained on only single-image prompts.”

PaLM-E is based on Google’s previous large language model called PaLM and the E in the name stands for “embodied,” and refers to the model’s interaction with physical objects and robotic control. PaLM-E is also built off of Google’s RT-1, a model that processes robot inputs and outputs actions such as camera images, task instructions, and motor commands. The AI uses ViT-22B, a vision transformer model that does tasks such as image classification, object detection, and image captioning. 

Get the AI & data signal, daily.

335k+ subscribers read this every morning. One email, both newsletters. Unsubscribe anytime.

The robot is able to generate its own plan of action in response to commands using the model. When the robot was asked to “bring me the rice chips from the drawer,” PaLM-E was able to guide it to go to the drawers, open the top drawer, take the rice chips out of the drawer, bring it to the user, and put it down. The robot was able to do this even with a human disturbance, with a researcher knocking the rice chips back into the drawer the first time the robot picked it up. PaLM-E is able to do this by analyzing data from its live camera.  

“PaLM-E generates high-level instructions as text; in doing so, the model is able to naturally condition upon its own predictions and directly leverage the world knowledge embedded in its parameters,” the researchers wrote. “This enables not only embodied reasoning but also question answering, as demonstrated in our experiments.” 

The usage of a large language model as the core of the robot has given it the ability to become more autonomous, needing less training and fine-tuning compared to previous models.

Continue Reading

Enjoyed this summary? Read the complete article at the source:

Continue at vice.com →