DeepMinds New AI Initiative: Granting Robots an Inner Voice for Enhanced Learning

Google DeepMind is currently working on a system designed to provide AI agents with an «inner voice» to enhance their capacity to learn tasks and ultimately become «smarter.»

In its patent application, the lab outlined a method referred to as «Intra-Agent Speech for Task Learning Facilitation,» where robots observe tasks through images or videos and subsequently generate natural language descriptions of them.

According to the researchers, this «internal monologue» helps establish connections between visual inputs and actions, enabling agents to understand and engage with unfamiliar objects without prior training while also reducing memory and computational overhead.

For instance, a robot may view footage of someone lifting a cup while internally processing the phrase «a person lifts a cup.» This would enable the agent to «recall» the appropriate actions to take when encountering similar objects. Consequently, the robot would be able to make more informed decisions and adapt more effectively to new situations in dynamic real-world environments.

This technique supports what is known as «zero-shot learning,» meaning that the robot can handle tasks involving unfamiliar objects without any pre-training. DeepMind emphasizes that this approach will lower the memory and computational power requirements traditionally needed for training robotic systems.

The initiative is part of DeepMind’s broader efforts in the field of robotics. In June, the company unveiled «Gemini Robotics On-Device,» designed to function without cloud access. Google claims that this compact model is efficient enough to operate directly within robots.

Gemini Robotics On-Device is a variant of the Gemini Robotics Vision-Language model, tailored for use with robots and offline operations. Developed for delay-sensitive or autonomous environments, it runs locally, enabling robots to respond quickly to changing conditions while maintaining data privacy.

The system can perform tasks out of the box and adapt to new ones with just 50 to 100 demonstrations. Google developers present it as a «foundation model.» Initially trained on Google’s ALOHA robot, the AI has been adapted to other robots, including Apptronik’s humanoid Apollo and the Franka FR3. The model can execute complex actions such as folding clothes or unzipping bags.

Developers can customize the model by remotely controlling the robot to explore new tasks. It supports multi-joint dynamics simulation with a physics engine such as Contact and can be deployed in physical environments. However, unlike its hybrid counterpart, the on-device version does not feature built-in semantic safety systems. Google advises developers to implement their own security protocols and has currently restricted access to the AI in order to assess real-world safety risks.