Humanoid robot company 1X, supported by OpenAI, focuses on chaining tasks with its robot named Eve. The robot, which can perform sequential tasks, is approaching full autonomy.
Just like the artificial intelligence developments experienced after the release of ChatGPT in November 2022, it is seen that a similar development is taking place on the humanoid robots side. The first ChatGPT or similar ones before it only worked in the form of question-answer. However, today we can now give them a task and make them fulfil the requirements of that task. The humanoid robot company 1X, supported by ChatGPT developer OpenAI, is trying to add exactly this to its robot Eve.
1X wants to provide physical labour through safe, intelligent androids, and the steps it takes serve this purpose. The new video released by the company shows the humanoid robot Eve's ability to complete autonomous tasks one after the other. However, the company also points out that this is just the beginning of the journey.
The company had previously developed an autonomous model that could combine many tasks into a single target conditional neural network. However, when multi-task models are small (<100M parameters), adding data to correct the behaviour of one task often negatively affected the behaviour of other tasks. The first solution is to increase the number of parameters, but this takes longer to train and delays which indicators need to be collected to improve robot behaviour.
So, how can we quickly iterate on the data while creating a general robot that can perform many tasks with a single neural network? 1X's answer to this is quite clever. The company says it separates the ability to quickly improve task performance from the process of combining multiple capabilities into a single neural network. To achieve this, the company created a voice-controlled natural language interface to chain short-term capabilities to longer ones in multiple small models.
At this point, I recommend you watch the new video below, which shows long-term behaviours being performed thanks to humans guiding the skill chaining.
While humans can easily do long-term tasks, it is very difficult to chain multiple autonomous robot skills into a sequence because each subsequent skill requires a generalisation on the results of the previous skill. This is compounded with each successive skill, meaning that the third skill has to address variability in the results of the second skill, and so on.
While humans can effortlessly perform long-term tasks, replicating this with robots requires handling the complexity of these sequential variations. From a user perspective, the robot can perform many natural language tasks, abstracting away the actual number of models controlling the robot. This allows us to combine single-task models with goal conditional models over time.
The single-task models provide a solid foundation for shadow mode evaluations, allowing the team to compare the predictions of a new model with the current baseline during testing. When the target conditional model is in good agreement with the predictions of the single-task model, 1X says they can transition to a more powerful, unified model without changing the user workflow.
Using this high-level language interface to direct robots also opens a whole new door for data collection. Instead of using VR to control a single robot, an operator can direct multiple robots with natural language. Since this guidance is sent infrequently, humans don't have to be near the robots, they can control them remotely.
Meanwhile, 1X says that the robots in the video change tasks based on human direction, so they are not autonomous. After creating a dataset of vision and natural language command pairs, the next step is to automate predictions of high-level actions. 1X says this can be achieved with multi-modal, vision-aware language models such as GPT-4o, VILA and Gemini Vision.