New "AGI" Benchmark Designed for Dangerous AI

New "AGI" Benchmark Designed for Dangerous AI

Scientists are working on a new "AGI"—Artificial General Intelligence benchmark. This will consist of 75 challenging tests aimed at measuring the "malicious impacts" of future AI models.

As advancements in artificial intelligence continue at a rapid pace, OpenAI scientists have developed a new benchmark. Known as "MLE-bench", this benchmark consists of 75 extremely difficult tests designed to evaluate future advanced AIs’ ability to modify their own code and improve themselves.

The MLE-bench benchmark is a compilation of 75 Kaggle tests, each designed to assess machine learning engineering skills. This research involves training AI models, preparing datasets, and conducting scientific experiments, with the aim of evaluating how well machine learning algorithms perform specific tasks in the real world.

OpenAI scientists have developed MLE-bench to measure the performance of AI models in the field of autonomous machine learning engineering. These tests are considered among the toughest challenges AI can face.


Risks and Rewards are High

Researchers highlight that if AI agents can autonomously perform machine learning research tasks, it could accelerate scientific advancements in fields like healthcare, climate science, and more. However, if these capabilities evolve unchecked, they could lead to catastrophic consequences. AI agents are essentially autonomous intelligent systems that perform specific tasks without human intervention.

On the other hand, researchers warn that if innovations in AI outpace our ability to understand their effects, there is a risk of models emerging with "destructive impacts" and potential for "misuse". Any model capable of solving the majority of MLE-bench challenges would likely be able to independently handle many open-ended machine learning tasks, including self-improvement.

Scientists tested OpenAI's most powerful AI model, the o1, on MLE-bench. The OpenAI o1 model managed to reach at least one Kaggle bronze medal level in 16.9% of the 75 tests. As more trials were conducted, this percentage increased. Achieving a bronze medal means placing in the top 40% of human participants on Kaggle leaderboards. The OpenAI o1 model earned an average of seven gold medals, double the level required for a human to be considered a "Kaggle Grandmaster."

Comments

Ads by Google