Artificial intelligence experts prepare to stage 'humanity's final exam' to thwart the most powerful technologies

By Jeffrey Dastin and Katie Paul

(Reuters) – A team of technology experts launched a global call on Monday to identify the toughest questions to ask artificial intelligence systems, which are increasingly making popular benchmark tests seem child’s play.

Dubbed “Humanity’s Last Examination,” the project aims to determine when expert-level AI has arrived. It aims to remain relevant even as capabilities advance in the years to come, according to the organizers, a nonprofit called the Center for AI Safety (CAIS) and the startup Scale AI.

The call comes days after the creator of ChatGPT unveiled a new model, known as OpenAI o1, that “destroyed the most popular reasoning benchmarks,” said Dan Hendrycks, executive director of CAIS and an adviser to Elon Musk’s startup xAI.

Hendrycks co-authored two 2021 papers that offered tests of now-widely used AI systems, one asking about their undergraduate-level knowledge of topics like U.S. history, the other probing the models’ ability to reason through competition-level math. The undergraduate-level test is downloaded more times from the online AI hub Hugging Face than any other such dataset.

At the time these papers were published, AI was giving almost random answers to exam questions. “They are now being crushed,” Hendrycks told Reuters.

For example, Anthropic AI lab’s Claude models went from scoring around 77% on an undergraduate-level test in 2023 to nearly 89% a year later, according to a major ability ranking.

These common benchmarks therefore have less meaning.

According to Stanford University’s AI Index report in April, AI appears to perform poorly on lesser-used tests involving plan formulation and visual pattern recognition puzzles. OpenAI o1 scored about 21% on one version of the ARC-AGI pattern recognition test, for example, ARC organizers said Friday.

Some AI researchers say results like these show that planning and abstract reasoning are better indicators of intelligence, though Hendrycks said the ARC’s visual aspect makes it less suited to assessing language patterns. “The final test of humanity” will require abstract reasoning, he said.

Answers to common benchmark tests may also have been incorporated into the data used to train AI systems, industry observers said. Hendrycks said some questions in “humanity’s final exam” will remain private to ensure AI systems’ answers don’t come from memorization.

The exam will include at least 1,000 crowdsourced questions to be submitted by November 1, which will be difficult for non-experts to answer. These questions will be subject to peer review, and winning submissions will be offered co-authorship and prizes of up to $5,000 sponsored by Scale AI.

“We desperately need more rigorous testing of expert-level models to measure the rapid progress of AI,” said Alexandr Wang, CEO of Scale.

One restriction: Organizers don’t want questions about weapons, which some say would be too dangerous for AI to study.

(Reporting by Jeffrey Dastin in San Francisco and Katie Paul in New York; Editing by Christina Fincher)