A benchmark testing tool for AI general assistants

Scores and response times by method and level. The GPT4+ plugin score should be seen as an oracle since the plugins were chosen manually based on the question. The human score refers to the score obtained by our annotators during question validation. Credit: arXiv (2023). DOI: 10.48550/arxiv.2311.12983

A team of researchers affiliated with AI startups Gen AI, Meta, AutoGPT, HuggingFace and Fair Meta, has developed a benchmark tool for AI assistant makers, particularly those making products based on a large model language, to test their applications as potentially artificial. General Intelligence Applications (AGI). They wrote a paper describing their tool, which they called GAIA, and how it can be used. The article is published on the arXiv preprint server.

Over the past year, AI researchers have debated the capability of AI systems, both privately and on social media. Some have suggested that AI systems are very close to AGI, while others have suggested that the opposite is much closer to the truth. Such systems, all agree, will match and even surpass human intelligence at some point. The only question is when.

In this new effort, the research team notes that to reach consensus, if true AGI systems emerge, an evaluation system must be in place to measure their level of intelligence relative to each other and to to humans. Such a system, they further emphasize, should start with a reference point, and this is what they propose in their paper.

The benchmark created by the team consists of a series of questions asked of a potential AI, with answers compared to those provided by a random set of humans. In creating the benchmark, the team ensured that the questions were not typical AI queries, which AI systems tend to perform well on.

Instead, the questions they ask tend to be fairly easy for a human to answer but difficult for a computer. In many cases, finding answers to the questions formulated by researchers involved going through several stages of work and/or “thinking”. As an example, they might ask a specific question about something found on a specific website, like: “How much higher or lower is the fat content of a given pint of ice cream, depending on USDA standards, as reported by Wikipedia?

The research team tested the AI products they work with and found that none of them came close to beating the benchmark, suggesting that the industry may not be as closer to the development of a real AGI than some think.

More information:
Grégoire Mialon et al, GAIA: a reference for AI General Assistants, arXiv (2023). DOI: 10.48550/arxiv.2311.12983

Journal information:
arXiv

Quote: AI Researchers Present GAIA: A Benchmark Testing Tool for AI General Assistants (December 1, 2023) retrieved December 2, 2023 from

This document is subject to copyright. Apart from fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for information only.

A benchmark testing tool for AI general assistants

To help autonomous vehicles make moral decisions, researchers abandon the ‘trolley problem’

Are we living in a giant void? This could solve the puzzle of the expansion of the universe, research suggests

Are we living in a giant void? This could solve the puzzle of the expansion of the universe, research suggests

Leave a Reply Cancel reply

Category