A team of AI researchers and computer scientists from Cornell University, the University of Washington, and the Allen Institute for Artificial Intelligence has developed a benchmarking tool called WILDHALLUCINATIONS to assess the factuality of several large language models (LLMs). The group published a paper describing the factors that went into creating their tool on the site arXiv preprint server.
LLMs such as ChatGPT have become popular: people use them to write letters, poems, songs, research papers, and other textual documents. But over time, their flaws have become apparent: LLMs often make inaccurate statements. Such errors, if they stray too far from reality, are known as hallucinations.
The research team notes that the main reason LLMs hallucinate is due to the quality of the data used to train them, typically massive amounts of text from the internet. Thus, models trained on specific, highly accurate datasets are much more likely to provide accurate information.
The research team noted that the designers of many LLMs have made claims about revised versions of their models, often suggesting that they hallucinate less often, implying that they are more accurate. But the researchers also noted that to date, users have no way to verify whether these claims are true. For this new study, the team created a tool to help the user community assess the accuracy of some of the most popular LLMs.
Dubbed WILDHALLUCINATIONS, the benchmark tool invites several LLMs to generate results from user-generated chatbot conversations. It then fact-checks the responses. Noting that many chatbot responses come from information provided on Wiki pages, the research team was careful to note differences in the responses between queries that contained information that could be found on Wikipedia and those that could not.
To test their benchmarking tool, the researchers used it to evaluate several of the most popular LLMs, many of which had recently been updated. They found that LLM makers hadn’t made much progress in improving accuracy. Most were no more accurate than their previous versions.
The team also found that most models performed better when they could extract information from one or more Wiki pages. LLMs also performed better on some topics than others. For example, they struggled to find reliable information about celebrities and financial issues. They were more reliable when asked certain types of scientific questions.
More information:
Wenting Zhao et al, WildHallucinations: Evaluating the Factuality of Long Forms in LLMs with Real-World Entity Queries, arXiv (2024). DOI: 10.48550/arxiv.2407.17468. arxiv.org/abs/2407.17468
arXiv
© 2024 Science X Network
Quote:New Benchmarking Tool Assesses LLM Factuality (2024, August 21) retrieved August 21, 2024 from
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without written permission. The content is provided for informational purposes only.