Testing the biological reasoning capabilities of large language models

Overall performance of five LLMs in the biological examination. Credit: Gong et al.

Large language models (LLMs) are advanced deep learning algorithms that can process written or spoken prompts and generate texts in response to those prompts. These templates have recently become increasingly popular and now help many users create summaries of long documents, draw inspiration from brand names, find quick answers to simple queries, and generate various other types of text .

Researchers at the University of Georgia and the Mayo Clinic recently set out to assess the biological knowledge and reasoning skills of different LLMs. Their article, pre-published on the arXiv server, suggests that OpenAI’s GPT-4 model performs better than other predominant LLMs on the market on reasoning biology problems.

“Our recent publication demonstrates the significant impact of AI on biological research,” Zhengliang Liu, co-author of the recent paper, told Tech Xplore. “This study arose from the rapid adoption and evolution of LLMs, particularly following the notable introduction of ChatGPT in November 2022. These advances, seen as critical steps towards artificial general intelligence (AGI), have marked a shift from traditional biotechnology approaches to an AI-driven methodology in the field of biology.

In their recent study, Liu and colleagues sought to better understand the potential value of LLMs as tools for conducting research in biology. While many previous studies have highlighted the usefulness of these models in a wide range of fields, their ability to reason about biological data and concepts has not yet been thoroughly evaluated.

“The main objectives of this article were to evaluate and compare the capabilities of leading LLMs, such as GPT-4, GPT-3.5, PaLM2, Claude2 and SenseNova, in their ability to understand and reason through questions related to biology,” Liu said. “This was meticulously assessed using a 108-question multiple-choice exam, covering diverse areas such as molecular biology, biological engineering, metabolic engineering and synthetic biology.”

Liu and his colleagues planned to determine how some of the most renowned LLMs available today process and analyze biological information, while assessing their ability to generate relevant biological hypotheses and tackle biology-related logical reasoning tasks. . The researchers compared the performance of five different LLMs using multiple-choice tests.

“Multiple-choice tests are commonly used to assess LLMs because test results can be easily scored/evaluated/compared,” explained Jason Holmes, co-author of the paper. “For this study, biology experts designed a 108-question multiple-choice test with a few subcategories.”

Holmes and their colleagues asked the LLMs five times each of the questions on the test they compiled. However, every time a question was asked, they changed the way it was phrased.

“The purpose of asking the same question multiple times for each LLM was to determine both the average performance and the average variation in responses,” Holmes explained. “We varied the wording so as not to accidentally base our results on optimal or suboptimal wording of instructions that led to a change in performance. This approach also gives us an idea of how performance will vary in real-world use , where users I won’t ask questions in the same way.”

The tests carried out by Liu, Holmes and their colleagues provided insight into the potential usefulness of different LLMs in helping biology researchers. Overall, their results suggest that LLMs answer a variety of biology-related questions well, while accurately linking concepts rooted in fundamental molecular biology, common molecular biology, metabolic engineering, and synthetic biology.

“GPT-4 notably demonstrated superior performance among the LLMs examined, achieving an average score of 90 on our multiple-choice tests in five trials using distinct prompts,” said Xinyu Gong, co-author of the paper.

“Beyond achieving the highest overall test score, GPT-4 also showed high consistency across trials, highlighting its reliability in biological reasoning compared to homologous models. These results highlight the immense capacity of GPT-4 to aid research and teaching in biology.

The recent study by this team of researchers may soon inspire additional work further exploring the usability of LLMs in the field of biology. The results collected so far suggest that LLMs could be useful tools for both research and education, for example supporting the tutoring of biology students, the creation of interactive learning tools and the creation of testable biological hypotheses.

“Essentially, our paper represents a pioneering effort in merging the capabilities of advanced AI, particularly LLMs, with the complex and rapidly evolving field of biology,” Liu said. “This marks a new chapter in biological research, positioning AI not only as a supporting tool, but as a central element for navigating and deciphering the vast and complex biological landscape.”

Future advancements in LLMs and their continued education on biological data could pave the way for important scientific discoveries, while also enabling the creation of more advanced educational tools. Liu, Holmes, Gong and their colleagues now plan to conduct further studies in this area.

In their next work, they plan to first design strategies to overcome the computing demands and privacy-related issues associated with using GPT-4, the LLM that underpins ChatGPT. This could be achieved by developing open source LLMs to automate tasks such as gene annotation and phenotype-genotype matching.

“We will use GPT-4’s knowledge distillation, creating instruction tracking data to refine local models such as LLaMA foundation models,” Zihao Wu, co-author of the paper, told Tech Xplore.

“This strategy will leverage the capabilities of GPT-4 while addressing privacy and cost concerns, making advanced tools more accessible to the biology community. Additionally, thanks to the vision capabilities of GPT-4V, we will extend our research to multimodal analyses, focusing on natural drug molecules. , such as anticancer agents or vaccine adjuvants, especially those with unknown biosynthetic pathways. »

“We will study their chemical and biosynthetic pathways as well as their potential applications. GPT-4V’s ability to recognize molecular structures will improve our analysis of complex multimodal data, thereby advancing our understanding and application in drug discovery and development in synthetic biology.”

More information:
Xinyu Gong et al, Assessing the Potential of Major Large-Scale Language Models for Reasoning Biological Questions, arXiv (2023). DOI: 10.48550/arxiv.2311.07582

Journal information:
arXiv

Quote: Testing the biological reasoning abilities of large language models (December 19, 2023) retrieved on December 19, 2023 from

This document is subject to copyright. Apart from fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for information only.

Testing the biological reasoning capabilities of large language models

Israeli Finance Minister: “Not a single shekel” to Gaza from Palestinian Authority funds

New neuromuscular model promises to revolutionize high-throughput drug screening studies

New neuromuscular model promises to revolutionize high-throughput drug screening studies

Leave a Reply Cancel reply

Category