Credit: Pixabay/CC0 Public domain
A team of computer scientists led by the University of Massachusetts Amherst recently announced a new method for automatically generating comprehensive proofs that can be used to prevent software bugs and verify that the underlying code is correct.
This new method, called Baldur, harnesses the power of large language models (LLM) artificial intelligence and, combined with the cutting-edge Thor tool, delivers an unprecedented efficiency of almost 66%. The team recently received a Distinguished Paper Award at the ACM Joint European Software Engineering Conference and Symposium on Foundations of Software Engineering.
“We’ve unfortunately come to expect our software to be buggy, even though it’s everywhere and we all use it every day,” says Yuriy Brun, a professor at UMass’ Manning College of Information and Computer Sciences Amherst and senior editor of the newspaper. author.
The effects of buggy software can range from annoying (faulty formatting or sudden crashes) to potentially catastrophic when it comes to security flaws or precision software used for space exploration or to control healthcare devices.
Of course, there have been methods of verifying software for as long as they have existed. One popular method is the simplest: you have a human go through the code, line by line, manually checking for errors. Or you can run the code and compare it to what you expect it to do. If, for example, you expect your word processing software to break the line every time you press the “return” key, but it instead generates a question mark, then you know that something in the code is wrong.
The problem with both methods is that they are prone to human error, and checking for all possible problems is extremely time-consuming, expensive, and impractical for anything other than trivial systems.
A much more in-depth, but more difficult, method is to generate a mathematical proof showing that the code does what it is supposed to do, and then use a theorem prover to ensure that the proof is also correct. This method is called automatic verification.
But manually writing these proofs is extremely time-consuming and requires extensive expertise. “These proofs can be several times longer than the software code itself,” says Emily First, the lead author of the paper who conducted this research as part of her doctoral dissertation at UMass Amherst.
With the rise of LLMs, of which ChatGPT is the most famous example, one possible solution is to try to automatically generate such proofs. However, “one of the biggest challenges of LLMs is that they are not always correct,” says Brun. “Instead of breaking down and letting you know something is wrong, they tend to “fail silently,” producing an incorrect answer but presenting it as if it were correct. And, often, the worst thing you can do is fail silently.
This is where Baldur comes in.
First, whose team did their work at Google, used Minerva, an LLM trained on a large corpus of natural language text, then fine-tuned it on 118 GB of mathematical scientific articles and web pages containing expressions mathematics.
Then, she refined the LLM on a language, called Isabelle/HOL, in which the mathematical proofs are written. Baldur then generated a complete proof and worked in tandem with the theorem prover to verify his work. When the theorem prover detected an error, it returned the proof, along with the error information, to the LLM, so that it could learn from its error and hopefully generate a new proof. without error.
This process gives a remarkable increase in accuracy. The state-of-the-art tool for automatically generating evidence is called Thor, which can generate evidence 57% of the time. When Baldur (Thor’s brother, according to Norse mythology) is paired with Thor, the two can generate evidence 65.7% of the time.
While there is still a large degree of error, Baldur is by far the most effective and efficient way to verify the correctness of software ever, and as AI’s capabilities become more extensive and refined, Baldur’s effectiveness should also increase.
The article is published as part of the Proceedings of the 31st ACM European Joint Software Engineering Conference and Symposium on Foundations of Software Engineering.
More information:
Emily First et al, Baldur: Comprehensive generation and repair with large language models, Proceedings of the 31st ACM European Joint Software Engineering Conference and Symposium on Foundations of Software Engineering (2023). DOI: 10.1145/3611643.3616243
Provided by University of Massachusetts Amherst
Quote: Researchers develop AI-based automatic verification method to verify software code (January 4, 2024) retrieved January 4, 2024 from
This document is subject to copyright. Apart from fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for information only.