Computer scientists at Nanyang Technological University Singapore (NTU Singapore) have successfully compromised several artificial intelligence (AI) chatbots, including ChatGPT, Google Bard and Microsoft Bing Chat, to produce content that violates their developers’ guidelines – a result known as a “jailbreak”. “.
“Jailbreaking” is a computer security term in which hackers find and exploit flaws in a system’s software to make it do something that its developers deliberately told it not to do.
Additionally, by training a large language model (LLM) on a database of prompts that had previously demonstrated the ability to successfully hack these chatbots, the researchers created an LLM chatbot capable of automatically generating other prompts for jailbreak other chatbots.
LLMs form the brains of AI chatbots, allowing them to process human input and generate text that is almost indistinguishable from that a human can create. This includes performing tasks such as planning a travel itinerary, telling a bedtime story, and developing computer code.
The work of NTU researchers now adds “jailbreak” to the list. Their findings can be critical in helping businesses realize the weaknesses and limitations of their LLM chatbots so they can take steps to harden them against hackers.
After running a series of proof-of-concept tests on LLMs to prove that their technique does indeed present a clear and present threat to them, the researchers immediately reported the issues to affected service providers, after launching successful jailbreak attacks .
Professor Liu Yang from NTU’s School of Computer Science and Engineering, who led the study, said: “Large language models (LLMs) have proliferated rapidly due to their exceptional ability to understand, generate and complete human-like text, LLM chatbots being very popular applications for daily use.
“Developers of such AI services have implemented safeguards to prevent the AI from generating violent, unethical or criminal content. But AI can be outsmarted, and we have now used AI against its own type to “jailbreak” LLMs to produce such content. “
Ph.D. NTU. Student Mr. Liu Yi, co-author of the paper, said: “The paper presents a new approach to automatically generate jailbreak prompts against hardened LLM chatbots. Training an LLM with jailbreak prompts helps automate the generation of these prompts, achieving a much higher success rate than existing methods. In effect, we attack chatbots by using them against themselves.
The researchers’ paper describes a dual method for “jailbreaking” LLMs, which they called “Masterkey.”
First, they reverse-engineered how LLMs detect and defend against malicious queries. Using this information, they taught an LLM to automatically learn and produce prompts that bypass the defenses of other LLMs. This process can be automated, creating a jailbreak LLM that can adapt and create new jailbreak prompts even after developers patch their LLM.
The researchers’ paper, which appears on the preprint server arXivhas been accepted for presentation at the Network and Distributed System Security Symposium, a leading security forum, in San Diego, USA, in February 2024.
Testing the limits of LLM ethics
AI chatbots receive prompts, or a series of instructions, from human users. All LLM developers establish guidelines to prevent chatbots from generating unethical, questionable, or illegal content. For example, asking an AI chatbot how to create malware to hack bank accounts often results in an outright refusal of an answer on the grounds of criminal activity.
Professor Liu said: “Despite their advantages, AI chatbots remain vulnerable to jailbreak attacks. They can be compromised by malicious actors who abuse their vulnerabilities to force chatbots to generate results that violate established rules.
NTU researchers studied ways to circumvent a chatbot by creating prompts that fly under the radar of its ethical guidelines so that the chatbot is tricked into responding. For example, AI developers rely on keyword censors that detect certain words that might signal potentially questionable activity and refuse to respond if such words are detected.
One strategy researchers used to circumvent keyword censors was to create a character providing prompts that simply contained spaces after each character. This bypasses LLM censors, who could operate from a list of banned words.
The researchers also asked the chatbot to respond in the guise of a persona that was “unqualified and devoid of moral constraints,” thereby increasing the chances of producing unethical content.
Researchers were able to infer the inner workings and defenses of LLMs by manually entering these prompts and observing how long each prompt took to pass or fail. They were then able to reverse engineer the LLMs’ hidden defense mechanisms, further identify their ineffectiveness, and create a dataset of prompts that successfully jailbreak the chatbot.
Intensification of the arms race between hackers and LLM developers
When vulnerabilities are discovered and revealed by hackers, AI chatbot developers respond by “fixing” the problem, in a never-ending cat-and-mouse cycle between hacker and developer.
With Masterkey, NTU computer scientists have upped the ante in this arms race, as an AI jailbreaking chatbot can produce a high volume of prompts and continuously learn what works and what doesn’t, allowing hackers to beat LLM developers at their own game with their own tools. .
The researchers first created a training dataset that included prompts they found effective during the previous phase of reverse engineering the jailbreak, as well as unsuccessful prompts, so Masterkey knew what not to do. The researchers fed this dataset into an LLM as a starting point, then performed ongoing pre-training and task tuning.
This exposes the model to a wide range of information and refines its capabilities by training it on tasks directly related to jailbreaking. The result is an LLM that can better predict how to manipulate text for jailbreaking, leading to more effective and universal prompts.
Researchers found that Masterkey-generated prompts were three times more effective than LLM-generated prompts in jailbreak LLMs. Masterkey was also able to learn from previous failed prompts and can be automated to constantly produce new, more effective prompts.
The researchers say their LLM can be used by developers themselves to strengthen their security.
Ph.D. NTU. Student Mr. Deng Gelei, co-author of the paper, said: “As LLMs continue to evolve and expand their capabilities, manual testing becomes both laborious and potentially inadequate to cover all possible vulnerabilities. An automated approach to generating jailbreak prompts can ensure comprehensive coverage, evaluating a wide range of possible misuse scenarios.
More information:
Gelei Deng et al, MasterKey: Automated jailbreak on several large language model chatbots, arXiv (2023). DOI: 10.48550/arxiv.2307.08715
arXiv
Provided by Nanyang Technological University
Quote: Researchers use AI chatbots against themselves to “jailbreak” themselves (December 28, 2023) retrieved December 28, 2023 from
This document is subject to copyright. Apart from fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for information only.