Large language models (LLMs), deep learning-based models trained to generate, summarize, translate and process written text, have attracted considerable attention following the release of Open AI’s conversational platform ChatGPT. Although ChatGPT and similar platforms are now widely used for a wide range of applications, they could be vulnerable to a specific type of cyberattack that produces biased, unreliable, or even offensive responses.
Researchers from the Hong Kong University of Science and Technology, University of Science and Technology of China, Tsinghua University and Microsoft Research Asia recently conducted a study on the potential impact of these attacks and the techniques that could protect models against them. Their article, published in Intelligence of natural machinesintroduces a new psychology-inspired technique that could help protect ChatGPT and similar LLM-based conversational platforms from cyberattacks.
“ChatGPT is a societal-impact AI tool with millions of users and integration into products such as Bing,” Yueqi Xie, Jingwei Yi and colleagues write in their paper. “However, the emergence of jailbreak attacks particularly threatens its responsible and secure use. Jailbreak attacks use conflicting prompts to circumvent ChatGPT’s ethical safeguards and generate harmful responses.”
The main goal of Xie, Yi, and their colleagues’ recent work was to highlight the impact that jailbreak attacks can have on ChatGPT and introduce viable defense strategies against these attacks. Jailbreak attacks essentially exploit vulnerabilities in LLMs to bypass developer-defined constraints and elicit pattern responses that would typically be restricted.
“This paper investigates the serious but under-explored problems created by jailbreaks as well as potential defensive techniques,” Xie, Yi and their colleagues explain in their paper. “We introduce a jailbreak dataset with different types of jailbreak prompts and malicious instructions.”
The researchers first compiled a dataset including 580 examples of jailbreak prompts designed to bypass restrictions that prevent ChatGPT from providing responses deemed “immoral.” This includes unreliable text that could fuel misinformation as well as toxic or abusive content.
When they tested ChatGPT on these jailbreak prompts, they found that it often fell into their “trap”, producing the malicious and unethical content they requested. Xie, Yi and their colleagues then set out to design a simple but effective technique that could protect ChatGPT against carefully crafted jailbreak attacks.
The technique they created is inspired by the psychological concept of personal reminders, nudges that can help people remember tasks they need to complete, events they are expected to attend, and so on. The searcher defense approach, called system mode self-reminder, is also designed to remind Chat-GPT that the answers it provides must follow specific guidelines.
“This technique encapsulates the user’s query in a system prompt that reminds ChatGPT to respond responsibly,” the researchers write. “Experimental results demonstrate that self-recalls significantly reduce the success rate of jailbreak attacks against ChatGPT from 67.21% to 19.34%.”
So far, the researchers have tested the effectiveness of their technique using the dataset they created and found that it achieved promising results, reducing the success rate of attacks, without preventing them all. In the future, this new technique could be further improved to reduce the vulnerability of LLMs to these attacks, while potentially inspiring the development of other similar defense strategies.
“Our work systematically documents the threats posed by jailbreak attacks, introduces and analyzes a dataset to evaluate defensive interventions, and proposes a psychologically inspired self-reminder technique that can effectively mitigate jailbreaks without additional training,” summarize the researchers in their article.
More information:
Yueqi Xie et al, Defending ChatGPT against jailbreak attacks via automatic callbacks, Intelligence of natural machines (2023). DOI: 10.1038/s42256-023-00765-8.
© 2024 Science X Network
Quote: A simple technique to defend ChatGPT against jailbreak attacks (January 18, 2024) retrieved January 19, 2024 from
This document is subject to copyright. Apart from fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for information only.