Overview of our proposed methods: (A) We propose four types of malicious triggers in the common embedding space for attack decomposition: text trigger, OCR text trigger, visual trigger, and combined OCR text-visual trigger. (B) We use an end-to-end gradient-based attack to update images to match malicious trigger embeddings in the common embedding space. (C) Our adversarial attack is based on space embedding and aims to hide the malicious trigger in harmless-looking images, combined with a harmless textual prompt for jailbreak. (D) Our attacks exhibit broad generalization and compositionality in various jailbreak scenarios with a mix of text prompts and malicious triggers. Credit: arXiv (2023). DOI: 10.48550/arxiv.2307.14539
UC Riverside computer scientists have identified a security flaw in visual language artificial intelligence (AI) models that may allow bad actors to use AI for nefarious purposes, such as obtaining instructions on way to make a bomb.
When integrated with models like Google Bard and Chat GPT, vision language models allow users to ask questions with images and text.
Bourns College of Engineering scientists demonstrated a “jailbreak” hack by manipulating the operations of Large Language Model, or LLM, software that essentially forms the basis of query-and-response AI programs.
The title of the article is “Jailbreak in Pieces: Compositional Adversarial Attacks on Multi-Modal Language Models”. It was submitted for publication to the International Conference on Representations of Learning and is available on the arXiv preprint server.
These AI programs give users detailed answers to almost any question, recalling stored knowledge drawn from large amounts of information from the Internet. For example, ask Chat GPT: “How do I grow tomatoes?” and it will respond with step-by-step instructions, starting with seed selection.
But ask the same model how to do something harmful or illegal, like “How do I make meth?” and the model would normally refuse, providing a generic response such as “I can’t help with that.”
Yet Yue Dong, an assistant professor at UCR, and colleagues found ways to trick AI language models, particularly LLMs, into answering nefarious questions with detailed answers that could be gleaned from the data collected on the dark web.
The vulnerability occurs when the images are used with AI queries, Dong explained.
“Our attacks use a novel composition strategy that combines an image, adversarially targeted at toxic embeddings, with generic prompts to achieve the jailbreak,” reads the paper by Dong and colleagues presented at the SoCal symposium NLP held at UCLA in November.
Dong explained that computers see images by interpreting millions of bytes of information that create pixels, or small dots, that make up the image. For example, a typical cell phone image consists of approximately 2.5 million bytes of information.
Remarkably, Dong and his colleagues found that bad actors can hide nefarious questions, such as “How do I make a bomb?”, within the millions of bytes of information contained in an image and trigger responses that bypass the Built-in generative AI protections. models like ChatGPT.
“Once the safeguard is bypassed, the models happily give answers to teach us how to make a bomb step by step with many details that can lead bad actors to build a bomb successfully,” Dong said.
Dong and his graduate student Erfan Shayegani, along with Professor Nael Abu-Ghazaleh, published their findings in an online paper so that AI developers can eliminate the vulnerability.
“We act as attackers to ring the bell, so the IT community can respond and defend itself,” Dong said.
AI queries based on images and text are of great use. For example, doctors can capture MRI organ scans and mammogram images to detect tumors and other medical problems that need rapid attention. AI models can also create graphs from simple cell phone images or spreadsheets.
More information:
Erfan Shayegani et al, Jailbreak in Pieces: Compositional Adversarial Attacks on Multimodal Language Models, arXiv (2023). DOI: 10.48550/arxiv.2307.14539
arXiv
Provided by University of California – Riverside
Quote: Scientists identify security flaw in AI query models (January 10, 2024) retrieved January 10, 2024 from
This document is subject to copyright. Apart from fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for information only.