GPT-4 is taking a misaligned step by engaging in insider trading. Credit: Scheurer et al.
Artificial intelligence (AI) tools are now widely used around the world, helping both engineers and non-expert users with a wide range of tasks. Evaluating the security and reliability of these tools is therefore of the utmost importance, as this could, ultimately, help to better regulate their use.
Researchers at Apollo Research, an organization created with the aim of evaluating the security of AI systems, recently set out to evaluate the responses provided by large language models (LLMs) in a scenario where they are put under pressure. Their conclusions, published on the preprint server arXivsuggest that these models, the best known of which is OpenAI’s ChatGPT, could in some cases strategically mislead their users.
“At Apollo Research, we believe that some of the greatest risks come from advanced AI systems that can evade standard security assessments by engaging in strategic deception,” Jérémy Scheurer, co-author of the paper, told Tech Xplore. “Our goal is to fully understand AI systems to prevent the development and deployment of deceptive AI.
“However, so far there have been no demonstrations of AI acting in strategically deceptive ways without having been given explicit instructions to do so. We think it is important to have such compelling demonstrations to make this problem more salient and convince researchers, policymakers and the public that this is an important problem.”
By identifying scenarios in which specific AI tools can be strategically deceptive, Scheurer and his colleagues hope to inform further research evaluating their safety. Currently, there is very little empirical evidence highlighting the deceptive nature of AI and the contexts in which it can occur. The team therefore believes that there is a need for clear, experimentally validated examples of deceptive AI behavior.
“This research has been largely motivated by a desire to understand how and when AIs can become deceptive and we hope that this early work will be the start of more rigorous scientific treatments of AI deception,” Scheurer said.
Scheurer carried out this recent study in close collaboration with his colleague Mikita Balesni, who developed the concrete trading task on which they tested the LLMs. Their work was supervised by Marius Hobbhahn, principal investigator and CEO of Apollo Research.
“This was a model effort of Red-Teaming (a cybersecurity term commonly used in machine learning), meaning we attacked an LLM by adversarially feeding it text that will make people act in a misaligned (i.e. bad) way,” Scheurer said. “This contrasts with asking neutral questions (for example, ‘What is the capital of France?’) and seeing how she answers.”
As part of their study, the researchers specifically “attacked” GPT-4, the LLM developed by OpenAI that underpins the operation of ChatGPT. To do this, they designed text inputs (i.e., prompts) that instructed the model to act like an AI system making financial investments in a simulated environment.
“We tell the model that it has access to various tools to search for profitable trades, send emails and execute trades,” Scheurer said. “We then simulate normal interactions that the model might have with its “colleagues” and, most importantly, we introduce pressure on the model to perform well. For example, the CEO sends the AI agent an email saying the company is not doing well. and needs a good year. Another “colleague” mentions via a message that an economic slowdown is coming, etc. Finally, a fellow trader mentions that there is an upcoming merger announcement between two companies, which could be profitable, while also mentioning that management will be unhappy if they find out that one is trading based on inside information .
The researchers found that in this simulated, high-pressure environment, the GPT-4-based AI agent very often decides to act based on the inside information it received from another trader, buying shares of the merging company. Additionally, the model explicitly determines that he should not mention the insider tip, providing alternative explanations for the transaction made in a report prepared for his supervisor.
“Even when the manager explicitly asks if the model knew about the merger, the model usually doubles down on the lie,” Scheurer said. “I think it’s important to reiterate that this is a Red-Teaming effort, meaning we specifically looked for realistic scenarios in which this would happen. So our research is more of an existence proof that such behavior can occur, not an indication of whether such behavior exists or how likely it is to occur in nature.
This recent study by Scheurer and colleagues provides a clear and tangible example of scenarios in which LLMs could be strategically misleading. The researchers now plan to continue their research in this area, to identify other cases in which AI tools could be strategically deceptive as well as the possible implications of their deception.
“I think the biggest impact of our work is to make the problem of strategic AI deception (without explicit instructions to behave deceptively) very concrete and to show that it is not just about “a speculative story about the future, but that this type of behavior can happen today. with current models in certain circumstances,” Scheurer added. “I think this could lead people to take this issue more seriously, also opening the door for a lot of follow-up research from the community aimed at better understanding this behavior and ensuring it doesn’t happen again. “
More information:
Jérémy Scheurer et al, Technical report: Large language models can strategically mislead their users when put under pressure, arXiv (2023). DOI: 10.48550/arxiv.2311.07590
arXiv
© 2023 Science X Network
Quote: Study shows large language models can strategically mislead users when under pressure (December 12, 2023) retrieved December 12, 2023 from
This document is subject to copyright. Apart from fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for information only.