When artificial intelligence models examine hundreds of gigabytes of training data to learn the nuances of language, they also imbibe the biases woven into texts.
Computer science researchers at Dartmouth are considering ways to target the parts of the model that encode these biases, paving the way for mitigating them, or even removing them altogether.
In a recent article published in the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, co-authors Weicheng Ma, who holds a Ph.D. in computer science. candidate in the Guarini School of Graduate and Advanced Studies, and Soroush Vosoughi, assistant professor of computer science, examine how stereotypes are encoded in large pre-trained language models.
A large language model, or neural network, is a deep learning algorithm designed to process, understand, and generate text and other content when trained on huge data sets.
Pre-trained models contain biases, like stereotypes, Vosoughi explains. These can be generally positive (suggesting, for example, that a particular group has certain skills) or negative (suggesting, for example, that a person has a certain profession based on their gender).
And machine learning models are poised to permeate daily life in a variety of ways. They can help hiring managers sift through piles of resumes, facilitate faster approvals or rejections of bank loans, and provide guidance during parole decisions.
But inherent stereotypes based on demographics would lead to unfair and undesirable results. To mitigate these effects, “we ask ourselves if we can do anything about stereotypes even after a model has been trained,” says Vosoughi.
The researchers started from the hypothesis that stereotypes, like other linguistic features and patterns, are encoded in specific parts of the neural network model known as “attention heads.” These are similar to a group of neurons; they allow a machine learning program to memorize multiple words given to it as input, among other functions, some of which are not yet fully understood.
Ma, Vosoughi and their collaborators created a stereotype-laden dataset and used it to repeatedly fit 60 different pre-trained large language models, including BERT and T5. By amplifying the model’s stereotypes, the dataset acted as a detector, highlighting the attention heads that did the heavy lifting in encoding these biases.
In their paper, the researchers show that eliminating the worst offenders significantly reduces stereotypes in large language models, without significantly affecting their language abilities.
“Our discovery upends the traditional view that advances in AI and natural language processing require extensive training or complex algorithmic interventions,” says Ma. Since the technique is not inherently language or language specific, model, it would be widely applicable, according to Ma.
What’s more, Vosoughi adds, the data set can be modified to reveal certain stereotypes while leaving others intact: “it’s not a one-size-fits-all solution.”
Thus, a medical diagnostic model, in which differences based on age or gender may be important for patient assessment, would use a different version of the dataset than that used to eliminate bias from a model that selects potential candidates.
The technique only works when there is access to the fully trained model and will not apply to black-box models, such as OpenAI’s chatbot ChatGPT, whose inner workings are invisible to users and researchers.
Adapting the current approach to black-box models is the immediate next step, says Ma.
More information:
Weicheng Ma et al, Deciphering stereotypes in pre-trained language models, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023). DOI: 10.18653/v1/2023.emnlp-main.697
Provided by Dartmouth College
Quote: Focusing on the origins of bias in large language models (January 15, 2024) retrieved January 15, 2024 from
This document is subject to copyright. Apart from fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for information only.