Credit: CC0 Public domain
Toward the end of the last century, Bill Gates saw the opportunity to unify citizens of nearly 200 countries, speaking more than 7,000 languages, coming together in common dialogue through the burgeoning web community.
“The Internet is becoming the public square of tomorrow’s global village,” he said.
Since then, the Internet has certainly brought the world closer together and has greatly enriched global communications, commerce, research and entertainment.
But a recent report reminds us – as if we really needed reminding – that progress sometimes comes with problems.
Researchers at Amazon Web Services’ Artificial Intelligence Lab and the University of California, Santa Barbara say that after examining more than 6 billion sentences on the web, they found that more than half had been translated in two or more different languages. The translations, they found, were often poor. And with each successive translation into other languages, some as many as eight or nine, the results got worse.
The report, “A Shocking Amount of the Web is Machine Translated: Insights from Multidirectional Parallelism,” has been uploaded to the preprint server. arXiv January 11.
“The low quality of these translations indicates that they were likely created using machine translation,” the authors report. “Our work raises serious concerns about training models such as large multilingual language models on monolingual and bilingual data mined from the web.”
The researchers said that texts are not only translated by artificial intelligence, but also created by AI. They observed that rates of AI-generated translations were highest among low-resource languages, such as Wolof and Xhosa, African languages.
“We find that highly multidirectional parallel translations are of significantly lower quality than bidirectional parallel translations,” the authors continue.
This means that as billions of bits of data are ingested for AI training operations, underrepresented regions on the web, such as African countries and other countries with more obscure languages, will be facing greater challenges in establishing large, reliable and grammatical linguistic models. . With few native resources to rely on, they have to rely heavily on corrupted translations flooding the market.
Mehak Dhaliwal, a former applied science intern at Amazon Web Services, told Motherboard in an interview: “We actually became interested in this topic because several colleagues who work in machine learning and are native speakers of languages low-resource users noted that much of the Internet in their native language appears to be generated by machine training… Everyone should be aware that the content they view on the Web may have been generated by a machine.
Amazon researchers discovered bias in the selection of content used for AI training.
They state: “Automatically generated multi-directional parallel translations not only dominate the total amount of translated content on the web in low-resource languages, but also constitute a large portion of the total web content in these languages. »
Such content, they suggest, tends to be simpler, lower-quality passages, “probably produced to generate advertising revenue.” Since fluency and accuracy are lower for machine-trained hardware, many translations will lead to even less accurate content and increase the chances of AI hallucinating.
Sometimes, computer-generated translations over the years have resulted in unintentionally humorous or embarrassing interpretations.
Google misinterpreted the phrase “Russia is a big country” and referred to Mordor, a fictional village from JRR Tolkien’s “Lord of the Rings.” In 2019, Facebook’s translation software inadvertently referred to Chinese President Xi Jinping as “Mr. S***hole” multiple times in an English article translated from a Burmese text. Facebook immediately apologized and blamed the incident on a “technical error.”
And a medical prescription translation tool for Armenian speakers gave unfortunate advice to a patient suffering from headaches.
English: “You can take over-the-counter ibuprofen if needed to relieve pain.”
Armenian translation: “You can take as many anti-tank missiles as you need to relieve the pain. »
More information:
Brian Thompson et al, A Shocking Amount of the Web is Machine Translated: Insights into Multidirectional Parallelism, arXiv (2024). DOI: 10.48550/arxiv.2401.05749
arXiv
© 2024 Science X Network
Quote: Faulty machine translations litter the web (January 22, 2024) retrieved January 22, 2024 from
This document is subject to copyright. Apart from fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for information only.