A team of artificial intelligence researchers from Amazon AGI has announced the development of what it describes as the largest text-to-speech model ever made. By largest, they mean having the most parameters and using the largest training dataset. They published an article on arXiv preprint server describing how the model was developed and trained.
LLMs like ChatGPT have gained attention for their human ability to intelligently answer questions and create high-level papers. But AI continues to make its way into other consumer applications as well. In this new effort, researchers attempted to improve the capability of a text-to-speech application by increasing its number of parameters and expanding its training base.
The new model, called Big Adaptive Streamable TTS with Emergent skills, (BASE TTS for short) has 980 million parameters and was trained using 100,000 hours of recorded speech (found on public sites), most of which were in English. The team also gave him examples of words and phrases spoken in other languages to allow the model to correctly pronounce known phrases when encountering them, “on the contrary,” for example, or “adios, amigo “.
The Amazon team also tested the model on smaller datasets, hoping to learn where it develops what is now known in the AI field as an emerging quality, in which an application of AI, whether it is an LLM or text-to-speech application. , suddenly seems to reach a higher level of intelligence. They found that for their application, a medium-sized dataset was where the jump to a higher level occurred, with 150 million parameters.
They also noted that this leap involved a multitude of linguistic attributes, such as the ability to use compound nouns, express emotions, use foreign words, apply paralinguistics and punctuation, and pose questions with emphasis on the right word in simple language. sentence.
The team says BASE TTS will not be made public – they fear it could be used unethically – but plan to use it as a learning app. They hope to apply what they’ve learned so far to improve the sound quality of text-to-speech applications in general.
More information:
Mateusz Łajszczak et al, BASE TTS: Lessons learned from creating a speech synthesis model with a billion parameters on 100,000 hours of data, arXiv (2024). DOI: 10.48550/arxiv.2402.08093
www.amazon.science/publication… n-100,000 hours-of-data
arXiv
© 2024 Science X Network
Quote: Amazon unveils the largest speech synthesis model ever made (February 17, 2024) retrieved February 17, 2024 from
This document is subject to copyright. Apart from fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for information only.