Amazon Unveils Largest Text-to-Speech Model Ever

An overview of BASE TTS. The speech tokenizer (1) learns a discrete representation, which is modeled by an autoregressive model (2) conditioned by the reference text and speech. The speech code decoder (3) converts the predicted speech representations into a waveform. Credit: arXiv (2024). DOI: 10.48550/arxiv.2402.08093

A team of artificial intelligence researchers from Amazon AGI has announced the development of what it describes as the largest text-to-speech model ever made. By largest, they mean having the most parameters and using the largest training dataset. They published an article on arXiv preprint server describing how the model was developed and trained.

LLMs like ChatGPT have gained attention for their human ability to intelligently answer questions and create high-level papers. But AI continues to make its way into other consumer applications as well. In this new effort, researchers attempted to improve the capability of a text-to-speech application by increasing its number of parameters and expanding its training base.

The new model, called Big Adaptive Streamable TTS with Emergent skills, (BASE TTS for short) has 980 million parameters and was trained using 100,000 hours of recorded speech (found on public sites), most of which were in English. The team also gave him examples of words and phrases spoken in other languages to allow the model to correctly pronounce known phrases when encountering them, “on the contrary,” for example, or “adios, amigo “.

The Amazon team also tested the model on smaller datasets, hoping to learn where it develops what is now known in the AI field as an emerging quality, in which an application of AI, whether it is an LLM or text-to-speech application. , suddenly seems to reach a higher level of intelligence. They found that for their application, a medium-sized dataset was where the jump to a higher level occurred, with 150 million parameters.

They also noted that this leap involved a multitude of linguistic attributes, such as the ability to use compound nouns, express emotions, use foreign words, apply paralinguistics and punctuation, and pose questions with emphasis on the right word in simple language. sentence.

The team says BASE TTS will not be made public – they fear it could be used unethically – but plan to use it as a learning app. They hope to apply what they’ve learned so far to improve the sound quality of text-to-speech applications in general.

More information:
Mateusz Łajszczak et al, BASE TTS: Lessons learned from creating a speech synthesis model with a billion parameters on 100,000 hours of data, arXiv (2024). DOI: 10.48550/arxiv.2402.08093

www.amazon.science/publication… n-100,000 hours-of-data

Journal information:
arXiv

Quote: Amazon unveils the largest speech synthesis model ever made (February 17, 2024) retrieved February 17, 2024 from

This document is subject to copyright. Apart from fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for information only.

Amazon Unveils Largest Text-to-Speech Model Ever

White House discusses threat from Russian anti-satellite weapon

An acorn falls on his car, a police officer opens fire

An acorn falls on his car, a police officer opens fire

Leave a Reply Cancel reply

Category