Amazon unveils the largest text-to-speech model ever made

February 15, 2024

Researchers at Amazon have introduced the largest text-to-speech model to date, which is set to have enhanced qualities that allow it to better articulate complex sentences.

The model, BASE TTS (text-to-speech), which stands for Big Adaptive Streamable TTS with Emergent abilities, could set the foundation for more human-like interactions.

According to the research, it looks like extensive training for TTS models could improve reliability and versability in the same way that we see with large language models (LLMs) used for artificial intelligence.

Amazon’s BASE TTS impresses researchers

The text-to-speech model has been trained on 100,000 hours of speech data that lives in the public domain, which gives the tool a “state-of-the-art naturalness.” Predominantly English, some German, Dutch and Spanish data was also used.

Moreover, the researchers found that even training a TTS model on 10,000 hours of speech can result in an improved ability to articulate complex sentences more naturally.

At 980 million parameters, BASE-large has been recognized as the largest text-to-speech model ever made. The team also trained lesser models, with 400 million and 150 million parameters, and 10,000 and 1,000 hours of speech, in order to compare results.

Amazon’s team describes BASE TTS as a “high-fidelity model capable of mimicking speaker characteristics with just a few seconds of reference audio,” recognizing the need for more research but acknowledging its potential.

Some of the key areas the researchers focused on were compound nouns, emotions, foreign words, paralinguistics, punctuations, questions, and syntactic complexities – examples can be found on a dedicated web page.

With revolutionary artificial intelligence headlining most of 2023, text-to-speech breakthroughs like this in 2024 could continue to bring once-futuristic technologies into the hands of the masses, but the research team’s cautious approach does highlight a need for proper regulation amid security and privacy fears.

More from TechRadar Pro

Source

Researchers at Amazon have introduced the largest text-to-speech model to date, which is set to have enhanced qualities that allow it to better articulate complex sentences. The model, BASE TTS (text-to-speech), which stands for Big Adaptive Streamable TTS with Emergent abilities, could set the foundation for more human-like interactions. According…