A new AI system can create natural-sounding voices and music after receiving a request with a few seconds of audio.
AudioLM, developed by Google researchers, generates audio that matches the style of the message, including complex sounds like piano music or people talking, in a way that is almost indistinguishable from the original recording. The technique holds promise for speeding up the process of training AI to generate audio, and could eventually be used to automatically generate music to accompany videos.
(You can listen to all the examples here.)
AI-generated audio is common: Voices from home assistants like Alexa use natural language processing. AI music systems like OpenAI’s Jukebox have already generated impressive results, but most existing techniques require people to prepare transcriptions and label text-based training data, which is time-consuming and human-intensive . Jukebox, for example, uses text-based data to generate song lyrics.
AudioLM, described in a non-peer-reviewed paper last month, is different: it requires no transcription or tagging. Instead, sound databases are fed into the program, and machine learning is used to compress the audio files into chunks of sound, called “tokens,” without losing too much information. This tokenized training data is fed into a machine learning model that uses natural language processing to learn sound patterns.
To generate the audio, a few seconds of sound are fed to AudioLM, which then predicts what comes next. The process is similar to how language models such as GPT-3 predict which phrases and words tend to occur.
The audio clips released by the team sound quite natural. In particular, piano music generated with AudioLM sounds smoother than piano music generated using existing AI techniques, which tends to sound chaotic.
Roger Dannenberg, who researches computer-generated music at Carnegie Mellon University, says AudioLM already has much better sound quality than previous music generation programs. In particular, he says, AudioLM is surprisingly good at recreating some of the repetitive patterns inherent in human-made music. To generate realistic piano music, AudioLM must capture many of the subtle vibrations contained in each note when the piano keys are played. Music must also maintain its rhythms and harmonies over a period of time.
“This is really impressive, in part because it indicates that they are learning some kind of structure at multiple levels,” says Dannenberg.
AudioLM isn’t just limited to music. Because it was trained on a library of recordings of humans speaking sentences, the system can also generate speech that continues with the accent and cadence of the original speaker, although at this point those sentences may still sound like non sequiturs that they don’t make any sense. AudioLM is trained to learn which types of sound fragments occur frequently together and uses the process in reverse to produce sentences. It also has the advantage of being able to learn the pauses and exclamations that are inherent in spoken languages but do not translate easily to text.
Rupal Patel, who researches information and speech science at Northeastern University, says previous work with AI to generate audio could only capture these nuances if they were explicitly noted in the training data. Instead, AudioLM learns these features from the input data automatically, which increases the realistic effect.
“There’s a lot of what we might call linguistic information that’s not in the words you say, but it’s another way of communicating based on the way you say things to express a specific intention or a specific emotion,” says Neil Zeghidour , a co-creator of AudioLM. For example, someone might laugh after saying something to indicate that it was a joke. “All of this makes the speech natural,” he says.
Finally, AI-generated music could be used to provide more natural background soundtracks for videos and slideshows. More natural-sounding speech-generation technology could help improve Internet accessibility tools and robots that work in healthcare settings, Patel says. The team also hopes to create more sophisticated sounds, such as a band with different instruments or sounds that mimic a rainforest recording.
However, the ethical implications of the technology must be considered, says Patel. In particular, it is important to determine whether the musicians who produce the clips used as training data will receive attribution or copyright of the final product, an issue that has arisen with text-to-image AI. AI-generated speech that is indistinguishable from reality could also become so convincing that it allows disinformation to spread more easily.
In the paper, the researchers write that they are already considering and working to mitigate these issues, for example by developing techniques to distinguish natural sounds from sounds produced with AudioLM. Patel also suggested including audio watermarks in AI-generated products to make them easier to distinguish from natural audio.