“There is a lot of what we could call linguistic information that is not in the words that you pronounce, but it’s another way of communicating based on the way you say things to express a specific intention or specific emotion,” says Neil Zeghidour, a co-creator of AudioLM. For example, someone may laugh after saying something to indicate that it was a joke. “All that makes speech natural,” he says.
Eventually, AI-generated music could be used to provide more natural-sounding background soundtracks for videos and slideshows. Speech generation technology that sounds more natural could help improve internet accessibility tools and bots that work in health care settings, says Patel. The team also hopes to create more sophisticated sounds, like a band with different instruments or sounds that mimic a recording of a tropical rainforest.
However, the technology’s ethical implications need to be considered, Patel says. In particular, it’s important to determine whether the musicians who produce the clips used as training data will get attribution or royalties from the end product—an issue that has cropped up with text-to-image AIs. AI-generated speech that’s indistinguishable from the real thing could also become so convincing that it enables the spread of misinformation more easily.
In the paper, the researchers write that they are already considering and working to mitigate these issues—for example, by developing techniques to distinguish natural sounds from sounds produced using AudioLM. Patel also suggested including audio watermarks in AI-generated products to make them easier to distinguish from natural audio.