Microsoft researchers are working on a text-to-speech model that can mimic a person’s voice – complete with emotion and intonation – after a mere three seconds of training.
Some require clean voice data from a recording studio to capture high-quality speech.
So if the snippet of voice used as the acoustic prompt in the model is recorded on the telephone, the synthesized spoken text would also sound like it’s coming through the phone.
If the seconds of recorded voice of the acoustic prompt is emoting anger, then the synthesized speech based on that voice will also display anger.
A person’s voice could be captured and synthesized for use in a wide range of areas – from ads or spam calls to video games or chatbots.
Patrick Harr, CEO of anti-phishing firm SlashNext, told The Register TTS could also become yet another tool for cybercriminals, who could use it for vishing campaigns – attacks using fraudulent phone calls or voice messages thought to be from a contact the victim knows.