VALL-E AI can mimic a person’s voice from a three-second snippet

Microsoft researchers are working on a text-to-speech model that can mimic a person’s voice – complete with emotion and intonation – after a mere three seconds of training.

Some require clean voice data from a recording studio to capture high-quality speech.

So if the snippet of voice used as the acoustic prompt in the model is recorded on the telephone, the synthesized spoken text would also sound like it’s coming through the phone.

If the seconds of recorded voice of the acoustic prompt is emoting anger, then the synthesized speech based on that voice will also display anger.

A person’s voice could be captured and synthesized for use in a wide range of areas – from ads or spam calls to video games or chatbots.

Patrick Harr, CEO of anti-phishing firm SlashNext, told The Register TTS could also become yet another tool for cybercriminals, who could use it for vishing campaigns – attacks using fraudulent phone calls or voice messages thought to be from a contact the victim knows.

Share this article on social media:

Subscribe to Our Newsletter!
Stay on top of cybersecurity risks, evolving threats and industry news.
This field is for validation purposes and should be left unchanged.

Recent News

Featured Services

The Latest Cybersecurity News

From major cyberattacks, newly discovered critical vulnerabilities to recommended best practices, read it here first:

BOOK A MEETING

Enter your Email Address

This field is for validation purposes and should be left unchanged.

* No free email provider (e.g: gmail.com, hotmail.com, etc.)

This site is registered on wpml.org as a development site. Switch to a production site key to remove this banner.