Synthe Size Me

Posted March 12, 2023 by Matthias Ott

Leonie Watson just shared an interesting audio snippet on Mastodon:

https://front-end.social/@tink/110007014963441869

What sounds like her speaking about accessibility is actually not Leonie, but an AI-generated synthetic voice, a cloned version of Leonie’s voice based on audio training data.

I was aware of the fact that speech synthesis is getting much better and that machine learning plays a big role in synthetic voices sounding more realistic, for example by adding modulation, variation, and emotion. But when you listen to some of the samples of the voices on play.ht, it is astonishing to see just how accurate and realistic those voices, and especially cloned voices, have become. It is now almost impossible to distinguish a human speaker from a synthetic voice. Mind-blowing and, at the same time, scary and full of potential for misuse.

First of all, this is an amazing thing for everyone who, like Leonie, uses assistive technology to browse the Web, read content, and who interacts with any other kind of user interface that is based on audio output. Listening to robotic, unemotional voices everywhere isn’t fun. So imagine how beautiful the Web suddenly sounds once you let a warm, gentle, human-sounding voice – or maybe even a familiar voice like your very own – read the content for you. Imagine a chat where the voices of many contacts will be synthesized so that not only people reading messages visually will, very literally, hear the other person’s voice in their head without the distracting artificial charm of Siri. Or imagine someone visiting your website and, if they like, they can listen to an audio version of all the posts you ever wrote – read by you. Or you – no, everyone! – speaking every language fluently in business calls. You can already use tools like resemble.ai to clone a voice based on three minutes of data and use an API to integrate that voice into whatever you’re building.

But I’m sure that by now, you’ve also thought about a bunch of not to desirable futures. Voice bots calling you in the middle of the night, trying to sell you an insurance (okay, this is already happening). Scammers calling your grandma. People canceling your orders or placing new ones on your behalf. Artists getting their voices pirated and used to create an endless stream of sameness. Basically everyone could be stealing and remixing each other’s voices, because they’re up for grabs. Prank’d!

(Related: I’m now pretty sure that my descendants will have a fully interactive virtual version of me in some kind of app that they can open whenever they miss me…)

All of this will happen in some way or another. But with all things related to “AI”, it’s on us to decide which future we want to create.

7 Webmentions

2 Reposts

5 Likes

ⓘ Webmentions are a way to notify other websites when you link to them, and to receive notifications when others link to you. Learn more about Webmentions.

Kontrastor 82M – Dynamic Cascading Style Enhancer

7 Webmentions

2 Reposts

5 Likes

More Notes

Ad Infinitum

Lazy and Prompt

Buckle Up

At Machine Speed