An experiment by a journalist revealed how advanced the technologies are for virtual representation of a person and how with their help not only his relatives but also bank identification systems can be fooled.
Joanna Stern, a columnist for the Wall Street Journal, tried to understand how natural a digital avatar of a person could be, created using advanced algorithms based on generative artificial intelligence. The result of the experiment turned out to be frightening, as the digital clone of Joanna was able to fool her relatives and cheat the voice identification system of a bank.
To create her virtual avatar, Joanna uses the tool Synthesia, which is positioned by the developers as a service for creating a talking digital avatar based on video and audio recordings of real people. After creating an avatar, the user can enter any text that the virtual clone obediently repeats. As source material for training the algorithm, Joanna used 30 minutes of video and about two hours of audio recordings with her voice.
Startup Synthesia charges customers $1,000 a year to create and maintain a virtual avatar, plus an additional monthly fee. In a few weeks, Joanna’s digital clone is ready, after which she begins to test it.
Joanna generates text for the video app TikTok for iOS using ChatGPT and uploads it to Synthesia, after which the avatar creates the finished video. When she sees him, Joanna is stunned – as if she were looking at her own reflection in the mirror.
However, at this stage, the technology is not completely perfect. While the avatar appears convincing enough when playing short sentences, longer phrases show that they were not spoken by a human. Not all TikTok users are careful, but some notice that a video created using a virtual avatar looks unnatural.
A similar issue occurred when trying to use a digital avatar in Google Meet video calls. Due to the poor playback quality of long sentences, the avatar maintains a perfect pose at all times and practically does not move.
However, video avatars will undoubtedly become more sophisticated in the near future. There are already several betas in Synthesia’s system that can nod their heads, raise and lower their eyebrows, and perform some other human movements.
After testing the capabilities of the video avatar, Joanna decided to try out a voice clone created using ElevenLabs’ generative AI algorithm. It takes about 90 minutes for the voice recordings to be uploaded to the service, and in less than two minutes the voice clone is ready. The audio avatar can play any text with the user’s voice. ElevenLabs charges customers $5 per month to create a voice clone.
Compared to Synthesia’s video avatar, the audio clone looks more like a real person. It adds intonation to the speech and the reproduction of the text itself becomes smoother.
Joanna called her sister first and used a voice clone to communicate with her. The nurse didn’t notice the ruse right away, but after a while she sensed that the voice clone wasn’t pausing to catch her breath. Joanna then calls her father asking him to remind her of her social security number. However, he gets the trick because Joanna’s voice sounds like a recording.
Joanna’s virtual avatar also calls Chase Bank customer support. The algorithm answers several questions in the bank’s voice identification process. After a short conversation, Jonana’s avatar connects with a bank representative as the voice recognition system detects no problems.
A Chase spokesperson later said the bank uses voice recognition along with other tools to verify a customer’s identity. The bank specifies that voice identification allows customers to communicate with a support employee, but cannot be used to perform a transaction or other operation.
The voice generated by the ElevenLabs service turns out to be as close as possible to Joanna’s, with intonation and other speech characteristics. To create such a voice clone, it is enough to upload several audio recordings to the service and agree to the rules of the platform, which state that the user undertakes not to use the algorithm for fraudulent purposes. This means that anyone can easily generate the voice of any of their friends or celebrities.
An ElevenLabs representative claims that the company only allows paid account holders to clone their voices. In case of violation of the policy, the user account will be blocked. In addition, the developers plan to release a service that checks whether a given audio was created with the ElevenLabs algorithm.
The company claims it can identify all user-generated content in order to filter it or take other measures against violators, including in cooperation with law enforcement.
Joanna, for her part, admits that any of the algorithms she uses still can’t make a copy that’s indistinguishable from the original. ChatGPT generates text without relying on the knowledge and experience of the journalist. The Synthesia service creates an avatar that, although it looks like a person, cannot convey all the characteristic features of the user. Finally, the ElevenLabs system generates very close to the original speech, but it’s not perfect either.
AI technologies will continue to develop, and probably in the future it will become increasingly difficult to distinguish a virtual avatar from a real person in the process of communication.