TECH NEWS – VALL-E 2 remains a research project because Microsoft says it could pose a significant risk of malicious use.
The Redmond-based tech giant said in a blog post that its latest neural codec language model for speech synthesis “achieves human parity for the first time,” meaning it has become so sophisticated that it is nearly impossible to distinguish the text generated from that of a real person, and can do so from a very limited sample and command set. With just a few seconds of speech, VALL-E 2 works from a large training library that maps pronunciation, intonation, and voice changes between the model and the sample, producing synthesized speech that looks absolutely convincing.
In the blog post, Microsoft presents several examples of how the zero-shot TTS process can produce amazingly high-quality speech from 3-10 seconds of material. But the ethical statement should also be addressed in the post. In it, Microsoft states that it has no plans to release VALL-E 2 to the public: “VALL-E 2 is a research project only. At this time, we have no plans to incorporate VALL-E 2 into a product or to release it to the public. There may be potential risks in misusing the model, such as spoofing voice identification or impersonating a particular speaker. We conducted the experiments under the assumption that the user agrees to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker consents to the use of his or her voice and a synthesized speech recognition model.”
Microsoft previously made a similar decision regarding VASA-1. This is a technology that can take a still image and create a video in which the person in the image can convincingly move. What we don’t understand is what the company is doing with this technology. If they have created it, they will use it for something, but if the audience can’t do it, who will?
Leave a Reply