Predictive Voice Masking for Privacy

Normally we want our devices to understand us, and they do a pretty good job at it. However the proliferation of always listening devices with internet based Automatic Speech Recognition (ASR) capabilities also brings about privacy concerns. The makers of the current market leading devices all stress their privacy credentials, but what about other devices, bugs or malware?

A new paper from Columbia University seeks to address these concerns with what it describes as ‘real time voice camouflage’.

To date, the main approaches to voice security try to confuse the ASR by playing either white noise, random human crowd noise or speech similar to what was said recently. These work to an extent, but lack of real time contextual masking restricts their effectiveness.

To get round this, the Columbia researchers propose predictive speech masking. According to the paper:

We introduce predictive attacks, which are able to disrupt any word that automatic speech recognition models are trained to transcribe. Our approach achieves real-time performance by forecasting an attack on the future of the signal, conditioned on two seconds of input speech. Our attack is optimized to have a volume similar to normal background noise, allowing people in a room to converse naturally and without monitoring from an automatic speech recognition system.

The system isn’t perfect, especially with shorter words, but performs better than other current systems at masking spoken content. The authors acknowledge this limitation:

[A] potential adverse consequence is that our model is not 100% accurate, and people may rely on it when it might not be.

You can read the full paper here: Real-Time Neural Voice Camouflage.