Look Once to Hear - A Spy's Dream Come True

Written by Harry Fairhead

Sunday, 23 June 2024

Deep learning has triumphed again. You can don a pair of headphones, look at a person talking and from then on the system will track the person so you can hear them as they move away or become swamped in noise. It's the ultimate cocktail party effect.

A team from the Paul G. Allen Center for Computer Science & Engineering, University of Washington, has done something I personally would have assumed impossibly difficult. Past work proved that it was possible to track a speaker but, only if a clean high-quality recording of their voice was already available. Even this is a difficult task without AI. Signal processing algorithms aren't easy to implement and trying to extract the features necessary to identify a speaker is very difficult. But we don't have to - AI can do the job for us.

The new approach makes use of AI to both find the important features of a specified speaker and to track them. A beam forming microphone array is used to pick up audio in the direction that the user is looking. As the user is looking at the target there should be no time lag between each ear and this can be used to select the target signal. A pretrained neural network extracts the characteristics of the target speaker and this is then fed into a second neural network that tracks the target without the assumption that the user is lookng directly at them.

listen

This all sounds very computationally expensive, but the whole thing works in realtime running on an Orange Pi 5B - which is a very low-cost IoT device. The system takes 5.47ms to process an 8ms chunk of audio - which is remarkable and leaves space, or rather time, for extras. The speed was obtained by converting a PyTorch version to an ONNX model.

That it works is evident in this video:

This is a first step on an interesting road. As well as allowing communication in difficult situations and its potential to help hearing impaired people follow a conversation, it could be developed and integrated with larger systems. You could add a speech recognition network and produce a transcript. With some tweaking and improvement it would be a gift to any spook. What could be an easier way to bug a situation than to simply look at the person you want to eavesdrop on and then turn away and look completely disinterested.

If you are attracted by trying to implement any of these, and more ideas, the good news is that the code is open source and available on GitHub.

listenicon

More Information

Look Once to Hear: Target Speech Hearing with Noisy Examples

Bandhav Veluri, Malek Itani, Tuochao Chen and Takuya Yoshioka

The paper won Best Paper Honorable Mention at CHI 2024.

Whisper - Open Source Speech Recognition You Can Use

Speech2Face - Give Me The Voice And I Will Give You The Face

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

JetBrains Junie and AI Assistant Expand Reach
24/04/2025

All JetBrains AI tools, including the coding agent Junie and its improved AI Assistant are now available within its IDEs under a single subscription and come with a free tier.

+ Full Story

Understanding GPU Architecture With Cornell
11/04/2025

Find out everything there's to know about GPUs. This Cornell Virtual Workshop will be helpful for those who program in CUDA.

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

Last Updated ( Sunday, 23 June 2024 )

More Information

Related Articles

Comments