Deep learning turns mono recordings into immersive sound
- by 7wData
Listen to a bird singing in a nearby tree, and you can relatively quickly identify its approximate location without looking. Listen to the roar of a car engine as you cross the road, and you can usually tell immediately whether it is behind you.
The human ability to locate a sound in three-dimensional space is extraordinary. The phenomenon is well understood—it is the result of the asymmetric shape of our ears and the distance between them.
But while researchers have learned how to create 3D images that easily fool our visual systems, nobody has found a satisfactory way to create synthetic 3D sounds that convincingly fool our aural systems.
Today, that looks set to change at least in part, thanks to the work of Ruohan Gao at the University of Texas at and Kristen Grauman at Facebook Research. They have used a trick that humans also exploit to teach an AI system to convert ordinary mono sounds into pretty good 3D sound. The researchers call it 2.5D sound.
First some background. The brain uses a variety of clues to work out where a sound is coming from in 3D space. One important clue is the difference between a sound’s arrival times at each ear—the interaural time difference.
A sound produced on your left will obviously arrive at your left ear before the right. And although you are not conscious of this difference, the brain uses it to determine where the sound has come from.
Another clue is the difference in volume. This same sound will be louder in the left ear than in the right, and the brain uses this information as well to make its reckoning. This is called the interaural level difference.
These differences depend on the distance between the ears. Stereo recordings do not reproduce this effect, because the separation of stereo microphones does not match it.
The way sound interacts with ear flaps is also important. The flaps distort the sound in ways that  depend on the direction it arrives from. For example, a sound from the front reaches the ear canal before hitting the ear flap. By contrast, the same sound coming from behind the head is distorted by the ear flap before it reaches the ear canal.
The brain can sense these differences too. In fact, the asymmetric shape of the ear is the reason we can tell when a sound is coming from above, for example, or from many other directions.
The trick to reproducing 3D sound artificially is to reproduce the effect that all this geometry has on sound. And that’s a tough problem.
One way to measure the distortion is with binaural recording. This is a recording made by placing a microphone inside each ear, which can pick up these tiny variations.
By analyzing the variations, researchers can then reproduce them using a mathematical algorithm known as a head-related transfer function. That turns any ordinary pair of headphones into extraordinary 3D sound machines.
But because everybody’s ears are different, everybody hears sound in a different way.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More