CAS Quarterly

Spring 2018

Issue link: https://digital.copcomm.com/i/987065

Contents of this Issue

Navigation

Page 57 of 79

58 S P R I N G 2 0 1 8 C A S Q U A R T E R L Y microphone rustle, to RX. That's where we came to the limit of our explanatory skills. It's easy enough with some problems, like clicks, which you can identify easily as vertical lines on the spectrogram, to write a program like De-Click. But if you're looking at problems like lavaliere microphone rustle, you will quickly come to the conclusion that it looks very much like speech on the spectrogram. They have a similar frequency range and both often occur in bursts. Visually, rustle often resembles speech sibilants. So, instead of trying to write a step-by-step rustle removal algorithm, we've decided to let the machine itself learn the differences between rustle and speech. CAS: What is Machine Learning? AL: Machine Learning is a technology that allows us to skip the task of explaining to the machine, step-by-step, how to solve a certain problem. All you need to do is show the machine a lot of examples of speech and rustle—and it begins to make connections on its own to identify which shapes on the spectrogram look more like speech and which look more like rustle. Once it knows which shapes are speech and which are rustle, it can attenuate only the rustle portions of the spectrogram. CAS: How does a machine learn? AL: In order to train the De-Rustle module, we made two databases. The first was of clean speech, many hours from different sources. We found as much clean speech as we could, because we needed to show the machine a great variety of desirable results (what clean speech looks like). The second database that we collected was all sorts of lavaliere microphone rustle. This is where we got into trouble because there is no known database of isolated lavaliere microphone rustle. So we had to do a lot of in-house recording of rubbing all sorts of microphones against different kinds of clothing. Our sound designer did it for us for several days. Once we got those two databases in place, we started training the neural network to recognize the patterns of speech and those of rustle. The neural net is like an artificial brain with neurons that receive, process, and output information. The process of training adjusts connections between neurons to constantly improve the rate of recognition. Like a baby who learns to connect sounds into speech, the neural net, with enough training, starts to identify phonemes, words, and sentences. Our neural net has about 10 million neurons. That's a good-sized brain, for a frog! We train the neural net by randomly combining segments from the speech database with segments from the rustle database in all possible combinations. This ensures that the neural net sees a great variety of real-life mixes during training, and always knows the correct answer (the clean speech). Over time (and millions of examples), the net starts to see the differences between rustle and speech. After it's trained, the neural net can be deployed on the user's machine to be run locally. CAS: Which other RX modules incorporate Machine Learning? AL: Currently, the only modules in RX 6 that are based on Machine Learning are De-Rustle and Dialogue Isolate. Dialogue Isolate removes non-stationary noises from speech. Such noises are often difficult to identify on a spectrogram. Although in RX, there are a lot of ways to identify problems and separate noises from speech, none of the traditional ways work well when you have a non-stationary noise in a mono signal. This is where Machine Learning helps us because the machine can learn what typical speech looks like. Machine Learning incorporates the concept of generalization. It's impossible to show the neural net every possible combination of speech and noise it may see in real life. But after seeing thousands and millions of examples during training, the machine learns how to look at new material and determine what we want to keep and what we want to extract. While the RX neural network has never seen Japanese, Russian, or French during the training phase, it can generalize what it knows of the English language to also be effective at isolating speech in those languages. CAS: Once RX has utilized its neural net and determines what stays and what goes, then what? AL: The neural net determines the probability of each point in the spectrogram being speech or rustle. This can be viewed as a Spectral Mask. Once we have the Spectral Mask, we can apply it to attenuate the spectrogram according to those probabilities. If we know that some spectrogram points are speech, then we keep them. If we know that some spectrogram points are noise, then those points are attenuated. CAS: When you have a Spectral Mask comprised of points to be attenuated or boosted, how many points are we talking about? AL: Usually one second of input speech has around 100,000 points. This is about as many pixels as you see in RX's spectrogram. CAS: That's a lot of computations! AL: Luckily for RX users, the majority of computations have taken place during the training of the neural net. It trains for hundreds of hours, examining millions of examples of noise and clean speech. But once the neural net is trained, the processing stage runs faster because it doesn't have to train further.

Articles in this issue

Archives of this issue

view archives of CAS Quarterly - Spring 2018