CAS Quarterly

Page 32 of 55

C A S Q U A R T E R L Y W I N T E R 2 0 1 8 33 Once the network has appropriate weightings, it can start some inputs differently than others. It can make meaningful decisions. There's just one small problem: the system's programmers have no idea what those weightings should be. A newly designed network doesn't know anything. School for neurons. Fortunately, the neural network can teach itself … with a little human help. You need a bunch of samples with known results. This could be hundreds of clips of production dialogue, each of which has had its spectrogram analyzed by a human operator to sort the actors' voices from the backgrounds 5 . The analysis is stored as an "answer key" with each sample. Start by letting the network assign random initial values to its weightings. Feed it a test clip. Have the computer check how closely its result agrees with the answer key. If the result is wrong for any band—and it probably will be, since we're starting with random weights—the network decreases weightings on the path that fed that band. If a result is almost correct, it increases that path's weightings by a small amount and tries again. Eventually, the weights leading to a correct answer get optimized. It's similar to how human synapses get strengthened when we learn a physical skill. After lots of training passes and corrections, a neural network can get close enough to meet the design goals 6 . This training requires lots of samples and lots of repetition. Developers run the training passes through very fast computers, using their own custom server farms, along with time on even bigger farms like Amazon Web Services. Building Audionamix's networks took as much as a week of initial training, followed by human design tweaks, followed by more training. iZotope created some 10 million individual weights during training. They represent about 30 megabytes of numeric data in the finished product. Nature vs. Nurture Algorithmic processing works as designed, every time. If it doesn't sound the way you expected, it's because you made a mistake in the design. Fix this part of the circuit, and you've solved the problem. Neural networks, on the other hand, can be only as good as their training. The sample set has to be valid for the processing goals, and include enough samples of every possible condition. If you miss a condition while training, the network may treat it randomly when it encounters that condition in the field. If the samples don't have enough variation—that is, if they all sound too similar—the network may fail with real-world inputs. Basic speech recognition networks (like the one in your phone's digital assistant) get trained with immense available libraries, compiled from countless phone calls in more than 100 languages by services like Google Voice, and corrected by native speakers who used the service. Audio processing networks, like the ones in this article, have to contend with wide bandwidth inputs, a much fuzzier definition of what's signal and what's noise, and an output that can't be defined as phonemes or words. These are different challenges. So while speech recognition is a fairly mature technology, film sound processing via neural networks is just getting off the ground. Brains and personality. It takes more than just a well-trained network to make a successful processor. Real-world inputs have to be processed into columns of numbers the network can handle, and the output has to be turned into something the operator can use. This requires algorithmic processing, along with careful design on both sides of the network. At least in today's world, designing an audio neural network also involves compromises. You have to balance the size and construction of the network with available computer power, since the number of time-consuming math operations grows exponentially as the matrix gets bigger. The input and output algorithms add to the computing load. The neural network products in this article are processor-intensive, and—as yet—can't handle audio streams in real-time. Designers will consider other network topologies as well. The best choice for an application might not need all the connections we've drawn, or it might feed signals back to earlier layers. iZotope uses a relatively new recurrent architecture called Long Short Term Memory 7 . So, how well a neural network reaches its goals depends on the designers' experience and assumptions, lots of testing, and plenty of "secret sauce." iZotope's and Audionamix's implementations have similar goals. But the products are different. 5 Or start with a big bunch of clean recordings, mix in your own noises, and consider the original unmixed dialogue to be the answer. This technique doesn't require as much skilled labor, and is often in conjunction with standardized speech sample collections. 6 If you kept training long enough and had a diverse enough sample library, the result could approach absolute perfection. In practical terms, however, training has to stop so a product can be brought to market. Think of any remaining confusion in the network as the logical equivalent of signal-to-noise. 7 It's a confusing name, and its math is beyond me. You can see a discussion of the technique at www.wildml.com.

Winter 2018

Contents of this Issue

Navigation

Page 32 of 55

Articles in this issue

Links on this page

Archives of this issue