Digital audio is like gravity; it works whether you understand it or not, but knowing about a few basic principles can make it easier to live with.
Writers explaining digital audio sampling often use an illustration of a sine wave with vertical bars drawn on it, appearing as a series of "steps". Digital audio has been a part of everyday life now for decades, yet still that picture causes major confusion and fuels one of the most prevalent myths; that a greater sample rate, and longer word lengths (bit depth) gives better audio resolution.
Firstly, lets take a glance at the 50,000 ft view at the stages of a recording made with a microphone,
· Digital audio is like film, you sample lots of discrete "snapshots" and when you play them back you see a moving picture. The more frequently you take the snapshot the smoother the picture will be.
· The more frequently you sample an audio source the smoother the playback will be.
No. No. No.
Firstly, digital audio is NOT like film. A digital audio system will measure a source (Analogue to Digital (AD) conversion), and on playback (DA conversion) it will RECONSTRUCT the source. You don't hear the samples, you hear the reconstruction of the source. With film, you actually see the snapshots and because your mind is too slow (no offence) to process each image individually you perceive a flowing sequence of events. More frames per second gives a smoother image. There is NO reconstruction, you actually construct the moving image yourself.
How does reconstruction work, and why doesn't a higher sample rate (more "snapshots") give a smoother, more accurate reconstruction? Let's completely ignore the real world for a moment; grab a piece of paper and a pen (mental paper and pen will do) and draw four dots at the corners of a square 20 or so cm apart (the dimensions don't really matter, and you can use inches if you want to). Take a straight edge and lay it across the top two dots, then add another bunch of dots along the straight line - as many as you want. All set, right! Join all the dots in the upper line, and then join the 2 dots in the lower line so that you have two horizontal parallel lines on a sheet of paper. Notice how the upper line is smoother and higher resolution than the lower one? No? It must be - it's got more points on it?!? Of course it isn't - the lines are EXACTLY equally smooth and high definition, because you knew that it was going to be a straight line, you just needed to be told where it was. This is what reconstruction does, just with an additional several pages of maths including healthy doses of Nyquist and the rather marvelous Sinc function (check-out Hugh and Dan) so that you can do it for complex audio wave forms. So long as certain conditions are met (sampling frequency must be at least twice the highest frequency being sampled) and a bit of engineering gets done properly, you really do get a PERFECT reconstruction of the sampled wave. No Steps! There are benefits to higher sampling rates; filtering of junk from the signal that we can't process is easier and there are signal to noise ratio benefits, but you won't get "higher resoluton" audio below 20KHz. and modern Delta-Sigma converters do a pretty fine job anyway.
How frequently we sample is controlled by a clock, and so long as the clock is perfect then the samples will be measured at the exact right time (if it's not then you'll get all jittery, but that's another story). But what does "sampled" mean? It means measured. Imagine our infamous sine wave drawing. At time "0" measure the vertical height of the line, at time 1 do it again, and so on and so on. So long as the samples are measured at exactly the right time, and the measurements are exact, then we can reconstruct a perfect image of the source.
The problem is that we store those measurements as digital data, and we have a limited number of values available to us (OK - it's a big limited value). Just in case anyone's not sure, a word in this sense is made of a number of (BInary digITs) bits, and just like the decimal numbers we use daily, the numbers on the left are more significant than those on the right, so in the number 9,876,543, the 9 million is more significant than the 3 units; the 9 is the MSD (most significant digit) and the 3 is the LSD (least significant digit) and in the 8 bit binary word 10111111 the least significant BIT (LSB) is the one on the right (and yes, I know, I just read that back - sorry). OK.
If we have a 16 bit word in which to store the value of that measurement then we have 2 to the power of 16 (a little over 65 thousand) values available (this has to cover both positive and negative measurements, so let's say we have a little over 30,000 values for each side of the sine wave (above and below the "zero" line). This has to store the values of all signals between "none" - 0 Volts at the ADC, and the maximum that the converter can handle, and we simply choose the nearest one to the true value. A pure 24 bit system (in reality they're equivalent to about 20 bits, but let's not spoil a good story) has 2 to the power of 24 (over 16 million) values, so let's say 8 million values per side of the sine wave to cover the same range. Again, we get the nearest best fit to the true value and store that. Aha - so the higher word length option produces higher resolution output?!?!
Nope, what you get is a perfectly reconstructed waveform, which has errors in it. These errors are called quantization errors, they manifest as quantisation distortion, and they are the difference between the real measured value and the nearest discreet value that we can store. The size of the errors is proportional to the difference between any 2 adjacent values that can be represented, and the lowest and highest voltages produced are the same for either system and the 24 bit system has more discreet values within the range of zero to full-on, so the stored values will tend to be closer to the true values. Hence, the 24 bit system has a lower level of errors and a lower level of distortion.
Except - and this is the bit that often gets, sort-of, missed-out because it's hard to get your head around. Those errors are actually fixed by a process called "dither". Many articles that go into this stuff tell you that "dither adds low level noise" and that "this removes quantization errors". Dither toggles the LSB(s) and so essentially moves the quantisation levels around at random, so that all errors become uncorrelated and thus noise-like. But - there's still this slight nagging feeling that there's a logic diagram with a box that says "then a miracle happens". Let's have a go at explaining it. Again, to be clear, think of this as a mind experiment to illustrate a point, not a real engineering solution.
If we apply dither to an analogue signal at the AD stage, we add low level noise that causes the least significant bit(s) of every word, irrespective of its value, to toggle on and off (in reality analogue noise is sufficiently high that several bits will be effected anyway). Any instantaneous observation of the LSB could see it as 1 or 0, but over a suitable period of time it will average such that the mean value will be between the 0 and 1 value, with the probability of its value effected by the true analogue value. As an example, if a 0 value of a 1 bit system represented 0 Volts, and a 1 value represented 1 Volt, and the analogue value was 0.5V then the LSB would be a 1 for half the samples and a 0 half the samples; if the analogue value was 0.75V then the LSB would be a 1 for 75% of the samples, a 0 for 25%. This occurs for every level of signal being coded, from no signal to the highest the ADC will handle. When you reconstruct this example, you get a value of exactly 0.75V out of a system that can store values equivalent to only 0V or 1V.
If word length increases we add bits. The analogue values to be recreated by the DAC still have the same limits. The process of word length expansion re-writes individual sample values so that they represent the same "analogue" values; there are simply more different values available between the ones actually used? This is no problem, the data will sit quite happily until something is changed - processed, at which time new values will be written as required.
If we reduce word length, we lose data. The highest and lowest values must represent the same analogue values, so we have less discrete values between, and so the quantisation errors will be bigger. This is truncation. When we dither in the digital domain, LSBs are added to a noise-like signal, and that combination is then used to form the new LSB of the truncated sample word. As we are losing data - in the case of reduced word length - the probability of the LSB being toggled high or low is effected by the more precise information that we are about to lose. The "noise" is added to the data and is stored in the new file.
Dither is what makes digital audio work. It's almost, but not quite perfect. It fixes (it doesn't mask them - it really fixes them) quantisation errors, but it adds some noise. The noise added is uncorrelated to the audio and is far less offensive than correlated distortion, it's also low level, can be "noise shaped" to remove it from the parts of the audio frequency that we are more sensitive to, and in higher samle rate systems can be pushed into audio range that we can't hear. Dither is what lives in the "And then a miracle happens" box.
And it's as simple as that!