On January 31, 2008, the US Patent & Trademark Office published Apple’s patent application titled Determining scale factor values in encoding audio data with AAC. Apple’s patent generally relates to digital audio processing and, more specifically, to rate-distortion control by optimizing the selection of scale factor values when encoding audio data.
Patent: Determining Scale Factor Values in Encoding Audio Data with AAC
Apple’s Patent Background
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it is not to be assumed that any of the approaches described in this section qualify as prior art, merely by virtue of their inclusion in this section.
Audio coding, or audio compression, algorithms are used to obtain compact digital representations of high-fidelity (i.e., wideband) audio signals for the purpose of efficient transmission and/or storage. A central objective in audio coding is to represent the signal with a minimum number of bits while achieving transparent signal reproduction, i.e., while generating output audio which cannot be humanly distinguished from the original input, even by a sensitive listener.
Advanced Audio Coding (”AAC”) is a wideband audio coding algorithm that exploits two primary coding strategies to dramatically reduce the amount of data needed to convey high-quality digital audio. Signal components that are “perceptually irrelevant” and can be discarded without a perceived loss of audio quality are removed. Further, redundancies in the coded audio signal are eliminated. Hence, efficient audio compression is achieved by a variety of perceptual audio coding and data compression tools, which are combined in the MPEG-4 AAC specification. The MPEG-4 AAC standard incorporates MPEG-2 AAC, forming the basis of the MPEG-4 audio compression technology for data rates above 32 kbps per channel. Additional tools increase the effectiveness of AAC at lower bit rates, and add scalability or error resilience characteristics. These additional tools extend AAC into its MPEG-4 incarnation (ISO/IEC 14496-3, Subpart 4).
AAC is referred to as a perceptual audio coder, or lossy coder, because it is based on a listener perceptual model, i.e., what a listener can actually hear, or perceive. A common problem in perceptual audio coding is bitrate control. According to the concept of Perceptual Entropy, the information content of an audio signal varies dependent on the signal properties. Thus, the required bitrate to encode this information generally varies over time. For some applications bitrate variations are not an issue. However, for many applications a firm control of the instantaneous and/or average bitrate is desired.
The three basic bitrate modes for audio coding are CBR (constant bitrate), ABR (average bitrate) and VBR (variable bitrate). CBR is important to bitrate-critical applications, such as audio streaming. Unlike CBR, in which bitrates are strictly constant at each instance, ABR allows a variation of bitrates for each instance while maintaining a certain average bitrate for the entire track, thereby resulting in a reasonably predictable size to the finished files. As the name indicates, VBR allows the bitrate to vary significantly; however, the sound quality is consistent.
A CBR codec is constant in bitrate along an audio time signal, but is typically variable in sound quality. For example, for stereo encoding at a bitrate of 96 kb/s, an encoded speech track, which is “easy” to encode due to its relatively narrow frequency bandwidth, sounds indistinguishable from the original source of the track. However, noticeable artifacts could be heard in similarly encoded complex classical music, which is “difficult” to encode due to a typically broad frequency bandwidth and, therefore, more data to encode.
Simultaneous Masking is a frequency domain phenomenon where a low level signal, e.g., a narrow-band noise (the maskee) can be made inaudible by a simultaneously occurring stronger signal (the masker). A masked threshold can be measured below which any signal will not be audible. The masked threshold depends on the sound pressure level (SPL) and the frequency of the masker, and on the characteristics of the masker and maskee. If the source signal consists of many simultaneous maskers, a global masked threshold can be computed that describes the threshold of just noticeable distortions as a function of frequency. The most common way of calculating the global masked threshold is based on the high resolution short term energy spectrum of the audio or speech signal.
Coding audio based on a psychoacoustic model encodes audio signals above a masked threshold block by block. Therefore, if distortion (typically referred to as quantization noise), which is inherent to an amplitude quantization process, is under the masked threshold, a typical human cannot hear the noise. A sound quality target is based on a subjective perceptual quality scale (e.g., from 0-5, with 5 being best quality). From an audio quality target on this perceptual quality scale, a noise profile, i.e., an offset from the applicable masked threshold, is determinable. This noise profile represents the level at which quantization noise can be masked, while achieving the desired quality target. From the noise profile, appropriate quantization step sizes are determinable. The quantization step sizes are a significant determining factor of the coding bitrate.
The more bits allocated for encoding a block of audio, the less noise may be generated during the quantization process. However current techniques for estimating how many bits to allocate are inefficient. For example, current techniques estimate audio quality based on an erroneous assumption of the noise-to-audio quality relationship. As another example, current techniques take into account all possible scale factor values at each scale factor band, which requires a significant number of calculations.
Based on the foregoing, there is room for improvement in estimating scale factor values when encoding audio data.
General Overview
Perceptual audio coding aims to achieve the best perceived audio quality for a given target bitrate; or, conversely, perceptual audio coding aims to achieve the lowest bitrate for a given audio quality target. The following encoder modules may be used to achieve these aims: a) a psychoacoustic model that estimates a masked threshold, b) a bit allocation module that controls which parameters and spectral coefficients are transmitted and at which resolution, and c) a multiplexer that forms a valid bitstream.
Conceptually, the masked threshold indicates the maximum spectral level of quantization distortions that will be just inaudible. Audio coders have a bit allocation module designed to shape the quantization noise such that the quantization noise just approaches the masked threshold. This noise shaping is achieved by modifying “scale factor values” (SFVs) which in turn determine the amount of quantization noise created in each “scale factor band” (SFB). As opposed to the traditional approach, this description introduces a new bit allocation approach that optimizes the SFVs, the number of bits used for encoding (e.g. MDCT) spectral coefficients, and the audio quality. Although this bit allocation process is applied to AAC, it is applicable to other coders, such as MP3, AC-3, and WMA.
In one approach, when estimating the cost of using a particular SFV for a particular SFB, the amount of noise of using the SFV is determinable. One factor that the cost takes into account when choosing the SFV is the audio quality achieved. Audio quality acts as a “credit” whereas the number of bits (e.g., to encode the quantized spectral coefficients and SFVs) acts as a “debit.” Instead of assuming that a constant decrease in noise has a corresponding constant increase in audio quality, a more accurate modeling of audio quality based on noise is used. Such a model may be based on a non-linear function where, after a certain level of noise, a decrease in noise does not correspond to a proportional increase in audio quality.
In another approach, when estimating the cost of transitioning from one SFB to another SFB, instead of considering all possible SFVs, only a proper subset of the possible SFVs are considered, thus reducing the computational complexity. The subset is determined based on an initial SFV where a certain number of SFVs “above” the initial SFV are considered and a certain number of SFVs “below” the initial SFV are considered.
In another approach, the initial SFV is generated based on an efficient formula that considers the masked threshold intensity for the corresponding SFB and the band energy or sum of spectral coefficient magnitudes of the corresponding SFB without performing any computationally-expensive square root operations.
Apple’s patent FIG. 1, noted below, is a block diagram that illustrates an exemplary perceptual audio coder, according to an embodiment of the invention

Apple’s patent Figs. 2A-B, noted below, are graphs that illustrate exemplary uniform and non-uniform quantizers

Apple’s patent FIG. 3, noted below is a diagram that illustrates a range of scale factor values for optimization in a dynamic program, according to an embodiment of the invention.

Apple’s patent FIG. 4, noted below is a diagram that illustrates a lattice and the contributions of partial costs to the cost of transitioning from one scale factor band to another scale factor band, according to an embodiment of the invention.

Apple lists Frank M. Baumgarte (Sunnyvale, CA) as the sole inventor of this patent.
A secondary ACC related patent by Apple’s Frank Baumgarte titled Bitrate control for perceptual coding was also published today which generally relates to digital media processing and, more specifically, to controlling bitrate by accounting for human perception.
Apple’s Abstract reads as follows: Techniques for generating a target digital media item based on a source digital media item are described. A digital media item may be a song, a video clip, an album, or any length of audio or video. When adjusting the bit count for a portion of the target digital media item, instead of using the same set of parameter values used in a perceptual model for each portion of the source media item, the set of parameter values may be modified to encode the portion of the source digital media item. In this way, how audio or video is perceived is taken into account when adjusting a proposed bit count for a given portion of the target digital media item. Thus, while maintaining the same statistical bitrate as before increased digital media quality is achieved.

Apple’s patent FIG. 2, noted above, is a block diagram that illustrates one type of bitrate control in a perceptual audio coder, according to an embodiment of the invention. For more information on this secondary ACC patent, see application 20080027732.
NOTICE: MacNN presents only a brief summary of patents with associated graphic(s) for journalistic news purposes as each such patent application and/or grant is revealed by the U.S. Patent & Trade Office. Readers are cautioned that the full text of any patent applications and/or grants should be read in its entirety for further details.
Written and researched by Neo.
Leave a Reply
You must be logged in to post a comment.










