Constrained-Energy Lapped Transform (CELT) CodecOctasic Semiconductor4101, Molson Street, suite 300MontrealQuebecH1Y 3L1Canadajean-marc.valin@octasic.comXiph.Org Foundationtterribe@xiph.orgJuniper Networks2251 Corporate Park Drive, Suite 100HerndonVA20171-1817USAgmaxwell@juniper.netXiph.Org Foundationxiphmont@xiph.org
General
AVT Working Groupaudio codeclow delayInternet-DraftCELT
CELT is an open-source voice codec suitable for use in very low delay
Voice over IP (VoIP) type applications. This document describes the encoding
and decoding process.
This document describes the CELT codec, which is designed for transmitting full-bandwidth
audio with very low delay. It is suitable for encoding both
speech and music at rates starting at 32 kbit/s. It is primarily designed for transmission
over the Internet and protocols such as RTP , but also includes
a certain amount of robustness to bit errors, where this could be done at no significant
cost.
The novel aspect of CELT compared to most other codecs is its very low delay,
below 10 ms. There are two main advantages to having a very low delay audio link.
The lower delay itself is important for some interactions, such as playing music
remotely. Another advantage is its behavior in the presence of acoustic echo. When
the round-trip audio delay is sufficiently low, acoustic echo is no longer
perceived as a distinct repetition, but rather as extra reverberation. Applications
of CELT include:Collaborative network music performanceHigh-quality teleconferencingWireless audio equipmentLow-delay links for broadcast applicationsThe source code for the reference implementation of the CELT codec is provided in . This source code is the normative specification
of the codec. The corresponding text description in this document is provided
for informative purposes.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 .
CELT stands for Constrained Energy Lapped Transform. This is
the fundamental principle of the codec: the quantization process is designed in such a way
as to preserve the energy in a certain number of bands. The theoretical aspects of the
codec are described in greater detail and
. Although these papers describe slightly older versions of
the codec (version 0.3.2 and 0.5.1, respectively), the principles remain the same.
CELT is a transform codec, based on the Modified Discrete Cosine Transform
(MDCT). The MDCT is derived from the DCT-IV by adding an overlap with time-domain
aliasing cancellation .
The main characteristics of CELT are as follows:
Ultra-low algorithmic delay (scalable, typically 4 to 9 ms)Sampling rates from 32 kHz to 48 kHz and above (full audio bandwidth)Applicability to both speech and musicSupport for mono and stereoAdaptive bit-rate from 32 kbit/s to 128 kbit/s per channel and aboveScalable complexityRobustness to packet loss (scalable trade-off between quality and loss-robustness)Open source implementation (floating-point and fixed-point)No known intellectual property issues
This document contains a detailed description of both the encoder and the decoder, along with a reference implementation. In most circumstances, and unless otherwise stated, the calculations
do not need to produce results that are bit-identical with the reference implementation, so alternate algorithms can sometimes be used. However, there are a few (clearly identified) cases, such as the bit allocation, where bit-exactness with the reference
implementation is required. An implementation is considered to be compatible if, for any valid bit-stream, the decoder's output is perceptually indistinguishable from the output produced by the reference decoder.
The CELT codec does not use a standard bit-packer,
but rather uses a range coder to pack both integers and entropy-coded symbols.
In mono mode, the bit-stream generated by the encoder contains the
following parameters (in order):
Feature flags I, P, S, F (2-4 bits)if P=1
Pitch periodif S=1
Transient scalefactorif scalefactor=(1 or 2) AND more than 2 short MDCTs
ID of block before transientif scalefactor=3
Transient timeCoarse energy encoding (for each band)Fine energy encoding (for each band)For each band
if P=1 and band is at the beginning of a pitch band
Pitch gain bitPVQ indicesMore fine energy (using all remaining bits)Note that due to the use of a range coder, all of the parameters have to be encoded and decoded in order.
The CELT bit-stream is "octet-based" in the sense that the encoder always produces an
integer number of octets when encoding a frame. Also, the bit-rate used by the CELT encoder can
only be determined by the number of octets produced.
In many cases (e.g. UDP/RTP), the transport layer already encodes the data length, so
no extra information is necessary to signal the bit-rate. In cases where this is not true,
or when there are multiple compressed frames per packet, the size of each compressed
frame MUST be signalled in some way.
The operation of both the encoder and decoder depends on the mode data. A mode
definition can be created by celt_create_mode() (modes.c)
based on three parameters:
frame size (number of samples)sampling rate (samples per second)number of channels (1 or 2)The frame size can be any even number of samples from 64 to 1024, inclusively.
The sampling rate must be between 32000 Hz and 96000 Hz. The mode data that is
created defines how the encoder and the decoder operate. More specifically, the
following information is contained in the mode object:
Frame sizeSampling rateWindowing overlapNumber of channelsDefinition of the bandsDefinition of the pitch bandsDecay coefficients of the Laplace distributions for coarse energyBit allocation matrix
The windowing overlap is the amount of overlap between the frames. CELT uses a low-overlap window that is typically half of the frame size. For a frame size of 256 samples, the overlap is 128 samples, so the total algorithmic delay is 256+128=384. CELT divides the audio into frequency bands, for which the energy is preserved. These bands are chosen to follow the ear's critical bands, with the exception that each band has to contain at least 3 frequency bins.
The energy bands are based on the Bark scale. The Bark band edges (in Hz) are defined as
[0, 100, 200, 300, 400, 510, 630, 770, 920, 1080, 1270, 1480, 1720, 2000, 2320,
2700, 3150, 3700, 4400, 5300, 6400, 7700, 9500, 12000, 15500, 20000]. The actual bands used by the codec
depend on the sampling rate and the frame size being used. The mapping from Hz to MDCT bins is done by
multiplying by sampling_rate/(2*frame_size) and rounding to the nearest value. An exception is made for
the lower frequencies to ensure that all bands contain at least 3 MDCT bins. The definition of the Bark
bands is computed in compute_ebands() (modes.c).
CELT includes a pitch predictor for which the gains are defined over
a set of pitch bands. The pitch bands are defined
(in Hz) as [0, 345, 689, 1034, 1378, 2067, 3273, 5340, 6374]. The Hz values
are mapped to MDCT bins in the same was as the energy bands. The pitch
band boundaries are aligned to energy band boundaries. The definition of the pitch
bands is computed in compute_pbands() (modes.c).
The mode contains a bit allocation table that is derived from a prototype allocation table,
specified in the band_allocation matrix (modes.c). Each row of the table is a single prototype allocation,
in bits per Bark band, and assuming 256-sample frames. These rows must be projected onto the actual band layout in use at the
current frame size and sample rate, using exact integer calculations. The reference
implementation
pre-computes these projections in compute_allocation_table() (modes.c) and any other implementation
MUST produces bit-identical allocation results.
Every entry in the allocation table is multiplied by the current number of channels and
the current frame size. Each prototype allocation is projected
independently using the following process: the upper band frequencies (in Hz) from the current Bark band and current CELT band are compared. (When the process begins, these will each be the first band, but will increment independently.) If the current Bark band's upper edge frequency
is less than the current CELT band's upper edge frequency, the entire value of the Bark band plus any carried remainder is assigned to the current CELT
band, and the process continues with the next
Bark band in sequence and zero remainder. If the current Bark band's upper edge frequency is equal to or greater than that of
the current CELT band, the CELT band will receive only part of this Bark band's allocation.
This portion allocated to the CELT band is then calculated by multiplying the Bark band's allocation by the
difference in Hz between the Bark band's upper frequency and the current
CELT band's lower frequency, adding the width of the current Bark band
divided by two, and then dividing this total by the width of the current Bark
band in Hz. The partial value plus any carried remainder is added to the current
CELT band, and the difference between the partial value and the Bark target is
taken as the new carried remainder. The process begins then again starting at the
next CELT band and next Bark band. Once all bands in a prototype allocation have been considered, any
remainder is added to the last CELT band. All of the resulting values are
rescaled by adding 128 and dividing by 256.
The top-level function for encoding a CELT frame in the reference implementation is
celt_encode() (celt.c).
The basic block diagram of the CELT encoder is illustrated in .
The encoder contains most of the building blocks of the decoder and can,
with very little extra computation, compute the signal that would be decoded by the decoder.
CELT has three main quantizers denoted Q1, Q2 and Q3. These apply to band energies
(), pitch gains ()
and normalized MDCT bins (), respectively.
The input audio first goes through a pre-emphasis filter
(just before the windowing in ), which attenuates the
spectral tilt. The filter is has the transfer function A(z)=1-alpha_p*z^-1, with
alpha_p=0.8. The inverse of the pre-emphasis is applied at the decoder.
CELT uses an entropy coder based upon ,
which is itself a rediscovery of the FIFO arithmetic code introduced by .
It is very similar to arithmetic encoding, except that encoding is done with
digits in any base instead of with bits,
so it is faster when using larger bases (i.e.: an octet). All of the
calculations in the range coder must use bit-exact integer arithmetic.
The range coder also acts as the bit-packer for CELT. It is
used in three different ways, to encode:
entropy-coded symbols with a fixed probability model using ec_encode(), (rangeenc.c)integers from 0 to 2^M-1 using ec_enc_uint() or ec_enc_bits(), (entenc.c)integers from 0 to N-1 (where N is not a power of two) using ec_enc_uint(). (entenc.c)
The range encoder maintains an internal state vector composed of the
four-tuple (low,rng,rem,ext), representing the low end of the current
range, the size of the current range, a single buffered output octet,
and a count of additional carry-propagating output octets. Both rng
and low are 32-bit unsigned integer values, rem is an octet value or
the special value -1, and ext is an integer with at least 16 bits.
This state vector is initialized at the start of each each frame to
the value (0,2^31,-1,0).
Each symbol is drawn from a finite alphabet and coded in a separate
context which describes the size of the alphabet and the relative
frequency of each symbol in that alphabet. CELT only uses static
contexts; they are not adapted to the statistics of the data that is
coded.
The main encoding function is ec_encode() (rangeenc.c),
which takes as an argument a three-tuple (fl,fh,ft)
describing the range of the symbol to be encoded in the current
context, with 0 <= fl < fh <= ft <= 65535. The values of this tuple
are derived from the probability model for the symbol. Let f(i) be
the frequency of the ith symbol in the current context. Then the
three-tuple corresponding to the kth symbol is given by
ec_encode() updates the state of the encoder as follows. If fl is
greater than zero, then low = low + rng - (rng/ft)*(ft-fl) and
rng = (rng/ft)*(fh-fl). Otherwise, low is unchanged and
rng = rng - (rng/ft)*(fh-fl). The divisions here are exact integer
division. After this update, the range is normalized.
To normalize the range, the following process is repeated until
rng > 2^23. First, the top 9 bits of low, (low>>23), are placed into
a carry buffer. Then, low is set to . This process is carried out by
ec_enc_normalize() (rangeenc.c).
The 9 bits produced in each iteration of the normalization loop
consist of 8 data bits and a carry flag. The final value of the
output bits is not determined until carry propagation is accounted
for. Therefore the reference implementation buffers a single
(non-propagating) output octet and keeps a count of additional
propagating (0xFF) output octets. An implementation MAY choose to use
any mathematically equivalent scheme to perform carry propagation.
The function ec_enc_carry_out() (rangeenc.c) performs
this buffering. It takes a 9-bit input value, c, from the normalization
8-bit output and a carry bit. If c is 0xFF, then ext is incremented
and no octets are output. Otherwise, if rem is not the special value
-1, then the octet (rem+(c>>8)) is output. Then ext octets are output
with the value 0 if the carry bit is set, or 0xFF if it is not, and
rem is set to the lower 8 bits of c. After this, ext is set to zero.
In the reference implementation, a special version of ec_encode()
called ec_encode_bin() (rangeenc.c) is defined to
take a two-tuple (fl,ftb), where , but avoids using division.
Functions ec_enc_uint() or ec_enc_bits() are based on ec_encode() and
encode one of N equiprobable symbols, each with a frequency of 1,
where N may be as large as 2^32-1. Because ec_encode() is limited to
a total frequency of 2^16-1, this is done by encoding a series of
symbols in smaller contexts.
ec_enc_bits() (entenc.c) is defined, like
ec_encode_bin(), to take a two-tuple (fl,ftb), with >ftb-8&0xFF) using ec_encode_bin() and
subtracts 8 from ftb. Then, it encodes the remaining bits of fl, e.g.,
(fl&(1<, again using ec_encode_bin().
ec_enc_uint() (entenc.c) takes a two-tuple (fl,ft),
where ft is not necessarily a power of two. Let ftb be the location
of the highest 1 bit in the two's-complement representation of
(ft-1), or -1 if no bits are set. If ftb>8, then the top 8 bits of fl
are encoded using ec_encode() with the three-tuple
(fl>>ftb-8,(fl>>ftb-8)+1,(ft-1>>ftb-8)+1), and the remaining bits
are encoded with ec_enc_bits using the two-tuple
.
After all symbols are encoded, the stream must be finalized by
outputting a value inside the current range. Let end be the integer
in the interval [low,low+rng) with the largest number of trailing
zero bits. Then while end is not zero, the top 9 bits of end, e.g.,
>23), are sent to the carry buffer, and end is replaced by
(end<<8&0x7FFFFFFF). Finally, if the value in carry buffer, rem, is]]>
neither zero nor the special value -1, or the carry count, ext, is
greater than zero, then 9 zero bits are sent to the carry buffer.
After the carry buffer is finished outputting octets, the rest of the
output buffer is padded with zero octets. Finally, rem is set to the
special value -1. This process is implemented by ec_enc_done()
(rangeenc.c).
The bit allocation routines in CELT need to be able to determine a
conservative upper bound on the number of bits that have been used
to encode the current frame thus far. This drives allocation
decisions and ensures that the range code will not overflow the
output buffer. This is computed in the reference implementation to
fractional bit precision by the function ec_enc_tell()
(rangeenc.c).
Like all operations in the range encoder, it must
be implemented in a bit-exact manner.
The CELT codec has several optional features that can be switched on or off in each frame, some of which are mutually exclusive. The four main flags are intra-frame energy (I), pitch (P), short blocks (S), and folding (F). Those are described in more detail below. There are eight valid combinations of these four features, and they are encoded into the stream first using a variable length code (). It is left to the implementor to choose when to enable each of the flags, with the only restriction that the combination of the four flags MUST correspond to a valid entry in .
Encoding of the feature flagsIPSFEncoding0001000101011001110101111100001000001110010100101010001011
CELT uses prediction to encode the energy in each frequency band. In order to make frames independent, however, it is possible to disable the part of the prediction that depends on previous frames. This is called intra-frame energy and requires around 12 more bits per frame. It is enabled with the I bit (Table. flags-encoding). The use of intra energy is OPTIONAL and the decision method is left to the implementor. The reference code describes one way of deciding which frames would benefit most from having their energy encoded without prediction. The intra_decision() (quant_bands.c) function looks for frames where the log-spectral distance between consecutive frames is more than 9 dB. When such a difference is found between two frames, the next frame (not the one for which the difference is detected) is marked encoded with intra energy. The one-frame delay is to ensure that when a frame containing a transient is lost, then the next frame will be decoded without accumulating error from the lost frame.
CELT can use a pitch predictor (also known as long-term predictor) to improve the voice quality at lower bit-rates. While the pitch period can be estimated in any way, it is RECOMMENDED for performance reasons to estimate it using a frequency-domain correlation between the current frame and the history buffer, as implemented in find_spectral_pitch() (pitch.c). When the P bit is set, the pitch period is encoded after the flag bits. The value encoded is an integer in the range [0, 1024-N-overlap-1].
To improve audio quality during transients, CELT can use a short block multiple-MDCT transform. Unlike other transform codecs, the multiple MDCTs are jointly quantized as if the coefficients were obtained from a single MDCT. For that reason, it is better to consider the short block case as using a different transform of the same length rather than as multiple independent MDCTs. In the reference implementation, the decision to use short blocks is made by transient_analysis() (celt.c) based on the pre-emphasized signal's peak values, but other methods can be used. When the S bit is set, a 2-bit transient scalefactor is encoded directly after the flag bits. If the scalefactor is 0, then the multiple-MDCT output is unmodified. If the scalefactor is 1 or 2, then the output of the MDCTs that follow the transient is scaled down by 2^scalefactor. If the scalefactor is equal to 3, then a time-domain pre-emphasis window is applied before computing the MDCTs and no further scaling is applied to the MDCTs output. The window value is 1 from the beginning of the frame to 16 samples before the transient time. It is a Hanning window from there to the transient time, and then the value is 1/8 up to the end of the frame. The Hanning window part is defined as:
static const float transientWindow[16] = {
0.0085135, 0.0337639, 0.0748914, 0.1304955,
0.1986827, 0.2771308, 0.3631685, 0.4538658,
0.5461342, 0.6368315, 0.7228692, 0.8013173,
0.8695045, 0.9251086, 0.9662361, 0.9914865};
When the scalefactor is 3, the transient time is the exact time of the transient
determined by the encoder and encoded as an integer number of samples with the range
[0, N+overlap-1] directly after the scalefactor.
In the case where the scalefactor is 1 or 2 and the mode is defined to use more than 2 MDCTs, the last MDCT to which the scaling is not applied is encoded using an integer in the range [0, B-2], where B is the number of short MDCTs used for the mode.
The last encoding feature in CELT is spectral folding. It is designed to prevent birdie artifacts caused by the sparse spectra often generated by low-bitrate transform codecs. When folding is enabled, a copy of the low-frequency spectrum is added to the higher-frequency bands (above ~6400 Hz). The folding operation is described in more detail in .
The MDCT implementation has no special characteristics. The
input is a windowed signal (after pre-emphasis) of 2*N samples and the output is N
frequency-domain samples. A low-overlap window is used to reduce the algorithmic delay.
It is derived from a basic (full overlap) window that is the same as the one used in the Vorbis codec: W(n)=[sin(pi/2*sin(pi/2*(n+.5)/L))]^2. The low-overlap window is created by zero-padding the basic window and inserting ones in the middle, such that the resulting window still satisfies power complementarity. The MDCT is computed in mdct_forward() (mdct.c), which includes the windowing operation and a scaling of 2/N.
The MDCT output is divided into bands that are designed to match the ear's critical bands,
with the exception that each band has to be at least 3 bins wide. For each band, the encoder
computes the energy that will later be encoded. Each band is then normalized by the
square root of the non-quantized energy, such that each band now forms a unit vector X.
The energy and the normalization are computed by compute_band_energies()
and normalise_bands() (bands.c), respectively.
It is important to quantize the energy with sufficient resolution because
any energy quantization error cannot be compensated for at a later
stage. Regardless of the resolution used for encoding the shape of a band,
it is perceptually important to preserve the energy in each band. CELT uses a
coarse-fine strategy for encoding the energy in the base-2 log domain,
as implemented in quant_bands.c
The coarse quantization of the energy uses a fixed resolution of
6 dB and is the only place where entropy coding is used.
To minimize the bitrate, prediction is applied both in time (using the previous frame)
and in frequency (using the previous bands). The 2-D z-transform of
the prediction filter is: A(z_l, z_b)=(1-a*z_l^-1)*(1-z_b^-1)/(1-b*z_b^-1)
where b is the band index and l is the frame index. The prediction coefficients are
a=0.8 and b=0.7 when not using intra energy and a=b=0 when using intra energy.
The time-domain prediction is based on the final fine quantization of the previous
frame, while the frequency domain (within the current frame) prediction is based
on coarse quantization only (because the fine quantization has not been computed
yet). We approximate the ideal
probability distribution of the prediction error using a Laplace distribution. The
coarse energy quantization is performed by quant_coarse_energy() and
quant_coarse_energy() (quant_bands.c).
The Laplace distribution for each band is defined by a 16-bit (Q15) decay parameter.
Thus, the value 0 has a frequency count of p[0]=2*(16384*(16384-decay)/(16384+decay)). The
values +/- i each have a frequency count p[i] = (p[i-1]*decay)>>14. The value of p[i] is always
rounded down (to avoid exceeding 32768 as the sum of all frequency counts), so it is possible
for the sum to be less than 32768. In that case additional values with a frequency count of 1 are encoded. The signed values corresponding to symbols 0, 1, 2, 3, 4, ...
are [0, +1, -1, +2, -2, ...]. The encoding of the Laplace-distributed values is
implemented in ec_laplace_encode() (laplace.c).
After the coarse energy quantization and encoding, the bit allocation is computed
() and the number of bits to use for refining the
energy quantization is determined for each band. Let B_i be the number of fine energy bits
for band i; the refinement is an integer f in the range [0,2^B_i-1]. The mapping between f
and the correction applied to the coarse energy is equal to (f+1/2)/2^B_i - 1/2. Fine
energy quantization is implemented in quant_fine_energy()
(quant_bands.c).
If any bits are unused at the end of the encoding process, these bits are used to
increase the resolution of the fine energy encoding in some bands. Priority is given
to the bands for which the allocation () was rounded
down. At the same level of priority, lower bands are encoded first. Refinement bits
are added until there are no unused bits. This is implemented in quant_energy_finalise()
(quant_bands.c).
Bit allocation is performed based only on information available to both
the encoder and decoder. The same calculations are performed in a bit-exact
manner in both the encoder and decoder to ensure that the result is always
exactly the same. Any mismatch would cause an error in the decoded output.
The allocation is computed by compute_allocation() (rate.c),
which is used in both the encoder and the decoder.For a given band, the bit allocation is nearly constant across
frames that use the same number of bits for Q1, yielding a
pre-defined signal-to-mask ratio (SMR) for each band. Because the
bands each have a width of one Bark, this is equivalent to modeling the
masking occurring within each critical band, while ignoring inter-band
masking and tone-vs-noise characteristics. While this is not an
optimal bit allocation, it provides good results without requiring the
transmission of any allocation information.
For every encoded or decoded frame, a target allocation must be computed
using the projected allocation. In the reference implementation this is
performed by compute_allocation() (rate.c).
The target computation begins by calculating the available space as the
number of whole bits which can be fit in the frame after Q1 is stored according
to the range coder (ec_[enc/dec]_tell()), and iff the frame has pitch prediction,
subtracting the number of pitch bands and then multiplying by 16.
Then the two projected prototype allocations whose sums multiplied by 16 are nearest
to that value are determined. These two projected prototype allocations are then interpolated
by finding the highest integer interpolation coefficient in the range 0-16
such that the sum of the higher prototype times the coefficient, plus the
sum of the lower prototype multiplied by
the difference of 16 and the coefficient, is less than or equal to the
available sixteenth-bits.
The reference implementation performs this step using a binary search in
interp_bits2pulses() (rate.c). The target
allocation is the interpolation coefficient times the higher prototype, plus
the lower prototype multiplied by the difference of 16 and the coefficient,
for each of the CELT bands.
Because the computed target will sometimes be somewhat smaller than the
available space, the excess space is divided by the number of bands, and this amount
is added equally to each band. Any remaining space is added to the target one
sixteenth-bit at a time, starting from the first band. The new target now
matches the available space, in sixteenth-bits, exactly.
The allocation target is separated into a portion used for fine energy
and a portion used for the Spherical Vector Quantizer (PVQ). The fine energy
quantizer operates in whole-bit steps. For each band the number of bits per
channel used for fine energy is calculated by 50 minus the log2_frac(), with
1/16 bit precision, of the number of MDCT bins in the band. That result is multiplied
by the number of bins in the band and again by twice the number of
channels, and then the value is set to zero if it is less than zero. Added
to that result is 16 times the number of MDCT bins times the number of
channels, and it is finally divided by 32 times the number of MDCT bins times the
number of channels. If the result times the number of channels is greater than than the
target divided by 16, the result is set to the target divided by the number of
channels divided by 16. Then if the value is greater than 7 it is reset to 7 because a
larger amount of fine energy resolution was determined not to be make an improvement in
perceived quality. The resulting number of fine energy bits per channel is
then multiplied by the number of channels and then by 16, and subtracted
from the target allocation. This final target allocation is what is used for the
PVQ.
The pitch period T is computed in the frequency domain using a generalized
cross-correlation, as implemented in find_spectral_pitch()
(pitch.c). An MDCT is then computed on the
synthesis signal memory using the offset T.
If there is sufficient energy in this
part of the signal, the pitch gain for each pitch band
is computed as g_a = X^T*p, where X is the normalized (non-quantized) signal and
p is the normalized pitch MDCT.
The gain is computed by compute_pitch_gain() (bands.c),
and if a sufficient number of bands have a high enough gain, then the pitch bit is set.
Otherwise, no use of pitch is made.
For frequencies above the highest pitch band (~6374 Hz), the pitch prediction is replaced by
spectral folding if and only if the folding bit is set. Spectral folding is implemented in
intra_fold() (vq.c). If the folding bit is not set, then
the prediction is simply set to zero.
The folding prediction uses the quantized spectrum at lower frequencies with a gain that depends
both on the width of the band, N, and the number of pulses allocated, K:
g_a = N / (N + 2*K*(K+1)),
When the short block bit is not set, the spectral copy is performed starting with bin 0 (DC) and going up. When the short block bit is set, then the starting point is chosen between 0 and B-1 in such a way that the source and destination bins belong to the same MDCT (i.e., to prevent the folding from causing pre-echo). Before the folding operation, each band of the source spectrum is multiplied by sqrt(N) so that the expected value of the squared value for each bin is equal to 1. The copied spectrum is then renormalized to have norm (||p|| = g_a).
For stereo streams, the folding is performed independently for each channel.CELT uses a Pyramid Vector Quantization (PVQ)
codebook for quantizing the details of the spectrum in each band that have not
been predicted by the pitch predictor. The PVQ codebook consists of all sums
of K signed pulses in a vector of N samples, where two pulses at the same position
are required to have the same sign. Thus the codebook includes
all integer codevectors y of N dimensions that satisfy sum(abs(y(j))) = K.
In bands where neither pitch nor folding is used, the PVQ is used to encode
the unit vector that results from the normalization in
directly. Given a PVQ codevector y,
the unit vector X is obtained as X = y/||y||, where ||.|| denotes the
L2 norm. In the case where a pitch
prediction or a folding vector p is used, the quantized unit vector X' becomes:
X' = p' + g_f * y,where g_f = ( sqrt( (y^T*p')^2 + ||y||^2*(1-||p'||^2) ) - y^T*p' ) / ||y||^2, and p' = g_a * p.The combination of the pitch with the PVQ codeword is described in
mix_pitch_and_residual() (vq.c) and is used in
both the encoder and the decoder.
Although the allocation is performed in 1/16 bit units, the quantization requires
an integer number of pulses K. To do this, the encoder searches for the value
of K that produces the number of bits that is the nearest to the allocated value
(rounding down if exactly half-way between two values), subject to not exceeding
the total number of bits available. The computation is performed in 1/16 of
bits using log2_frac() and ec_enc_tell(). The number of codebooks entries can
be computed as explained in . The difference
between the number of bits allocated and the number of bits used is accumulated to a
balance (initialised to zero) that helps adjusting the
allocation for the next bands. One third of the balance is subtracted from the
bit allocation of the next band to help achieving the target allocation. The only
exceptions are the band before the last and the last band, for which half the balance
and the whole balance are subtracted, respectively.
The search for the best codevector y is performed by alg_quant()
(vq.c). There are several possible approaches to the
search with a tradeoff between quality and complexity. The method used in the reference
implementation computes an initial codeword y1 by projecting the residual signal
R = X - p' onto the codebook pyramid of K-1 pulses:
y0 = round_towards_zero( (K-1) * R / sum(abs(R)))
Depending on N, K and the input data, the initial codeword y0 may contain from
0 to K-1 non-zero values. All the remaining pulses, with the exception of the last one,
are found iteratively with a greedy search that minimizes the normalized correlation
between y and R:
J = -R^T*y / ||y||
The last pulse is the only one considering the pitch and minimizes the cost function :
J = -g_f * R^T*y + (g_f)^2 * ||y||^2
The search described above is considered to be a good trade-off between quality
and computational cost. However, there are other possible ways to search the PVQ
codebook and the implementors MAY use any other search methods.
The best PVQ codeword is encoded as a uniformly-distributed integer value
by encode_pulses() (cwrs.c).
The codeword is converted to a unique index in the same way as specified in
. The indexing is based on the calculation of V(N,K) (denoted N(L,K) in ), which is the number of possible combinations of K pulses
in N samples. The number of combinations can be computed recursively as
V(N,K) = V(N+1,K) + V(N,K+1) + V(N+1,K+1), with V(N,0) = 1 and V(0,K) = 0, K != 0.
There are many different ways to compute V(N,K), including pre-computed tables and direct
use of the recursive formulation. The reference implementation applies the recursive
formulation one line (or column) at a time to save on memory use,
along with an alternate,
univariate recurrence to initialise an arbitrary line, and direct
polynomial solutions for small N. All of these methods are
equivalent, and have different trade-offs in speed, memory usage, and
code size. Implementations MAY use any methods they like, as long as
they are equivalent to the mathematical definition.
The indexing computations are performed using 32-bit unsigned integers. For large codebooks,
32-bit integers are not sufficient. Instead of using 64-bit integers (or more), the encoding
is made slightly sub-optimal by splitting each band into two equal (or near-equal) vectors of
size (N+1)/2 and N/2, respectively. The number of pulses in the first half, K1, is first encoded as an
integer in the range [0,K]. Then, two codebooks are encoded with V((N+1)/2, K1) and V(N/2, K-K1).
The split operation is performed recursively, in case one (or both) of the split vectors
still requires more than 32 bits. For compatibility reasons, the handling of codebooks of more
than 32 bits MUST be implemented with the splitting method, even if 64-bit arithmetic is available.
When encoding a stereo stream, some parameters are shared across the left and right channels, while others are transmitted separately for each channel, or jointly encoded. Only one copy of the flags for the features, transients and pitch (pitch period and gains) are transmitted. The coarse and fine energy parameters are transmitted separately for each channel. Both the coarse energy and fine energy (including the remaining fine bits at the end of the stream) have the left and right bands interleaved in the stream, with the left band encoded first.
The main difference between mono and stereo coding is the PVQ coding of the normalized vectors. In stereo mode, a normalized mid-side (M-S) encoding is used. Let L and R be the normalized vector of a certain band for the left and right channels, respectively. The mid and side vectors are computed as M=L+R and S=L-R and no longer have unit norm.
From M and S, an angular parameter theta=2/pi*atan2(||S||, ||M||) is computed. The theta parameter is converted to a Q14 fixed-point parameter itheta, which is quantized on a scale from 0 to 1 with an interval of 2^-qb, where qb = (b-2*(N-1)*(40-log2_frac(N,4)))/(32*(N-1)), b is the number of bits allocated to the band, and log2_frac() is defined in cwrs.c. From here on, the value of itheta MUST be treated in a bit-exact manner since
both the encoder and decoder rely on it to infer the bit allocation.
Let m=M/||M|| and s=S/||S||; m and s are separately encoded with the PVQ encoder described in . The number of bits allocated to m and s depends on the value of itheta. The number of bits allocated to coding m is obtained by:
imid = bitexact_cos(itheta);iside = bitexact_cos(16384-itheta);delta = (N-1)*(log2_frac(iside,6)-log2_frac(imid,6))>>2;qalloc = log2_frac((1<<qb)+1,4);mbits = (b-qalloc/2-delta)/2;where bitexact_cos() is a fixed-point cosine approximation that MUST be bit-exact with the reference implementation
in mathops.h. The spectral folding operation is performed independently for the mid and side vectors.
After all the quantization is completed, the quantized energy is used along with the
quantized normalized band data to resynthesize the MDCT spectrum. The inverse MDCT () and the weighted overlap-add are applied and the signal is stored in the synthesis buffer so it can be used for pitch prediction.
The encoder MAY omit this step of the processing if it knows that it will not be using
the pitch predictor for the next few frames. If the de-emphasis filter () is applied to this resynthesized
signal, then the output will be the same (within numerical precision) as the decoder's output.
Each CELT frame can be encoded in a different number of octets, making it possible to vary the bitrate at will. This property can be used to implement source-controlled variable bitrate (VBR). Support for VBR is OPTIONAL for the encoder, but a decoder MUST be prepared to decode a stream that changes its bit-rate dynamically. The method used to vary the bit-rate in VBR mode is left to the implementor, as long as each frame can be decoded by the reference decoder.
Like most audio codecs, the CELT decoder is less complex than the encoder, as can be
observed in the decoder block diagram in . In
fact, most of the operations performed by the decoder are also performed by the
encoder.
The decoder extracts information from the range-coded bit-stream in the same order
as it was encoded by the encoder. In some circumstances, it is
possible for a decoded value to be out of range due to a very small amount of redundancy
in the encoding of large integers by the range coder.
In that case, the decoder should assume there has been an error in the coding,
decoding, or transmission and SHOULD take measures to conceal the error and/or report
to the application that a problem has occurred.
The range decoder extracts the symbols and integers encoded using the range encoder in
. The range decoder maintains an internal
state vector composed of the two-tuple (dif,rng), representing the
difference between the high end of the current range and the actual
coded value, and the size of the current range, respectively. Both
dif and rng are 32-bit unsigned integer values. rng is initialized to
2^7. dif is initialized to rng minus the top 7 bits of the first
input octet. Then the range is immediately normalized, using the
procedure described in the following section.
Decoding symbols is a two-step process. The first step determines
a value fs that lies within the range of some symbol in the current
context. The second step updates the range decoder state with the
three-tuple (fl,fh,ft) corresponding to that symbol, as defined in
.
The first step is implemented by ec_decode()
(rangedec.c),
and computes fs = ft-min((dif-1)/(rng/ft)+1,ft), where ft is
the sum of the frequency counts in the current context, as described
in . The divisions here are exact integer division.
In the reference implementation, a special version of ec_decode()
called ec_decode_bin() (rangeenc.c) is defined using
the parameter ftb instead of ft. It is mathematically equivalent to
calling ec_decode() with ft = (1<<ftb), but avoids one of the
divisions.
The decoder then identifies the symbol in the current context
corresponding to fs; i.e., the one whose three-tuple (fl,fh,ft)
satisfies fl <= fs < fh. This tuple is used to update the decoder
state according to dif = dif - (rng/ft)*(ft-fh), and if fl is greater
than zero, rng = (rng/ft)*(fh-fl), or otherwise rng = rng - (rng/ft)*(ft-fh). After this update, the range is normalized.
To normalize the range, the following process is repeated until
rng > 2^23. First, rng is set to (rng<8)&0xFFFFFFFF. Then the next
8 bits of input are read into sym, using the remaining bit from the
previous input octet as the high bit of sym, and the top 7 bits of the
next octet for the remaining bits of sym. If no more input octets
remain, zero bits are used instead. Then, dif is set to
(dif<<8)-sym&0xFFFFFFFF (i.e., using wrap-around if the subtraction
overflows a 32-bit register). Finally, if dif is larger than 2^31,
dif is then set to dif - 2^31. This process is carried out by
ec_dec_normalize() (rangedec.c).
Functions ec_dec_uint() or ec_dec_bits() are based on ec_decode() and
decode one of N equiprobable symbols, each with a frequency of 1,
where N may be as large as 2^32-1. Because ec_decode() is limited to
a total frequency of 2^16-1, this is done by decoding a series of
symbols in smaller contexts.
ec_dec_bits() (entdec.c) is defined, like
ec_decode_bin(), to take a single parameter ftb, with ftb < 32.
and ftb < 32, and produces an ftb-bit decoded integer value, t,
initialized to zero. While ftb is greater than 8, it decodes the next
8 most significant bits of the integer, s = ec_decode_bin(8), updates
the decoder state with the 3-tuple (s,s+1,256), adds those bits to
the current value of t, t = t<<8 | s, and subtracts 8 from ftb. Then
it decodes the remaining bits of the integer, s = ec_decode_bin(ftb),
updates the decoder state with the 3 tuple (s,s+1,1<<ftb), and adds
those bits to the final values of t, t = t<<ftb | s.
ec_dec_uint() (entdec.c) takes a single parameter,
ft, which is not necessarily a power of two, and returns an integer,
t, with a value between 0 and ft-1, inclusive, which is initialized to zero. Let
ftb be the location of the highest 1 bit in the two's-complement
representation of (ft-1), or -1 if no bits are set. If ftb>8, then
the top 8 bits of t are decoded using t = ec_decode((ft-1>>ftb-8)+1),
the decoder state is updated with the three-tuple
(s,s+1,(ft-1>>ftb-8)+1), and the remaining bits are decoded with
t = t<<ftb-8|ec_dec_bits(ftb-8). If, at this point, t >= ft, then
the current frame is corrupt, and decoding should stop. If the
original value of ftb was not greater than 8, then t is decoded with
t = ec_decode(ft), and the decoder state is updated with the
three-tuple (t,t+1,ft).
The bit allocation routines in CELT need to be able to determine a
conservative upper bound on the number of bits that have been used
to decode from the current frame thus far. This drives allocation
decisions which must match those made in the encoder. This is
computed in the reference implementation to fractional bit precision
by the function ec_dec_tell() (rangedec.c). Like all
operations in the range decoder, it must be implemented in a
bit-exact manner, and must produce exactly the same value returned by
ec_enc_tell() after encoding the same symbols.
The energy of each band is extracted from the bit-stream in two steps according
to the same coarse-fine strategy used in the encoder. First, the coarse energy is
decoded in unquant_coarse_energy() (quant_bands.c)
based on the probability of the Laplace model used by the encoder.
After the coarse energy is decoded, the same allocation function as used in the
encoder is called (). This determines the number of
bits to decode for the fine energy quantization. The decoding of the fine energy bits
is performed by unquant_fine_energy() (quant_bands.c).
Finally, like the encoder, the remaining bits in the stream (that would otherwise go unused)
are decoded using unquant_energy_finalise() (quant_bands.c).
If the pitch bit is set, then the pitch period is extracted from the bit-stream. The pitch
gain bits are extracted within the PVQ decoding as encoded by the encoder. When the folding
bit is set, the folding prediction is computed in exactly the same way as the encoder,
with the same gain, by the function intra_fold() (vq.c).
In order to correctly decode the PVQ codewords, the decoder must perform exactly the same
bits to pulses conversion as the encoder (see ).
The decoding of the codeword from the index is performed as specified in
, as implemented in function
decode_pulses() (cwrs.c).
The spherical codebook is decoded by alg_unquant() (vq.c).
The index of the PVQ entry is obtained from the range coder and converted to
a pulse vector by decode_pulses() (cwrs.c).
The decoded normalized vector for each band is equal toX' = p' + g_f * y,where g_f = ( sqrt( (y^T*p')^2 + ||y||^2*(1-||p'||^2) ) - y^T*p' ) / ||y||^2, and p' = g_a * p.
This operation is implemented in mix_pitch_and_residual() (vq.c),
which is the same function as used in the encoder.
Just like each band was normalized in the encoder, the last step of the decoder before
the inverse MDCT is to denormalize the bands. Each decoded normalized band is
multiplied by the square root of the decoded energy. This is done by denormalise_bands()
(bands.c).
The inverse MDCT implementation has no special characteristics. The
input is N frequency-domain samples and the output is 2*N time-domain
samples, while scaling by 1/2. The output is windowed using the same
low-overlap window
as the encoder. The IMDCT and windowing are performed by mdct_backward
(mdct.c). If a time-domain pre-emphasis
window was applied in the encoder, the (inverse) time-domain de-emphasis window
is applied on the IMDCT result. After the overlap-add process,
the signal is de-emphasized using the inverse of the pre-emphasis filter
used in the encoder: 1/A(z)=1/(1-alpha_p*z^-1).
Packet loss concealment (PLC) is an optional decoder-side feature which
SHOULD be included when transmitting over an unreliable channel. Because
PLC is not part of the bit-stream, there are several possible ways to
implement PLC with different complexity/quality trade-offs. The PLC in
the reference implementation finds a periodicity in the decoded
signal and repeats the windowed waveform using the pitch offset. The windowed
waveform is overlapped in such a way as to preserve the time-domain aliasing
cancellation with the previous frame and the next frame. This is implemented
in celt_decode_lost() (mdct.c).
A potential denial-of-service threat exists for data encodings using
compression techniques that have non-uniform receiver-end
computational load. The attacker can inject pathological datagrams
into the stream which are complex to decode and cause the receiver to
become overloaded. However, this encoding does not exhibit any
significant non-uniformity.
With the exception of the first four bits, the bit-stream produced by
CELT for an unknown audio stream is not easily predictable, due to the
use of entropy coding. This should make CELT less vulnerable to attacks
based on plaintext guessing when encryption is used. Also, since almost
all possible bit combinations can be interpreted as a valid bit-stream,
it is likely more difficult to determine from the decrypted bit-stream
whether a guessed decryption key is valid.
When operating CELT in variable-bitrate (VBR) mode, some of the
properties described above no longer hold. More specifically, the size
of the packet leaks a very small, but non-zero, amount of information
about both the original signal and the bit-stream plaintext.
This document has no actions for IANA.
The authors would also like to thank the CELT users who contributed patches, bug reports, feature requests, suggestions or comments.
Key words for use in RFCs to Indicate Requirement Levels RTP: A Transport Protocol for real-time applicationsA High-Quality Speech and Audio Codec With Less Than 10 ms delayA Full-Bandwidth Audio Codec with Low Complexity and Very Low DelayThe CELT ultra-low delay audio codecModified Discrete Cosine TransformRange encoding: An algorithm for removing redundancy from a digitised messageSource coding algorithms for fast data compressionA Pyramid Vector QuantizerThis appendix contains the complete source code for a floating-point
reference implementation of the CELT codec written in C. This
implementation is derived from version 0.6.1 of the implementation available on the
, which can be compiled for
either floating-point or fixed-point architectures.
The implementation can be compiled with either a C89 or a C99
compiler. It is reasonably optimized for most platforms such that
only architecture-specific optimizations are likely to be useful.
The FFT used is a slightly modified version of the KISS-FFT package,
but it is easy to substitute any other FFT library.
The testcelt executable can be used to test the encoding and decoding
process:
[ [packet loss rate]]
where "rate" is the sampling rate in Hz, "channels" is the number of
channels (1 or 2), "frame size" is the number of samples in a frame
(64 to 1024) and "octets per packet" is the number of octets desired for each
compressed frame. The input and output files are assumed to be a 16-bit
PCM file in the machine native endianness. The optional "complexity" argument
can select the quality vs complexity tradeoff (0-10) and the "packet loss rate"
argument simulates random packet loss (argument is in tenths or a percent).