Coding Tools for a Next Generation Video CodecMozilla Corporation331 E. Evelyn AvenueMountain ViewCA94041USA+1 650 903-0800tterribe@xiph.org
RAI
netvc
This document proposes a number of coding tools that could be incorporated into
a next-generation video codec.
One of the biggest contributing factors to the success of the Internet is that
the underlying protocols are implementable on a royalty-free basis.
This allows them to be implemented widely and easily distributed by application
developers, service operators, and end users, without asking for permission.
In order to produce a next-generation video codec that is competitive with the
best patent-encumbered standards, yet avoids patents which are not available
on an open-source compatible, royalty-free basis, we must use old coding tools
in new ways and develop new coding tools.
This draft documents some of the tools we have been working on for inclusion in
such a codec.
This is early work, and the performance of some of these tools (especially in
relation to other approaches) is not yet fully known.
Nevertheless, it still serves to outline some possibilities an eventual working
group, if formed, could consider.
The basic theory of entropy coding was well-established by the late
1970's .
Modern video codecs have focused on Huffman codes (or "Variable-Length
Codes"/VLCs) and binary arithmetic coding.
Huffman codes are limited in the amount of compression they can provide and the
design flexibility they allow, but as each code word consists of an integer
number of bits, their implementation complexity is very low, so they were
provided at least as an option in every video codec up through H.264.
Arithmetic coding, on the other hand, uses code words that can take up
fractional parts of a bit, and are more complex to implement.
However, the prevalence of cheap, H.264 High Profile hardware, which requires
support for arithmetic coding, shows that it is no longer so expensive that a
fallback VLC-based approach is required.
Having a single entropy-coding method simplifies both up-front design costs and
interoperability.
However, the primary limitation of arithmetic coding is that it is an
inherently serial operation.
A given symbol cannot be decoded until the previous symbol is decoded, because
the bits (if any) that are output depend on the exact state of the decoder at
the time it is decoded.
This means that a hardware implementation must run at a sufficiently high clock
rate to be able to decode all of the symbols in a frame.
Higher clock rates lead to increased power consumption, and in some cases the
entropy coding is actually becoming the limiting factor in these designs.
As fabrication processes improve, implementers are very willing to trade
increased gate count for lower clock speeds.
So far, most approaches to allowing parallel entropy coding have focused on
splitting the encoded symbols into multiple streams that can be decoded
independently.
This "independence" requirement has a non-negligible impact on compression,
parallelizability, or both.
For example, H.264 can split frames into "slices" which might cover only a
small subset of the blocks in the frame.
In order to allow decoding these slices independently, they cannot use context
information from blocks in other slices (harming compression).
Those contexts must adapt rapidly to account for the generally small number of
symbols available for learning probabilities (also harming compression).
In some cases the number of contexts must be reduced to ensure enough symbols
are coded in each context to usefully learn probabilities at all (once more,
harming compression).
Furthermore, an encoder must specially format the stream to use multiple slices
per frame to allow any parallel entropy decoding at all.
Encoders rarely have enough information to evaluate this "compression
efficiency" vs. "parallelizability" trade-off, since they don't generally know
the limitations of the decoders for which they are encoding.
That means there will be many files or streams which could have been decoded if
they were encoded with different options, but which a given decoder cannot
decode because of bad choices made by the encoder (at least from the
perspective of that decoder).
The same set of drawbacks apply to the DCT token partitions in
VP8 .
Instead, we propose a very different approach: use non-binary arithmetic
coding.
In binary arithmetic coding, each decoded symbol has one of two possible
values: 0 or 1.
The original arithmetic coding algorithms allow a symbol to take on any number
of possible values, and allow the size of that alphabet to change with each
symbol coded.
Reasonable values of N (for example, N <= 16) offer the potential
for a decent throughput increase for a reasonable increase in gate count for
hardware implementations.
Binary coding allows a number of computational simplifications.
For example, for each coded symbol, the set of valid code points is partitioned
in two, and the decoded value is determined by finding the partition in which
the actual code point that was received lies.
This can be determined by computing a single partition value (in both the
encoder and decoder) and (in the decoder) doing a single comparison.
A non-binary arithmetic coder partitions the set of valid code points
into multiple pieces (one for each possible value of the coded symbol).
This requires the encoder to compute two partition values, in general (for both
the upper and lower bound of the symbol to encode).
The decoder, on the other hand, must search the partitions for the one that
contains the received code point.
This requires computing at least O(log N) partition values.
However, coding a parameter with N possible values with a binary arithmetic
coder requires O(log N) symbols in the worst case (the only case
that matters for hardware design).
Hence, this does not represent any actual savings (indeed, it represents an
increase in the number of partition values computed by the encoder).
In addition, there are a number of overheads that are per-symbol, rather than
per-value.
For example, renormalization (which enlarges the set of valid code points after
partitioning has reduced it too much), carry propagation (to deal with the
case where the high and low ends of a partition straddle a bit boundary),
etc., are all performed on a symbol-by-symbol basis.
Since a non-binary arithmetic coder codes a given set of values with fewer
symbols than a binary one, it incurs these per-symbol overheads less often.
This suggests that a non-binary arithmetic coder can actually be more efficient
than a binary one.
The other aspect that binary coding simplifies is probability modeling.
In arithmetic coding, the size of the sets the code points are partitioned into
are (roughly) proportional to the probability of each possible symbol value.
Estimating these probabilities is part of the coding process, though it can be
cleanly separated from the task of actually producing the coded bits.
In a binary arithmetic coder, this requires estimating the probability of only
one of the two possible values (since the total probability is 1.0).
This is often done with a simple table lookup that maps the old probability and
the most recently decoded symbol to a new probability to use for the next
symbol in the current context.
The trade-off, of course, is that non-binary symbols must be "binarized" into
a series of bits, and a context (with an associated probability) chosen for
each one.
In a non-binary arithmetic coder, the decoder must compute at least
O(log N) cumulative probabilities (one for each partition value it
needs).
Because these probabilities are usually not estimated directly in "cumulative"
form, this can require computing (N - 1) non-cumulative probability
values.
Unless N is very small, these cannot be updated with a single table lookup.
The normal approach is to use "frequency counts".
Define the frequency of value k to be
where A and B are parameters (usually A=2 and B=1 for a traditional
Krichevsky-Trofimov estimator).
The resulting probability, p[k], is given by
When ft grows too large, the frequencies are rescaled (e.g., halved, rounding
up to prevent reduction of a probability to 0).
When ft is not a power of two, partitioning the code points requires actual
divisions (see Section 4.1 for one detailed
example of exactly how this is done).
These divisions are acceptable in an audio codec like
Opus , which only has to code a few hundreds of
these symbols per second.
But video requires hundreds of thousands of symbols per second, at a minimum,
and divisions are still very expensive to implement in hardware.
There are two possible approaches to this.
One is to come up with a replacement for frequency counts that produces
probabilities that sum to a power of two.
Some possibilities, which can be applied individually or in combination:
Use probabilities that are fixed for the duration of a frame.
This is the approach taken by VP8, for example, even though it uses a binary
arithmetic coder.
In fact, it is possible to convert many of VP8's existing binary-alphabet
probabilities into probabilities for non-binary alphabets, an approach that is
used in the experiment presented at the end of this section.
Use parametric distributions.
For example, DCT coefficient magnitudes usually have an approximately
exponential distribution.
This distribution can be characterized by a single parameter, e.g., the
expected value.
The expected value is trivial to update after decoding a coefficient.
For example
produces an exponential moving average with a decay factor of
(1 - C).
For a choice of C that is a negative power of two (e.g., 1/16 or 1/32 or
similar), this can be implemented with two adds and a shift.
Given this expected value, the actual distribution to use can be obtained from
a small set of pre-computed distributions via a lookup table.
Linear interpolation between these pre-computed values can improve accuracy, at
the cost of O(N) computations, but if N is kept small this is trivially
parallelizable, in SIMD or otherwise.
Change the frequency count update mechanism so that ft is constant.
For example, let
be the cumulative frequency of all symbol values less than k and
be the elementary change in the cumulative frequency count fl[k] caused by
adding 1 to f[i].
Then one possible update formula after decoding the value i is
where D is a negative power of two chosen such that
floor(D*ft) == (N + F) .
This ensures that ft == fl[N] == fl[N]' is a constant.
This requires O(N) operations, but the arithmetic is very simple (given the
freedom to choose D and F, and to some extent N), and trivially
parallelizable, in SIMD or otherwise.
The downside is the addition of the value k at each step.
This is necessary to ensure that the probability of an individual symbol
(fl[k+1] - fl[k]) is never reduced to zero.
However it is equivalent to mixing in a uniform distribution with counts that
are otherwise an exponential moving average.
That means that ft and F must be sufficiently large, or there will be an
adverse impact on coding efficiency.
The upside is that F*e[i] may be replaced by any monotonically non-decreasing
vector whose Nth element is F.
That is, instead of just incrementing the probability of symbol i, it can
increase the probability of values that are highly correlated with i.
E.g., this allows decoding value i to apply a small probability increase
to the neighboring values (i - 1) and (i + 1), in addition
to a large probability increase to the value i.
This may help, for example, in motion vector coding, and is much more sensible
than the approach taken with binary context modeling, which often does things
like "increase the probability of all even values when decoding a 6" because
the same context is always used to code the least significant bit.
Rather than changing the context modeling, the other approach is to change the
function used to partition the set of valid code points so that it does not
need a division, even when ft is not a power of two.
Let the range of valid code points in the current arithmetic coder state be
[L, L + R), where L is the lower bound of the range and R is
the number of valid code points.
Assume that ft <= R < 2*ft (this is easy to enforce
with the normal rescaling operations used with frequency counts).
Then one possible partition function is
so that the new range after coding symbol k is
[L + r[k], L + r[k+1]).
This is a variation of the partition function proposed
by .
The size of the new partition (r[k+1] - r[k]) is no longer truly
proportional to R*p[k].
This partition function counts values of fl[k] smaller than R - ft
double compared to values larger than R - ft.
This over-estimates the probability of symbols at the start of the alphabet
and underestimates the probability of symbols at the end of the alphabet.
The amount of the range allocated to a symbol can be off by up to a factor of
2 compared to its fair share, implying a peak error as large as one bit per
symbol.
However, if the probabilities are accurate and the symbols being coded are
independent, the average inefficiency introduced will be as low as
log2(log2(e)*2/e) ~= 0.0861 bits per symbol.
This error can, of course, be reduced by coding fewer symbols with larger
alphabets.
In practice the overhead is roughly equal to the overhead introduced by other
approximate arithmetic coders like H.264's CABAC.
However, probabilities near one-half tend to have the most overhead.
In fact, probabilities in the range of 40% to 60% for a binary symbol may not
be worth modeling, since the compression gains may be entirely countered
by the added overhead, making it cheaper and faster to code such values as
raw bits.
This problem is partially alleviated by using larger alphabets.
A slightly more complicated partition function can reduce the overhead while
still avoiding the division.
This is done by splitting things into two cases:
Case 1: ft - (R - ft)]]>
That is, we have more values that are double-counted than single-counted.
In this case, we still double-count the first 2*R - 3*ft values,
but after that we alternate between single-counting and double-counting
for the rest.
Case 2:
That is, we have more values that are single-counted than double-counted.
In this case, we alternate between single-counting and double-counting for
the first 2*(R - ft) values, and single-count the rest.
For two equiprobable symbols in different places in the alphabet, this
reduces the maximum ratio of over-estimation to under-estimation from 2:1
for the previous partition function to either 4:3 or 3:2 (for each of the
two cases above, respectively), assuming symbol probabilities significantly
greater than the minimum possible.
That reduces the worst-case per-symbol overhead from 1 bit to 0.58 bits.
The resulting reduced-overhead partition function is
Here, e is a value that is greater than 0 in case 1, and 0 in case 2.
This function is about three times as expensive to evaluate as the
high-overhead version, but still an order of magnitude cheaper than a
division, since it is composed of very simple operations.
In practice it reduces the overhead by about 0.3% of the total bitrate.
It also tends to produce R values with a more uniform distribution compared
to the high-overhead version, which tends to have peaks in the distribution
of R at specific values (see for a discussion of this
effect).
Overall, it makes it more likely that the compression gains from
probabilities near one-half are not eliminated by the approximation
overhead, increasing the number of symbols that can be usefully modeled.
It is an open question whether or not these benefits are worth the increase
in computational complexity.
As a simple experiment to validate the non-binary approach, we compared a
non-binary arithmetic coder to the VP8 (binary) entropy coder.
This was done by instrumenting vp8_treed_read() in libvpx to dump out the
symbol decoded and the associated probabilities used to decode it.
This data only includes macroblock mode and motion vector information, as the
DCT token data is decoded with custom inline functions, and not
vp8_treed_read().
This data is available at
.
It includes 1,019,670 values encode using 2,125,995 binary symbols
(or 2.08 symbols per value).
We expect that with a conscious effort to group symbols during the codec
design, this average could easily be increased.
We then implemented both the regular VP8 entropy decoder (in plain C, using all
of the optimizations available in libvpx at the time) and a multisymbol
entropy decoder (also in plain C, using similar optimizations), which encodes
each value with a single symbol.
For the decoder partition search in the non-binary decoder, we used a simple
for loop (O(N) worst-case), even though this could be made constant-time and
branchless with a few SIMD instructions such as (on x86) PCMPGTW, PACKUSWB,
and PMOVMASKB followed by BSR.
The source code for both implementations is available at
(compile with -DEC_BINARY for the binary version and -DEC_MULTISYM for the
non-binary version).
The test simply loads the tokens, and then loops 1024 times encoding them using
the probabilities provided, and then decoding them.
The loop was added to reduce the impact of the overhead of loading the data,
which is implemented very inefficiently.
The total runtime on a Core i7 from 2010 is 53.735 seconds for the binary
version, and 27.937 seconds for the non-binary version, or a 1.92x
improvement.
This is very nearly equal to the number of symbols per value in the binary
coder, suggesting that the per-symbol overheads account for the vast majority
of the computation time in this implementation.
Integer transforms in image and video coding date back to at least
1969 .
Although standards such as MPEG2 and MPEG4 Part 2 allow some flexibility
in the transform implementation, implementations were subject to drift and
error accumulation, and encoders had to impose special macroblock refresh
requirements to avoid these problems, not always successfully.
As transforms in modern codecs only account for on the order of 10% of the
total decoder complexity, and, with the use of weighted prediction with gains
greater than unity and intra prediction, are far more susceptible to drift and
error accumulation, it no longer makes sense to allow a non-exact transform
specification.
However, it is also possible to make such transforms "reversible", in the sense
that applying the inverse transform to the result of the forward transform
gives back the original input values, exactly.
This gives a lossy codec, which normally quantizes the coefficients before
feeding them into the inverse transform, the ability to scale all the way to
lossless compression without requiring any new coding tools.
This approach has been used successfully by JPEG XR, for
example .
Such reversible transforms can be constructed using "lifting steps", a series
of shear operations that can represent any set of plane rotations, and thus
any orthogonal transform.
This approach dates back to at least 1992 , which
used it to implement a four-point 1-D Discrete Cosine Transform (DCT).
Their implementation requires 6 multiplications, 10 additions,
2 shifts, and 2 negations, and produces output that is a factor of
sqrt(2) larger than the orthonormal version of the transform.
The expansion of the dynamic range directly translates into more bits to code
for lossless compression.
Because the least significant bits are usually very nearly random noise, this
scaling increases the coding cost by approximately half a bit per sample.
To demonstrate the idea of lifting steps, consider the two-point transform
This can be implemented up to scale via
and reversed via
Both y0 and y1 are too large by a factor of sqrt(2), however.
It is also possible to implement any rotation by an angle t, including the
orthonormal scale factor, by decomposing it into three steps:
By letting t=-pi/4, we get an implementation of the first transform that
includes the scaling factor.
To get an integer approximation of this transform, we need only replace the
transcendental constants by fixed-point approximations:
This approximation is still perfectly reversible:
Each of the three steps can be implemented using just two ARM instructions,
with constants that have up to 14 bits of precision (though using fewer
bits allows more efficient hardware implementations, at a small cost in coding
gain).
However, it is still much more complex than the first approach.
We can get a compromise with a slight modification:
This still only implements the original orthonormal transform up to scale.
The y0 coefficient is too large by a factor of sqrt(2) as before, but y1 is now
too small by a factor of sqrt(2).
If our goal is simply to (optionally quantize) and code the result, this is
good enough.
The different scale factors can be incorporated into the quantization matrix in
the lossy case, and the total expansion is roughly equivalent to that of the
orthonormal transform in the lossless case.
Plus, we can perform each step with just one ARM instruction.
However, if instead we want to apply additional transformations to the data, or
use the result to predict other data, it becomes much more convenient to have
uniformly scaled outputs.
For a two-point transform, there is little we can do to improve on the
three-multiplications approach above.
However, for a four-point transform, we can use the last approach and arrange
multiple transform stages such that the "too large" and "too small" scaling
factors cancel out, producing a result that has the true, uniform, orthonormal
scaling.
To do this, we need one more tool, which implements the following transform:
This takes unevenly scaled inputs, rescales them, and then rotates them.
Like an ordinary rotation, it can be reduced to three lifting steps:
As before, the transcendental constants may be replaced by fixed-point
approximations without harming the reversibility property.
Using the tools from the previous section, we can design a reversible integer
four-point DCT approximation with uniform, orthonormal scaling.
This requires 3 multiplies, 9 additions, and 2 shifts (not
counting the shift and rounding offset used in the fixed-point multiplies, as
these are built into the multiplier).
This is significantly cheaper than the approach, and
the output scaling is smaller by a factor of sqrt(2), saving half a bit per
sample in the lossless case.
By comparison, the four-point forward DCT approximation used in VP9, which is
not reversible, uses 6 multiplies, 6 additions, and 2 shifts
(counting shifts and rounding offsets which cannot be merged into a single
multiply instruction on ARM).
Four of its multipliers also require 28-bit accumulators, whereas this proposal
can use much smaller multipliers without giving up the reversibility property.
The total dynamic range expansion is 1 bit: inputs in the range [-256,255)
produce transformed values in the range [-512,510).
This is the smallest dynamic range expansion possible for any reversible
transform constructed from mostly-linear operations.
It is possible to make reversible orthogonal transforms with no dynamic range
expansion by using "piecewise-linear" rotations ,
but each step requires a large number of operations in a software
implementation.
Pseudo-code for the forward transform follows:
Even though there are three asymmetrically scaled rotations by pi/4, by careful
arrangement we can share one of the shift operations (to help software
implementations: shifts by a constant are basically free in hardware).
This technique can be used to even greater effect in larger transforms.
The inverse transform is constructed by simply undoing each step in turn:
Although the right shifts make this transform non-linear, we can compute
"basis functions" for it by sending a vector through it with a single value
set to a large constant (256 was used here), and the rest of the values set to
zero.
The true basis functions for a four-point DCT (up to five digits) are
The corresponding basis functions for our reversible, integer DCT, computed
using the approximation described above, are
The mean squared error (MSE) of the output, compared to a true DCT, can be
computed with some assumptions about the input signal.
Let G be the true DCT basis and G' be the basis for our integer approximation
(computed as described above).
Then the error in the transformed results is
where D = (G - G') .
The MSE is then
where Rxx is the autocorrelation matrix of the input signal.
Assuming the input is a zero-mean, first-order autoregressive (AR(1)) process
gives an autocorrelation matrix of
for some correlation coefficient rho.
A value of rho = 0.95 is typical for image compression applications.
Smaller values are more normal for motion-compensated frame differences, but
this makes surprisingly little difference in transform design.
Using the above procedure, the theoretical MSE of this approximation is
1.230E-6, which is below the level of the truncation error introduced by the
right shift operations.
This suggests the dynamic range of the input would have to be more than
20 bits before it became worthwhile to increase the precision of the
constants used in the multiplications to improve accuracy, though it may be
worth using more precision to reduce bias.
The same techniques can be applied to construct a reversible eight-point DCT
approximation with uniform, orthonormal scaling using 15 multiplies,
31 additions, and 5 shifts.
It is possible to reduce this to 11 multiplies and 29 additions,
which is the minimum number of multiplies possible for an eight-point DCT with
uniform scaling , by introducing a scaling factor
of sqrt(2), but this harms lossless performance.
The dynamic range expansion is 1.5 bits (again the smallest possible), and
the MSE is 1.592E-06.
By comparison, the eight-point transform in VP9 uses 12 multiplications,
32 additions, and 6 shifts.
Similarly, we have constructed a reversible sixteen-point DCT approximation
with uniform, orthonormal scaling using 33 multiplies, 83 additions,
and 16 shifts.
This is just 2 multiplies and 2 additions more than the
(non-reversible, non-integer, but uniformly scaled) factorization
in .
By comparison, the sixteen-point transform in VP9 uses 44 multiplies,
88 additions, and 18 shifts.
The dynamic range expansion is only 2 bits (again the smallest possible),
and the MSE is 1.495E-5.
We also have a reversible 32-point DCT approximation with uniform,
orthonormal scaling using 87 multiplies, 215 additions, and
38 shifts.
By comparison, the 32-point transform in VP9 uses 116 multiplies,
194 additions, and 66 shifts.
Our dynamic range expansion is still the minimal 2.5 bits, and the MSE is
8.006E-05
Code for all of these transforms is available in the development repository
listed in .
These techniques can also be applied to constructing Walsh-Hadamard
Transforms, another useful transform family that is cheaper to implement than
the DCT (since it requires no multiplications at all).
The WHT has many applications as a cheap way to approximately change the time
and frequency resolution of a set of data (either individual bands, as in the
Opus audio codec, or whole blocks).
VP9 uses it as a reversible transform with uniform, orthonormal scaling for
lossless coding in place of its DCT, which does not have these properties.
Applying a 2x2 WHT to a block of 2x2 inputs involves running a 2-point WHT on
the rows, and then another 2-point WHT on the columns.
The basis functions for the 2-point WHT are, up to scaling, [1, 1] and
[1, -1].
The four variations of a two-step lifer given in
are exactly the lifting steps needed to
implement a 2x2 WHT: two stages that produce asymmetrically scaled outputs
followed by two stages that consume asymmetrically scaled inputs.
By simply re-ordering the operations, we can see that there are two shifts that
may be shared between the two stages:
By eliminating the double-negation of x11 and re-ordering the additions to it,
we can see even more operations in common:
Simplifying further, the whole transform may be computed with just
7 additions and 1 shift:
This is a significant savings over other approaches described in the
literature, which require 8 additions, 2 shifts, and
1 negation (37.5% more operations), or
10 additions, 1 shift, and
2 negations (62.5% more operations).
The same operations can be applied to compute a 4-point WHT in one dimension.
This implementation is used in this way in VP9's lossless mode.
Since larger WHTs may be trivially factored into multiple smaller WHTs, the
same approach can implement a reversible, orthonormally scaled WHT of any size
(2**N)x(2**M), so long as (N + M) is even.
The tools presented here were developed as part of Xiph.Org's Daala project.
They are available, along with many others in greater and lesser states of
maturity, in the Daala git repository at
.
See for more information.
This document has no actions for IANA.
Thanks to Nathan Egge, Gregory Maxwell, and Jean-Marc Valin for their
assistance in the implementation and experimentation, and in preparing this
draft.
New Networks for Perfect Inversion and Perfect ReconstructionLossless 8-point Fast Discrete Cosine Transform Using Lossless
Hadamard TransformPractical Fast 1-D DCT Algorithms with 11 MultiplicationsSource Coding Algorithms for Fast Data CompressionHadamard Transform Image CodingOn Unitary Transform ApproximationsAn Improved N-Bit to N-Bit Reversible Haar-Like TransformPiecewise Integer Mapping for Arithmetic CodingLow-complexity Hierarchical Lapped Transform for Lossy-to-Lossless
Image Coding in JPEG XR/HD Photo