John McGowan's AVI Overview: Video Compression Technologies


There are several underlying technologies used by different Video for
Windows Codecs.  For example, Indeo 3.2 and Cinepak both use Vector
Quantization.  The international standards MPEG-1, MPEG-2, MPEG-4,
H.261, and H.263 all use a combination of the block Discrete Cosine
Transform (DCT) and motion estimation/compensation.  Several of the
New Wave codecs use wavelet transform based image compression (the
Discrete Wavelet Transform or DWT).  Other technologies include
Fractal Image Compression, represented by Iterated Systems.

Some general comments on image and video compression:

(1) Image compression may be lossless where no information is
lost during the compression process.  The image produced by
the decompression (also known as decoding) process is identical
bit by bit with the original image.  The widely used GIF format
is a lossless image and video (GIF89a or animated GIF) compression
format.

LOSSLESS COMPRESSION

(2) Image compression may be lossy where information is lost during
the compression process.  These schemes exploit limitations of the
human visual system.  Some errors are undetectable by the human
eye.  Even though two images are different at the bit by bit level, the
human viewer cannot distinguish them.  Some errors are detectable by
the human eye but acceptable.  Some errors are detectable and very
annoying.  The widely used JPEG image compression standard is a lossy
compression scheme.

LOSSY COMPRESSION

(3) Within lossy image and video compression, a compression scheme may
be perceptually lossless, in which case the human viewer cannot 
distinguish between the original image or video and the decompressed 
compressed image or video which has errors introduced by the lossy
compression.  Most lossy image and video compression have some sort of
quality factor or factors.  If the quality is good enough, then the image will
be perceptually lossless.

PERCEPTUALLY LOSSLESS COMPRESSION

(3) JPEG's and MPEG's and other lossy compression of images and video
are often compressed beyond the point of perceptual losslessness, but
the compressed images and video are still acceptable to the human viewer.
If the compression and decompression degrades the image in a way that
is very similar or identical to the natural degradation of images that might
occur in the world then the human visual system will not object greatly.

Loss of fine detail in an image is often acceptable because humans
perceive objects in the natural world with widely varying levels of
detail depending on how close the human viewer is to the object and
whether the human viewer is looking directly at the object or not.
The human viewer sees less detail if an object is further away.
When a human viewer looks directly at an object, the viewer uses a
small very high resolution part of the retina.  If an object is to one
side of the direction of view, the viewer is using lower resolution
parts of the retina.  Human beings are also used to certain natural
forms of degradation such as rain, snow, and fog.

Note that in all these natural viewing situation, the human viewer
will still perceive sharp edges and lines in an image regardless of
the level of detail.  The human viewer will usually perceive the
objects as the same object despite the variations in level of detail.
A horse is a horse is a horse.

NATURALLY LOSSY COMPRESSION

(4) Sufficiently low quality lossy compression will introduce visual
artifacts that are highly annoying to the human viewer.  An example
is the blocking artifacts visible in highly compressed MPEG video and
other block Discrete Cosine Transform based image compression codecs.
At some point the lossy compression will introduce artifacts that are
very unnatural and are perceived as new objects in the scene or spurious
lines within the image. 

The human visual system is very sensitive to lines or edges.  One of its
main functions appears to be to detect and characterize physical
objects such as other people, potential threats such as predators, food
plants, and other things.  Objects in the visual system are dilineated by
edges.  Anything such as a codec algorithm that destroys or creates an
edge in an image is noticed, particularly if the edge is interpreted as
the border of an object by the human visual and cognitive system.

UNNATURAL LOSSY COMPRESSION 

All of the widely used video codecs are lossy compression algorithms.
At sufficiently high compression most of them will have problems with the
edges in the image.  Vector quantization, block Discrete Cosine Transform,
and wavelet based image and video compression inherently do not 
mathematically represent the intuitive notion of an edge or line.

THE POINT:  In using and selecting video codecs, the author of an
AVI file (or a compressed digital video in general) needs to achieve
NATURALLY LOSSY COMPRESSION or better.  Once the compression 
introduces noticable AND unnatural artifacts, the video is of very
limited use even in cases where some features and objects are 
recognizable.

A basic description of video compression technologies follows.  I have
tried to avoid the dense mathematics found in most of the technical
video and image compression literature.

Return to Top


Run Length Encoding


VIDEO CODECS THAT USE RUN LENGTH ENCODING

Microsoft RLE (MRLE)

Run length encoding is also used to encode the DCT coefficients in
the block Discrete Cosine Transform (DCT) based international
standards MPEG, H.261, H.263, and JPEG.


STRENGTHS AND WEAKNESSES

1.  Works for bilevel or 8 bit graphic images such as cel animation.

2.  Not good for high resolution natural images.


OVERVIEW

Run length encoding encodes a sequence or run of consecutive
pixels of the same color (such as black or white) as a single
codeword.

For example, the sequence of pixels

77 77 77 77 77 77 77

could be coded as

7 77  (for seven 77's)

Run length encoding can work well for bi-level images (e.g.
black and white text or graphics) and for 8 bit images, particularly
images such as cel animations which contain many runs of the same
color.

Run length encoding does not work well for 24 bit natural images
in general.  Runs of the same color are not that common.

Return to Top


Vector Quantization


VIDEO CODECS THAT USE VECTOR QUANTIZATION

Indeo 3.2
Cinepak

Indeo 3.2 and Cinepak both use vector quantization.  As with most
digital video, Indeo and Cinepak work in the YUV color space (not RGB
for example).

STRENGTHS AND WEAKNESSES

1.  The encoding process is computationally intensive.  Still cannot
be done in real time without dedicated hardware.

2.  The decoding process is very fast.

3.  Blocking artifacts at high compression.

4.  Generally, block Discrete Cosine Transform and Discrete Wavelet
Transform based image compression methods can achieve higher compression.

OVERVIEW

The basic idea of Vector Quantization based image compression is to
divide the image up into blocks (4x4 pixels in YUV space for Indeo and
Cinepak).  Typically, some blocks (hopefully many) are similar to
other blocks although usually not identical.  The encoder identifies a
class of similar blocks and replaced these with a "generic" block
representative of the class of similar blocks.  The encoder encodes a
lookup table that maps short binary codes to the "generic" blocks.
Typically, the shortest binary codes represent the most common classes
of blocks in the image.

The Vector Quantization (VQ) decoder uses the lookup table to assemble
an approximate image comprised of the "generic" blocks in the lookup
table.

Note that this is inherently a lossy compression process because the
actual blocks are replaced with a generic block that is a "good
enough" approximation to the original block.

The encoding process is slow and computationally intensive because the
encoder must accumulate statistics on the frequency of blocks and
calculate the similarity of blocks in order to build the lookup table.

The decoding process is very quick because it is lookup table based.
In Vector Quantization, the lookup table may be called a codebook.
The binary codes that index into the table may be called codewords.

Higher compression is achieved by making the lookup table smaller,
fewer classes of similar blocks in the image.  The quality of the
reproduced approximate image degrades as the lookup table becomes
smaller.

Vector Quantization is prone to blocking artifacts as compression is increased.

Vector Quantization is an entire sub-field in signal and image
processing.  It goes well beyond the brief description above and  
is applied to other uses than video compression.

The standard reference book on Vector Quantization is:

Vector Quantization and Signal Compression
A. Gersho and R. Gray
Boston, MA : Kluwer, 1992

Like many image and signal processing books, this is heavy on
abstract math.

A Simple Example

Consider the following 4 by 4 blocks of pixels.  Each pixel has a value
in the range of 0-255.  This is a grayscale image for simplicity.

(Block 1)
128 128 128 128

128 128 128 128

128 128 128 128

128 128 128 128

(Block 2)
128 127 128 128

128 128 128 128

128 128 127 128

128 128 128 128

(Block 3)
128 127 126 128

128 128 128 128

127 128 128 128

128 128 128 128

In practice, the blocks will look the same to a human viewer.
The second and third blocks could be safely replaced by the first
block.  By itself, this does not compress the image.  However, the
replacement block (Block 1) could be represented by a short index into
a lookup table of 4x4 blocks.  For example, the index, in this case,
could be 1.

Lookup Table[1] = 128 128 128 128
                  128 128 128 128
                  128 128 128 128
                  128 128 128 128

The original image could be converted into a lookup table and a series
of indexes into the lookup table, achieving substantial compression.
In video, the same lookup table could be used for many frames, not just
a single frame.

Return to Top


Discrete Cosine Transform


VIDEO CODECS THAT USE DCT

Motion JPEG
Editable MPEG
MPEG-1
MPEG-2
MPEG-4
H.261
H.263
H.263+


STRENGTHS AND WEAKNESSES

1.  Blocking artifacts at high compression.

2.  Ringing at sharp edges.  Occasional blurring at sharp edges.

3.  Computationally intensive.  Only recently has it been possible
to implement in real time on general purpose CPU's as opposed to
specialized chips.


OVERVIEW

   The Discrete Cosine Transform (DCT) is a widely used transform in
image compression.  The JPEG still image compression standard, the
H.261 (p*64) video-conferencing standard, the H.263 video-conferencing
standard and the MPEG (MPEG-1, MPEG-2, and MPEG-4) digital video
standards use the DCT.  In these standards, a two-dimensional (2D) DCT
is applied to 8 by 8 blocks of pixels in the image that is compressed.
The 64 (8x8 = 64) coefficients produced by the DCT are then quantized
to provide the actual compression.  In typical images, most DCT
coefficients from a DCT on an 8 by 8 block of pixels are small and
become zero after quantization.  This property of the DCT on real world
images is critical to the compression schemes.

   In addition, the human eyes are less sensitive to the high frequency
components of the image represented by the higher DCT coefficients.
A large quantization factor can and usually is applied to these
higher frequency components.  The de facto standard quantization
matrix (a matrix of 64 quantization factors, one for each of the 64
DCT coefficients) in the JPEG standard has higher quantization factors
for higher frequency DCT coefficients.

   The quantized DCT coefficients are then run-length encoded as variable
length codes that indicate some number of zero coefficients followed by
a non-zero coefficient.  For example, a run-length code might indicate
4 zero coefficients followed by a non-zero coefficient of level 2.  Short
variable length codes (e.g. 0110) are used for common combinations of
runs of zero and levels of the non-zero coefficient.  Longer variable
length codes (e.g. 0000001101) are used for less common combinations of
runs of zero and levels of the non-zero coefficient.  In this way,
substantial compression of the image is possible.

   The Discrete Cosine Transform itself is best explained as a 1 dimensional
(1D) DCT first.  The 2D DCT is equivalent to performing a 1D DCT on each
row of a block of pixels followed by a 1D DCT on each column of the block of
pixels produced by the 1D DCT's on the rows.

   The one dimensional Discrete Cosine Transform is applied to a block of
N samples (pixels in an image or sound pressure samples in an audio file).
The Discrete Cosine Transform is an NxN matrix whose rows are sampled
cosine functions:

            DCT(m,n) = sqrt( (1 - delta(m,1) ) / N )

                            * cos( (pi/N) * (n - 1/2) *  (m-1) )

              where

                          DCT(m,n) is the 1D DCT Matrix
                          m,n = 1,...,N
                          pi = 3.14159267...
                          N = number of samples in block
                          delta(m,1) = 1 if m is 1
                                       0 otherwise
                          cos(x) = cosine of x (radians)
                          *      = multiply


   Naively, performing a DCT on a block of N samples would require
N*N multiplies and adds.  However, the DCT matrix has
a recursive structure that allows implementation with order
N log(N) multiplies and adds (many fewer).  This makes the DCT practical
for implementation on current CPUs and DSPs.  

Return to Top


Discrete Wavelet Transform (DWT)


Video Codecs Using the Discrete Wavelet Transform:

VDONet's VDOWave
VxTreme
Intel Indeo 5.x
(possibly) Intel Indeo 4.x


STRENGTHS AND WEAKNESSES

1.  Most video and image compression implemented using the 
Discrete Wavelet Transform does not exhibit the blocking, also known
as tiling, artifacts seen with the block Discrete Cosine Transform.

2.  DWT based video and image compression often outperforms block DCT
compression if evaluated using the Peak Signal to Noise Ratio (PSNR)
or Mean Squared Error (MSE) metric (these are mathematically
equivalent).

3.  The subjective quality of video or images compressed with
DWT can appear better than block DCT methods for the same
compression ratio.

4.  Wavelet compression exhibits blurring and ringing artifacts
at sharp edges as compression increases.  This is an inherent
weakness of the method, shared with the Discrete Cosine Transform
which also exhibits the same type of artifacts.  The ringing artifacts
are also known as contouring, mosquito noise, and the Gibbs effect.


OVERVIEW

The Discrete Wavelet Transform essentially consists of passing a
signal, such as an image, through a pair of filters, a low pass filter
and a high pass filter.  The low pass filter yields a coarse or low
resolution version of the signal.  The high pass filter yields an
added detail or difference signal.  The outputs of the two filters are
then downsampled by two.  Thus, at this point the downsampled outputs
have the same number of bits as the input signal.  The parameters of
the two filters are selected so that when the upsampled output of the
low pass filter is added to the upsampled output of the high pass
filter, the original signal is reproduced.

The output of the high pass filter, the added detail signal, may then be fed
into another pair of filters and the process repeated.

A simple example of the discrete wavelet transform is the Haar
wavelet transform.

The input signal is x[n], a series of samples indexed by n. 

Haar Low Pass Filter (the average of two successive samples)

g[n] = 1/2 * ( x[n] + x[n+1] )

Haar High Pass Filter (the difference of two successive samples)

h[n] = 1/2 * ( x[n+1] - x[n] )


Note that:

x[n] = g[n] - h[n]

x[n+1] = g[n] + h[n]

The output sequences g[n] and h[n] contain redundant information.  
One can safely downsample by two, that is omit the even or odd
samples, and still reproduce the original input signal x[n].
Usually the odd samples are omitted.

The complete input signal x[n] can be reproduced from

g[0], g[2], g[4], ....
h[0], h[2], h[4], ....

alone.

x[0] = g[0] - h[0]
x[1] = g[0] + h[0]

x[2] = g[2] - h[2]
x[3] = g[2] + h[2]

and so on.

The output of the low pass filter is a coarse approximation of the
original input signal.  When the input signal is an image, this
means a - sometimes blurry - low resolution version of the original image.

The output of the high pass filter is an added detail or difference
signal.  When combined with the coarse approximation, as described, the
original input signal is exactly reproduced.

The coarse approximation is sometimes called a base layer and the 
added detail is sometimes called an enhancement layer.

The output of the high pass filter, h[n], can be fed into another
pair of filters, repeating the process.  The Discrete Wavelet Transforms
used in wavelet image and video compression iterate the process many
times.

So far, no compression has occured.  The transform produces the same
number of bits as the input signal.

The output values are called transform coefficients or wavelet transform
coefficients.

The Haar wavelet is used primarily for illustrative purposes.  More
complex filters are used for most real wavelet implementations.

Compression is typically achieved by applying some form of quantization,
scalar quantization in simple implementations and vector quantization 
in more complex implementations, to the added detail signals.  Some type of
entropy coding may be applied to the quantized
transform coefficients.

For example, in the crude Haar wavelet example, the input signal x[n]
might be a series of 8 bit samples such as a raster scan of a gray
scale image.  One could continue to use 8 bits for the output of the
low pass filter, g[n].  However, use only 6 bits (divide by 4) for the
output of the high pass filter, h[n] - the detail signal.  This is
scalar quantization.  Further, the output of the high pass filter in
the Haar case will tend to be peaked toward zero, that is small
coefficients will be more common than large coefficients.  Thus, some
form of entropy coding of the detail signal
h[n] is possible.

In fact, for most natural images, the low pass signal g[n] 
will be correlated with
previous samples g[n-1] except at edges.  g[n] will tend to be close
to or the same as g[n-1].  Objects tend to
have a continuous surface material, such as skin, with a constant
surface reflectance, a texture.  The texture can be a simple
constant color level or a complex periodic or quasi-periodic
pattern, for example wallpaper or marble.  Thus the low
pass signal g[n] could be encoded as a difference from previous
samples g[n-1] to achieve further compression.

For image coding, the notion is that the human visual system is less
sensitive to fine details in the image than to the gross features.
Thus quantization can be applied to the detail signals more strongly.

Return to Top


Contour-Based Image Coding


Video Codecs Using Contour-Based Image Coding:

Crystal Net's Surface Fitting Method (SFM) may be an example of
Contour-Based Image Coding.

The emerging ISO standard MPEG-4 incorporates some ideas associated
with Contour-Based Image Coding.

OVERVIEW

A contour is a line representing the outline of a figure, body, or
mass.  A texture is a representation of the structure of a surface.
Contour-based image coding represents images as contours bounding
textured regions.  Since contours frequently correspond to the boundaries
of objects in a scene there is a close relationship between 
contour-based image coding and object-based image coding.  Object-based
image coding represents an image as a collection of objects.

For example, once contours and textures have been extracted from an
image, the contours can be encoded as the control points of a spline -
a polynomial function used to represent curves - fitted to the
contour.  The textures can be encoded as the transform coefficients
from a spatial frequency transform such as the Discrete Cosine
Transform or the many variants of the Discrete Wavelet Transform.
Compression can be achieved through entropy
coding of scalar or vector quantization of the control parameters
of the spline and the transform coeffecients used for the texture.

Contour-Based Image Coding is a very leading edge image coding
technology (May, 1999).  Extracting contours, also known as
edge or line detection, remains an unsolved problem in computer
science.  Whereas human viewers have an easy sense of what is and
is not a contour or line in a scene, computer algorithms - so far -
miss some contours, as defined by humans, and also find spurious
contours, as defined by humans.  Extracting contours is one of a number
of pattern recognition and reasoning tasks that seem almost effortless
in humans but have proven difficult - impossible so far - to emulate
with computers.

A number of edge detection and image segmentation algorithms exist that
could be applied to the contour extraction in contour-based image
coding.

In principle, contour based image coding could circumvent the 
problems that transform coding methods such as the Discrete Cosine
Transform (JPEG, MPEG, H.261, H.263, DV, etc.) and the Discrete
Wavelet Transform (Intel Indeo, VDONet VDOWave, etc.) encounter at
sharp edges, achieving higher compression.


Frame Differencing


VIDEO CODECS THAT USE FRAME DIFFERENCING

Cinepak

STRENGTHS AND WEAKNESSES

1. Generally can achieve better compression than independent encoding
of individual frames.

2. Errors accumulate in successsive frames after a key frame, eventually
requiring another key frame.  (see below)

OVERVIEW

Frame Differencing exploits the fact that little changes from frame to
frame in many video or animation sequences.  For example, a video
might show a ball flying through the air in front of a static
background.  Most of the image, the background, does not change from
frame to subsequent frame in the scene.

In frame differencing, the still image compression method such as
vector quantization is applied to the difference between the frame and
the decoded previous frame.  Often, most of the difference is zero or
small values which can be heavily compressed.

Most often, frame differencing uses "key frames" which are frames
compressed without reference to a previous frame.  This limits accumulated
errors and enables seeking within the video stream.

In the widely used Cinepak, a key frame is often set every 15 frames.
Cinepak movies usually use frame differencing combined with vector
quantization.

If the compression scheme is lossy (vector quantization is lossy), errors
will accumulate from frame to frame.  Eventually these errors will become
visible.  This necessitates key frames!

Return to Top


Motion Compensation


VIDEO CODECS THAT USE MOTION COMPENSATION

ClearVideo (RealVideo) Fractal Video Codec from Iterated Systems
VDOWave from VDONet
VxTreme
MPEG-1,2, and 4
H.261
H.263
H.263+

STRENGTHS AND WEAKNESSES

1. Motion compensation achieves high video compression in video
generally superior to frame differencing.

2. The encoding phase of motion compensation (known as motion
estimation) is computationally intensive.  MPEG-1 with IP and
B frames cannot be encoded in real time without dedicated
hardware, a silicon implementation of motion estimation.

3. The motion compensation scheme used in the international
standards MPEG, H.261, and H.263 works best for scenes with
limited motion such as talking heads.  In general, video with
heavy motion such as sports video is hard to compress with
motion compensation.

OVERVIEW

Motion Compensation codes for motion within a scene such as a ball
moving across a background.  The block Discrete Cosine Transform (DCT)
based international video standards MPEG-1, MPEG-2, MPEG-4, H.261, and
H.263 use motion compensation.  Iterated Systems ClearVideo (Real
Video) fractal Video Codec, VDOWave from VDONet, and VxTreme's video
codec use forms of motion compensation.

Motion Compensation refers to a number of ideas and algorithms.  The
motion compensation method used in MPEG and related international
standards (H.261 and H.263) is described below.  This motion
compensation works for translational motion only.  This is suited for
objects moving across a background or panning of the camera.  It does
not work well for spinning objects, resizing objects, or camera zooms.
Alternative forms of motion compensation exist which handle
rotational, scaling, skewing, and other kinds of motion in a scene.

Recognizing objects such as a flying ball in a scene is an unsolved
problem in image processing and understanding.  A way to exploit
the motion of the ball to achieve image compression is to partition the
image into blocks (16x16 pixels in MPEG-1).  Code a "motion vector" for
each block which points to the 16x16 pixel block in a previous (or
future) frame that most closely approximates the block being coded.
In many cases this reference block will be the same block (no motion).
In some cases this reference block will be a different block (motion).
The encoder need not recognize the presence of a ball or other object, only
compare blocks of pixels in the decoded and reference frames.

COMPRESSION IS ACHIEVED BY SENDING OR STORING ONLY THE MOTION VECTOR
(AND A POSSIBLE SMALL ERROR) INSTEAD OF THE PIXEL VALUES FOR THE
ENTIRE BLOCK.

Note that the reference block can be anywhere in the image.  The coded
or "predicted" blocks must form a partition or tiling of the image (frame)
being decoded.  A reference block can be any 16x16 pixel block in 
the reference frame (image) that most closely approximates the coded
or "predicted" block.  The reference frame must be decoded prior to
the current frame being decoded.

However, the reference frame need not be PRESENTED before the current
frame being decoded.  In fact, the reference frame could be a future
frame!!  MPEG allows for this through so-called B (bi-directionally
predicted) frames.

In the example of the ball in front of a static background, no motion
occurs in most of the blocks.  For these cases, the motion vectors
are zero.  Motion compensation for these blocks is then equivalent
to frame differencing, where the difference between the block and the
same block in a previous (or future) frame is coded.

For the block or blocks containing the moving ball, the motion vectors
will be non-zero, pointing to a block in a previous (or future) frame
that contains the ball.  The displaced block is subtracted from
the current block.  In general, there will be some left over non-zero
values, which are then coded using the still image compression scheme
such as Vector Quantization, the Block Discrete Cosine Transform, or
the Discrete Wavelet Transform.

MPEG style motion compensation does not require recognition of the
ball.  An encoder simply compares the block being coded with displaced
blocks in the reference frame (a previous or future frame).  The
comparison can use mean squared error or some other metric of
differences between images.  The encoder selects the displaced block
with the smallest mean squared error difference!  At no point has the
encoder recognized an object in the image.

In MPEG, the motion vectors are encoded as variable length codes for
greater compression.

The encoding process is called Motion Estimation.  This finds the
motion vector (or vectors) for each block.

The decoding process is called Motion Compensation.

Motion Compensation achieves greater compression than simple Frame
Differencing.

ILLUSTRATIVE EXAMPLE:

  Predicted Region                        Reference Region
(the current frame being decoded)       (a previously decoded frame)
  _________                              _________
 |    |    |                            |    |    |
 | *  |    |                            |    | *  |  (moving ball)
 | 4  | -4 |                            |    |    |
 |    |    |                            |    |    |
  _________                              _________
 |    |    |                            |    |    |
 | 0  | 0  |                            |    |    |  (no change)
 |    |    |                            |    |    |
 |    |    |                            |    |    |
  _________                              _________

The asterisk (*) represents a ball flying across the scene from right
to left.  

Four blocks with associated motion vectors (4, -4, 0, and 0).  The
upper left block looks like the upper right block in the reference
region (where the ball was).  The upper right block looks like the
upper left region in the reference region.  The lower left and
lower right blocks were unchanged.  In this simple example, the
vertical displacement is zero and is ignored.  

In this simple example, the region can be decoded using the
motion vectors alone.  In more general cases, there is an 
error between the frame predicted using motion vectors alone
and the actual frame.  This error is coded using a still image
compression scheme such as the block Discrete Cosine Transform (DCT).

In this simple example, the previously decoded frame is also
previous in the presentation order.  The previously decoded or
reference frame precedes the current frame in time.  In general, keep
in mind the distinction between decode order and presentation order.
The reference frame could be a future frame.

Return to Top
Disclaimer
jmcgowan11@earthlink.net