John McGowan's AVI Overview: Video Compression Technologies



There are several underlying technologies used by different Video for
Windows Codecs.  For example, Indeo 3.2 and Cinepak both use Vector
Quantization.  The international standards MPEG-1, MPEG-2, MPEG-4,
H.261, and H.263 all use a combination of the block Discrete Cosine
Transform (DCT) and motion estimation/compensation.  Several of the
New Wave codecs use wavelet transform based image compression (the
Discrete Wavelet Transform or DWT).  Other technologies include
Fractal Image Compression, represented by Iterated Systems.

Some general comments on image and video compression:

(1) Image compression may be lossless where no information is
lost during the compression process.  The image produced by
the decompression (also known as decoding) process is identical
bit by bit with the original image.  The widely used GIF format
is a lossless image and video (GIF89a or animated GIF) compression
format.

LOSSLESS COMPRESSION

(2) Image compression may be lossy where information is lost during
the compression process.  These schemes exploit limitations of the
human visual system.  Some errors are undetectable by the human
eye.  Even though two images are different at the bit by bit level, the
human viewer cannot distinguish them.  Some errors are detectable by
the human eye but acceptable.  Some errors are detectable and very
annoying.  The widely used JPEG image compression standard is a lossy
compression scheme.

LOSSY COMPRESSION

(3) Within lossy image and video compression, a compression scheme may
be perceptually lossless, in which case the human viewer cannot 
distinguish between the original image or video and the decompressed 
compressed image or video which has errors introduced by the lossy
compression.  Most lossy image and video compression have some sort of
quality factor or factors.  If the quality is good enough, then the image will
be perceptually lossless.

PERCEPTUALLY LOSSLESS COMPRESSION

(3) JPEG's and MPEG's and other lossy compression of images and video
are often compressed beyond the point of perceptual losslessness, but
the compressed images and video are still acceptable to the human viewer.
If the compression and decompression degrades the image in a way that
is very similar or identical to the natural degradation of images that might
occur in the world then the human visual system will not object greatly.

Loss of fine detail in an image is often acceptable because humans
perceive objects in the natural world with widely varying levels of
detail depending on how close the human viewer is to the object and
whether the human viewer is looking directly at the object or not.
The human viewer sees less detail if an object is further away.
When a human viewer looks directly at an object, the viewer uses a
small very high resolution part of the retina.  If an object is to one
side of the direction of view, the viewer is using lower resolution
parts of the retina.  Human beings are also used to certain natural
forms of degradation such as rain, snow, and fog.

Note that in all these natural viewing situation, the human viewer
will still perceive sharp edges and lines in an image regardless of
the level of detail.  The human viewer will usually perceive the
objects as the same object despite the variations in level of detail.
A horse is a horse is a horse.

NATURALLY LOSSY COMPRESSION

(4) Sufficiently low quality lossy compression will introduce visual
artifacts that are highly annoying to the human viewer.  An example
is the blocking artifacts visible in highly compressed MPEG video and
other block Discrete Cosine Transform based image compression codecs.
At some point the lossy compression will introduce artifacts that are
very unnatural and are perceived as new objects in the scene or spurious
lines within the image. 

The human visual system is very sensitive to lines or edges.  One of its
main functions appears to be to detect and characterize physical
objects such as other people, potential threats such as predators, food
plants, and other things.  Objects in the visual system are dilineated by
edges.  Anything such as a codec algorithm that destroys or creates an
edge in an image is noticed, particularly if the edge is interpreted as
the border of an object by the human visual and cognitive system.

UNNATURAL LOSSY COMPRESSION 

All of the widely used video codecs are lossy compression algorithms.
At sufficiently high compression most of them will have problems with the
edges in the image.  Vector quantization, block Discrete Cosine Transform,
and wavelet based image and video compression inherently do not 
mathematically represent the intuitive notion of an edge or line.

THE POINT:  In using and selecting video codecs, the author of an
AVI file (or a compressed digital video in general) needs to achieve
NATURALLY LOSSY COMPRESSION or better.  Once the compression 
introduces noticable AND unnatural artifacts, the video is of very
limited use even in cases where some features and objects are 
recognizable.

A basic description of video compression technologies follows.  I have
tried to avoid the dense mathematics found in most of the technical
video and image compression literature.

Return to Top


Run Length Encoding

VIDEO CODECS THAT USE RUN LENGTH ENCODING Microsoft RLE (MRLE) Run length encoding is also used to encode the DCT coefficients in the block Discrete Cosine Transform (DCT) based international standards MPEG, H.261, H.263, and JPEG. STRENGTHS AND WEAKNESSES 1. Works for bilevel or 8 bit graphic images such as cel animation. 2. Not good for high resolution natural images. OVERVIEW Run length encoding encodes a sequence or run of consecutive pixels of the same color (such as black or white) as a single codeword. For example, the sequence of pixels 77 77 77 77 77 77 77 could be coded as 7 77 (for seven 77's) Run length encoding can work well for bi-level images (e.g. black and white text or graphics) and for 8 bit images, particularly images such as cel animations which contain many runs of the same color. Run length encoding does not work well for 24 bit natural images in general. Runs of the same color are not that common. Return to Top

Vector Quantization

VIDEO CODECS THAT USE VECTOR QUANTIZATION Indeo 3.2 Cinepak Indeo 3.2 and Cinepak both use vector quantization. As with most digital video, Indeo and Cinepak work in the YUV color space (not RGB for example). STRENGTHS AND WEAKNESSES 1. The encoding process is computationally intensive. Still cannot be done in real time without dedicated hardware. 2. The decoding process is very fast. 3. Blocking artifacts at high compression. 4. Generally, block Discrete Cosine Transform and Discrete Wavelet Transform based image compression methods can achieve higher compression. OVERVIEW The basic idea of Vector Quantization based image compression is to divide the image up into blocks (4x4 pixels in YUV space for Indeo and Cinepak). Typically, some blocks (hopefully many) are similar to other blocks although usually not identical. The encoder identifies a class of similar blocks and replaced these with a "generic" block representative of the class of similar blocks. The encoder encodes a lookup table that maps short binary codes to the "generic" blocks. Typically, the shortest binary codes represent the most common classes of blocks in the image. The Vector Quantization (VQ) decoder uses the lookup table to assemble an approximate image comprised of the "generic" blocks in the lookup table. Note that this is inherently a lossy compression process because the actual blocks are replaced with a generic block that is a "good enough" approximation to the original block. The encoding process is slow and computationally intensive because the encoder must accumulate statistics on the frequency of blocks and calculate the similarity of blocks in order to build the lookup table. The decoding process is very quick because it is lookup table based. In Vector Quantization, the lookup table may be called a codebook. The binary codes that index into the table may be called codewords. Higher compression is achieved by making the lookup table smaller, fewer classes of similar blocks in the image. The quality of the reproduced approximate image degrades as the lookup table becomes smaller. Vector Quantization is prone to blocking artifacts as compression is increased. Vector Quantization is an entire sub-field in signal and image processing. It goes well beyond the brief description above and is applied to other uses than video compression. The standard reference book on Vector Quantization is: Vector Quantization and Signal Compression A. Gersho and R. Gray Boston, MA : Kluwer, 1992 Like many image and signal processing books, this is heavy on abstract math.

A Simple Example

Consider the following 4 by 4 blocks of pixels. Each pixel has a value in the range of 0-255. This is a grayscale image for simplicity. (Block 1) 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 (Block 2) 128 127 128 128 128 128 128 128 128 128 127 128 128 128 128 128 (Block 3) 128 127 126 128 128 128 128 128 127 128 128 128 128 128 128 128 In practice, the blocks will look the same to a human viewer. The second and third blocks could be safely replaced by the first block. By itself, this does not compress the image. However, the replacement block (Block 1) could be represented by a short index into a lookup table of 4x4 blocks. For example, the index, in this case, could be 1. Lookup Table[1] = 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 The original image could be converted into a lookup table and a series of indexes into the lookup table, achieving substantial compression. In video, the same lookup table could be used for many frames, not just a single frame. Return to Top

Discrete Cosine Transform

VIDEO CODECS THAT USE DCT Motion JPEG Editable MPEG MPEG-1 MPEG-2 MPEG-4 H.261 H.263 H.263+ STRENGTHS AND WEAKNESSES 1. Blocking artifacts at high compression. 2. Ringing at sharp edges. Occasional blurring at sharp edges. 3. Computationally intensive. Only recently has it been possible to implement in real time on general purpose CPU's as opposed to specialized chips. OVERVIEW The Discrete Cosine Transform (DCT) is a widely used transform in image compression. The JPEG still image compression standard, the H.261 (p*64) video-conferencing standard, the H.263 video-conferencing standard and the MPEG (MPEG-1, MPEG-2, and MPEG-4) digital video standards use the DCT. In these standards, a two-dimensional (2D) DCT is applied to 8 by 8 blocks of pixels in the image that is compressed. The 64 (8x8 = 64) coefficients produced by the DCT are then quantized to provide the actual compression. In typical images, most DCT coefficients from a DCT on an 8 by 8 block of pixels are small and become zero after quantization. This property of the DCT on real world images is critical to the compression schemes. In addition, the human eyes are less sensitive to the high frequency components of the image represented by the higher DCT coefficients. A large quantization factor can and usually is applied to these higher frequency components. The de facto standard quantization matrix (a matrix of 64 quantization factors, one for each of the 64 DCT coefficients) in the JPEG standard has higher quantization factors for higher frequency DCT coefficients. The quantized DCT coefficients are then run-length encoded as variable length codes that indicate some number of zero coefficients followed by a non-zero coefficient. For example, a run-length code might indicate 4 zero coefficients followed by a non-zero coefficient of level 2. Short variable length codes (e.g. 0110) are used for common combinations of runs of zero and levels of the non-zero coefficient. Longer variable length codes (e.g. 0000001101) are used for less common combinations of runs of zero and levels of the non-zero coefficient. In this way, substantial compression of the image is possible. The Discrete Cosine Transform itself is best explained as a 1 dimensional (1D) DCT first. The 2D DCT is equivalent to performing a 1D DCT on each row of a block of pixels followed by a 1D DCT on each column of the block of pixels produced by the 1D DCT's on the rows. The one dimensional Discrete Cosine Transform is applied to a block of N samples (pixels in an image or sound pressure samples in an audio file). The Discrete Cosine Transform is an NxN matrix whose rows are sampled cosine functions: DCT(m,n) = sqrt( (1 - delta(m,1) ) / N ) * cos( (pi/N) * (n - 1/2) * (m-1) ) where DCT(m,n) is the 1D DCT Matrix m,n = 1,...,N pi = 3.14159267... N = number of samples in block delta(m,1) = 1 if m is 1 0 otherwise cos(x) = cosine of x (radians) * = multiply Naively, performing a DCT on a block of N samples would require N*N multiplies and adds. However, the DCT matrix has a recursive structure that allows implementation with order N log(N) multiplies and adds (many fewer). This makes the DCT practical for implementation on current CPUs and DSPs. Return to Top

Discrete Wavelet Transform (DWT)

Video Codecs Using the Discrete Wavelet Transform: VDONet's VDOWave VxTreme Intel Indeo 5.x (possibly) Intel Indeo 4.x STRENGTHS AND WEAKNESSES 1. Most video and image compression implemented using the Discrete Wavelet Transform does not exhibit the blocking, also known as tiling, artifacts seen with the block Discrete Cosine Transform. 2. DWT based video and image compression often outperforms block DCT compression if evaluated using the Peak Signal to Noise Ratio (PSNR) or Mean Squared Error (MSE) metric (these are mathematically equivalent). 3. The subjective quality of video or images compressed with DWT can appear better than block DCT methods for the same compression ratio. 4. Wavelet compression exhibits blurring and ringing artifacts at sharp edges as compression increases. This is an inherent weakness of the method, shared with the Discrete Cosine Transform which also exhibits the same type of artifacts. The ringing artifacts are also known as contouring, mosquito noise, and the Gibbs effect. OVERVIEW The Discrete Wavelet Transform essentially consists of passing a signal, such as an image, through a pair of filters, a low pass filter and a high pass filter. The low pass filter yields a coarse or low resolution version of the signal. The high pass filter yields an added detail or difference signal. The outputs of the two filters are then downsampled by two. Thus, at this point the downsampled outputs have the same number of bits as the input signal. The parameters of the two filters are selected so that when the upsampled output of the low pass filter is added to the upsampled output of the high pass filter, the original signal is reproduced. The output of the high pass filter, the added detail signal, may then be fed into another pair of filters and the process repeated. A simple example of the discrete wavelet transform is the Haar wavelet transform. The input signal is x[n], a series of samples indexed by n. Haar Low Pass Filter (the average of two successive samples) g[n] = 1/2 * ( x[n] + x[n+1] ) Haar High Pass Filter (the difference of two successive samples) h[n] = 1/2 * ( x[n+1] - x[n] ) Note that: x[n] = g[n] - h[n] x[n+1] = g[n] + h[n] The output sequences g[n] and h[n] contain redundant information. One can safely downsample by two, that is omit the even or odd samples, and still reproduce the original input signal x[n]. Usually the odd samples are omitted. The complete input signal x[n] can be reproduced from g[0], g[2], g[4], .... h[0], h[2], h[4], .... alone. x[0] = g[0] - h[0] x[1] = g[0] + h[0] x[2] = g[2] - h[2] x[3] = g[2] + h[2] and so on. The output of the low pass filter is a coarse approximation of the original input signal. When the input signal is an image, this means a - sometimes blurry - low resolution version of the original image. The output of the high pass filter is an added detail or difference signal. When combined with the coarse approximation, as described, the original input signal is exactly reproduced. The coarse approximation is sometimes called a base layer and the added detail is sometimes called an enhancement layer. The output of the high pass filter, h[n], can be fed into another pair of filters, repeating the process. The Discrete Wavelet Transforms used in wavelet image and video compression iterate the process many times. So far, no compression has occured. The transform produces the same number of bits as the input signal. The output values are called transform coefficients or wavelet transform coefficients. The Haar wavelet is used primarily for illustrative purposes. More complex filters are used for most real wavelet implementations. Compression is typically achieved by applying some form of quantization, scalar quantization in simple implementations and vector quantization in more complex implementations, to the added detail signals. Some type of entropy coding may be applied to the quantized transform coefficients. For example, in the crude Haar wavelet example, the input signal x[n] might be a series of 8 bit samples such as a raster scan of a gray scale image. One could continue to use 8 bits for the output of the low pass filter, g[n]. However, use only 6 bits (divide by 4) for the output of the high pass filter, h[n] - the detail signal. This is scalar quantization. Further, the output of the high pass filter in the Haar case will tend to be peaked toward zero, that is small coefficients will be more common than large coefficients. Thus, some form of entropy coding of the detail signal h[n] is possible. In fact, for most natural images, the low pass signal g[n] will be correlated with previous samples g[n-1] except at edges. g[n] will tend to be close to or the same as g[n-1]. Objects tend to have a continuous surface material, such as skin, with a constant surface reflectance, a texture. The texture can be a simple constant color level or a complex periodic or quasi-periodic pattern, for example wallpaper or marble. Thus the low pass signal g[n] could be encoded as a difference from previous samples g[n-1] to achieve further compression. For image coding, the notion is that the human visual system is less sensitive to fine details in the image than to the gross features. Thus quantization can be applied to the detail signals more strongly. Return to Top

Contour-Based Image Coding

Video Codecs Using Contour-Based Image Coding: Crystal Net's Surface Fitting Method (SFM) may be an example of Contour-Based Image Coding. The emerging ISO standard MPEG-4 incorporates some ideas associated with Contour-Based Image Coding. OVERVIEW A contour is a line representing the outline of a figure, body, or mass. A texture is a representation of the structure of a surface. Contour-based image coding represents images as contours bounding textured regions. Since contours frequently correspond to the boundaries of objects in a scene there is a close relationship between contour-based image coding and object-based image coding. Object-based image coding represents an image as a collection of objects. For example, once contours and textures have been extracted from an image, the contours can be encoded as the control points of a spline - a polynomial function used to represent curves - fitted to the contour. The textures can be encoded as the transform coefficients from a spatial frequency transform such as the Discrete Cosine Transform or the many variants of the Discrete Wavelet Transform. Compression can be achieved through entropy coding of scalar or vector quantization of the control parameters of the spline and the transform coeffecients used for the texture. Contour-Based Image Coding is a very leading edge image coding technology (May, 1999). Extracting contours, also known as edge or line detection, remains an unsolved problem in computer science. Whereas human viewers have an easy sense of what is and is not a contour or line in a scene, computer algorithms - so far - miss some contours, as defined by humans, and also find spurious contours, as defined by humans. Extracting contours is one of a number of pattern recognition and reasoning tasks that seem almost effortless in humans but have proven difficult - impossible so far - to emulate with computers. A number of edge detection and image segmentation algorithms exist that could be applied to the contour extraction in contour-based image coding. In principle, contour based image coding could circumvent the problems that transform coding methods such as the Discrete Cosine Transform (JPEG, MPEG, H.261, H.263, DV, etc.) and the Discrete Wavelet Transform (Intel Indeo, VDONet VDOWave, etc.) encounter at sharp edges, achieving higher compression.

Frame Differencing

VIDEO CODECS THAT USE FRAME DIFFERENCING Cinepak STRENGTHS AND WEAKNESSES 1. Generally can achieve better compression than independent encoding of individual frames. 2. Errors accumulate in successsive frames after a key frame, eventually requiring another key frame. (see below) OVERVIEW Frame Differencing exploits the fact that little changes from frame to frame in many video or animation sequences. For example, a video might show a ball flying through the air in front of a static background. Most of the image, the background, does not change from frame to subsequent frame in the scene. In frame differencing, the still image compression method such as vector quantization is applied to the difference between the frame and the decoded previous frame. Often, most of the difference is zero or small values which can be heavily compressed. Most often, frame differencing uses "key frames" which are frames compressed without reference to a previous frame. This limits accumulated errors and enables seeking within the video stream. In the widely used Cinepak, a key frame is often set every 15 frames. Cinepak movies usually use frame differencing combined with vector quantization. If the compression scheme is lossy (vector quantization is lossy), errors will accumulate from frame to frame. Eventually these errors will become visible. This necessitates key frames! Return to Top

Motion Compensation

VIDEO CODECS THAT USE MOTION COMPENSATION ClearVideo (RealVideo) Fractal Video Codec from Iterated Systems VDOWave from VDONet VxTreme MPEG-1,2, and 4 H.261 H.263 H.263+ STRENGTHS AND WEAKNESSES 1. Motion compensation achieves high video compression in video generally superior to frame differencing. 2. The encoding phase of motion compensation (known as motion estimation) is computationally intensive. MPEG-1 with IP and B frames cannot be encoded in real time without dedicated hardware, a silicon implementation of motion estimation. 3. The motion compensation scheme used in the international standards MPEG, H.261, and H.263 works best for scenes with limited motion such as talking heads. In general, video with heavy motion such as sports video is hard to compress with motion compensation. OVERVIEW Motion Compensation codes for motion within a scene such as a ball moving across a background. The block Discrete Cosine Transform (DCT) based international video standards MPEG-1, MPEG-2, MPEG-4, H.261, and H.263 use motion compensation. Iterated Systems ClearVideo (Real Video) fractal Video Codec, VDOWave from VDONet, and VxTreme's video codec use forms of motion compensation. Motion Compensation refers to a number of ideas and algorithms. The motion compensation method used in MPEG and related international standards (H.261 and H.263) is described below. This motion compensation works for translational motion only. This is suited for objects moving across a background or panning of the camera. It does not work well for spinning objects, resizing objects, or camera zooms. Alternative forms of motion compensation exist which handle rotational, scaling, skewing, and other kinds of motion in a scene. Recognizing objects such as a flying ball in a scene is an unsolved problem in image processing and understanding. A way to exploit the motion of the ball to achieve image compression is to partition the image into blocks (16x16 pixels in MPEG-1). Code a "motion vector" for each block which points to the 16x16 pixel block in a previous (or future) frame that most closely approximates the block being coded. In many cases this reference block will be the same block (no motion). In some cases this reference block will be a different block (motion). The encoder need not recognize the presence of a ball or other object, only compare blocks of pixels in the decoded and reference frames. COMPRESSION IS ACHIEVED BY SENDING OR STORING ONLY THE MOTION VECTOR (AND A POSSIBLE SMALL ERROR) INSTEAD OF THE PIXEL VALUES FOR THE ENTIRE BLOCK. Note that the reference block can be anywhere in the image. The coded or "predicted" blocks must form a partition or tiling of the image (frame) being decoded. A reference block can be any 16x16 pixel block in the reference frame (image) that most closely approximates the coded or "predicted" block. The reference frame must be decoded prior to the current frame being decoded. However, the reference frame need not be PRESENTED before the current frame being decoded. In fact, the reference frame could be a future frame!! MPEG allows for this through so-called B (bi-directionally predicted) frames. In the example of the ball in front of a static background, no motion occurs in most of the blocks. For these cases, the motion vectors are zero. Motion compensation for these blocks is then equivalent to frame differencing, where the difference between the block and the same block in a previous (or future) frame is coded. For the block or blocks containing the moving ball, the motion vectors will be non-zero, pointing to a block in a previous (or future) frame that contains the ball. The displaced block is subtracted from the current block. In general, there will be some left over non-zero values, which are then coded using the still image compression scheme such as Vector Quantization, the Block Discrete Cosine Transform, or the Discrete Wavelet Transform. MPEG style motion compensation does not require recognition of the ball. An encoder simply compares the block being coded with displaced blocks in the reference frame (a previous or future frame). The comparison can use mean squared error or some other metric of differences between images. The encoder selects the displaced block with the smallest mean squared error difference! At no point has the encoder recognized an object in the image. In MPEG, the motion vectors are encoded as variable length codes for greater compression. The encoding process is called Motion Estimation. This finds the motion vector (or vectors) for each block. The decoding process is called Motion Compensation. Motion Compensation achieves greater compression than simple Frame Differencing. ILLUSTRATIVE EXAMPLE: Predicted Region Reference Region (the current frame being decoded) (a previously decoded frame) _________ _________ | | | | | | | * | | | | * | (moving ball) | 4 | -4 | | | | | | | | | | _________ _________ | | | | | | | 0 | 0 | | | | (no change) | | | | | | | | | | | | _________ _________ The asterisk (*) represents a ball flying across the scene from right to left. Four blocks with associated motion vectors (4, -4, 0, and 0). The upper left block looks like the upper right block in the reference region (where the ball was). The upper right block looks like the upper left region in the reference region. The lower left and lower right blocks were unchanged. In this simple example, the vertical displacement is zero and is ignored. In this simple example, the region can be decoded using the motion vectors alone. In more general cases, there is an error between the frame predicted using motion vectors alone and the actual frame. This error is coded using a still image compression scheme such as the block Discrete Cosine Transform (DCT). In this simple example, the previously decoded frame is also previous in the presentation order. The previously decoded or reference frame precedes the current frame in time. In general, keep in mind the distinction between decode order and presentation order. The reference frame could be a future frame. Return to Top

© 2000 by John F. McGowan, Ph.D.

Disclaimer

jmcgowan11@earthlink.net