Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Zlib compressed data

Zlib compression is commonly used in file formats. The zlib compressed data format, as defined in RFC1950, allows for multiple techniques but only the Deflate compression method, a variation of LZ77, is used.

Overview

Zlib compressed data consist of:

  • data header
  • compressed data
  • Adler-32 checksum of the uncompressed data

Characteristics

CharacteristicsDescription
Byte orderbig-endian

Data header

The data header is 2 or 6 bytes in size and consist of:

OffsetSizeValueDescription
The bit values are stored a 8-bit values
0.04 bitsCompression method
0.44 bitsCompression information
Flags
1.05 bitsCheck bits
1.51 bitPreset dictionary flag
1.62 bitsCompression level. The compression level is used mainly for re-compression
If the dictionary identifier flag is set
24Preset dictionary identifier, which contains an Adler-32 used to identifier the preset dictionary
Common
......Compressed data
...4Checksum, which contains an Adler-32 of the compressed data

The check bits value must be such that when the first 2 bytes are represented as a 16-bit unsigned integer in big-endian byte order the value is a multiple of 31, such that:

((first * 256) + second) % 31 = 0

Compression method

ValueIdentifierDescription
8Deflate (RFC1951), with a maximum window size of 32 KiB
15Reserved for additional header data

Note that RFC1950 only defines 8 as a valid compression method.

Compression information

The value of the compression information is dependent on the compression method.

Compression information - compression method 8 (Deflate)

For compression method 8 (Deflate) the compression information contains the base-2 logarithm of the LZ77 window size minus 8.

OffsetSizeValueDescription
0.04 bitsWindow size, which consists of a base-2 logarithm (2n), with a maximum value of 7 (32 KiB)

To determine the corresponding window size:

1 << (7 + 8)

E.g. a compression information value of 7 indicates a 32768 bytes window size. Values larger than 7 are not allowed according to RFC1950 and thus the maximum window size is 32768 bytes.

Compression level

ValueIdentifierDescription
0Fastest
1Fast
2Default
3Slowest, maximum compression

Compressed data

Deflate compressed data

The deflate compressed data consists of one or more deflate compressed blocks. Each block consists of:

  • block header
  • block data

Note that a block can reference uncompressed data that is stored in a previous block.

Block header

The block header is 3 bits in size and consists of:

OffsetSizeValueDescription
01 bitLast block (in stream) marker, where 1 represents the last block and 0 otherwise
0.12 bitsBlock type

Block types

ValueIdentifierDescription
0Uncompressed (or stored) block
1Fixed Huffman compressed block
2Dynamic Huffman compressed block
3Reserved (not used)

Uncompressed block data

The uncompressed block data is of variable size and consists of:

OffsetSizeValueDescription
0.35 bitsEmpty values (not used)
12Uncompressed data size
32Copy of uncompressed data size, which contains a 1s complement of the uncompressed data size
5...Uncompressed data

The uncompressed data size can range between 0 and 65535 bytes.

Huffman compressed block data

The uncompressed block data is of variable size and consists of:

  • Optional dynamic Huffman table
  • Encoded bit-stream
  • End-of-stream (or end-of-block or end-of-data) marker
Dynamic Huffman table

The dynamic Huffman table consists of:

OffsetSizeValueDescription
0.35 bitsNumber of literal codes, which is value + 257. The number of literal codes must be smaller than 286
1.05 bitsNumber of distance codes, which is value + 1. The number of distance codes must be smaller than 30
1.54 bitsThe number of Huffman codes for the code sizes, which is value + 4
2.1...The code sizes
......Huffman encoded stream of the Huffman codes for the literals
......Huffman encoded stream of the Huffman codes for the distances

A single code size value is 3 bits of size. A value of 0 means the code size is not used in the Huffman encoding of the literal and distance codes.

The codes size values are stored in the following sequence:

16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15

The first value applies to a code size of 16, the second to 17, etc. Code sizes that are not stored default to 0.

The code size values are used to construct the code sizes Huffman table. This must be a complete Huffman table which is used to decode the literal and distance codes. The corresponding codes size Huffman encoding is defined as:

ValueIdentifierDescription
0 - 15Represents a code size of 0 - 15
16Copy the previous code size 3 - 6 times. The next 2 bits indicate repeat length (0 = 3, ... , 3 = 6), e.g. codes 8, 16 (+2 bits 11), 16 (+2 bits 10) will expand to 12 code lengths of 8 (1 + 6 + 5)
17Repeat a code length of 0 for 3 - 10 times (3 bits of length)
18Repeat a code length of 0 for 11 - 138 times (7 bits of length)

Both the literal and distance Huffman codes are stored Huffman encoded using the code sizes Huffman table. Code sizes that are not stored default to 0. The code size for the literal code 256 (end-of-block) should be set and thus not 0.

Encoded bit-stream

The encoded bit-stream is stored in 8-bit integers, where bit values are stored back-to-front. So that 3 least-significant bits (LSB) would represent a 3-bit value at the start of the -stream. Note that the LSB of the 3-bit value is the LSB of the byte value.

Deflate uses a Huffman tree of 288 Huffman codes (or symbols) where the values:

  • 0 - 255; represent the literal byte values: 0 - 255
  • 256: represents the end of (compressed) stream (or block)
  • 257 - 285 (combined with extra-bits): represent a (size, offset) tuple (or match length) of 3 - 258 bytes
  • 286, 287: are not used (reserved) and their use is considered illegal although the values are still part of the tree

This document refers to this Huffman tree as the literals Huffman tree.

The bits in the encoded bit-stream correspond to values in the literals Huffman tree. If a symbol is found that represents a compression size and offset tuple (or match length code) the bits following the literals symbol contains a distance (Huffman) code. The match length coedes might require additional (or extra) bits to store the length (or size).

The distances Huffman tree contains space for 32 symbols. See section Distance codes. The distance code might require additional (or extra) bits to store the distance.

Literal codes

The literal codes consist of:

ValueIdentifierDescription
0x00 – 0xffliteral byte values
0x100end-of-block marker
0 additional bits
0x101Size of 3
0x102Size of 4
0x103Size of 5
0x104Size of 6
0x105Size of 7
0x106Size of 8
0x107Size of 9
0x108Size of 10
1 additional bit
0x109Size of 11 to 12
0x10aSize of 13 to 14
0x10bSize of 15 to 16
0x10cSize of 17 to 18
2 additional bits
0x10dSize of 19 to 22
0x10eSize of 23 to 26
0x10fSize of 27 to 30
0x110Size of 31 to 34
3 additional bits
0x111Size of 35 to 42
0x112Size of 43 to 50
0x113Size of 51 to 58
0x114Size of 59 to 66
4 additional bits
0x115Size of 67 to 82
0x116Size of 83 to 98
0x117Size of 99 to 114
0x118Size of 115 to 130
5 additional bits
0x119Size of 131 to 162
0x11aSize of 163 to 194
0x11bSize of 195 to 226
0x11cSize of 227 to 257
0 additional bits
0x11dSize of 258
Distance codes

The distance codes consist of:

ValueIdentifierDescription
0distance of 1
1distance of 2
2distance of 3
3distance of 4
1 additional bit
4distance of 5 - 6
5distance of 7 - 8
2 additional bits
6distance of 9 - 12
7distance of 13 - 16
3 additional bits
8distance of 17 - 24
9distance of 25 - 32
4 additional bits
10distance of 33 - 48
11distance of 49 - 64
5 additional bits
12distance of 65 - 96
13distance of 97 - 128
6 additional bits
14distance of 129 - 192
15distance of 193 - 256
7 additional bits
16distance of 257 - 384
17distance of 385 - 512
8 additional bits
18distance of 513 - 768
19distance of 769 - 1024
9 additional bits
20distance of 1025 - 1536
21distance of 1537 - 2048
10 additional bits
22distance of 2049 - 3072
23distance of 3073 - 4096
11 additional bits
24distance of 4097 - 6144
25distance of 6145 - 8192
12 additional bits
26distance 8193 - 12288
27distance 12289 - 16384
13 additional bits
28distance 16385 - 24576
29distance 24577 - 32768
other
30-31not used, reserved and illegal but still part of the tree

TODO: complete this section

Additional bits

The additional bits are stored in big-endian (MSB first) and indicate the index into the corresponding array of size values (or base size + additional size).

ValueIdentifierDescription
0 additional bits
0Offset of 1
1Offset of 2
2Offset of 3
3Offset of 4
1 additional bit

TODO: complete this section

Decompression

The decompression in pseudo code:

if block_header.type == HUFFMANN_FIXED:
{
    initialize the fixed Huffman trees
}

do
{
    read block_header from input stream

    if( block_header.type == UNCOMPRESSED )
    {
        align with next byte
        read and check block_header.size and block_header.size_copy
        read data of block_header.size
    }
    else
    {
        if( block_header.type == HUFFMANN_DYNAMIC )
        {
            read the dynamic Huffman trees (see subsection below)
        }
        loop (until end of block code recognized)
        {
            decode literal/length value from input stream
            if( value < 256 )
            {
                copy value (literal byte) to output stream
            }
            else if value = end of block (256)
            {
                 break from loop
             }
             else (value = 257..285)
             {
                 decode distance from input stream

                 move backwards distance bytes in the output
                 stream, and copy length bytes from this
                 position to the output stream.
            }
        }
    }
}
while( block_header.last_block_flag == 0 );

Adler-32 checksum

Zlib provides a highly optimized version of the algorithm provided below.

uint32_t adler32(
          uint8_t *buffer,
          size_t buffer_size,
          uint32_t previous_key )
{
    size_t buffer_iterator = 0;
    uint32_t lower_word    = previous_key & 0xffff;
    uint32_t upper_word    = ( previous_key >> 16 ) & 0xffff;

    for( buffer_iterator = 0;
         buffer_iterator < buffer_size;
         buffer_iterator++ )
    {
        lower_word += buffer[ buffer_iterator ];
        upper_word += lower_word;

        if( ( buffer_iterator != 0 )
         && ( ( buffer_iterator % 0x15b0 == 0 )
          ||  ( buffer_iterator == buffer_size - 1 ) ) )
        {
            lower_word = lower_word % 0xfff1;
            upper_word = upper_word % 0xfff1;
        }
    }
    return( ( upper_word << 16 ) | lower_word );
}

References