The Cape Peninsula University of Technology (CPUT) Electronic Theses and Dissertations (ETD) repository holds full-text theses and dissertations submitted for higher degrees at the University (including submissions from former Cape Technikon and Peninsula Technikon). |

## Real-time loss-less data compression

##### Abstract

Data stored on disks generally contain significant redundancy. A mechanism or algorithm that
recodes the data to lessen the data size could possibly double or triple the effective data that
could be stored on the media. One mechanism of doing this is by data compression.
Many compression algorithms currently exist, but each one has its own advantages as well as
disadvantages. The objective of this study', to formulate a new compression algorithm that
could be implemented in a real-time mode in any file system. The new compression algorithm
should also execute as fast as possible, so as not to cause a lag in the file systems performance.
This study focuses on binary data of any type, whereas previous articles such as (Huftnlan.
1952:1098), (Ziv & Lempel, 1977:337: 1978:530), (Storer & Szymanski. 1982:928) and
(Welch, 1984:8) have placed particular emphasis on text compression in their discussions of
compression algorithms for computer data.
The resulting compression algorithm that is formulated by this study is Lempel-Ziv-Toutlc
(LZT). LZT is basically an LZ77 (Ziv & Lempel, 1977:337) encoder with a buffer size equal in
size to that of the data block of the file system in question. LZT does not make this distinction,
it discards the sliding buffer principle and uses each data block of the entire input stream. as
one big buffer on which compression can be performed. LZT also handles the encoding of a
match slightly different to that of LZ77. An LZT match is encoded by two bit streams, the first
specifying the position of the match and the other specifying the length of the match. This
combination is commonly referred to as a <position, length> pair.
To encode the position portion of the <position, length> pair, we make use of a sliding scale
method. The sliding scale method works as follows. Let the position in the input buffer, of the
current character to be compressed be held by inpos, where inpos is initially set to 3. It is then
only possible for a match to occur at position 1 or 2. Hence the position of a match will never
be greater than 2, and therefore the position portion can be encoded using only 1 bit. As "inpos"
is incremented as each character is encoded, the match position range increases and therefore
more bits will be required to encode the match position.
The reason why a decimal 2 can be encoded 'sing only I bit can be explained as follows. When
decimal values are converted to binary values, we get 010 = 02, 110 = 12, 210, = 102etc. As a
position of 0 will never be used, it is possible to develop a coding scheme where a decimal
value of 1 can be represented by a binary value of 0, and a decimal value of 2 can be
represented by binary value of 1. Only I bit is therefore needed to encode match position I and
match position 2. In general. any decimal value n ca:) be represented by the binary equivalent
for (n - 1). The number of bits needed to encode (n - 1), indicates the number of bits needed to
encode the match position.
The length portion of the <position, length> pair is encoded using a variable length coding
(vlc) approach. The vlc method performs its encoding by using binary blocks. The first binary
block is 3 bits long, where binary values 000 through 110 represent decimal values I through 7.