|The Cape Peninsula University of Technology (CPUT) Electronic Theses and Dissertations (ETD) repository holds full-text theses and dissertations submitted for higher degrees at the University (including submissions from former Cape Technikon and Peninsula Technikon).|
Real-time loss-less data compression
Toufie, Moegamat Zahir
MetadataShow full item record
Data stored on disks generally contain significant redundancy. A mechanism or algorithm that recodes the data to lessen the data size could possibly double or triple the effective data that could be stored on the media. One mechanism of doing this is by data compression. Many compression algorithms currently exist, but each one has its own advantages as well as disadvantages. The objective of this study', to formulate a new compression algorithm that could be implemented in a real-time mode in any file system. The new compression algorithm should also execute as fast as possible, so as not to cause a lag in the file systems performance. This study focuses on binary data of any type, whereas previous articles such as (Huftnlan. 1952:1098), (Ziv & Lempel, 1977:337: 1978:530), (Storer & Szymanski. 1982:928) and (Welch, 1984:8) have placed particular emphasis on text compression in their discussions of compression algorithms for computer data. The resulting compression algorithm that is formulated by this study is Lempel-Ziv-Toutlc (LZT). LZT is basically an LZ77 (Ziv & Lempel, 1977:337) encoder with a buffer size equal in size to that of the data block of the file system in question. LZT does not make this distinction, it discards the sliding buffer principle and uses each data block of the entire input stream. as one big buffer on which compression can be performed. LZT also handles the encoding of a match slightly different to that of LZ77. An LZT match is encoded by two bit streams, the first specifying the position of the match and the other specifying the length of the match. This combination is commonly referred to as a <position, length> pair. To encode the position portion of the <position, length> pair, we make use of a sliding scale method. The sliding scale method works as follows. Let the position in the input buffer, of the current character to be compressed be held by inpos, where inpos is initially set to 3. It is then only possible for a match to occur at position 1 or 2. Hence the position of a match will never be greater than 2, and therefore the position portion can be encoded using only 1 bit. As "inpos" is incremented as each character is encoded, the match position range increases and therefore more bits will be required to encode the match position. The reason why a decimal 2 can be encoded 'sing only I bit can be explained as follows. When decimal values are converted to binary values, we get 010 = 02, 110 = 12, 210, = 102etc. As a position of 0 will never be used, it is possible to develop a coding scheme where a decimal value of 1 can be represented by a binary value of 0, and a decimal value of 2 can be represented by binary value of 1. Only I bit is therefore needed to encode match position I and match position 2. In general. any decimal value n ca:) be represented by the binary equivalent for (n - 1). The number of bits needed to encode (n - 1), indicates the number of bits needed to encode the match position. The length portion of the <position, length> pair is encoded using a variable length coding (vlc) approach. The vlc method performs its encoding by using binary blocks. The first binary block is 3 bits long, where binary values 000 through 110 represent decimal values I through 7.