Thursday, August 5, 2010

BWT (Burrows Wheeler Transform) Encoding Algorithm

Encoding
Procedure for implementing the algorithm:
1. Select a block size to be used. Block size is directly related to effectiveness of encoding and inversely related to the time required. Hence a compromise has to be reached.
2. Convert the data byte stream to blocks of n bytes where n is the block size chosen.
3. The following example illustrates the procedure to be done next.
Let n=6 and the first string be “kerala”. By wrap around rotations of the string by one character each the following strings can be derived. i.e., first the string is rotated one character to the right and the result taken and then one more character and so on until when one more rotation results in recovery of the original string.
kerala
eralak
ralake
alaker
lakera
akeral
Then these strings are sorted and the following order is obtained:
akeral
alaker
eralak
kerala
lakera
ralake
The string formed by taking the last letters of the sorted array of string forms the encoded data. i.e., here the encoded data is “lrkaae” . But obviously we cannot get back the original string from this data alone. We have to have something called the primary_index. It is very easy to get. While rotating the original string, remember the string formed by rotating the original string once i.e.,”eralak”. It’s position in the array of sorted strings gives the primary index. It occurs third in the sorted array of strings and hence the primary_index=2 (counting from 0).
4. Do this on each block of data until the data is exhausted. It is advisable to write the encoded string followed by the primary index on to the output file. Here we have dealt with data as string but it is not possible in the actual implementation as the data stream will have in it non-alphabetic characters also. But the adapatation to be done is very trivial.
The strings should not be kept in memory unless we have a hell lot of memory. Just keep an array of indices each element of which points to a position in the block and sort them. So the first element should point to the first data string and so on.
Decoding
The data that we now have with us is the encoded version and the primary index. So we now have “lrkaae” and the primary index (2) with us. Now we have to prepare a vector of n elements and an array to hold the sorted version of the encoded data. Sort the encoded data encoded_data and store it in sorted_data.
So now, sorted_data=”aaeklr” and encoded_data=”lrkaae”.
No we move to prepare the vector. We give the pseudo code below.

for(int i=0;i
{
for(int j=0;j
{
if(encoded_data[j]==sorted_data[i]&& encoded_data[j] is not flagged)
{
vector[i]=j;
flag encoded_data[j];
break;
}
}
}

So here the vector is {3,4,5,2,0,1}. Now the following simple code from Mark Nelson’s article on BWT in the DDJ does the job.
index=primary_index;
for(int i=0;i
{
output(encoded_data[index]);
index=vector[index];
}

Thus we get back “kerala”, the original string.
Now the question is, is “lrkaae” more suitable for compression compared to the original string “kerala”. Here in the encoded string both a’s have come together. When the block size chosen is large it causes more such occurrences. MTF (Move to front) encoder converts this data to one having a high frequency of characters having ASCII values less near 0. Encoding such data by an entropy encoder results in very good performance. Huffman algorithm may be used as the entropy encoder.
Sincere thanks to Mark Nelson’s article on Data compression by the Burrows Wheeler Transform from which I came to know of this fantastic algorithm.
Now the performance table using BeWT and HuffPack compared with WinZip, the famed archiver:
File Full Size BeWT+HuffPack HuffPack WinZip
a.htm 221514 061743 145632 043472
t.pdf 056553 051808 055969 045332
l.exe 040960 013159 020349 009680
notepad.exe 053248 023033 034176 017769
The sizes are in bytes after compression. Although the sizes do not quite match those of the famed commercial archiver WinZip (WinZip Computing Inc.) , the improvements over using HuffPack alone is great and it illustrates how well the method works. A better algorithm such as a combination of Run Length Encoding and Arithmetic encoding if used instead of Huffman should improve the compression and exploit the full potential of BWT as illustrated by Mark Nelson in his article.

0 comments:

Post a Comment