Dror Baron -
Fast algorithms for lossless data compression
The goal of my research in universal lossless compression was to
develop algorithms that are universal to unknown input statistics
while being fast.
Universality:
Our algorithms achieve coding
lengths that asymptotically achieve the entropy rate.
For length-n inputs, the redundancy above entropy
achieved by our methods is 0.5 log(n) + O(1) bits
per unknown parameter (conditional probability). This redundancy is within
O(1) bits per parameter of Rissanen's lower bounds
on the redundancy of universal coding methods.
Computational efficiency:
The total amount of computation performed by our algorithms
is close to linear in the length n of the data.
Despite this efficiency, the redundancy performance is excellent.
To achieve these improvements, we used the Burrows Wheeler
transform (BWT), which permutes a block of input symbols in an
invertible manner. The interesting feature
of the BWT is that its output distribution is
similar to piecewise i.i.d.; intuitively, it can be
said that the BWT removes memory from a discrete source,
in an analogous manner to how the Karhunen Loeve transform
(KLT) removes correlation between samples of continuous sources.
Specific contributions follow.
Two-part codes for compression of i.i.d. sources
We studied a class of two-part codes for compression
of i.i.d. sequences, and provided a deterministic
converse upper bound on the redundancy.
This redundancy exceeds Rissanen's bound by 1.05 bits,
and can be achieved using a quantizer structure that is
available in closed form.
D. Baron,
Y. Bresler,
and M. K. Mihcak,
"Two-Part Codes with Low Worst-Case Redundancies for
Distributed Compression of Bernoulli Sequences,"
Proceedings of the
37th Annual Conference on Information Sciences and Systems
(CISS2003), Baltimore, MD, March 2003
(pdf).
Parallel compression
We described a parallel compressor that uses B
processors to compress a length-n input x
in O(n/B) time, while achieving a universal
redundancy within O(1) bits per parameter
of Rissanen's bounds. The main steps are
(i) partition x into
B blocks, and accumulate statistics on all
blocks in parallel;
(ii) merge the B sets of statistics into
a single estimate of the single underlying source;
(iii) quantize the source parameters using
our near-optimum two-part codes; and
(iv) compress the B blocks in parallel, based on the
quantized source. This research appears in the following
publications, and is described in greater detail
here.
N. Krishnan,
D. Baron,
and M. K. Mihcak,
"A Parallel Two-Pass MDL Context Tree Algorithm for Universal Source Coding,"
IEEE Int. Symp. Inf. Theory, Honolulu, HI, June 2014
(pdf,
arxiv).
N. Krishnan and
D. Baron,
"Performance of Parallel Two-Pass MDL
Context Tree Algorithm,"
Proc. IEEE Global Conf. Signal Inf. Process.,
Atlanta, GA, December 2014
(pdf,
poster).
N. Krishnan and
D. Baron,
"A Universal Parallel Two-Pass MDL Context Tree Compression Algorithm,"
to appear in IEEE J. Sel. Topics Signal Process.,
vol. 9, no. 4, pp. 1-8, June 2015
(pdf,
arxiv).
You may also want to view the following tutorial video about
this research by
Nikhil Krishnan.
Fast suffix sorting
In order to compute the BWT, all suffixes of the input
sequence must be sorted. Existing algorithms
that were fastest in practice have quadratic
worst-case complexity. Instead,
our worst-case is O(n sqrt(log(n))),
and is at least as fast in practice as all existing
algorithms whose worst-case is O(n log(n)) or less.
This provides a useful tradeoff between worst-case and
average performance.
Our algorithms are relatively simple,
and amenable to VLSI implementation. This may lead to fast
universal compression in hardware.
D. Baron and
Y. Bresler,
"Anti-Sequential Suffix Sorting for BWT-Based Data Compression,"
IEEE Transactions on Computers,
vol. 54, no. 4, pp. 385-397, April 2005
(pdf).
Linear complexity universal coding
The prior art in universal coding methods included several
algorithms whose complexity was O(n log(n)).
There was a misconception that growing a
context tree, a data structure that maintains statistics
for repetitions of sub-sequences, is O(n log(n))
at least. We showed that it is possible to provide a linear
complexity solution by combining suffix sorting tools,
which were well-known to the computer science community,
with the non-sequential semi-predictive coding method,
which was well known to the source coding community.
These results are discussed in:
D. Baron and
Y. Bresler,
"An O(N) Semi-Predictive Universal Encoder via the BWT,"
IEEE Transactions on Information Theory,
vol. 50 , No. 5 , pp. 928-937, May 2004
(pdf).
Sequential compression:
As a by-product of my expertise in low-complexity universal
coding, I identified a minor glitch in a paper by Rissanen.
Having worked out an alternative solution, we
provide an O(n log(log(n)))sequential
universal algorithm, which improves on previous
n log(n) approaches.