Parallel BZIP2 (PBZIP2)

Data Compression Software

by Jeff Gilchrist
(major contributions by Yavor Nikolov)

PBZIP2 Contact Address



PBZIP2 is a parallel implementation of the bzip2 block-sorting file compressor that uses pthreads and achieves near-linear speedup on SMP machines. The output of this version is fully compatible with bzip2 v1.0.2 or newer (ie: anything compressed with pbzip2 can be decompressed with bzip2)PBZIP2 should work on any system that has a pthreads compatible C++ compiler (such as gcc). It has been tested on: Linux, Windows (cygwin & MinGW), Solaris, Tru64/OSF1, HP-UX, OS/2, OSX, and Irix.

NOTE: If you are looking for a parallel BZIP2 that works on cluster machines, you should check out MPIBZIP2 which was designed for a distributed-memory message-passing architecture.

Screen Shot

PBZIP2 v1.1.8 Screen Shot


License/Disclaimer

This software is distributed under a BSD-style license. For details, see the file COPYING. Use at your own risk. I take no responsibility for anything that happens to your data or equipment. Always test (bzip2 -tv) a compressed file containing important data before deleting the original to verify the compression was successful.

If you find this software useful or you are using it in a government/business/commercial environment, please consider making a donation to help support future improvements:


Download

Click to download the latest version:



Source Code: PBZIP2 v1.1.10 (46 KB) [SHA-1: 9c2dee2648176e7f66bbe10f3efb63f219895d39]
[MD5: 373e61985156eaa3af26cad289a36283]




 





Source Code Repository & Bug Reports



Launch Pad:  https://launchpad.net/pbzip2



 





Pre-built Packages



Debian/Ubuntu:  'apt-get update; apt-get install pbzip2' or get the Deb package
FreeBSD:  'pkg_add -r pbzip2' or get the package
Gentoo:  get the Ebuild package
Mandriva:  'urpmi pbzip2'
NetBSD:  get the package
OS/2: get the package
OSX:  'fink install pbzip2' or get the package
OSX:  Automator action and workflow service package
RedHat:  'yum install pbzip2'
Slackware:  get the package
Solaris:  get the package from OpenCSW or from sunfreeware
Windows:  install cygwin and compile yourself or get the 32bit binary package




 





Previous Versions



Source Code: https://launchpad.net/pbzip2/+download











Recent History

v1.1.10 (Nov 23, 2014)
  • Makefile: remove explicit CXX configuration
  • Banner refinements
v1.1.9 (Apr 13, 2014)
  • Spec file refinement for rpm builds thanks to Ville Skytta
  • Makefile refinements
  • Close redirected stdout on finish for better AFS/NFS support (bug #1300876) thanks to Richard Brittain
  • Fix printf format vs actual type misalignments (bug #1236086)
v1.1.8 (Jun 10, 2012)
  • Fixed bug of metadata unpreserved on empty files compress (bug #1011021)
v1.1.7 (Jun 06, 2012)
  • Fixed refusal to write to stdout on -dc from stdin (bug #886628)
  • Fixed occasional failure on decompress with --ignore-trailing-garbage=1 with multiple bad blocks in the archive (bug #886625)
v1.1.6 (Oct 30, 2011)
  • Fixed bug - deadlock due to unsynchronized broadcasts (bug #876686)
  • Prevent deletion of input files on error (bug #874543)
  • Document how to compress/decompress from standard input (bug #820525)
  • Added more detailed kernel error messages (bug #874605)
  • Fixes for error handling in muliti-file processing (bug #883782)
v1.1.5 (Jul 16, 2011)
  • Fixed excessive output permissions while compress/decompress is in progress (bug #807536)


Contributions

- Bryan Stillwell <bryan [at] bokeoa {dot} com> - code cleanup, RPM spec, and prep work for inclusion in Fedora Extras
- Dru Lemley [http://lemley.net/smp.html] - help with large file support
- Kir Kolyshkin <kir [at] sacred {dot} ru> - autodetection for # of CPUs
- Joergen Ramskov <joergen [at] ramskov {dot} org> - initial version of man page
- Peter Cordes <peter [at] cordes {dot} ca> - code cleanup
- Kurt Fitzner <kfitzner [at] excelcia {dot} org> - port to Windows compilers and decompression throttling
- Oliver Falk <oliver [at] linux-kernel {dot} at> - RPM spec update
- Jindrich Novy <jnovy [at] redhat {dot} com> - code cleanup and bug fixes
- Benjamin Reed <ranger [at] befunk {dot} com> - autodetection for # of CPUs in OSX and maintains OSX packages
- Chris Dearman <chris [at] mips {dot} com> - fixed pthreads race condition that led to pthread resources issues when processing large numbers of files and random segfaults
- Richard Russon <ntfs [at] flatcap {dot} org> - help fix decompression bug
- Paul Pluzhnikov <paul [at] parasoft {dot} com> - fixed minor memory leak
Anibal Monsalve Salazar <anibal [at] debian {dot} org> - creates and maintains Debian packages
- Steve Christensen - creates and maintains Solaris packages (sunfreeware.com)
- Alessio Cervellin - created and maintained Solaris packages (blastwave.org)
- Andre Przywara - creates and maintains Slackware packages (linuxpackages.net)
- Ying-Chieh Liao - created the FreeBSD port
- Andrew Pantyukhin <sat [at] FreeBSD {dot} org> - maintains the FreeBSD port and willing to resolve any FreeBSD-related problems
- Roland Illig - creates and maintains the NetBSD packages
- Matt Turner <mattst88 [at] gmail {dot} com> - code cleanup
- Alvaro Reguly <alvaro [at] reguly {dot} com> - RPM spec update to support SUSE Linux
- Ivan Voras <ivoras [at] freebsd {dot} org> - support for stdin and pipes during compression and CPU detect changes
- John Dalton <john [at] johndalton {dot} info> - code cleanup and bug fix for stdin support
- Rene Georgi <rene.georgi [at] online {dot} de> - code and Makefile cleanup, support for direct decompress and bzcat 
- Rene Rheaume & Jeroen Roovers <jer [at] xs4all {dot} nl> - patch to support uclibc's lack of a getloadavg function
- Reinhard Schiedermeier <rs [at] cs {dot} hm {dot} edu> - support for tar --use-compress-prog=pbzip2
Elbert Pol - creates and maintains OS/2 packages
- Nico Vrouwe <nico [at] gojelly {dot} com> - support for CPU detection on Windows
- Eduardo Terol <EduardoTerol [at] gmx {dot} net> - creates and maintains Windows 32bit package
- Nikita Zhuk <nikita [at] zhuk {dot} fi> - creates and maintains Mac OS X Automator action and workflow/service
Jari Aalto <jari.aalto [at] cante {dot} net> - added long options to man page and -h output, added --loadavg, --read long options
- Scott Emery <emery [at] sgi {dot} com> - ignore fwrite return and pass chown errors in writeFileMetaData if effective uid root
- Steven Chamberlain <steven [at] pyro {dot} eu {dot} org> - code to support throttling compression to prevent memory exhaustion with slow output pipe
Benjamin von Mossner - creates and maintains Solaris packages (opencsw.org)
- Yavor Nikolov <nikolov.javor [at] gmail {dot} com> - added support for multi-threaded decompression using STDIN/pipes, code to support throttling compression to prevent memory exhaustion with slow output pipe, major improvements to protection of shared variables, error and signal handling, program termination, outputBuffer usage redesigned as fixed-size circular buffer, added -S switch for thread stack size customization, fixed infinite loop on when fileWriter fails to create output file at start, fixed command line parsing bug for -b, -p, -m switches, lots of minor bugs fixed and improvements (see AUTHORS or pbzip2.cpp for full details)
- Tanguy Fautre <tanguy [at] aristechnologies {dot} com> - created v2.0 development branch of pbzip2.  Source re-factored to be much more modular, now uses CMake, using Boost threading model instead of pthreads, created lipbz2 library to access multi-threaded capabilities as a library, error handling done via exceptions, standard I/O redone to be type safe.


Special Thanks for suggestions and testing to: Phillippe Welsh, Cassens Transport Co., James Terhune, Dru Lemley, Bryan Stillwell, George Chalissery, Kir Kolyshkin, Madhu Kangara, Mike Furr, Joergen Ramskov, Kurt Fitzner, Peter Cordes, Oliver Falk, Jindrich Novy, Benjamin Reed, Chris Dearman, Richard Russon, Anibal Monsalve Salazar, Jim Leonard, Paul Pluzhniko, Robert Archard, Coran Fisher, Ken Takusagawa, David Pyke, Matt Turner, Damien Ancelin, Alvaro Reguly, Ivan Voras, John Dalton, Sami Liedes, Rene Georgi, Rene Rheaume, Jeroen Roovers, Reinhard Schiedermeier, Kari Pahula, Elbert Pol, Nico Vrouwe, Eduardo Terol, Samuel Thibault, Michael Fuereder, Jari Aalto, Scott Emery, Steven Chamberlain, Yavor Nikolov, Nikita Zhuk, Joao Seabra, Conn Clark, Mark A. Haun, Tim Bielawa, Michal Gorny, Mikolaj Habdank, Christian Kujau, Marc-Christian Petersen, Piero Ottuzzi, Ephraim Ofir, Laszlo Ersek, Benjamin von Mossner, Tanguy Fautre, Mihai Lazarescu, Knuth Posern, Dima Tisnek, Amit Belani, Andy Isaacs, David James, Mikolaj Izdebski., Assaf Gordon.



Benchmark Results

The following benchmark was performed using an SGI Altix 3700 Bx2 system with 128 1.6GHz Itanium2 Processors, 6MB cache, 256GB system memory running Linux Kernel 2.4.21-sgi306rp31 on the SHARCNET computing network.

Benchmark results for compressing 1.83GB of data on a Itanium2 1.6 GHz system.

The following benchmark was performed with various systems using a 900k block size.  The pbzip2 software was benchmarked with the Intanium2, Opteron, and Xeon processors using a Linux 2.6 64bit kernel while the Core2 used Windows Vista 64bit (cygwin).

Benchmark results for compressing 159MB of data with 900k block size on various machines.

For more benchmark information click here.


PBZIP2 Data Format

You should be able to compress files larger than 4GB with pbzip2.

Files that are compressed with pbzip2 are broken up into pieces and each individual piece is compressed.  This is how pbzip2 runs faster on multiple CPUs since the pieces can be compressed simultaneously. The final .bz2 file may be slightly larger than if it was compressed with the regular bzip2 program due to this file splitting (usually less than 0.2% larger).  Files that are compressed with pbzip2 will also gain considerable speedup when decompressed using pbzip2.

Files that were compressed using bzip2 will not see speedup since bzip2 packages the data into a single chunk that cannot be split between processors.  pbzip2 will still be able to decompress these files, but it will be slower than if the .bz2 file was created with pbzip2.

A file compressed with bzip2 will contain one compressed stream of data that looks like this:
[-----------------------------------------------------]

Data compressed with pbzip2 is broken into multiple streams and each stream is bzip2 compressed looking like this:
[-----|-----|-----|-----|-----|-----|-----|-----|-----]

If you are writing software with libbzip2 to decompress data created with pbzip2, you must take into account that the data contains multiple bzip2 streams so you will encounter end-of-stream markers from libbzip2 after each stream and must look-ahead to see if there are any more streams to process before quitting.  The bzip2 program itself will automatically handle this condition.

Usage

Run pbzip2 for the help listing.

===================================================================

Usage: pbzip2 [-1 .. -9] [-b#cdfhklm#p#qrS#tvVz] <filename> <filename2> <filenameN>

-b# Where # is block size in 100k steps (default 9 = 900k)
-c, --stdout Output to standard out (stdout)
-d,--decompress Decompress file
-f,--force Force, overwrite existing output file
-h,--help Print this help message
-k,--keep Keep input file, do not delete
-l,--loadavg Load average determines max number processors to use
-m# Where # is max memory usage in 1MB steps (default 100 = 100MB)
-p# Where # is the number of processors (default: autodetect)
-q,--quiet Quiet mode (default)
-r,--read Read entire input file into RAM and split between processors
-S# Child thread stack size in 1KB steps (default stack size if unspecified)
-t,--test Test compressed file integrity
-v,--verbose Verbose mode
-V,--version Display version info for pbzip2 then exit
-z,--compress Compress file (default)
-1,--fast ... -9,--best Set BWT block size to 100k .. 900k (default 900k).
--ignore-trailing-garbage=# Ignore trailing garbage flag (1 - ignored; 0 - forbidden)

Example: pbzip2 -b15qk myfile.tar
Example: pbzip2 -p4 -r -5 myfile.tar second*.txt
Example: tar cf myfile.tar.bz2 --use-compress-prog=pbzip2 dir_to_compress/
Example: pbzip2 -d -m500 myfile.tar.bz2
Example: pbzip2 -dc myfile.tar.bz2 | tar x
Example: pbzip2 -c < myfile.txt > myfile.txt.bz2 

===================================================================

The pbzip2 program is a parallel version of bzip2 for use on shared memory machines. It provides near-linear speedup when used on true multi-processor machines and 5-10% speedup on Hyperthreaded machines. The output is fully compatible with the regular bzip2 data so any files created with pbzip2 can be uncompressed by bzip2 and vice-versa.

The default settings for pbzip2 will work well in most cases. The only switch you will likely need to use is -d to decompress files and -p to set the # of processors for pbzip2 to use if autodetect is not supported on your system, or you want to use a specific # of CPUs.  Note, that if you are using a large number of CPUs you may wish to lower your default stack size setting (with the -S switch or ulimit) to reduce the amount of memory each thread uses.

Example 1:
pbzip2 -v myfile.tar

This example will compress the file "
myfile.tar" into the compressed file "myfile.tar.bz2". It will use the autodetected # of processors (or 2 processors if autodetect not supported) with the default file block size of 900k and default BWT block size of 900k.

The program would report something like:
===================================================================

Parallel BZIP2 v1.1.0 - by: Jeff Gilchrist [http://compression.ca]
[Mar. 13, 2010] (uses libbzip2 by Julian Seward)
Major contributions: Yavor Nikolov <nikolov.javor+pbzip2@gmail.com>

# CPUs: 2
BWT Block Size: 900k
File Block Size: 900k
Maximum Memory: 100 MB
-------------------------------------------
File #: 1 of 1
Input Name: myfile.tar
Output Name: myfile.tar.bz2

Input Size: 7428687 bytes
Compressing data...
Output Size: 3236549 bytes
-------------------------------------------

Wall Clock: 2.809000 seconds

===================================================================

Example 2:
pbzip2 -b15vk myfile.tar

This example will compress the file "
myfile.tar" into the compressed file "myfile.tar.bz2". It will use the autodetected # of processors (or 2 processors if autodetect not supported) with a file block size of 1500k and a BWT block size of 900k. The file "myfile.tar" will not be deleted after compression is finished.

The program would report something like:
===================================================================

Parallel BZIP2 v1.1.0 - by: Jeff Gilchrist [http://compression.ca]
[Mar. 13, 2010] (uses libbzip2 by Julian Seward)
Major contributions: Yavor Nikolov <nikolov.javor+pbzip2@gmail.com>

# CPUs: 2
BWT Block Size: 900k
File Block Size: 1500k
Maximum Memory: 100 MB
-------------------------------------------
File #: 1 of 1
Input Name: myfile.tar
Output Name: myfile.tar.bz2

Input Size: 7428687 bytes
Compressing data...
Output Size: 3236394 bytes
-------------------------------------------

Wall Clock: 3.059000 seconds

===================================================================

Example 3:
pbzip2 -p4 -r -5 -v myfile.tar second*.txt

This example will compress the file "
myfile.tar" into the compressed file "myfile.tar.bz2". It will use 4 processors with a BWT block size of 500k. The file block size will be the size of "myfile.tar" divided by 4 (# of processors) so that the data will be split evenly among each processor. This requires you have enough RAM for pbzip2 to read the entire file into memory for compression. Pbzip2 will then use the same options to compress all other files that match the wildcard "second*.txt" in that directory.

The program would report something like:
===================================================================

Parallel BZIP2 v1.1.0 - by: Jeff Gilchrist [http://compression.ca]
[Mar. 13, 2010] (uses libbzip2 by Julian Seward)
Major contributions: Yavor Nikolov <nikolov.javor+pbzip2@gmail.com>

# CPUs: 4
BWT Block Size: 500k
File Block Size: 1857k
Maximum Memory: 100 MB
-------------------------------------------
File #: 1 of 3
Input Name: myfile.tar
Output Name: myfile.tar.bz2

Input Size: 7428687 bytes
Compressing data...
Output Size: 3237105 bytes
-------------------------------------------
File #: 2 of 3
Input Name: secondfile.txt
Output Name: secondfile.txt.bz2

Input Size: 5897 bytes
Compressing data...
Output Size: 3192 bytes
-------------------------------------------
File #: 3 of 3
Input Name: secondbreakfast.txt
Output Name: secondbreakfast.txt.bz2

Input Size: 83531 bytes
Compressing data...
Output Size: 11832 bytes
-------------------------------------------

Wall Clock: 5.127381 seconds

===================================================================

Example 4: tar cf myfile.tar.bz2 --use-compress-prog=pbzip2 dir_to_compress/
Example 4: tar -c directory_to_compress/ | pbzip2 -vc > myfile.tar.bz2

This example will compress the data being given to pbzip2 via pipe from TAR into the compressed file "myfile.tar.bz2".  It will use the autodetected # of processors (or 2 processors if autodetect not supported) with the default file block size of 900k and default BWT block size of 900k.  TAR is collecting all of the files from the "directory_to_compress/" directory and passing the data to pbzip2 as it works.

The program would report something like:
===================================================================

Parallel BZIP2 v1.1.0 - by: Jeff Gilchrist [http://compression.ca]
[Mar. 13, 2010] (uses libbzip2 by Julian Seward)
Major contributions: Yavor Nikolov <nikolov.javor+pbzip2@gmail.com>

# CPUs: 2
BWT Block Size: 900k
File Block Size: 900k
Maximum Memory: 100 MB
-------------------------------------------
File #: 1 of 1
Input Name: <stdin>
Output Name: <stdout>

Compressing data...
-------------------------------------------

Wall Clock: 0.176441 seconds

===================================================================

Example 5: pbzip2 -dv -m500 myfile.tar.bz2

This example will decompress the file "
myfile.tar.bz2" into the decompressed file "myfile.tar". It will use the autodetected # of processors (or 2 processors if autodetect not supported). It will use a maximum of 500MB of memory for decompression. The switches -b, -r, and -1..-9 are not valid for decompression.

The program would report something like:
===================================================================

Parallel BZIP2 v1.1.0 - by: Jeff Gilchrist [http://compression.ca]
[Mar. 13, 2010] (uses libbzip2 by Julian Seward)
Major contributions: Yavor Nikolov <nikolov.javor+pbzip2@gmail.com>

# CPUs: 2
Maximum Memory: 500 MB
-------------------------------------------
File #: 1 of 1
Input Name:
myfile.tar.bz2
Output Name:
myfile.tar

BWT Block Size: 900k
Input Size: 3236549 bytes
Decompressing data...
Output Size: 7428687 bytes
-------------------------------------------

Wall Clock: 1.154000 seconds

===================================================================

Example 6: tar xf myfile.tar.bz2 --use-compress-prog=pbzip2
Example 6: pbzip2 -dvc myfile.tar.bz2 | tar x

This example will decompress the 
file "myfile.tar.bz2" and pass the decompressed data stream to TAR via a pipe.  It will use the autodetected # of processors (or 2 processors if autodetect not supported). The switches -b, -r, and -1..-9 are not valid for decompression. TAR will extract all the data from the tar archive into the current directory.

The program would report something like:
===================================================================

Parallel BZIP2 v1.1.3 - by: Jeff Gilchrist [http://compression.ca]
[Mar. 27, 2011] (uses libbzip2 by Julian Seward)
Major contributions: Yavor Nikolov <nikolov.javor+pbzip2@gmail.com>

# CPUs: 4
Maximum Memory: 100 MB
Ignore Trailing Garbage: off
-------------------------------------------
File #: 1 of 1
Input Name: myfile.tar.bz2
Output Name: <stdout>

BWT Block Size: 900k
Input Size: 265121 bytes
Decompressing data (no threads)...
-------------------------------------------

Wall Clock: 0.142000 seconds

===================================================================

Example 7: pbzip2 -cv < myfile.txt > myfile.txt.bz2

This example will read the file "myfile.txt" from standard input, compressing it to standard output which is redirected to the file "myfile.tar.bz2".

The program would report something like:
===================================================================

Parallel BZIP2 v1.1.6 - by: Jeff Gilchrist [http://compression.ca]
[Oct. 30, 2011] (uses libbzip2 by Julian Seward)
Major contributions: Yavor Nikolov <nikolov.javor+pbzip2@gmail.com>

# CPUs: 24
BWT Block Size: 900 KB
File Block Size: 900 KB
Maximum Memory: 100 MB
-------------------------------------------
File #: 1 of 1
Input Name: <stdin>
Output Name: <stdout>

Compressing data...
Output Size: 29897646 bytes
-------------------------------------------

Wall Clock: 2.212470 seconds

===================================================================

Bugs/Contact

If you would like to report any bugs please create a bug entry on our PBZIP2 Bug Tracker.  To contact me you can reach me via e-mail at: PBZIP2 Contact Address


  • This web page is maintained by Jeff Gilchrist, Copyright (C) 2003-2014.
  • This web page best viewed using a resolution of 800 x 600 or higher.
compression.ca