bzip2, bunzip2 - a block-sorting file compressor
bzcat - decompresses files to stdout
bzip2recover - recovers data from damaged bzip2 files
bzip2 compresses files using the Burrows-Wheeler block sorting text compression algorithm, and Huffman coding. Compression is generally considerably better than that achieved by more conventional LZ77/LZ78-based compressors, and approaches the performance of the PPM family of statistical compressors.
More at wikipedia
The command-line options are deliberately very similar to those of GNU gzip, but they are not identical.
Files on the command line (or expanded by globing) are replaced by a compressed version with the name suffixed by
Files are not over-written, specify
Piping is done if no file names are specified reading from standard input to writing to standard output.
Decompresses specified files, unless they were not created by bzip2 which will be passed with a warning.
filename.bz2 becomes filename
If the file does not end in a recognised ending, .bz2, .bz, .tbz2 or .tbz, bzip2 warns that it cannot determine the name of the original file, and uses the original name with .out appended.
bunzip2 correctly decompresss a file which is the concatenation of two or more compressed files.
Integrity testing (-t) of concatenated compressed files is supported.
files re output to the standard output by using -c .
bzip2 reads arguments from the environment variables BZIP2 and BZIP, in that order, and will process them before any arguments read from the command line. This gives a convenient way to supply default arguments.
bzip2 uses 32-bit CRCs to verfiy that the decompressed file is correct.
The compressed file may be slightly larger than the original. Files of less than one hundred bytes tend to get larger, since the compression mechanism has a constant overhead in the region of 50 bytes. Random data (including the output of most file compressors) is coded at about 8.05 bits per byte, giving an expansion of around 0.5%.
Memory managementbzip2 compresses files in blocks whose size affects both the compression ratio achieved, and the amount of memory needed for compression and decompression. The flags -1 through -9 specify the block size to be 100,000 bytes through 900,000 bytes (the default) .
At decompression time, the block size used for compression is read from the header of the compressed file.
Memory requirements can be estimated as:
Larger block sizes give rapidly diminishing marginal returns. Most of the compression comes from the first two or three hundred k of block size. It is important to consider that the decompression memory requirement is set at compression time by the choice of block size.
For files compressed with the default 900k block size, 3.700 MB will be required to decompress.
Use the largest block size memory constraints allow which maximises the compression achieved, speed are unaffected by block size.
Another significant point applies to files which fit in a
single block -- that means most files you'd encounter
using a large block size.
This table summarises the maximum memory usage for different block sizes.
Compress Decompress Decompress Corpus Flag usage usage -s usage Size -1 1200k 500k 350k 914704 -2 2000k 900k 600k 877703 -3 2800k 1300k 850k 860338 -4 3600k 1700k 1100k 846899 -5 4400k 2100k 1350k 845160 -6 5200k 2500k 1600k 838626 -7 6100k 2900k 1850k 834096 -8 6800k 3300k 2100k 828642 -9 7600k 3700k 2350k 828642
RECOVERING DATA FROM DAMAGED FILES
The compressed representation of each block is delimited by a 48-bit pattern, which makes it possible to find the block boundaries with reasonable certainty. Each block also carries its own 32-bit CRC, so damaged blocks can be distinguished from undamaged ones.
bzip2recover is a simple program whose purpose is to search for blocks in .bz2 files, and write each block out into its own .bz2 file. You can then use bzip2 -t to test the integrity of the resulting files, and decompress those which are undamaged.
bzip2recover takes a single argument, the name of the damaged file, and writes a number of files "rec00001file.bz2", "rec00002file.bz2", etc, containing the extracted blocks. The output filenames are designed so that the use of wildcards in subsequent processing -- for example, "bzip2 -dc rec*file.bz2 > recovered_data" -- processes the files in the correct order.
bzip2recover should be of most use dealing with large .bz2 files, as these will contain many blocks. It is clearly futile to use it on damaged single-block files, since a damaged block cannot be recovered. If you wish to minimise any potential data loss through media or transmission errors, you might consider compressing with a smaller block size.
Decompression speed is unaffected.
bzip2 allocates several megabytes of memory and accesses all over it in a fairly random fashion. This means that performance, both for co pressing and decompressing, is largely determined by the speed at which your machine can service cache misses. Because of this, small changes to the code to reduce the miss rate have been observed to give disproportionately large performance improvements. I imagine bzip2 will perform best on machines with very large caches.
bzip2, bunzip2 and bzcat are the same program, and the decision about what actions to take is done on the basis of which name is used.
bzip2recover versions prior to 1.0.2 used 32-bit integers to represent bit positions in compressed files, so they
could not handle compressed files more than 512 megabytes long.
AUTHOR Julian Seward, jsewardbzip.org.
The ideas embodied in bzip2 are due to (at least) the following people: Michael Burrows and David Wheeler (for the block sorting transformation), David Wheeler (again, for the Huffman coder), Peter Fenwick (for the structured coding model in the original bzip, and many refinements), and Alistair Moffat, Radford Neal and Ian Witten (for the arithmetic coder in the original bzip). I am much indebted for their help, support and advice. See the manual in the source distribution for pointers to sources of documentation. Christian von Roques encouraged me to look for faster sorting algorithms, so as to speed up compression. Bela Lubkin encouraged me to improve the worst-case compression performance. Donna Robinson XMLised the documentation. The bz* scripts are derived from those of GNU gzip. Many people sent patches, helped with portability problems, lent machines, gave advice and were generally helpful.
gzip -l *gz compressed uncompressed ratio uncompressed_name 20 0 0.0% smother.diske- 20 0 0.0% smother.diskf- 20 0 0.0% smother.diskg- 20 0 0.0% smother.diskh- 798830592 3596346423 77.8% smother_wd0e 798830672 3596346423 77.8% (totals)with
gzip -lv *gz method crc date time compressed uncompressed ratio uncompressed_name defla 00000000 Sep 1 15:00 20 0 0.0% smother.diske- defla 00000000 Sep 1 15:01 20 0 0.0% smother.diskf- defla 00000000 Sep 1 15:01 20 0 0.0% smother.diskg- defla 00000000 Sep 1 15:01 20 0 0.0% smother.diskh- defla dbd673f2 Sep 1 16:09 798830592 3596346423 77.8% smother_wd0e 798830672 3596346423 77.8% (totals)The uncompressed size is given as
-1for files not in
gzipformat, such as compressed
zcat file.Z | wc -c
Travel the directory structure recursively. |
Most useful for decompression.
Any suffix can be given, but suffixes other than
A null suffix forces gunzip to try decompression on all given files regardless of suffix, as in:
gunzip -S "" * (*.* for MSDOS)Previous versions of gzip used the
| Check the compressed file integrity.
|Suppress all warning messages.|
||Display the name and percentage reduction for each file compressed.||
When compressing, do not save the original file name and time stamp by
default. (The original name is always saved if the name had to be truncated.) |
When decompressing, do not restore the original file name if present (remove only the
This option is the default when decompressing.
When compressing, always save the original file name and time stamp; default.|
When decompressing, restore the original file name and time stamp if present, useful on systems which have a limit on file name length or when the time stamp has been lost after a file transfer.
Write output on standard output; keep original files unchanged.
If there are several input files, the output consists of a sequence of independently compressed members.
To obtain better compression, concatenate all input files before compressing them.
Specify speed/compression tradeoff |
|even if the file has multiple links
or the corresponding file already exists, or if the compressed data
is read from or written to a terminal. If the input data is not in
a format recognized by ||behave as ||Display the |
The following command will find all
gzip files in the current
directory and subdirectories, and extract them in place without
destroying the original:
find . -name '*.gz' -print | sed 's/^\(.*\)[.]gz$/gunzip < "&" > "\1"/' | sh
gunzipwill extract all members at once. If one member is damaged, other members might still be recovered after removal of the damaged member.
This is an example of concatenating
gzip --to-stdout file1 > foo.gz gzip --to-stdout file2 >> foo.gz
gunzip --to-stdout foo
cat file1 file2
In case of damage to one member of a
.gz file, other members can
still be recovered (if the damaged member is removed).
Better compression is obtained by compressing all members at once:
cat file1 file2 | gzip > foo.gz
compresses better than
gzip --to-stdout file1 file2 > foo.gz
To recompress concatenated files to get better compression:
zcat old.gz | gzip > new.gz
If a compressed file consists of several members, the uncompressed
size and CRC reported by the
--list option applies to the last member only.
To display the uncompressed size for all members, use:
zcat file.gz | wc -c
To create a single archive file with multiple members so
that members can later be extracted independently, use an archiver such
tar supports the
option to invoke
gzip is designed as a complement to
tar, not as a replacement.
GZIPholds default options.
for sh: GZIP="-8v --name"; export GZIP for csh: setenv GZIP "-8v --name" for MSDOS: set GZIP=-8v --name
On Vax/VMS, the name of the environment variable is
avoid a conflict with the symbol set for invocation of the program.
When writing compressed data to a tape, it is generally necessary to pad
the output with zeroes up to a block boundary.
When the data is read and the whole block is passed to
gunzip for decompression,
gunzip detects that there is extra trailing garbage after the
compressed data and emits a warning by default, use
--quiet to suppress the warning.
This option can be set in the
GZIP environment variable, as in:
for sh: GZIP="-q" tar -xfz --block-compress /dev/rst0 for csh: (setenv GZIP "-q"; tar -xfz --block-compress /dev/rst0)
In the above example,
gzip is invoked implicitly by the
option of GNU
tar. Make sure that the same block size (
tar) is used for reading and writing compressed data on
tapes. (This example assumes you are using the GNU version of
email@example.com. Include the version number,
which you can find by running
gzip -V. Also include in your
message the hardware and operating system, the compiler used to compile
a description of the bug behavior, and the input to
gzip that triggered the bug.
gzipreduces the size of the named files using Lempel-Ziv coding (LZ77) creating a file by one with the extension
.gz, keeping the ownership modes, access and modification times. (The default extension is
zfor MSDOS, OS/2 FAT and Atari.)
-standard input is compressed to the standard output.
gzipwill only attempt to compress regular files ( it will ignore symbolic links).
gzipattempts to truncate only the parts of the file name longer than 3 characters. (A part is delimited by dots.) If the name consists of small parts only, the longest parts are truncated.
gzip.msdos.exeis compressed to
gzip keeps the original file name and timestamp in
the compressed file. These are used when decompressing the file with the
-N option. This is useful when the compressed file name was
truncated or when the time stamp was not preserved after a file transfer.
Compressed files can be restored to their original form using
zcat. If the original name saved in the
compressed file is not suitable for its file system, a new name is
constructed from the original one to make it legal.
gunzip takes a list of files on its command line and replaces
each file whose name ends with
_z and which begins with the correct
magic number with an uncompressed file without the original extension.
gunzip also recognizes the special extensions
.taz as shorthands for
respectively. When compressing,
gzip uses the
extension if necessary instead of truncating a file with a
gunzip can currently decompress files created by
pack. The detection of the input
format is automatic. When using the first two formats,
checks a 32 bit CRC (cyclic redundancy check). For
gunzip checks the uncompressed length. The
was not designed to allow consistency checks. However
sometimes able to detect a bad
.Z file. If you get an error when
.Z file, do not assume that the
.Z file is
correct simply because the standard
uncompress does not complain.
This generally means that the standard
uncompress does not check
its input, and happily generates garbage output. The SCO
-H format (
lzh compression method) does not include a CRC but
also allows some consistency checks.
Files created by
zip can be uncompressed by
gzip only if
they have a single member compressed with the 'deflation' method. This
feature is only intended to help conversion of
tar.zip files to
tar.gz format. To extract
zip files with several
unzip instead of
zcat is identical to
uncompresses either a list of files on the command line or its standard
input and writes the uncompressed data on standard output.
will uncompress files that have the correct magic number whether they
.gz suffix or not.
gzip uses the Lempel-Ziv algorithm used in
zip and PKZIP.
The amount of compression obtained depends on the size of the input and
the distribution of common substrings. Typically, text such as source
code or English is reduced by 60-70%. Compression is generally much
better than that achieved by LZW (as used in
coding (as used in
pack), or adaptive Huffman coding
Compression is always performed, even if the result is slightly
larger than the original. The worst case expansion is a few bytes for
file header, plus 5 bytes every 32K block, or an expansion
ratio of 0.015% for large files. Note that the actual number of used
disk blocks almost never increases.
gzip preserves the mode, ownership and timestamps of files when compressing or decompressing.
This is the output of the command
gzip 1.2.4 (18 Aug 93) usage: gzip [-cdfhlLnNrtvV19] [-S suffix] [file ...] -c --stdout write on standard output, keep original files unchanged -d --decompress decompress -f --force force overwrite of output file and compress links -h --help give this help -l --list list compressed file contents -L --license display software license -n --no-name do not save or restore the original name and time stamp -N --name save or restore the original name and time stamp -q --quiet suppress all warnings -r --recursive operate recursively on directories -S .suf --suffix .suf use suffix .suf on compressed files -t --test test compressed file integrity -v --verbose verbose mode -V --version display version number -1 --fast compress faster -9 --best compress better file... files to (de)compress. If none given, use standard input.as of 05/16/10 gzip -V
gzip 1.3.5 (2002-09-30) Copyright 2002 Free Software Foundation Copyright 1992-1993 Jean-loup Gailly This program comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of this program under the terms of the GNU General Public License. For more information about these matters, see the file named COPYING. Compilation options: DIRENT UTIME STDC_HEADERS HAVE_UNISTD_H HAVE_MEMORY_H HAVE_STRING_H HAVE_LSTAT Written by Jean-loup Gailly.
This document was generated on 7 November 1998 using the texi2html translator version 1.52.