gzip The data compression program

Concept Index

  • concatenated files
  • Environment
  • overview
  • tapes
  • Taken from bzip2.txt 9/11/10 v1.06

    bzip2, bunzip2 - a block-sorting file compressor bzcat - decompresses files to stdout bzip2recover - recovers data from damaged bzip2 files

    bzip2 [ -cdfkqstvzVL123456789 ] [ filenames ... ]
    bunzip2 [ -fkvsVL ] [ filenames ... ]
    bzcat [ -s ] [ filenames ... ]
    bzip2recover filename

    bzip2 compresses files using the Burrows-Wheeler block sorting text compression algorithm, and Huffman coding. Compression is generally considerably better than that achieved by more conventional LZ77/LZ78-based compressors, and approaches the performance of the PPM family of statistical compressors.

    More at wikipedia

    The command-line options are deliberately very similar to those of GNU gzip, but they are not identical.

    Files on the command line (or expanded by globing) are replaced by a compressed version with the name suffixed by .bz2
    Compressed files retain ownership, permissions, and modification date ( access and change date are not preserved).

    Files are not over-written, specify --force.

    Piping is done if no file names are specified reading from standard input to writing to standard output.

    Decompresses specified files, unless they were not created by bzip2 which will be passed with a warning.
    Filename for the decompressed file from that of the compressed file as follows:

    filename.bz2 becomes filename
    filename.bz becomes filename
    filename.tbz2 becomes filename.tar
    filename.tbz becomes filename.tar
    anyothername becomes anyothername.out

    If the file does not end in a recognised ending, .bz2, .bz, .tbz2 or .tbz, bzip2 warns that it cannot determine the name of the original file, and uses the original name with .out appended.

    bunzip2 correctly decompresss a file which is the concatenation of two or more compressed files.
    The result is the concatenation of the corresponding uncompressed files.

    Integrity testing (-t) of concatenated compressed files is supported.

    files re output to the standard output by using -c .
    Multiple files may be compressed and decompressed using this. The resulting outputs are fed sequentially to stdout. Compression of multiple files in this manner generates a stream containing multiple compressed file representations. Such a stream can be decompressed correctly only by bzip2 version 0.9.0 or later. Earlier versions of bzip2 will stop after decompressing the first file in the stream.

    bzcat (or bzip2 -dc) decompresses to the standard output.

    bzip2 reads arguments from the environment variables BZIP2 and BZIP, in that order, and will process them before any arguments read from the command line. This gives a convenient way to supply default arguments.

    bzip2 uses 32-bit CRCs to verfiy that the decompressed file is correct.
    This guards against corruption of the compressed data going undetected with the chances of one in four billion for each file.
    the check occurs upon decompression you recover the original uncompressed data. bzip2recover may recover data from damaged files.

    Return values:
    0 normal exit,
    1 environmental problems (file not found, invalid flags, I/O errors
    2 to indicate a corrupt compressed file,
    3 internal consistency error (eg, bug)

    OPTIONS
    -d
    ‑‑decompress
    -z
    --compress
    -t
    --test
    Test integrity of the file(s) by decompression without writing output
    -c
    --stdout
    write to standard output. Useful for piping.
    -f
    --force
    1. overwrites existing output files.
    2. hard links to files to be severed
    3. Files that don't seem to be compressed (i.e. missing magic header bytes) will pass unmodified. GNU gzip compatibility.
    -k
    --keep
    Keep (don't delete) input files
    -s
    --small
    Reduce memory usage
    Files are decompressed and tested using an algorithm which only requires 2.5 bytes per block byte. This means any file can be decompressed in 2.3MB, at about half the normal speed.
    During compression, -s selects a block size of 200k, which limits memory use to around the same figure, at the expense of compression ratio.
    For machines 8 megabytes or less use -s. See Memory Management below.
    -q
    --quiet
    Suppress warnings .
    I/O errors and critical events will not be suppressed.
    -v
    --verbose
    Show the compression ratio, multiple -v's increase the verbosity
    -L
    --license
    -V
    --version
    -1 or --fast
     to
    -9 or --best
    Set the block size to 100 k, 200 k .. 900 k when compressing.
    See memory management.
    Aliases for GNU gzip compatibility.
    --fast doesn't , --best selects the default behaviour.
    -- all subsequent arguments as file names, even if they start with a dash.
    example: bzip2 -- -myfilename.
    --repetitive-fast
    --repetitive-best
    These flags are redundant in versions 0.9.5 and above. They provided some coarse control over the behaviour of the sorting algorithm in earlier versions, which was sometimes useful. 0.9.5 and above have an improved algorithm which renders these flags irrelevant.

    The compressed file may be slightly larger than the original. Files of less than one hundred bytes tend to get larger, since the compression mechanism has a constant overhead in the region of 50 bytes. Random data (including the output of most file compressors) is coded at about 8.05 bits per byte, giving an expansion of around 0.5%.

    Memory management

    bzip2 compresses files in blocks whose size affects both the compression ratio achieved, and the amount of memory needed for compression and decompression. The flags -1 through -9 specify the block size to be 100,000 bytes through 900,000 bytes (the default) .

    At decompression time, the block size used for compression is read from the header of the compressed file.

    Memory requirements can be estimated as:
    Compression:   400k + ( 8 x block size )
    Decompression: 100k + ( 4 x block size ), or 100k + ( 2.5 x block size )

    Larger block sizes give rapidly diminishing marginal returns. Most of the compression comes from the first two or three hundred k of block size. It is important to consider that the decompression memory requirement is set at compression time by the choice of block size.

    For files compressed with the default 900k block size, 3.700 MB will be required to decompress.
    To support decompression of any file on a 4 megabyte machine, bunzip2 has an option to decompress using approximately half this amount of memory, about 2300 kbytes. Decompression speed is also halved, so you should use this option only where necessary. The relevant flag is -s.

    Use the largest block size memory constraints allow which maximises the compression achieved, speed are unaffected by block size.

    Another significant point applies to files which fit in a single block -- that means most files you'd encounter using a large block size.
    The amount of real memory touched is proportional to the size of the file, since the file is smaller than a block.
    For example, compressing a 20,000 bytes file with the flag -9 will cause the compressor to allocate around 7,600k of memory, but only touch 400k + 20,000 * 8 = 560 kb.
    Similarly, the decompressor will allocate 3700k but only touch 100k + 20,000 * 4 = 180 kb.

    This table summarises the maximum memory usage for different block sizes.
    Also recorded is the total compressed size for 14 files of the Calgary Text Compression Corpus totalling 3,141,622 bytes. This column gives some feel for how compression varies with block size. These figures tend to understate the advantage of larger block sizes for larger files, since the Corpus is dominated by smaller files.

                      Compress   Decompress   Decompress   Corpus
               Flag     usage      usage       -s usage     Size
                -1      1200k       500k         350k      914704
                -2      2000k       900k         600k      877703
                -3      2800k      1300k         850k      860338
                -4      3600k      1700k        1100k      846899
                -5      4400k      2100k        1350k      845160
                -6      5200k      2500k        1600k      838626
                -7      6100k      2900k        1850k      834096
                -8      6800k      3300k        2100k      828642
                -9      7600k      3700k        2350k      828642
    

    RECOVERING DATA FROM DAMAGED FILES
    bzip2 compresses files in blocks, usually 900kb long. Each block is handled independently. If a media or transmission error causes a multi-block compressed file to become damaged, it may be possible to recover data from the undamaged blocks in the file.

    The compressed representation of each block is delimited by a 48-bit pattern, which makes it possible to find the block boundaries with reasonable certainty. Each block also carries its own 32-bit CRC, so damaged blocks can be distinguished from undamaged ones.

    bzip2recover is a simple program whose purpose is to search for blocks in .bz2 files, and write each block out into its own .bz2 file. You can then use bzip2 -t to test the integrity of the resulting files, and decompress those which are undamaged.

    bzip2recover takes a single argument, the name of the damaged file, and writes a number of files "rec00001file.bz2", "rec00002file.bz2", etc, containing the extracted blocks. The output filenames are designed so that the use of wildcards in subsequent processing -- for example, "bzip2 -dc rec*file.bz2 > recovered_data" -- processes the files in the correct order.

    bzip2recover should be of most use dealing with large .bz2 files, as these will contain many blocks. It is clearly futile to use it on damaged single-block files, since a damaged block cannot be recovered. If you wish to minimise any potential data loss through media or transmission errors, you might consider compressing with a smaller block size.

    PERFORMANCE NOTES
    The sorting phase of compression gathers together similar strings in the file. Because of this, files containing very long runs of repeated symbols, like "aabaabaabaab ..." (repeated several hundred times) may compress more slowly than normal. Versions 0.9.5 and above fare much better than previous versions in this respect. The ratio between worst-case and average-case compression time is in the region of 10:1. For previous versions, this figure was more like 100:1. Using -vvvv to monitor progress in great detail.

    Decompression speed is unaffected.

    bzip2 allocates several megabytes of memory and accesses all over it in a fairly random fashion. This means that performance, both for co pressing and decompressing, is largely determined by the speed at which your machine can service cache misses. Because of this, small changes to the code to reduce the miss rate have been observed to give disproportionately large performance improvements. I imagine bzip2 will perform best on machines with very large caches.

    bzip2, bunzip2 and bzcat are the same program, and the decision about what actions to take is done on the basis of which name is used.

    CAVEATS
    This document pertains to version 1.0.6.
    Compressed data created by this version is forward and backward compatible with the previous releases.
    The following exception: 0.9.0 and above can correctly decompress multiple concatenated compressed files. 0.1pl2 will stop after decompressing just the first file in the stream.

    bzip2recover versions prior to 1.0.2 used 32-bit integers to represent bit positions in compressed files, so they could not handle compressed files more than 512 megabytes long.
    Versions 1.0.2 and above use 64-bit ints on some platforms which support them (GNU supported targets, and Windows). To establish whether or not bzip2recover was built with such a limitation, run it without arguments.
    Build an unlimited version by recompiling with MaybeUInt64 set to be an unsigned 64-bit integer.

    AUTHOR Julian Seward, jsewardbzip.org.

    http://www.bzip.org

    The ideas embodied in bzip2 are due to (at least) the following people: Michael Burrows and David Wheeler (for the block sorting transformation), David Wheeler (again, for the Huffman coder), Peter Fenwick (for the structured coding model in the original bzip, and many refinements), and Alistair Moffat, Radford Neal and Ian Witten (for the arithmetic coder in the original bzip). I am much indebted for their help, support and advice. See the manual in the source distribution for pointers to sources of documentation. Christian von Roques encouraged me to look for faster sorting algorithms, so as to speed up compression. Bela Lubkin encouraged me to improve the worst-case compression performance. Donna Robinson XMLised the documentation. The bz* scripts are derived from those of GNU gzip. Many people sent patches, helped with portability problems, lent machines, gave advice and were generally helpful.

    --list, -l

    gzip -l *gz
              compressed        uncompressed  ratio uncompressed_name
                     20                   0   0.0% smother.diske-
                     20                   0   0.0% smother.diskf-
                     20                   0   0.0% smother.diskg-
                     20                   0   0.0% smother.diskh-
              798830592          3596346423  77.8% smother_wd0e
              798830672          3596346423  77.8% (totals) 
    with --verbose
     gzip -lv *gz
    method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
    defla 00000000 Sep  1 15:00                  20                   0   0.0% smother.diske-
    defla 00000000 Sep  1 15:01                  20                   0   0.0% smother.diskf-
    defla 00000000 Sep  1 15:01                  20                   0   0.0% smother.diskg-
    defla 00000000 Sep  1 15:01                  20                   0   0.0% smother.diskh-
    defla dbd673f2 Sep  1 16:09           798830592          3596346423  77.8% smother_wd0e
                                          798830672          3596346423  77.8% (totals) 
    The uncompressed size is given as -1 for files not in gzip format, such as compressed .Z files.
    To get the uncompressed size for such a file, use: zcat file.Z | wc -c
    The crc is given as ffffffff for a file not in gzip format.
    Title and totals lines are not displayed with --quiet,.

    --decompress
    --uncompress
    -d
    -r
    --recursive,
    Travel the directory structure recursively.
    gzip will descend directories specified and process all the files there .

    -S suf
    --suffix suf
    Suffix suf instead of .gz default.
    Most useful for decompression.
    Any suffix can be given, but suffixes other than .z and .gz should be avoided to avoid confusion when files are transferred to other systems.
    A null suffix forces gunzip to try decompression on all given files regardless of suffix, as in:
     gunzip -S "" *        (*.* for MSDOS) 
    Previous versions of gzip used the .z suffix. This was changed to avoid a conflict with pack.

    -t
    --test
    Check the compressed file integrity.

    --quiet
    -q
    Suppress all warning messages.
    -v
    --verbose
    Display the name and percentage reduction for each file compressed.
    --no-name
    -n
    When compressing, do not save the original file name and time stamp by default. (The original name is always saved if the name had to be truncated.)
    When decompressing, do not restore the original file name if present (remove only the gzip suffix from the compressed file name) and do not restore the original time stamp if present (copy it from the compressed file).
    This option is the default when decompressing.
    --name
    -N
    When compressing, always save the original file name and time stamp; default.
    When decompressing, restore the original file name and time stamp if present, useful on systems which have a limit on file name length or when the time stamp has been lost after a file transfer.
    --stdout
    --to-stdout
    -c
    Write output on standard output; keep original files unchanged.
    If there are several input files, the output consists of a sequence of independently compressed members.
    To obtain better compression, concatenate all input files before compressing them.
    --fast
    --best
    -n
    Specify speed/compression tradeoff
    --fast or -1 fastest / less compression and
    --best or -9 slowest / more compression .
    default -6 (biased towards high compression at expense of speed).
    --force
    -f
    even if the file has multiple links or the corresponding file already exists, or if the compressed data is read from or written to a terminal. If the input data is not in a format recognized by gzip, and if --stdout is also given, copy the input data without change to the standard ouput: let
    zcat behave as cat. If --force is not given, and when not running in the background, gzip prompts to verify whether an existing file should be overwritten.
    --help
    -h
    --version
    -V
    --license
    -L
    Display the gzip license then quit.

    The following command will find all gzip files in the current directory and subdirectories, and extract them in place without destroying the original:

            find . -name '*.gz' -print | sed 's/^\(.*\)[.]gz$/gunzip < "&" > "\1"/' | sh 

    Advanced usage

    Multiple compressed files can be concatenated. In this case, gunzip will extract all members at once. If one member is damaged, other members might still be recovered after removal of the damaged member.
    Better compression can be usually obtained if all members are decompressed and then recompressed in a single step.

    This is an example of concatenating gzip files:

         gzip --to-stdout file1  > foo.gz
         gzip --to-stdout file2 >> foo.gz 

    In case of damage to one member of a .gz file, other members can still be recovered (if the damaged member is removed).
    Better compression is obtained by compressing all members at once:

        cat file1 file2 | gzip > foo.gz 

    compresses better than gzip --to-stdout file1 file2 > foo.gz

    To recompress concatenated files to get better compression: zcat old.gz | gzip > new.gz

    If a compressed file consists of several members, the uncompressed size and CRC reported by the --list option applies to the last member only.
    To display the uncompressed size for all members, use:

         zcat file.gz | wc -c 

    To create a single archive file with multiple members so that members can later be extracted independently, use an archiver such as tar or zip. GNU tar supports the -z option to invoke gzip transparently.

    gzip is designed as a complement to tar, not as a replacement.

    Environment

    The environment variable GZIP holds default options.
    These options are interpreted first and can be overwritten by explicit command line parameters. For example:
    for sh:    GZIP="-8v --name"; export GZIP
    for csh:   setenv GZIP "-8v --name"
    for MSDOS: set GZIP=-8v --name 

    On Vax/VMS, the name of the environment variable is GZIP_OPT, to avoid a conflict with the symbol set for invocation of the program.

    Using gzip on tapes

    When writing compressed data to a tape, it is generally necessary to pad the output with zeroes up to a block boundary.
    When the data is read and the whole block is passed to gunzip for decompression, gunzip detects that there is extra trailing garbage after the compressed data and emits a warning by default, use --quiet to suppress the warning.
    This option can be set in the GZIP environment variable, as in:

    for sh:    GZIP="-q"  tar -xfz --block-compress /dev/rst0
    for csh:   (setenv GZIP "-q"; tar -xfz --block-compress /dev/rst0) 

    In the above example, gzip is invoked implicitly by the -z option of GNU tar. Make sure that the same block size (-b option of tar) is used for reading and writing compressed data on tapes. (This example assumes you are using the GNU version of tar.)

    Reporting Bugs

    email to jloup@chorus.fr or bug-gnu-utils@prep.ai.mit.edu. Include the version number, which you can find by running gzip -V. Also include in your message the hardware and operating system, the compiler used to compile gzip, a description of the bug behavior, and the input to gzip that triggered the bug.

    Overview

    gzip reduces the size of the named files using Lempel-Ziv coding (LZ77) creating a file by one with the extension .gz, keeping the ownership modes, access and modification times. (The default extension is -gz for VMS, z for MSDOS, OS/2 FAT and Atari.)
    If no files are specified or if a file name is - standard input is compressed to the standard output.
    gzip will only attempt to compress regular files ( it will ignore symbolic links).
    If the output file name is too long for the file system, gzip attempts to truncate only the parts of the file name longer than 3 characters. (A part is delimited by dots.) If the name consists of small parts only, the longest parts are truncated.
    For example, if file names are limited to 14 characters, gzip.msdos.exe is compressed to gzi.msd.exe.gz.

    gzip keeps the original file name and timestamp in the compressed file. These are used when decompressing the file with the -N option. This is useful when the compressed file name was truncated or when the time stamp was not preserved after a file transfer.

    Compressed files can be restored to their original form using gzip -d or gunzip or zcat. If the original name saved in the compressed file is not suitable for its file system, a new name is constructed from the original one to make it legal.

    gunzip takes a list of files on its command line and replaces each file whose name ends with .gz, .z, .Z, -gz, -z or _z and which begins with the correct magic number with an uncompressed file without the original extension. gunzip also recognizes the special extensions .tgz and .taz as shorthands for .tar.gz and .tar.Z respectively. When compressing, gzip uses the .tgz extension if necessary instead of truncating a file with a .tar extension.

    gunzip can currently decompress files created by gzip, zip, compress or pack. The detection of the input format is automatic. When using the first two formats, gunzip checks a 32 bit CRC (cyclic redundancy check). For pack, gunzip checks the uncompressed length. The compress format was not designed to allow consistency checks. However gunzip is sometimes able to detect a bad .Z file. If you get an error when uncompressing a .Z file, do not assume that the .Z file is correct simply because the standard uncompress does not complain. This generally means that the standard uncompress does not check its input, and happily generates garbage output. The SCO compress -H format (lzh compression method) does not include a CRC but also allows some consistency checks.

    Files created by zip can be uncompressed by gzip only if they have a single member compressed with the 'deflation' method. This feature is only intended to help conversion of tar.zip files to the tar.gz format. To extract zip files with several members, use unzip instead of gunzip.

    zcat is identical to gunzip -c. zcat uncompresses either a list of files on the command line or its standard input and writes the uncompressed data on standard output. zcat will uncompress files that have the correct magic number whether they have a .gz suffix or not.

    gzip uses the Lempel-Ziv algorithm used in zip and PKZIP. The amount of compression obtained depends on the size of the input and the distribution of common substrings. Typically, text such as source code or English is reduced by 60-70%. Compression is generally much better than that achieved by LZW (as used in compress), Huffman coding (as used in pack), or adaptive Huffman coding (compact).

    Compression is always performed, even if the result is slightly larger than the original. The worst case expansion is a few bytes for file header, plus 5 bytes every 32K block, or an expansion ratio of 0.015% for large files. Note that the actual number of used disk blocks almost never increases.
    gzip preserves the mode, ownership and timestamps of files when compressing or decompressing.

    This is the output of the command gzip -h:

    gzip 1.2.4 (18 Aug 93)
    usage: gzip [-cdfhlLnNrtvV19] [-S suffix] [file ...]
     -c --stdout      write on standard output, keep original files unchanged
     -d --decompress  decompress
     -f --force       force overwrite of output file and compress links
     -h --help        give this help
     -l --list        list compressed file contents
     -L --license     display software license
     -n --no-name     do not save or restore the original name and time stamp
     -N --name        save or restore the original name and time stamp
     -q --quiet       suppress all warnings
     -r --recursive   operate recursively on directories
     -S .suf  --suffix .suf     use suffix .suf on compressed files
     -t --test        test compressed file integrity
     -v --verbose     verbose mode
     -V --version     display version number
     -1 --fast        compress faster
     -9 --best        compress better
     file...          files to (de)compress. If none given, use standard input.
    
    as of 05/16/10 gzip -V
    gzip 1.3.5
    (2002-09-30)
    Copyright 2002 Free Software Foundation
    Copyright 1992-1993 Jean-loup Gailly
    This program comes with ABSOLUTELY NO WARRANTY.
    You may redistribute copies of this program
    under the terms of the GNU General Public License.
    For more information about these matters, see the file named COPYING.
    Compilation options:
    DIRENT UTIME STDC_HEADERS HAVE_UNISTD_H HAVE_MEMORY_H HAVE_STRING_H HAVE_LSTAT 
    Written by Jean-loup Gailly.


    This document was generated on 7 November 1998 using the texi2html translator version 1.52.

    www.bzip.org/1.0.5/bzip2-manual