Navigation
  • Home
  • Recent
  • Most Active
  • Popular
  • Blog
  • Credits
  • RSS
  •   Interaction
  • Register
  • Statistics
  •   Help
  • Suggestions
  • Contact Us
  • How to Edit
  • Help



  • [Edit]


    LZW (Lempel-Ziv-Welch) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is designed to be fast to implement but not necessarily optimal since it does not perform any analysis on the data.

        LZW
            Description of the algorithm
            Uses
            Example
                Encoding
                Decoding
            Python_programming_language|Python Example
            Patent issues
            Lempel-Ziv-Welch vs. Ziv-Lempel-Welch
            See also

    top

    Description of the algorithm
    The compressor algorithm builds a string translation table from the text being compressed. The string translation table maps fixed-length codes (usually 12-bit) to strings. The string table is initialized with all single-character strings (256 entries in the case of 8-bit characters). As the compressor character-serially examines the text, it stores every unique two-character string into the table as a code/character concatenation, with the code mapping to the corresponding first character. As each two-character string is stored, the first character is outputted. Whenever a previously-encountered string is read from the input, the longest such previously-encountered string is determined, and then the code for this string concatenated with the extension character (the next character in the input) is stored in the table. The code for this longest previously-encountered string is outputted and the extension character is used as the beginning of the next string.

    The decompressor algorithm only requires the compressed text as an input, since it can build an identical string table from the compressed text as it is recreating the original text. However, an abnormal case shows up whenever the sequence character/string/character/string/character (with the same character for each character and string for each string) is encountered in the input and character/string is already stored in the string table. When the decompressor reads the code for character/string/character in the input, it cannot resolve it because it has not yet stored this code in its table. This special case can be dealt with because the decompressor knows that the extension character is the previously-encountered character.

    top

    Uses
    The method became widely used in the program compress, which became a more or less standard utility in Unix systems circa 1986. (It has since disappeared from many for both legal and technical reasons.) Several other popular compression utilities also used the method, or closely related ones.

    It became very widely used after it became part of the GIF image format in 1987. It may also (optionally) be used in TIFF files.

    LZW compression provided a better compression ratio, in most applications, than any well-known method available up to that time. It became the first widely used universal data compression method on computers. It would typically compress large English texts to about half of their original sizes.

    Today, an implementation of the algorithm is contained within the popular Adobe Acrobat software program.

    top

    Example
    This example shows the LZW algorithm in action, showing the status of the output and the dictionary at every stage, both in encoding and decoding the message. In order to keep things clear, let us assume that we're dealing with a simple alphabet - capital letters only, and no punctuation or spaces. This example has been constructed to give reasonable compression on a very short message; when used on real data, repetition is generally less pronounced, and so the initial parts of a message will see little compression. As the message grows, however, the compression ratio tends asymptotically to the maximum. A message to be sent might then look like the following:

    TOBEORNOTTOBEORTOBEORNOT

    The
      is a marker used to show that the end of the message has been reached. Clearly, then, we have 27 symbols in our alphabet. A computer will render these as strings of bits; 5-bit strings are needed to give sufficient combinations to encompass the entire dictionary. As the dictionary grows, the strings will need to grow in length to accommodate the additional entries. A 5-bit string gives 25 = 32 possible combinations of bits, and so when the 33rd dictionary word is created, the algorithm will have to start using 6-bit strings. Note that since the all-zero string 00000 is used, and is labeled "0", the 33rd dictionary entry will be labeled 32. The initial dictionary, then, will consist of the following:

      = 00000
    A = 00001
    B = 00010
    C = 00011
    .
    .
    .
    Z = 11010

    top

    Encoding
    If we weren't using LZW, and just sent the message as it stands (25 symbols at 5 bits each), it would require 125 bits. We will be able to compare this figure to the LZW output later. We are now in a position to apply LZW to the message.

    Symbol: Bit Code: New Dictionary Entry:
    (= output)

    T 20 = 10100 28: TO
    O 15 = 01111 29: OB
    B 2 = 00010 30: BE
    E 5 = 00101 31: EO
    O 15 = 01111 32: OR <--- start using 6-bit strings
    R 18 = 010010 33: RN
    N 14 = 001110 34: NO
    O 15 = 001111 35: OT
    T 20 = 010100 36: TT
    TO 28 = 011100 37: TOB
    BE 30 = 011110 38: BEO
    OR 32 = 100000 39: ORT
    TOB 37 = 100101 40: TOBE
    EO 31 = 011111 41: EOR
    RN 33 = 100001 42: RNO
    OT 35 = 100011 43: OT
      0 = 000000

    Total Length = 5
      5 + 12
        6 = 97 bits.

    In using LZW we have made a saving of 28 bits out of 125 -- we have reduced the message by almost 22%. If the message were longer, then the dictionary words would begin to represent longer and longer sections of text, allowing repeated words to be sent very compactly.

    top

    Decoding
    Imagine now that we have received the message produced above, and wish to decode it. We need to know in advance the initial dictionary used, but we can reconstruct the additional entries as we go, since they are always simply concatenations of previous entries.

    Bits: Output: New Entry:
    Full: Partial:

    10100 = 20 T 28: T?
    01111 = 15 O 28: TO 29: O?
    00010 = 2 B 29: OB 30: B?
    00101 = 5 E 30: BE 31: E?
    01111 = 15 O 31: EO 32: O? <--- start using 6-bit strings
    010010 = 18 R 32: OR 33: R?
    001110 = 14 N 33: RN 34: N?
    001111 = 15 O 34: NO 35: O?
    010100 = 20 T 35: OT 36: T?
    011100 = 28 TO 36: TT 37: TO? <--- for 36, only add 1st element
    011110 = 30 BE 37: TOB 38: BE? of next dictionary word
    100000 = 32 OR 38: BEO 39: OR?
    100101 = 37 TOB 39: ORT 40: TOB?
    011111 = 31 EO 40: TOBE 41: EO?
    100001 = 33 RN 41: EOR 42: RN?
    100011 = 35 OT 42: RNO 43: OT?
    000000 = 0

    The only slight complication comes if the newly-created dictionary word is sent immediately. In the decoding example above, when the decoder receives the first symbol, T, it knows that symbol 28 begins with a T, but what does it end with? The problem is illustrated below. We are decoding part of a message that reads ABABA:

    Bits: Output: New Entry:
    Full: Partial:

    .
    .
    .
    011101 = 29 AB 46: (word) 47: AB?
    101111 = 47 AB? <--- what do we do here?

    At first glance, this may appear to be asking the impossible of the decoder. We know ahead of time that entry 47 should be ABA, but how can the decoder work this out? The critical step is to note that 47 is built out of 29 plus whatever comes next. 47, therefore, ends with "whatever comes next". But, since it was sent immediately, it must also start with "whatever comes next", and so must end with the same symbol it starts with, namely A. This trick allows the decoder to see that 47 must be ABA.

    More generally the situation occurs whenever the encoder encounters the input of the form cScSc, where c is a single character, S is a string and cS is already in the dictionary. The encoder outputs the symbol for cS putting new symbol for cSc in the dictionary. Next it sees the cSc in the input and sends the new symbol it just inserted into the dictionary. By the reasoning presented in the above example this is the only case where the newly-created symbol is sent immediately.

    top

    Python_programming_language|Python Example

      Lempel-Ziv-Welch compression algorithm
      Translated to python by Thomas Van Durme
      Last edited: September 20, 2006
    class LZW:
    """Returns the compressed string in utf8 format"""
    def compress(self,uncompressed):
    if isinstance(uncompressed, str):
    chars = int(256)
    mydict = dict()
    buffer = list()
    result = str()
    for i in range(chars):
    mydictstr(i) = i
    for i in uncompressed:
    if len(buffer) == 0:
    xstr = str(ord(i))
    else:
    xstr = self.__join(buffer,"-")+"-"+str(ord(i))
    if mydict.has_key(xstr):
    buffer.append(ord(i))
    else:
    result += unichr(mydictself.__join(buffer,"-")).encode('utf8')
    mydictxstr = chars
    chars += 1
    del buffer
    buffer = list()
    buffer.append(ord(i))
    if len(buffer) != 0:
    result += unichr(mydictself.__join(buffer,"-")).encode('utf8')
    return result
    else:
    raise TypeError

    """Returns the decompressed string, input is a utf8 compressed string"""
    def decompress(self,compressed):
    if isinstance(compressed, str):
    chars = int(256)
    mydict = dict()
    for i in range(chars):
    mydicti = unichr(i).encode('utf8')
    decoded = compressed.decode('string_escape').decode('utf8')
    buffer = str()
    chain = str()
    result = str()
    for i in decoded:
    code = ord(i)
    current = mydictcode
    if buffer == "":
    buffer = current
    result += current
    else:
    if code<=255:
    result += current
    chain = buffer+current
    mydictchars = chain
    chars += 1
    buffer = current
    else:
    if mydict.has_key(code):
    chain = mydictcode
    else:
    chain = buffer+buffer0
    result += chain
    mydictchars = buffer+chain0
    chars += 1
    buffer = chain
    return result
    else:
    raise TypeError

    def __join(self,mylist,delimiter):
    if isinstance(mylist,list) and isinstance(delimiter,str):
    result = str()
    for i in range(0,len(mylist)):
    try:
    if i+1 == len(mylist):
    result += str(mylisti)
    else:
    result+= str(mylisti)+delimiter
    except TypeError:
    pass
    return result
    else:
    raise TypeError

    How to use:

    lzw = LZW()
    a = lzw.compress("test")
    print a
    b = lzw.decompress(a)
    print b


    top

    Patent issues
    Various patents have been issued in the USA and other countries for LZW and similar algorithms. LZ78 was covered by by Lempel, Ziv, Cohn, and Eastman, assigned to Sperry Corporation, later Unisys Corporation, filed on August 10, 1981. Two US patents were issued for the LZW algorithm: by Victor S. Miller and Mark N. Wegman and assigned to IBM, originally filed on June 1, 1983, and by Welch, assigned to Sperry Corporation, later Unisys Corporation, filed on June 20, 1983. On June 20, 2003, this patent on the LZW algorithm expired *.

    US Patent 4,558,302 is the one that has caused the most controversy (See GIF#Unisys and LZW patent enforcement).

    top

    Lempel-Ziv-Welch vs. Ziv-Lempel-Welch
    Although the LZW acronym refers to the inventors as Lempel, Ziv and Welch, some people claim the intellectual property rights go to Ziv first, so the method must be called the Ziv-Lempel-Welch algorithm, and not the Lempel-Ziv-Welch algorithm. Others who like distinguishing between the algorithm and the code prefer calling the algorithm LZ and the code written by Welch as LZW.

    top

    See also
     
    Search more:
     

       
    Source Privacy License Download Contact Us Atlas
    Scientus.org Dictionary (Yet Another Wiki) RC : 1.39
    MIT OpenCourseWare
    This article is licensed under the GNU Free Documentation License [copyleft]. It uses material from the Wikipedia article "LZW". link