Explanation of Lemmings file compression

Started by Lemmingologist, August 05, 2005, 02:48:46 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Lemmingologist

The levels, graphics, and sound files for PC Lemmings are all compressed using the .DAT compression format. A .DAT file contains one or more data items (for example, levels), and each item in the file is stored as a "block." The compressed data are read in bitwise, starting at the end of a block and working backwards. A few pieces of information, however, are stored in a header at the beginning of each block, and must be read in before jumping to the end.


Header

Byte 0: instructions for handling the tail end of the block. There are eight different possibilities, represented by the values 00 through 07. The instructions depend on the structure of the end of the file as well, which can be determined by looking at the last bit.
If the last bit in the file is 0, then the instructions are determined as follows:
00 - skip last 10 bits, read next 3 bits
01 - skip last 9 bits, read next 3 bits
02 - skip last 8 bits, read next 3 bits
03 - skip last 7 bits, read next 3 bits
04 - skip last 2 bits, read next 2 bits, skip next 4 bits, read next 1 bit
05 - skip last 2 bits, read next 3 bits, skip next 3 bits
06 - skip last 2 bits, read next 3 bits, skip next 2 bits
07 - skip last 2 bits, read next 3 bits, skip next 1 bit
If the last bit in the file is 1, the instructions are slightly different:
00 - skip last 11 bits, read next 8 bits
01 - skip last 10 bits, read next 8 bits
02 - skip last 9 bits, read next 8 bits
03 - skip last 8 bits, read next 8 bits
04 - skip last 3 bits, read next 4 bits, skip next 4 bits, read next 4 bits
05 - skip last 3 bits, read next 4 bits, skip next 3 bits, read next 4 bits
06 - skip last 3 bits, read next 4 bits, skip next 2 bits, read next 4 bits
07 - skip last 3 bits, read next 4 bits, skip next 1 bit, read next 4 bits

Byte 1: checksum. Don't know how this is calculated, but it doesn't appear to make any difference.
Bytes 2-5: four-byte big-endian integer indicating the size, in bytes, of the uncompressed block.
Bytes 6-9: four-byte big-endian integer indicating the size, in bytes, of the compressed block.


To be continued in reply, as there appears to be a limit on message length.

Lemmingologist

Data

Once the header has been read, skip to the end of the block and follow the instructions indicated by byte 0 of the header. Remember, the data are being read backwards, which means that "next" really means the bits that come before in the compressed file. Also, integers which are not in the header should be read in reverse bit order (for example, 110 = 3, and 01000000 = 2). The bits that were read, rather than skipped, form an integer specifying the first "length." If 8 bits were read instead of 3 (that is, the last bit of the file is 1), add 8 to this length. Now, do the following until all of the data have been read (reading backwards in the file from the current position, one bit at a time, and making sure not to treat the header as data):

(1) Read in a number of bytes equal to length + 1, and write them directly to the decompressed file, reversing the order of the bits in each byte. Remember that these bytes may start in the middle of a file byte - do not skip any bits. Proceed to (2).

(2) The next few bits indicate the nature of the bits that come after them. (Since bits are being read backwards one at a time, the actual order of these bits in the file will be reversed).
00 - read next 3 bits as an integer. This number is the next length. Repeat (1).
01 - there will be 2 repeated bytes. Read next 8 bits as an offset. Proceed to (3).
100 - there will be 3 repeated bytes. Read next 9 bits as an offset. Proceed to (3).
101 - there will be 4 repeated bytes. Read next 10 bits as an offset. Proceed to (3).
110 - read next 8 bits as an integer, "n". There will be n + 1 repeated bytes. Read the next 12 bits as an offset. Proceed to (3).
111 - read next 8 bits as an integer, "n". The next length is equal to n + 8. Repeat (1).

(3) Copy the indicated number of repeated bytes from the decompressed file, starting from the position equal to the number of bytes already decompressed, minus the offset, minus 1. Paste these bytes at the end of the decompressed file.
For example, if 50 bytes have been decompressed so far, and the offset is 20, copy bytes starting from position 50 - 20 - 1 = 29. If there are, for example, 5 repeated bytes, then copy 5 bytes starting from byte number 29, and paste them at the end of the file. Repeat (2).

After all of the data in the block have been read and the decompressed file has been written, the order of the bytes in the decompressed file will then need to be reversed, as the original file was read and compressed in reverse order. This must be done bytewise - read the last byte of the decompressed file, paste it in as the first byte of the new file; read the second to last byte and paste it in as the new second byte, and so on.


And that is all there is to the Lemmings compression format! I have included program code for compression and decompression as replies to this post. These programs should be able to be translated into any programming language which can read and write single bytes (built-in methods for reading and writing single bits are not necessary). Be aware, though, that the algorithms I am using are slow when handling large files, and could use some optimization. These programs are intended as a starting point for anyone who would like to write a Lemmings graphics editor. Which, I imagine, will inevitably happen sometime (see my other post - Instructions for editing graphics files).

Lemmingologist

dim byte, bytePos as integer
dim in, out as binaryStream

function ReadBits(count as integer) as integer
      dim i, sum as integer

      for i=count-1 downto 0
            sum=sum+pow(2,i)*(byte mod 2)
            byte=byte\2
            if bytePos < 7 then
                  bytePos=bytePos+1
            else
                  bytePos=0
                  byte=in.ReadByte
                  in.position=in.position-2
            end if
      next

      return sum
end function

dim i, bits, start, blockSize, compressedBlockSize, fileSize, itemCount, repeat, offset, length as integer
dim stop, endOfFile as boolean
dim file as folderItem

file=GetOpenFolderItem("type") //displays "Open File" dialog box
fileSize=0
itemCount=0
if file <> nil then
      while not endOfFile
            in=file.OpenAsBinaryFile(false)
            out=GetFolderItem("temp").CreateBinaryFile("type") //creates a binary file named temp
            in.position=fileSize
            start=in.ReadByte
            bits=in.ReadByte
            blockSize=in.ReadLong
            compressedBlockSize=in.ReadLong
            in.position=fileSize+compressedBlockSize-1
            byte=in.ReadByte
            bytePos=0
            in.position=in.position-2

            if start > 0 then
                  if ReadBits(1)=0 then
                        if start < 4 then
                              bits=ReadBits(9-start)
                              length=ReadBits(3)
                        elseif start=4 then
                              length=ReadBits(3)*2
                              bits=ReadBits(4)
                              length=length+ReadBits(1)
                        else
                              length=ReadBits(4)
                              bits=ReadBits(8-start)
                        end if
                  else
                        if start < 4 then
                              bits=ReadBits(10-start)
                              length=ReadBits(8)+8
                        else
                              bits=ReadBits(2)
                              length=ReadBits(4)*16
                              bits=ReadBits(8-start)
                              length=length+ReadBits(4)+8
                        end if
                  end if
            else
                  bits=ReadBits(8)
                  if ReadBits(1)=0 then
                        length=ReadBits(4)
                  else
                        bits=ReadBits(2)
                        length=ReadBits(8)+8
                  end if
            end if

            while in.position >= fileSize+5
                  for i=0 to length
                        bits=ReadBits(8)
                        out.WriteByte bits
                  next
                  stop=(in.position < fileSize+5)

                  while not stop
                        if ReadBits(1) = 0 then //0
                              if ReadBits(1) = 0 then //00
                                    length=ReadBits(3)
                                    stop=true
                              else //01
                                    repeat=1
                              end if
                        else //1
                              if ReadBits(1) = 0 then //10
                                    if ReadBits(1) = 0 then //100
                                          repeat=2
                                    else //101
                                          repeat=3
                                    end if
                              else //11
                                    if ReadBits(1) = 0 then //110
                                          repeat=ReadBits(8)
                                    else //111
                                          length=ReadBits(8)+8
                                          stop=true
                                    end if
                              end if
                        end if

                        if not stop then
                              if repeat <= 3 then
                                    offset=ReadBits(7+repeat)
                              else
                                    offset=ReadBits(12)
                              end if
                              for i=0 to repeat
                                    out.position=out.position-offset-1
                                    bits=out.ReadByte
                                    out.position=out.position+offset
                                    out.WriteByte bits
                              next
                        end if
                  wend
            wend

            in.close
            out.close
            in=GetFolderItem("temp").OpenAsBinaryFile(false)
            out=GetFolderItem(file.name+" Item "+str(itemCount)).CreateBinaryFile("type")
            in.position=in.length-1
            for i=1 to in.length-blockSize
                  bits=in.ReadByte
                  in.position=in.position-2
            next
            for i=1 to blockSize
                  out.WriteByte in.ReadByte
                  in.position=in.position-2
            next
            in.close
            out.close
            GetFolderItem("temp").delete

            fileSize=fileSize+compressedBlockSize
            if fileSize >= file.length then
                  endOfFile=true
            end if
            itemCount=itemCount+1
      wend
end if

Lemmingologist

function Binary(byte as integer) as string
      dim i, b as integer
      dim bin as string

      b=byte
      for i=7 downto 0
            bin=bin+str(b\pow(2,i))
            b=b mod pow(2,i)
      next

      return bin
end function

function FindRepeatedBytes(str1 as string, str2 as string, maxDistance as integer) as integer
      dim pos, curInStr, skip, length as integer
      dim source, find as string

      source=right(str1, maxDistance*8)
      find=str2

      do
            //inStr is a search function which returns 0 if the string to be found does not occur,
            //and n+1 if n characters precede its first occurrence
            curInStr=inStr(source, find)
            if curInStr > 0 then
                  if curInStr mod 8 = 1 then
                        pos=skip+curInStr\8+1
                  end if
                  skip=skip+ceil(curInStr/8)
                  source=right(source, len(source)-ceil(curInStr/8)*8)
            end if
      loop until curInStr=0

      return pos
end function

dim i, j, byte, position, count, offset, length, max as integer
dim newBytes, compressed, raw as string
dim in, out as binaryStream
dim file as folderItem
dim stop as boolean

file=GetOpenFolderItem("type") //displays "Open File" dialog box
if file <> nil then
      in=file.OpenAsBinaryFile(false)
      in.position=in.length-1

      while not stop
            count=0
            do
                  byte=in.ReadByte
                  if in.position=1 then
                        stop=true
                  else
                        in.position=in.position-2
                  end if
                  newBytes=newBytes+Binary(byte)
                  if len(raw) < max*8 then
                        offset=len(raw)\8-position
                  else
                        offset=max-position
                  end if
                  if len(newBytes) <= 32 then
                        max=pow(2, 6+len(newBytes)\8)
                  else
                        max=4096
                  end if
                  position=FindRepeatedBytes(raw, newBytes, max)
                  count=count+1
            loop until position=0 or count > 256 or stop

            if len(newBytes) > 16 then
                  if not stop then
                        in.position=in.position+1
                        count=count-1
                        newBytes=left(newBytes, len(newBytes)-8)
                  end if

                  if length > 0 then
                        if length <= 8 then
                              compressed=left(compressed, len(compressed)-length*8)+right(Binary(length-1), 5)+right(compressed, length*8)
                        else
                              compressed=left(compressed, len(compressed)-length*8)+"111"+Binary(length-9)+right(compressed, length*8)
                        end if
                        length=0
                  end if

                  select case count
                  case 2
                        compressed=compressed+"01"+Binary(offset)
                  case 3
                        compressed=compressed+"100"+right(Binary(offset\256), 1)+Binary(offset mod 256)
                  case 4
                        compressed=compressed+"101"+right(Binary(offset\256), 2)+Binary(offset mod 256)
                  else
                        compressed=compressed+"110"+Binary(count-1)+right(Binary(offset\256), 4)+Binary(offset mod 256)
                  end select
            else
                  if length+count <= 264 then
                        compressed=compressed+newBytes
                        length=length+count
                  else
                        compressed=left(compressed, len(compressed)-length*8)+"111"+Binary(length-9)+right(compressed, length*8)
                        compressed=compressed+newBytes
                        length=count
                  end if
            end if

            raw=raw+newBytes
            if len(raw) > 32768 then
                  raw=right(raw, 32768)
            end if
            newBytes=""
      wend

      if length > 0 then
            if length <= 8 then
                  compressed=left(compressed, len(compressed)-length*8)+right(Binary(length-1), 5)+right(compressed, length*8)
            else
                  compressed=left(compressed, len(compressed)-length*8)+"111"+Binary(length-9)+right(compressed, length*8)
            end if
            length=0
      end if

      out=GetFolderItem(file.name+" Compressed").CreateBinaryFile("type") //creates a binary file with the original file name plus "Compressed"
      j=len(compressed) mod 8
      out.WriteByte j

      if left(compressed, 1) = "0" then
            if j > 4 then
                  for i=7 downto j
                        newBytes=newBytes+"0"
                  next
                  compressed=left(compressed, 5)+newBytes+right(compressed, len(compressed)-5)
            elseif j=4 then
                  compressed=left(compressed, 4)+"0000"+right(compressed, len(compressed)-4)
            elseif j=0 then
                  compressed=compressed+"00000000"
            end if
      else
            for i=7 downto j
                  newBytes=newBytes+"0"
            next
            if j=0 then
                  compressed=newBytes+compressed
            elseif j < 3 then
                  compressed=left(compressed, j)+newBytes+right(compressed, len(compressed)-j)
            elseif j < 5 then
                  compressed=left(compressed, 3)+newBytes+right(compressed, len(compressed)-3)
            else
                  compressed=left(compressed, 7)+newBytes+right(compressed, len(compressed)-7)
            end if
      end if

      max=ceil(len(compressed)/8)
      out.WriteByte 0
      out.WriteLong in.length
      out.WriteLong max+10

      for i=1 to max
            byte=0
            for j=7 downto 0
                  byte=byte+val(right(compressed,1))*pow(2,j)
                  compressed=left(compressed, len(compressed)-1)
            next
            out.WriteByte byte
      next

      in.close
      out.close
end if

the guest

Can you give a semi-English explanation of what you do for compression?

Mindless

What language is this code in?  It's not any syntax I recognize... although it looks alot like BASIC.

Edit: After a quick search, it appears to be REALbasic...  :-/

the guest

Just curious:  "Lemmingologist", you don't happen to be the author of LemEdit, are you?

I'm guessing this for no reason other than the fact that your knowledge appears to somewhat match the features and limitations of LemEdit, including the significant fact that you don't have info on the VGASPECx.DAT files, just like the last version of LemEdit we get also don't have support for them yet.

The reason we (well, at least I) want to know is because, well, LemEdit could use some fixes to work better on modern Windows machines......

Mindless

Quote from: the guest  link=1123253327/0#7 date=1123290704Just curious:  "Lemmingologist", you don't happen to be the author of LemEdit, are you?

I'm guessing this for no reason other than the fact that your knowledge appears to somewhat match the features and limitations of LemEdit, including the significant fact that you don't have info on the VGASPECx.DAT files, just like the last version of LemEdit we get also don't have support for them yet.

The reason we (well, at least I) want to know is because, well, LemEdit could use some fixes to work better on modern Windows machines......
Sorry to break your theory, but (s)he can't be, since in order to compress .DATs (LemEdit does ;P) you MUST know how to create the correct checksum.

the guest

Good point.  I honestly haven't been really trying to thoroughly read whatever (s)he posted.  X_X

Lemmingologist

Start at the end of the uncompressed file, and repeat the following until reaching the beginning:

Read in bytes, one at a time, backwards toward the beginning, until a string (two bytes or longer) is found which has already been read. At the end of the compressed file, first indicate the number of non-repeating bytes that were read in before the repeated string, minus one (call this number n). This is done in the following manner:
      If n is less than 8, write two 0-bits, then write n as a three-bit integer.
      If n is greater than or equal to 8, write three 1-bits, then write n - 8 as an eight-bit integer.
For example, if 23 non-repeating bytes were read before the repeated string, then n = 23 - 1 = 22. 22 - 8 = 14, so the output to the compressed file would then be 111 followed by 14 in binary, which is 00001110.
Now append the non-repeating bytes to the compressed file.

Then, add two or three bits indicating the number of
repeating bytes, using the codes below:
 &#A0; &#A0; &#A0;01 - 2 repeated bytes
 &#A0; &#A0; &#A0;100 - 3 repeated bytes
 &#A0; &#A0; &#A0;101 - 4 repeated bytes
 &#A0; &#A0; &#A0;110 - anywhere from 2 to 256 repeated bytes. Follow this with the number of repeated bytes minus one, written as an eight-bit integer.

Next, write the offset of the repeated string, which is equal to the total number of bytes read so far from the uncompressed file, minus the number of bytes read before the previous occurrence of the repeated string.
The number of bits used to write the offset depends on the repeat indicator code used above:
 &#A0; &#A0;  01 - offset is an eight-bit integer
 &#A0; &#A0; &#A0;100 - offset is a nine-bit integer
 &#A0; &#A0; &#A0;101 - offset is a ten-bit integer
 &#A0; &#A0; &#A0;110 - offset is a twelve-bit integer
This means that the maximum offset is 255, 511, 1023, and 4095 for each of the above codes respectively. If the string does not occur within 4096 bytes of the current position in the file, it must not be counted as a repeated string.


After all of the data have been read, reverse the order of the bits (not bytes) in the compressed file. Then, at the beginning of the new compressed file, add the ten-byte header:
Byte 0: the remainder when the number of bits in the compressed file is divided by 8.
Byte 1: checksum. This is ignored by Lemmings, so a zero can be used here.
Bytes 2-5: four-byte big-endian integer indicating the size, in bytes, of the uncompressed file.
Bytes 6-9: four-byte big-endian integer indicating the size, in bytes, of the compressed file, including the ten-byte header.

Finally, some 0-bits need to be added near the end of the new compressed file. The placement of these bits depends on the value of byte 0 of the header:
      00 - add eight 0-bits at the end of the file.
      01 - if the last bit of the file is 1, add seven 0-bits before the last bit of the file. Otherwise, do nothing.
      02 - if the last bit of the file is 1, add six 0-bits before the last two bits of the file. Otherwise, do nothing.
      03 - if the last bit of the file is 1, add five 0-bits before the last three bits of the file. Otherwise, do nothing.
      04 - add four 0-bits before the last four bits of the file.
      05 - if the last bit of the file is 0, add three 0-bits before the last five bits of the file. If the last bit is 1, add three 0-bits before the last seven bits of the file.
      06 - if the last bit of the file is 0, add two 0-bits before the last five bits of the file. If the last bit is 1, add two 0-bits before the last seven bits of the file.
      07 - if the last bit of the file is 0, add one 0-bit before the last five bits of the file. If the last bit is 1, add one 0-bit before the last seven bits of the file.

ccexplore

Quote from: Lemmingologist  link=1123253327/0#10 date=1123931939Read in bytes, one at a time, backwards toward the beginning, until a string (two bytes or longer) is found which has already been read.
Ok, that's really the only part I cared about, since it was the part that affects tradeoffs between performance and effectiveness.

It sounds like your code searches for all possible substrings, correct?  So that in "abcdef", it will search "ab", "abc", "abcd" and "abcde"?

I'm curious what sorts of compression ratio you've gotten, and also how long it takes to compress data from the vgagr# files.

Mindless

Quote from: Lemmingologist  link=1123253327/0#10 date=1123931939Byte 1: checksum. This is ignored by Lemmings, so a zero can be used here.
Hmmm...
Crap... (s)he's right... of course, my Lemmings game is cracked, so they may have the checksum checking code removed...

Mindless

Quote from: Lemmingologist  link=1123253327/0#0 date=1123253326If the last bit in the file is 0, then the instructions are determined as follows:
...

If the last bit in the file is 1, the instructions are slightly different:

...
By this do you mean the last bit in the .DAT file or the last bit in the compressed data section?

ccexplore

Yeah, interestingly it works apparently.

Then again, I don't think I ever said that it wouldn't work either. &#A0;In the code it definitely passes back the result of the checksum comparison, but whether the result was used or not I can't tell since I'm not going to go through all the places where the function is called and see whether the caller bothers to check the return value.

Anyhow, it's just a dirt-simple checksum. &#A0;Probably the simplest aspect of the whole thing.

Now of course, whether it works for LemEdit or not is another question which I haven' tested (not that it matters for me).

Mindless

Quote from: ccexplore  link=1123253327/0#14 date=1123951925Now of course, whether it works for LemEdit or not is another question which I haven' tested (not that it matters for me).
No checksum mismatch warnings from LemEdit...


Edit: Oooo... wow, I crashed LemEdit.