Explanation of Lemmings file compression

Started by Lemmingologist, August 05, 2005, 02:48:46 PM

Previous topic - Next topic

0 Members and 2 Guests are viewing this topic.

covox

Great stuff. Unsure quite what language the code is (at first glance it looks distinctly VBish), however thankfully it's quite pseudocodey. I can't thank you enough, this is indeed a Christmas miracle.

Currently I'm writing a C port of your proceedure (tried starting from scratch using your instructions, my method invoked far too many dynamic memory hissy fits :/). Hopefully I'm aiming toward some sort of multitool to export graphics, sound and level metadata (XML, almost done) to more modern file formats/organized hierachy. If anyone's in the mood for more source code I'm happy to post what I have. Well, maybe once I've cleaned it up a bit ;)

Mindless

Quote from: covox  link=1123253327/15#16 date=1123979319Hopefully I'm aiming toward some sort of multitool to export graphics, sound and level metadata (XML, almost done) to more modern file formats/organized hierachy.
Could you elaborate on this a bit? I'm not sure what it is you're trying to do. XML?

ccexplore

Quote from: covox  link=1123253327/15#16 date=1123979319Currently I'm writing a C port of your proceedure (tried starting from scratch using your instructions, my method invoked far too many dynamic memory hissy fits :/).
You can skip this step if you want, I already have working C++ code for compression and decompression.  (But I hope you will credit me for it if you do decide to use it.)  E-mail me for more info.

covox

QuoteCould you elaborate on this a bit? I'm not sure what it is you're trying to do. XML?

Ooh-er, could've explained myself better there ;P At some point, I'm going to have a shot at making an open-source extendable Lemmings clone. It'll be done in C and SDL, which means (hopefully) that porting and drop-in improvements should be trivial. I'm developing this on a gcc/linux base, which should compile under Windows with mingw32. No idea if VS6/VS.NET/Microsoft Magic Mystery Compiler will touch it however ;/

The Pingus guys have done a great job, however I'm aiming toward something more lightweight and ultra-portable. Oh, and ClanLib is waaay too large to stick onto a PDA. Heaven help you if you change versions :/

To start with, I'm just trying to make the game data manageable. So all the terrain/animations will be converted to indexed PNG, and the level information converted to XML documents. XML is nice because both people and computers can read it :)

Each level will be stored in an XML file, and there should be some mechanism which refreshes the index (containing all the levels/different games) each time you run the program. For example, if you were to develop a level called "Taste the golden spray", and you want it positioned as Level 12 of Fun on a levelset called "My Generic Custom Game", you could add this XML to the appropriate file:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Games available are Classic, ONML, Your_game_name_here -->
<LemLevel game="My Generic Custom Game" difficulty="1" level="12">
  <title>Taste the golden spray!</title>
  <settings levelset="2" release="80" save="80" speed="20" time="180"/>
  <skills climber="0" floater="0" bomber="0" blocker="0" builder="0" basher="0" miner="0" digger="0"/>

  <!-- BEGIN level metadata -->
  <Interactives>
    <interactive piecenum="4" xoffset="0" yoffset="0" flag_ground="0" flag_nooverdraw="0" flag_upsidedown="1" />
    ...
  </Interactives>
  <Terrains>
    <terrain piecenum="12" xoffset="0" yoffset="0" flag_black="1" flag_nooverdraw="0" flag_upsidedown="0" />
    ...
  </Terrains>
  <Steels>
    <steel xoffset="0" yoffset="0" xwidth="16" ywidth="16" />
    ...
  </Steels>
</LemLevel>

It's very preliminary at the moment, any suggestions are more than welcome.  There's a good chance that the schema will be extended at some point to include a broader definition of interactive objects (think the player-controlled cannons in Lemmings 2).

QuoteYou can skip this step if you want, I already have working C++ code for compression and decompression.  (But I hope you will credit me for it if you do decide to use it.)  E-mail me for more info.

Thanks greatly for the offer, however I'd feel more confident if all the code is pure C. But then again, I haven't had time to compile test my attempt yet, so I may come crying back :)

ccexplore

Quote from: covox  link=1123253327/15#19 date=1123993068Thanks greatly for the offer, however I'd feel more confident if all the code is pure C. But then again, I haven't had time to compile test my attempt yet, so I may come crying back :)
For the decompression I actually have a C version.  I don't have and can't give you a C++ version for the compression though, because it uses <list>, and C, of course, lacks any sort of standardized library for collections.

Then again, since you mentioned "open source", it may be best anyway if you write the code yourself based on Lemmingologist's posts (mine might be a bit too close to what's in the original game itself to be safely open-sourced).

Mindless

Quote from: ccexplore (not logged in)  link=1123253327/15#20 date=1123995345I don't have and can't give you a C++ version for the compression though, because it uses <list>, and C, of course, lacks any sort of standardized library for collections.
That'd be why I haven't ported the compression code to FreeBasic (I gave up on QBX because of memory issues).  I have no idea what list is...

FreeBasic is awesome: the power of C, yet the syntax of QB (and they're not finished yet)

Lemmingologist

function Binary(byte as integer) as string
      dim i, b, power as integer
      dim bin as string

      b=byte
      for i=7 downto 0
            power=pow(2,i)
            bin=bin+str(b\power)
            b=b mod power
      next

      return bin
end function

function FindRepeatedBytes(str1 as string, str2 as string, maxDistance as integer) as integer
      dim pos, curInStr, skip as integer
      dim source, find as string

      source=str1
      find=str2
      do
            //inStr is a search function which returns 0 if the string to be found does not occur,
            //and n+1 if n characters precede its first occurrence
            curInStr=inStr(source, find)
            if curInStr > 0 then
                  if curInStr mod 8 = 1 then
                        pos=skip+curInStr\8+1
                  end if
                  skip=skip+ceil(curInStr/8)
                  source=right(source, len(source)-ceil(curInStr/8)*8)
            end if
      loop until pos > 0 or curInStr=0

      return pos
end function

function Reverse(bytes as string) as string
      dim i as integer
      dim b, n as string

      b=bytes
      for i=1 to len(bytes)\8
            n=n+right(b, 8)
            b=left(b, len(b)-8)
      next

      return n
end function

function XOR(byte1 as integer, byte2 as string) as integer
      dim i, b1, power, result as integer
      dim b2 as string

      b1=byte1
      b2=byte2
      for i=7 downto 0
            power=pow(2,i)
            if str(b1\power) <> left(b2, 1) then
                  result=result+power
            end if
            b1=b1 mod power
            b2=right(b2, i)
      next

      return result
end function

dim i, j, byte, position, readPosition, count, offset, length as integer
dim newBytes, compressed, raw as string
dim in, out as binaryStream
dim file as folderItem
dim stop as boolean

file=GetOpenFolderItem("type")
if file <> nil then
      in=file.OpenAsBinaryFile(false)
      readPosition=in.length-1

      while not stop
            ProgressBar1.value=in.position/in.length*1000\1
            count=0
            do
                  stop=readPosition < 0
                  if not stop then
                        in.position=readPosition
                        byte=in.ReadByte
                        readPosition=readPosition-1
                        newBytes=Binary(byte)+newBytes
                  end if
                  offset=position+count-2
                  if not stop then
                        position=FindRepeatedBytes(raw, newBytes)
                        count=count+1
                  end if
            loop until position=0 or count > 256 or stop
            stop=readPosition < 0

            if len(newBytes) > 16 then
                  if stop then
                        if position=0 then
                              count=count-1
                              newBytes=left(newBytes, 8)
                        end if
                  else
                        readPosition=readPosition+1
                        count=count-1
                        newBytes=right(newBytes, len(newBytes)-8)
                  end if

                  if length > 0 then
                        if length <= 8 then
                              compressed=left(compressed, len(compressed)-length*8)+right(Binary(length-1), 5)+right(compressed, length*8)
                        else
                              compressed=left(compressed, len(compressed)-length*8)+"111"+Binary(length-9)+right(compressed, length*8)
                        end if
                        length=0
                  end if

                  if count=2 and offset < 256 then
                        compressed=compressed+"01"+Binary(offset)
                  elseif count=3 and offset < 512 then
                        compressed=compressed+"100"+right(Binary(offset\256), 1)+Binary(offset mod 256)
                  elseif count=4 and offset < 1024 then
                        compressed=compressed+"101"+right(Binary(offset\256), 2)+Binary(offset mod 256)
                  else
                        compressed=compressed+"110"+Binary(count-1)+right(Binary(offset\256), 4)+Binary(offset mod 256)
                  end if

                  if stop and position=0 then
                        if length < 264 then
                              compressed=compressed+Reverse(newBytes)
                              length=length+1
                        else
                              compressed=left(compressed, len(compressed)-length*8)+"111"+Binary(length-9)+right(compressed, length*8)
                              compressed=compressed+Reverse(newBytes)
                              length=1
                        end if
                  end if
            else
                  if length+count <= 264 then
                        compressed=compressed+Reverse(newBytes)
                        length=length+count
                  else
                        compressed=left(compressed, len(compressed)-264*8)+"11111111111"+right(compressed, 264*8)
                        compressed=compressed+Reverse(newBytes)
                        length=count
                  end if
            end if

            raw=left(newBytes+raw, 32768)
            newBytes=""
      wend

      if length > 0 then
            if length <= 8 then
                  compressed=left(compressed, len(compressed)-length*8)+right(Binary(length-1), 5)+right(compressed, length*8)
            else
                  compressed=left(compressed, len(compressed)-length*8)+"111"+Binary(length-9)+right(compressed, length*8)
            end if
      end if

      out=GetFolderItem(file.name+" Compressed").CreateBinaryFile("type")
      j=len(compressed) mod 8
      out.WriteByte j

      for i=7 downto j
            newBytes=newBytes+"0"
      next
      if left(compressed, 1) = "0" then
            if j < 4 then
                  compressed=newBytes+compressed
            elseif j=4 then
                  compressed=left(compressed, 4)+newBytes+right(compressed, len(compressed)-4)
            else
                  compressed=left(compressed, 5)+newBytes+right(compressed, len(compressed)-5)
            end if
      else
            if j=0 then
                  compressed=newBytes+compressed
            elseif j <= 4 then
                  compressed=left(compressed, j)+newBytes+right(compressed, len(compressed)-j)
            else
                  compressed=left(compressed, 7)+newBytes+right(compressed, len(compressed)-7)
            end if
      end if

      length=len(compressed)\8
      for i=1 to length
            byte=XOR(byte, left(right(compressed, i*8), 8))
      next
      byte=byte\128+(byte\64 mod 2)*2+(byte\32 mod 2)*4+(byte\16 mod 2)*8+(byte\8 mod 2)*16+(byte\4 mod 2)*32+(byte\2 mod 2)*64+(byte mod 2)*128
      out.WriteByte byte
      out.WriteLong in.length
      out.WriteLong length+10

      for i=1 to length
            byte=0
            for j=7 downto 0
                  byte=byte+val(right(compressed,1))*pow(2,j)
                  compressed=left(compressed, len(compressed)-1)
            next
            out.WriteByte byte
      next

      in.close
      out.close
end if

Lemmingologist

Quote from: ccexplore  link=1123253327/0#10 date=1123941636Ok, that's really the only part I cared about, since it was the part that affects tradeoffs between performance and effectiveness.

It sounds like your code searches for all possible substrings, correct? &#A0;So that in "abcdef", it will search "ab", "abc", "abcd" and "abcde"?

I'm curious what sorts of compression ratio you've gotten, and also how long it takes to compress data from the vgagr# files.
Yes, if "abcdef" is a repeated string, it will search for all of those substrings until it reaches "abcdefg", which has not occurred before in the file. It will then back up one space in the uncompressed file, remove the "g", and add code for the repeating "abcdef". I think my FindRepeatingBytes function is highly inefficient, though, as it searches through the file from beginning to end, when it is really looking for the last occurrence of the substring. This is probably why it takes a fairly long time (several minutes) to compress the VGAGR files. The compressed files are much smaller than the ones generated by LemEdit, as LemEdit only looks for repeating strings which are 16 bytes or less, and my compressor takes advantage of every bit in the "repeat number byte" as well as every bit allocated to the offset. They are still longer than the original compressed files, however. The Lemmings compressor must have been more clever about where to begin and end repeated substrings, unlike my "greedy" algorithm, which simply looks for the longest possible string starting from wherever it happens to be in the file.

Lemmingologist

Quote from: Mindless  link=1123253327/0#12 date=1123951785By this do you mean the last bit in the .DAT file or the last bit in the compressed data section?
The compressed data section. I suppose I should have said "block", rather than "file".

ccexplore

Quote from: Lemmingologist  link=1123253327/15#22 date=1124706104the substring. This is probably why it takes a fairly long time (several minutes) to compress the VGAGR files.
Several minutes?!?  X_X  I guess that won't do for end-user purposes.  ;) (Yes, I understand it's never intended as such.)

I still like to know how yours compared with DMA's (ie. the original DAT files from the game).  Are yours only larger by a few bytes, or a lot more?

As a reference, the algorithm I have is generally a few bytes larger for level DAT files, about 20% larger for VGASPEC files, and unfortunately, nearly twice as large for VGAGR files.  On the other hand, it operates with no perceptible delay (ie. under a second or less).

(Though as a practical matter, you can probably get away with not doing any real compression at all, so I guess this is more an academic exercise than anything.  :-/)

Lemmingologist

Quote from: ccexplore (not logged in)  link=1123253327/15#24 date=1124708714Several minutes?!? &#A0;X_X &#A0;I guess that won't do for end-user purposes. &#A0;;) (Yes, I understand it's never intended as such.)

I still like to know how yours compared with DMA's (ie. the original DAT files from the game). &#A0;Are yours only larger by a few bytes, or a lot more?

As a reference, the algorithm I have is generally a few bytes larger for level DAT files, about 20% larger for VGASPEC files, and unfortunately, nearly twice as large for VGAGR files. &#A0;On the other hand, it operates with no perceptible delay (ie. under a second or less).

(Though as a practical matter, you can probably get away with not doing any real compression at all, so I guess this is more an academic exercise than anything. &#A0;:-/)
My files are (roughly) 1% larger for VGAGR, 4% larger for levels, and 7% larger for VGASPEC. Obviously, this is not worth all the extra time my program takes. The slow speed is probably due to the fact that I'm using a high-level framework and a general-purpose search function, though. You could probably afford to increase your level of compression without sacrificing too much speed.

Lemmingologist

dim byte, bytePos as integer
dim in, out as binaryStream

function ReadBits(count as integer) as integer
      dim i, sum as integer

      for i=count-1 downto 0
            sum=sum+pow(2,i)*(byte mod 2)
            byte=byte\2
            if bytePos < 7 then
                  bytePos=bytePos+1
            else
                  bytePos=0
                  byte=in.ReadByte
                  in.position=in.position-2
            end if
      next

      return sum
end function

dim i, bits, start, blockSize, compressedBlockSize, fileSize, itemCount, repeat, offset, length as integer
dim stop, endOfFile as boolean
dim file as folderItem

file=GetOpenFolderItem("type") //displays "Open File" dialog box
fileSize=0
itemCount=0
if file <> nil then
      while not endOfFile
            in=file.OpenAsBinaryFile(false)
            out=GetFolderItem("temp").CreateBinaryFile("type") //creates a binary file named temp
            in.position=fileSize
            start=in.ReadByte
            bits=in.ReadByte
            blockSize=in.ReadLong
            compressedBlockSize=in.ReadLong
            in.position=fileSize+compressedBlockSize-1
            byte=in.ReadByte
            bytePos=0
            in.position=in.position-2

            if start > 0 then
                  if ReadBits(1)=0 then
                        if start < 4 then
                              bits=ReadBits(9-start)
                              length=ReadBits(3)
                        elseif start=4 then
                              length=ReadBits(3)*2
                              bits=ReadBits(4)
                              length=length+ReadBits(1)
                        else
                              length=ReadBits(4)
                              bits=ReadBits(8-start)
                        end if
                  else
                        if start < 4 then
                              bits=ReadBits(10-start)
                              length=ReadBits(8)+8
                        elseif start=4 then
                              bits=ReadBits(2)
                              length=ReadBits(1)*128
                              bits=ReadBits(4)
                              length=length+ReadBits(7)+8
                        else
                              bits=ReadBits(2)
                              length=ReadBits(4)*16
                              bits=ReadBits(8-start)
                              length=length+ReadBits(4)+8
                        end if
                  end if
            else
                  bits=ReadBits(8)
                  if ReadBits(1)=0 then
                        length=ReadBits(4)
                  else
                        bits=ReadBits(2)
                        length=ReadBits(8)+8
                  end if
            end if

            while in.position >= fileSize+5
                  for i=0 to length
                        bits=ReadBits(8)
                        out.WriteByte bits
                  next
                  stop=(in.position < fileSize+5)

                  while not stop
                        if ReadBits(1) = 0 then //0
                              if ReadBits(1) = 0 then //00
                                    length=ReadBits(3)
                                    stop=true
                              else //01
                                    repeat=1
                                    offset=ReadBits(8)
                              end if
                        else //1
                              if ReadBits(1) = 0 then //10
                                    if ReadBits(1) = 0 then //100
                                          repeat=2
                                          offset=ReadBits(9)
                                    else //101
                                          repeat=3
                                          offset=ReadBits(10)
                                    end if
                              else //11
                                    if ReadBits(1) = 0 then //110
                                          repeat=ReadBits(8)
                                          offset=ReadBits(12)
                                    else //111
                                          length=ReadBits(8)+8
                                          stop=true
                                    end if
                              end if
                        end if

                        if not stop then
                              for i=0 to repeat
                                    out.position=out.position-offset-1
                                    bits=out.ReadByte
                                    out.position=out.position+offset
                                    out.WriteByte bits
                              next
                        end if
                  wend
            wend

            in.close
            out.close
            in=GetFolderItem("temp").OpenAsBinaryFile(false)
            out=GetFolderItem(file.name+" Item "+str(itemCount)).CreateBinaryFile("type")
            in.position=in.length-1
            for i=1 to in.length-blockSize
                  bits=in.ReadByte
                  in.position=in.position-2
            next
            for i=1 to blockSize
                  out.WriteByte in.ReadByte
                  in.position=in.position-2
            next
            in.close
            out.close
            GetFolderItem("temp").delete

            fileSize=fileSize+compressedBlockSize
            if fileSize >= file.length then
                  endOfFile=true
            end if
            itemCount=itemCount+1
      wend
end if

Lemmingologist

I have updated my compression and decompression code (above on this page). The changes include:
_ Checksum calculation (thanks to ccexplore)
_ Support for 12-bit offsets with less than 5 repeating bytes
_ Fixed a bug in compressor which sometimes caused the first byte of the file to be encoded incorrectly.
_ Increased compression speed by optimizing searches for repeated strings