Decompression question

deuteragenie · October 31, 2024, 11:56:07 AM

Hello,

I am trying to decompress an (original version) main.dat file, following the instructions at:
https://www.camanis.net/lemmings/files/docs/lemmings_dat_file_format.txt

The length of the first section is 8670 bytes (compressed), incl. 10 bytes for the header.

At position 8670 in main.dat, I find the following bits (in reverse order):

111 00000010 0000000 ...

According to the spec, "111" indicates that the following 8 bits represent an amount of bytes to read + 9.

So, we should have 2 + 9 = 11 bytes to read.

But ... the ldecomp says 13, i.e. 4 + 9 bytes to read.

Am I overlooking something ? Should it be that the next 9 bits indicate the number of bytes to read, like this:

111 000000100 00000000 ...

Or is ldecomp wrong ? Or the spec wrong ?

Any help appreciated.

Simon · October 31, 2024, 01:01:18 PM

Welcome to the forums!

In 2008, I implemented this decompression in C++ from that same spec document lemmings_dat_file_format.txt. See crunch.h and crunch.cpp on github. The part with algo == 6 should be relevant.

I'll see if I find the time to take a detailed look myself and answer your question directly. I'll have to study the spec and my implementation again.

I believe that, in 2008, I didn't see such a bug in the spec. I only submitted to ccexplore/Mindless the "small fix by Simon" noted at the top of lemmings_dat_file_format. This fix was about which sections come when, not about how to interpret the bitstream.

Caveats with my C++ source:

Comments are in German. I was a beginner at open-source software development then.
I never decompressed main.dat with this. I used my implementation only on levels and tilesets. Maybe both crunch.cpp and the spec are wrong, and my usages merely never ran into such a bug.
It's been 16 years since I wrote this, and I haven't maintained the C++ repository for 10 years.

-- Simon

deuteragenie · October 31, 2024, 03:32:25 PM

Thanks,

It is quite useful to see another implementation.

It looks like in case 6 ("111"), you also took the next 8 bits to determine the length / number of bytes to read. This is according to specs, but ... does it produce the correct results ?

Mindless · October 31, 2024, 05:13:00 PM

111 00000100 ... is what I see for the first operation in decompressing the first segment of MAIN.DAT (of Lemmings and Oh No! More Lemmings).

deuteragenie · October 31, 2024, 06:46:14 PM

Thanks,

I am looking at main.dat with a hex editor, position 8670
The first 2 bytes (reversed) are 11100000 01000000
Broken according to specs, this is : 111 00000010

So I am still puzzled as to how one gets 4 + 9 for the number of bytes to read, as opposed to 2 + 9

Mindless · November 03, 2024, 10:28:03 PM

The num_bits_in_first_byte field is 7 so you need to throw away the last bit of the first byte, which the leaves you with 1110000 01000000 or 111 00000100 ....

deuteragenie · November 04, 2024, 07:23:10 AM

Yes, I think that might be it. Thank you !

I was puzzled by the source code of ldecomp which doesn´t seem to consider num_bits_in_first_byte, or maybe I overlooked it.

I will try again and report how it goes.

deuteragenie · November 05, 2024, 06:32:13 PM

Alright,

Taking into account num_bits_in_first_byte I can produce the same output as ldecomp for main.dat.

Many thanks for your help !

EricLang · November 07, 2024, 04:07:36 PM

I ported this decompression algorithm in 4 languages: delphi, c#, rust and zig. But I never dived into it, what it really does

I know it makes a lot of sense to check this checksum as well to be sure of a correct result.

deuteragenie · November 09, 2024, 08:58:52 AM

Hello,

To make it dead simple, at the cost of a bit of memory and cpu, but that is irrelevant compared to 1990's, I ended up doing this in Python:

1. Preparation
Convert the input bytestream to a bit array
Reverse the bit array
Suppress the extraneous bit(s) as determined by bits_in_first_byte

2. Processing
Then the 6 cases outlined in the specs are straightfowardedly implemented by reading the bit array "in order" until all decompressed bytes are produced and written to file.

To ensure correctness, I compared the output with the one of ldecompress.

As for the checksum, it is the compressed bytes xor'd, so maybe it is not useful to check the correctness of a decompression implementation.

Hope it helps.