Bit pushing in the Lix struct

Started by Simon, July 31, 2023, 03:28:05 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Simon

Here's some loose rambling from pushing source code around. I don't have concrete plans to rework anything here, but I've re-examined the existing memory layout this weekend. Thus, why not write it down, maybe it's of interest.

A lix (the struct Lixxie that describes a single lix) is 56 bytes long, assuming we're on a 64-bit machine with 64-bit pointers. The layout is roughly like this:

Bytes 0-32: The lix's job, an emplaced subclass of Job. All subclasses fit into 32 bytes.
Bytes 32-36: Position as two 2-byte integers
Bytes 36-38: 2-byte bitset for all the bools such as facing direction, ability to climb, ...
Bytes 38-40: Queued fling speed as two 1-byte integers
Byte 40: Style (color, player affiliation)
Byte 41: Encounters (water, exits, ...) during the current physics update
Byte 42: Exploder timer in multiplayer
Bytes 43-48: unused (5 bytes)
Bytes 48-56: Pointer to the entire physical world. Kitchen sink object, only filled & used during updating.

The first 32 bytes of that are the emplaced Job; in detail, the Job looks like this:

Bytes 0-8: vtable pointer for the Job class object
Bytes 8-16: monitor pointer for the Job class, an unused (?) multithreading D object per class object
Byte 16: Activity (determines vertical frame in the spritesheet)
Byte 17: Horizontal frame in the spritesheet
Bytes 18-32: Subclass payload: Job subclasses define it, e.g., leftover builder bricks, miner anti-shock

Things in the Job subclass are properly aligned. If the first field wants to be 4-byte-aligned (e.g., the common 32-bit int, which has size 4 and alignment 4), it goes at byte 20 in the Job, with bytes 18 and 19 unused padding. In D as in C++, a single-inheriting subclass starts its data after the base class's data, and the base class's data comes after the hidden vtable pointer (and in D the monitor pointer).

+----------------------+
| 0-8: vpointer of Job |
+----------------------+
| 8-16: monitor of Job |
+--+--+----------------+
|Ac|fr|      18-32:    |
+--+--+     subclass   |
|           payload    |
+----------+-----+-----+
| 32-36:pos|flags|fling|
+--+--+--+-+-----+-----+
|st|en|tm|  5 unused   |
+--+--+--+-------------+
| 48-56: world pointer |
+----------------------+





Cache lines in most computers are 64 bytes. If you fit into a cache line, you're quick to load.

The lix is 56 bytes long, that fits. Still, I should try to pad it with 8 more unused bytes. The padded lix would then always be 64 bytes long, and it's possible that it'll be even more cache-friendly (even though it's longer!) than the 56-byte lix.

The alternative experiment would be to pack everything into 32 bytes. The bytes will then be as snugly packed as a pile of kittens. We'd have to pass the pointer to the outside world in the methods, or abuse the memory layout and cram it over the (hopefully unused) monitor pointer. I think I could reduce the Job subclass payload from 12 to 8 bytes, and cram the Job-independent fields from 16 bytes into 12. It would be an elaborate experiment.

The elephants in the room are really the pointers. These 64-bit pointers (8 bytes) are fat walruses. You can't herd too many of those in your bathtub at the same time.

-- Simon

Simon

#1
struct LixxieImpl {

becomes

align (64) struct LixxieImpl {

This makes the formerly 56-byte-long 8-byte-aligned struct 64-byte-long and 64-byte-aligned. No need to add unused fields.

I've measured the improvement during replay verification of the entire Lix proof collection. In vanilla Lix 0.10.12 (the most recent release), mass replay verification took 18.3 seconds on sil (my main Desktop). Now it takes around 18.1 seconds, and occasionally I see runtimes in the 17.xx that I've never seen before putting the align (64). Improving the mass replay verification by 0.2 seconds is a 1 % improvement.

The one-liner change of making the struct longer from 56 to 64 bytes and, what's really the heart of the idea, aligning it to 64-byte-long cache lines improves the mass replay verification by 1 %. This goes into the next release, Lix 0.10.13. We'll see in the long term how it runs on 32-bit machines -- if anybody still uses one.

-- Simon