Performance of Rewinding and of Mass Replay Verification

Simon · July 03, 2024, 01:24:28 AM

Hi,

after Mindless released Golems on the forums here, he posted his ideas about what to include in a savestate, what to compress, ... I got inspired to investigate Lix rewinding performance. Over several evenings, I've finished the following three improvements:

Savestate only the mutable half of the world
Choose graphical frames of hatches/exits/... only when drawing
Refactor away from overuse of shared pointer to world

1. Mutable Half: The world in Lix consists of terrain, players, their lix, hatches, exits, traps, ..., and I used to keep everything in savestates. But some parts of the world never change: Water never changes. Fire hazards never change. An exit has some state (e.g., ownership, the red player gets points) but this state never changes during play. The initial value of the overtime never changes.

I've separated the world into a mutable half (land, lix, triggered traps, ...) and an immutable half (hatches, exits, water, ...). I copy only the mutable half into savestates. When I load a savestate, I attach the loaded mutable half to the existing immutable half, and that's the new world. This makes the savestates smaller in memory and reduces allocations. Each gadget (hatch, exit, trap, ...) is a heap-allocated class object. To track one in a savestate, we must deep-copy it, and it helps when we only have to save some of the gadgets, not all.

The name "half" is euphemistic. The mutable half is still the lion's share with the uncompressed land bitmap, the physics map, ..., and I didn't look into compression.

2. Graphical Frames: Ostensibly, exits still change over time because they animate. This animation has no bearing on physics, which is a strong argument to keep the exit in the immutable half of the world. But if the exit never changes, it can't keep track of its current animation frame.

The solution: During drawing, I tell the exit the current tick (physics update number) and the exit computes its animation frame from first principles. The drawing method ferries the arguments the to the exit.

Instead of incrementing the frame each tick and resetting the frame after reaching the end, we compute the modulus tick % animationLength. Naively, taking the modulus of integers is slower than adding and comparing integers. But there are performance benefits: We track less state in the exit now, which makes the exit smaller in memory by itself, and it makes all savestates smaller because exits can now be in the immutable half of the world.

Most importantly, non-drawing physics updates don't have to compute animation frames for exits anymore. When we aren't going to draw the results of some physics updates, we've saved many virtual method calls: One virtual call per gadget per update.

After all, we have our direst need for performance during singleplayer rewinding, during networked games with many players, and during mass replay verification. All three cases compute several (or all) physics updates behind the scenes. E.g., rewinding loads a savestate that is older than the target tick, then updates physics forward from the savestate until we're at the target tick.

3. Ditch Shared Pointer: A shared pointer to the world means two levels of indirection:

Usercode holds a shared pointer by value.
Each shared pointer points to the single reference-counting storage.
The reference-counting storage points points to the payload.

Compare this with unique resource handling (one indirection) or putting the world struct by value inside your world-handling code (direct access for the holder, and still one indirection for other parts of the physics).

In 2016, when I wrote D Lix, I chose the shared pointer for all holdings of worlds (the current world and for the savestated worlds) because I wanted easy savestating. The same world could appear several times in the savestating cache. I used the same type (shared pointer to world struct) everywhere. The problem was in the current world, which was also a shared pointer to a world struct. All physics that referenced the world (instead of, e.g., only the lix at hand) had to dereference the shared world pointer many times over.

Solution: Now, I hold the current world by value. After all, the world struct is less than 300 bytes; there is a lot of indirection in the world by itself already.

Sidetracking: I've also removed the shared pointers from the savestating cache. That wasn't necessary to change; it wouldn't have mattered for performance here. For now, I've resorted to manual management of copying/disposing between the current world and the savestates. To do what I really wanted, I need move semantics. D supports C++03-style RAII, but not yet the value-type-based rule-of-five resource handling that became popular with C++11. I want to write my program in the mathematical union of all the programming languages.

Overall Benefit

Mass replay verification is 3.5 percent faster. The 1077 replays from the replay collection take 18.9 seconds on my Intel i5-6600 3.30 GHz. Before, in the stable Lix 0.10.24, they took 19.6 seconds.

It can't be from the smaller savestates because mass replay verification doesn't generate savestates. It must be either from holding the current world by value or from from skipping graphical frame calculations and avoiding their virtual calls. I can't tell which of the two improvements it is, I have only measured all three optimizations together.

During singleplayer, I've kept the full 60 fps more often during hard rewinds when Lix 0.10.24 would dent to 55 or 50 fps. This is hard to quantify because it depends on the level. It's also hard to guess which of the three optimizations played the biggest role here. The bottom line is: The benefit is at least measurable during graphical play.

We'll see if and how much it helps in practice, after I release this in Lix 0.10.25.

-- Simon

Simon · July 06, 2024, 12:25:31 AM

Today, I've improved replay verification runtime once more by 13 percent. This is unexpectedly good, I didn't want to believe it, and I spent 2 hours to isolate the reason.

The culprit was the code that tests whether lix should exit. This code runs for every lix alive, during every physics update. The old code was:

if the outside world generally allows lix to exit because we aren't nuking
and this lix has encountered an exit during movement
and this lix' activity allows to become exiter
then loop through all exits, give points ...

The new code changes the order of the questions to:

Have we encountered exits earlier during this physics update?
Does our activity allow us to become exiter?
Is it generally allowed to exit because we aren't nuking?

Asking for encountered exits first has two desirable qualities:

It's usually false. The logic can short-circuit, i.e., we can skip the remainig questions.
It's fast. The lix stores her encounters in a small bitfield. We test one bit inside the lix.

Asking the outside world for general exit allowance is the complete opposite: We're usually allowed to exit (rarely are we nuking) and asking the outside world is expensive here. This code runs once per lix per physics update, thus the hot memory is the lix's own memory. Asking the outside world follows a pointer to not-so-hot memory, and we compute the answer (are we nuking) from several properties of the player teams.

Reordering three lines brought a whopping 13 percent speed improvement during mass replay verification. This means 13 percent of the entire workload -- which includes file loading, terrain rendering, ... I'm sure it will be useful during graphical play, too, especially when we compute several physics updates behind the scenes during networking games or during singleplayer framestepping.

There is treasure everywhere!

Nonetheless, I don't want to believe it still. 13 percent sounds too good. I'll have to playtest this experimental build in a networking session first.

-- Simon

Ramon · July 06, 2024, 09:46:41 AM

That is in fact an impressive optimization. Every single improvement matters long term and I'm glad to see it confirmed once again that you are an enthusiastic code optimizer.

WillLem · July 08, 2024, 03:41:12 PM

Inspired.

Performance of Rewinding and of Mass Replay Verification

Simon

Simon

Ramon

WillLem