D/A5, crash in al_get_joystick_state, Windows

Simon · March 01, 2016, 05:08:16 PM

Icho and I have have looked into the joystick crash on Windows. This bug is very hard to trigger, we've hit it only once. Key ingredients, as Clam and Nepster have suggested, are turbo-fast-forward, then framestepping back.

Theory: I don't run the D garbage collection (GC) during the game, only after each level. Automatic GC would kick in when RAM is scarce, which doesn't happen during play. But maybe VRAM gets scarce? This doesn't trigger the automatic GC, because RAM is not yet full. Yet somehow, we might get bad VRAM bitmaps, even though A5 always returns meaningful pointers on bitmap creation.

Therefore: From 0.2.33 on, I'm forcing the GC to run during play, whenever old savestates aren't needed anymore. GC'ing frees the savestates' RAM and VRAM. Let's see whether you can repro the crash still. When you framestep back in large jumps, holding down the button, this can make the GC run every frame. Let's see how much of a performance hit you take.

Get 0.2.33. You don't have to test immediately. But please report the crash again, should it occur from now on.

-- Simon

ccexplore · March 01, 2016, 11:05:01 PM

Have you try searching A5's bug database and forum to see if others have reported similar-looking issue? Maybe it's a known bug in the library and/or have known workarounds.

[edit: this paragraph is no longer trusted since the reported call stack's accuracy has been called into question, per my later post below] If I read your bug report correctly, the access violation is actually happening inside the Windows system function RtlpUnWaitCriticalSection. I believe this indicates a corrupted/uninitialized critical section object (used for thread synchronization) happening in A5's library code (most likely due to the owning object being deleted/cleaned up prematurely, or having a dangling pointer to the owning object). Thus it feels like it could be a bug in the library itself, unless this is a case where your D code calling into A5 has violated some preconditions that can lead to undefined behavior.

The problem with forcing GC is that, in addition to being an obvious hack, if your theory isn't correct the change might merely shift the exact timing/circumstances of the issue, temporarily making it look like it might be fixed when in fact it isn't.

If the bug is difficult to trigger and it doesn't help to debug a core (ie. memory) dump of the program when it crashes (probably only effective if you can actually debug down to A5's source code), you can potentially write something to make it easier to trigger the bug for debugging purposes. For example by writing a script to inject input to Lix to repeatedly try to hit the triggering condition (turbo fast forward + rewind), or changing the Lix code itself to do a test mode that operates off of canned input to repeatedly try to hit the triggering condition.

ccexplore · March 01, 2016, 11:22:56 PM

On that note, I might want to download the previous version of D-Lix for Windows without the GC hack. I doubt I can debug it any better than you did, but I'd like to try. I ask for no GC hack in case having the hack decreases the frequency of triggering the issue without actually eliminating it.

ccexplore · March 01, 2016, 11:54:06 PM

I briefly skimmed through Allegro5's GitHub, and a search for al_get_joystick_state yields relatively few hits. It doesn't look like any of the bitmap functions in Allegro would call that joystick function; in fact, the result of the source search suggests that there are only external callers for that function, and you said Lix never called it.

So it's starting to look more like the reported call stack from the crash is just inaccurate and probably useless for debugging. I have a feeling you may know/guess this already. I doubt I can do any better in this regard (ie. getting more accurate data on where exactly the crash is occurring), but I'd like to try.

Simon · March 02, 2016, 07:34:28 AM

Thanks for the reality checks!

I had the bug on the backburner for a while. I haven't yet investigated any documentation or forums. >_>

Debugging build with and without GC-hack. 0.2.32 is the commit right before the hack, 0.2.33 has the hack. Built with debugging symbols in the D code. Regular C debuggers understand them, they read like mangled C++ symbols.

There are debugging DLLs of Allegro 5. I haven't built against them. If you consider them a good idea, I'll try building against the debugging DLLs. For exact info about what my build uses, see my build notes for Windows on github, section "Installing Allegro 5".

I haven't tried whether the existing Lix Windows executables still run when we swap the DLLs, maybe with renaming the DLLs, but without recompiling Lix. I don't know how DLLs are linked in -- by symbol name or by binary interface.

I'd assume by symbol name, so swapping DLLs might work.

-- Simon

ccexplore · March 02, 2016, 09:54:52 AM

Having played with both versions, I've actually end up agreeing with your theory, and forcing more frequent garbage collection might well be the simplest solution. (To be fair, the crash will probably still happen if a level manages to actually, truly use up more video memory than available, but at least much harder now that we're cleaning up more aggressively.)

I am able to get an assert (attached) pretty easily in the 0.2.32 version in above post, after repeated forward and rewind. OKing past the assert then led right to some sort of memory exception. It doesn't bring up a callstack with mentions of joystick, but presumably it was the same underlying issue. More tellingly, I had set Window's Task Manager to be "Always on Top" to monitor Lix's memory usage while it is running (I guess you could also probably just run Lix in windowed mode), and indeed it shows Lix rather quickly balloons to well over 1 GB memory usage in forward/rewind scenarios especially when I kept the rewind key held down.

With version 0.2.33's garbage collection, memory usage stays at around 360 MB with no rapid runaway memory consumption according to Task Manager, even with repeated forward and rewind, and accordingly I've yet to cause the issue to happen. (There is still some minor growth in memory usage over time, but far, far more gradual and may well be expected as part of managing savestates for rewinds.) I didn't notice much difference performance-wise; if anything I think 0.2.33 is slightly better perf-wise during rewind, probably due to keeping the memory usage more in check compared to v0.2.32.

Forcing GC may be a "hack" in one sense, but in this case, your theory seems correct based on evidence, and unless you rewrite code to get much more explicit in managing the lifetimes of objects owning Allegro bitmaps (sounds prone to mistakes creating other kinds of bugs), forcing GC is at least the safest way to handle this.

Simon · March 02, 2016, 06:32:20 PM

Excellent and detailed feedback, thanks!

I will keep the manual management in the back of my head. VRAM allocations are wrapped in GC-ed D classes. The VRAM is disposed on GC, or on manually calling the dispose method. Mistaken accesses to freed VRAM should fail asserts in my wrappers. This shouldn't be too error-prone to manage manually.

Good to see that frequent GC'ing doesn't affect the performance much.

-- Simon

D/A5, crash in al_get_joystick_state, Windows

Simon

ccexplore

ccexplore

ccexplore

Simon

ccexplore

Simon