[BUG][PLAYER][DATA LOSS] Replays are lost during Mass Replay Check when renaming

Started by WillLem, June 05, 2024, 02:44:18 PM

Previous topic - Next topic

0 Members and 2 Guests are viewing this topic.

WillLem

Found a particularly nasty bug today when running tests on the SLX Replay Manager (which is essentially a re-skinned version of the existing NL Replay Renamer with a couple of extra features), and discovered that it exists in the engine as far back as NL 12.12.5.

To say the MRC deletes the files is somewhat incorrect. It does delete them, but this behaviour is actually correct. The bug is that the newly-generated replays are being saved over each other iff they have the same filename.

Steps to reproduce:

1) Unzip the attached Replay Check Batch to a new folder in "levels"
2) Run MRC on the replays within the folder, making sure to choose "move to" or "copy to" (choose Position + Timestamp to make it more likely that the buggy behaviour will occur, but it can occur for any renaming string)
3) Some of the replays won't have been saved (well, actually, they have been saved, but they overwrite each other because they have the same filename)

This is most likely occurring because the timestamp used for renaming is that of the replay check time rather than the timestamp on the original replay.

Suggestions for fixes:

1) Add an incrementing alphabet character to the timestamp iff the current timestamp has already been used (so, for example, __2024-06-05_15-26-51.  __2024-06-05_15-26-51A,  __2024-06-05_15-26-51B, etc)

2) Retain the timestamp of the original replay and use that instead (arguably, this should be happening anyway)

3) Check to see if the current "NewName" has already been used in the current replay check. If it has, add an alphanumeric character to the end of the replay name (so "NewName := NewName + i; Inc(i); or generate a random short string of digits) - essentially the same as suggestion (1), but we're taking the full replay name into account rather than just the timestamp

I'll likely go ahead and apply whichever of these fixes proves to be easiest in SLX, happy to do a PR if it tests working.

WillLem

Fixed in SLX commit beccdf927 and sent to NL via this pull request.

The fix generates a random 3-digit hex in the event of identical filenames when saving. For now, it's good enough to prevent data loss whilst we figure out whether we'd prefer to have replay files saved with the original timestamp tag (and, in the case of SLX, whilst I figure out how to do that! ;P)

Simon

Swap lines 309 and 310, and you'll be able to remove the duplicated line 306.

Theoretically, your loop can be infinite. In practice, I doubt anybody will run into problems because nobody has found the original (serious) data loss behavior in years. It's up to you how much work you want to invest here.

It's good that the code (both yours and the original) first saves the replay, only then removes the replay. Reason: If the computer hangs in between the two, you haven't lost data.

It's surprisingly hard to make this 100 % robust. Here are some concerns and you probably don't want to fix them all these days. They're rabbit holes (compared to your fix candidate) and running into these bugs will be rare.
  • Avoid the infinite loop.
  • You can't expect to find a free filename for certain, regardless of how you generate the filenames. Keep the source replay file (instead of deleting it) after you couldn't find a free filename for renaming.
  • Saving the replay can fail: Disk is full, you're renaming to a write-only media, you're renaming to a removed physical drive, you're renaming to a network drive and the connection dies, ... Keep the source replay file if the saving fails.
  • You have a time-to-check-to-time-of-use bug: You test in NL that filename X is free, then another
    program creates file X, then NL saves the file to X, overwriting the other program's file. Or NL fails to save because the file is still open in the other program.
The time-to-check-to-time-of-use bug is unlikely (NL is not a security attack target) and I'm noting this mainly for completeness. I don't even know a good way to fix time-to-check-to-time-of-use in Delphi. In C, there is fopen("name.txt", "wx"), where the "wx" means that you want the file-opening call to succeed iff the file is nonexistant and writeable. The idea is that you both test and open in a single library call that relies on the operating system to ensure that nobody can do anything to the file in between.

-- Simon

WillLem

Quote from: Simon on June 05, 2024, 09:22:30 PM
Swap lines 309 and 310, and you'll be able to remove the duplicated line 306
...
Avoid the infinite loop.

Good shout!

Applied these fixes and updated the PR.

Quote from: Simon on June 05, 2024, 09:22:30 PM
You can't expect to find a free filename for certain, regardless of how you generate the filenames. Keep the source replay file (instead of deleting it) after you couldn't find a free filename for renaming.

Not so easy to see how to do this. If we exit the loop, does this also exit the main procedure? If so, then we can use the existing above fix, and exit if naming attempts have maxed out unsuccessfully (currently extremely unlikely). If not, I guess we could set a bool (tracked per-item) and then check & exit before deleting.

Quote from: Simon on June 05, 2024, 09:22:30 PM
Keep the source replay file if the saving fails.

Another good shout, but again struggling to see how to do this one.

Quote from: Simon on June 05, 2024, 09:22:30 PM
You have a time-to-check-to-time-of-use bug: You test in NL that filename X is free, then another
program creates file X, then NL saves the file to X, overwriting the other program's file

A) This is definitely pushing the effort I want to put into this, but also B) it shouldn't overwrite, because it will see the existing file and generate a new name, will it not? Besides, given normal circumstances this should only happen if performing an MRC on the same folder in 2 NL instances simultaneously... such use of the program needn't be supported anyway.

Simon

NewName := ChangeFileExt(NewName, '') + '_' + UniqueTag + '.nxrp';

Assuming we loop 4 times out of the maximum 1,000, this generates names such as:
blah_123_456_789_0AB.nxrp

Do you want this? Appending more and more random chars indeed has a higher chance of finding a free filename.

Probably, we'll stop before we run into Windows's limit of 255 wchars per filename without path. In particular, to run into OP's data loss and thereby into the loop here, you have to give a short prefix that didn't depend on replay names/level names/...

Quoteit shouldn't overwrite, because it will see the existing file and generate a new name, will it not?

Not if the OS takes processor time away from NL exactly after you've tested that X doesn't exist, exactly before you save to X.

I agree, skip it for this PR. The old code didn't even test for existence at all, you haven't made anything worse.

-- Simon

WillLem

Quote from: Simon on June 06, 2024, 04:13:23 PM
Assuming we loop 4 times out of the maximum 1,000, this generates names such as:
blah_123_456_789_0AB.nxrp

Do you want this? Appending more and more random chars indeed has a higher chance of finding a free filename.

Circumstances that would lead to this happening:

We have > 1 replay file to test. All files are tested at the same timestamp, so result in the following name:

blah_timestamp
blah_timestamp
blah_timestamp
etc...

The loop sees that the filename blah_timestamp already exists, so saves the next file as blah_timestamp_123. Then, it generates the exact same tag for the next file (a 0.02% chance, probably reduced to even less when accounting for the files being saved during the same 1s time period), so needs to save it as 123_456.

In practice, this would basically never happen. Here's the results of a test I did just now. Note that those replays which have the same timestamp have a single 3-digit hex tag:



I mean, I'm happy to update it again if you think it's worth it. But, I'd say it's fixed.

Simon

Right, I wouldn't worry much about running out of allowed filename length. As you correctly describe, the chance of having even 2 or 3 extra loop iterations is tiny.

It's a good PR now that fixes the original data loss in the vast majority of cases.

-- Simon

WillLem

Quote from: Simon on June 06, 2024, 06:45:18 PM
It's a good PR now that fixes the original data loss in the vast majority of cases.

Thanks for checking, you definitely picked up on some necessary improvements :thumbsup:

namida

This definitely falls under worth fixing, and should be easy enough to achieve.
My projects
2D Lemmings: NeoLemmix (engine) | Lemmings Plus Series (level packs) | Doomsday Lemmings (level pack)
3D Lemmings: Loap (engine) | L3DEdit (level / graphics editor) | L3DUtils (replay / etc utility) | Lemmings Plus 3D (level pack)
Non-Lemmings: Commander Keen: Galaxy Reimagined (a Commander Keen fangame)

namida

Fixed in commit 996b2e1.

NL will now only overwrite the replay file if the file name generated for the output is exactly the same as the input filename (as in this case, it can be presumed the user is intentionally overwriting their existing ones). Otherwise, if the file already exists, a number will be appended to the name (before the extension). It starts with adding -0001, -0002 etc. In the extremely unlikely case 9999 is reached, it just adds on an extra digit (ie: the next one will have -10000 appended to the name).

Side effects may include an infinite hang if you have roughly 4 billion files with the same output name. I consider this a very unlikely scenario in practice.
My projects
2D Lemmings: NeoLemmix (engine) | Lemmings Plus Series (level packs) | Doomsday Lemmings (level pack)
3D Lemmings: Loap (engine) | L3DEdit (level / graphics editor) | L3DUtils (replay / etc utility) | Lemmings Plus 3D (level pack)
Non-Lemmings: Commander Keen: Galaxy Reimagined (a Commander Keen fangame)