How hard is it to translate Lix into another language?

geoo · February 05, 2015, 09:15:44 AM

Quote from: Simon on February 05, 2015, 09:04:55 AM0x20-0x7F includes some chars forbidden in Windows filenames.

That's why I put the caveat.

Quote from: Simon on February 05, 2015, 09:04:55 AMRight, that's why I'm considering UTF-8 mangling with little endian, i.e. _123456 means UTF-8 char 0x56 34 12. Converting this number to the unicode codepoint should be straightforward.

Wait, but isn't the first byte (0x56) determining in the most straight-forward way how many of the next bytes to read? I guess you could also keep reading bytes that start with binary digits 10 until you encounter something else to see which bytes belong to the current character...seems a bit more strange though.

namida · February 05, 2015, 10:41:22 AM

Why not just save the filename as a hexidecimal representation (or a hash) of the name altogether? That should be simpler and avoid having to deal with what characters can and can't be in filenames on certain systems.

Simon · February 05, 2015, 12:51:46 PM

Quote from: geoo
Quote from: Simon on February 05, 2015, 09:04:55 AMRight, that's why I'm considering UTF-8 mangling with little endian, i.e. _123456 means UTF-8 char 0x56 34 12. Converting this number to the unicode codepoint should be straightforward.
Wait, but isn't the first byte (0x56) determining in the most straight-forward way how many of the next bytes to read?

We probably mean the same thing. Yes, I want to use one underscore with a variable number of hex digits. The first two hex digits, 0x12, are the first little-endian byte. That tells whether there is more data to interpret. In the example 0x56 is the last byte, i.e., the first big-endian byte.

(I don't have to interpret filenames btw, I will always start with the unicode string and mangle that into a filename, never the other way around. It's nice still to be able to.)

Quote from: namida
Why not just save the filename as a hexidecimal representation (or a hash) of the name altogether?

That breaks backward compatibility with the existing user files, and it won't be easy to tell the the user from looking at the filename.

-- Simon

namida · February 05, 2015, 12:57:07 PM

It's a plain text file, right? Couldn't you just store the username on the first line or something?

As for backwards compatibility, just look for filenames that don't meet the naming convention (or don't have a username stored in them) and update them to the new format? Or, you could simply store the username as mentioned above (updating existing files that don't have one to take the username from the filename), without paying any regards to the filename (so the user can call the file whatever they want as long as it's in the right place)?

Simon · February 05, 2015, 01:11:07 PM

Quote from: namida on February 05, 2015, 12:57:07 PM
It's a plain text file, right? Couldn't you store the username on the first line or something?

Possible, but the logic of the program would change. Then, every user file must be inspected first to see whether its username matches the current user from the global config.

QuoteAs for backwards compatibility, look for filenames that don't meet the naming convention (or don't have a username stored in them) and update them to the new format?

That bloats the code more than my proposed solution, but will consider it once you show why converting every char is superior.

QuoteOr, you could store the username as mentioned above (updating existing files that don't have one to take the username from the filename), without paying any regards to the filename (so the user can call the file whatever they want as long as it's in the right place)?

The user doesn't care about how that file is named. It's created when he first runs the program, I want to ask for the bare minimum of input at that time. He must be able to play the game ASAP. The file should be named the exact same way in a fresh installation, so he can overwrite. A determined algorithm of mangling usernames into filenames is best.

Why is mangling every char better than mangling things except A-Z a-z 0-9, space, dash, and non-threatening ascii chars?

-- Simon

namida · February 05, 2015, 01:14:03 PM

Well, my idea was that the filename doesn't matter; it just stores a username in the file (which can be any character). Thus, you just need to generate a default filename somehow. It doesn't really matter what this is either; it could even be random. Thus, a hex representation or a hash would just be the simplest way to generate one, without having to worry about which characters are used.

You could always store somewhere in the settings which username is active by filename (though I guess this could get complicated in itself).

Simon · February 05, 2015, 06:34:21 PM

Currently using A: filename is the key, file doesn't contain the key, only the associated values
You propose B: filename arbitrary, game searches all files in dir and looks for they key inside, then reads values from the same file

A opens fewer files than B. I still don't see a benefit of B over A.

Do you somehow feel better if the key is stored in the file? No matter what method is chosen, we need some hashing/mangling/randomizing to generate a filename.

-- Simon

ccexplore · February 05, 2015, 09:10:26 PM

Wow, didn't expect so much interest/passion in allowing Unicode usernames.

Two quick thing to note:

1) It's nice to try to keep the filename as unmangled from the username as feasible. On the other hand, it's not that big a deal even if we don't do so, since it is rare for the user to need to manually edit the profile, and it's also rare for most users to have multiple profiles within the same Lix installation (thus the file would be located relatively easily in the rare times the user needs to locate it, even if it winds up with an ugly filename).

2) Compatibility can also mean the ability to look for unmangled username as filename as a fallback (ie. look for the name we would be using in old version) when filename with "mangled username" cannot be found. In other words, we can recognize the filename we'd use in both old and new versions, while of course we always save using the new version's scheme. This would effectively rename the user's profile for them. It means we aren't necessarily constrained by compatibility to have to keep all existing ASCII-only usernames be unmangled, at a cost of slightly more code (but probably hardly much more than what we'd have to write to do name mangling or hashing anyway?). This point especially applies if we decide to perform mangling on some ASCII-only usernames in order to avoid other potentially problematic characters (which I do think is a nice idea).

As a sidenote, I likely won't start working on this until the weekend at the earliest. Of course, this being open sourced, someone else can always beat me to the punch and implement something yourself.

Simon · February 06, 2015, 11:08:13 AM

ccx's and my consensus is to preserve ASCII in filenames, and escape where appropriate. Design decisions so far:

Non-escaped chars are A-Z a-z 0-9, and dash -, space, and the single-quote '. Reasoning: These chars may have come up in people's names so far. Counterarguments: single-quote is probably extremely rare in names, and has meaning in shells.
Escape character (prefixing each escape sequence) is underscore _. Reasoning: Seems better than percent, because percent might have meaning in the Windows shell. Counterarguments: Percent is less likely than underscore to appear in nicknames, and percent is already used to escape unicode in URLs.

geoo · February 06, 2015, 11:56:40 AM

Just for the record, according to this here are all the other characters that you don't strictly have to escape, and that you could use as escape characters instead of underscore:

Code Select

   ^   Accent circumflex (caret)
   &   Ampersand
   '   Apostrophe (single quotation mark)
   @   At sign
   {   Brace left
   }   Brace right
   [   Bracket opening
   ]   Bracket closing
   ,   Comma
   $   Dollar sign
   =   Equal sign
   !   Exclamation point
   -   Hyphen
   #   Number sign
   (   Parenthesis opening
   )   Parenthesis closing
   %   Percent
   .   Period
   +   Plus
   ~   Tilde
   _   Underscore

GigaLem · February 15, 2015, 03:59:42 PM

I was think of putting the word lix into google translate i started with Japanese but it comes out as LIX

namida · February 15, 2015, 04:04:19 PM

Assuming you didn't want to go with a custom "Japanese-ized" name, the transliteration would be リックス "rikkusu". Of course, that's assuming "Lix" is the singular form (I remember there was a debate a while back about singular and plural forms of "Lix", but I don't remember what the consensus was).

If "Lix" is the singular form, but you wanted to invoke the idea of plurality in the name, you could (though by no means have to) add たち "tachi" at the end.

Simon · April 02, 2015, 12:50:14 AM

I've let this feature slip through the cracks. >_>; It's great contributed code by ccx that's been unreleased for two months now. It deserves use.

ccx, I've pushed a minor change to branch unicode, to not-replace in user-filenames: space, dash, apostrophe.

With that done, what's the big picture: What does our code do, what should yet be implemented before release? Should we do the dictionary for entire new languages? (e.g. rewrite language.cpp to use std::map <string, string>, and parse a user-supplied language file)

-- Simon

ccexplore · April 02, 2015, 01:49:27 AM

Quote from: Simon on April 02, 2015, 12:50:14 AMWith that done, what's the big picture: What does our code do, what should yet be implemented before release? Should we do the dictionary for entire new languages? (e.g. rewrite language.cpp to use std::map <string, string>, and parse a user-supplied language file)

It's been a while. If I recall the one big thing left that hasn't been fully flushed out design-wise is the handling of translation of things like tutorial level hints/instructions, and miscellaneous level- related things like _English.txt and so forth. I mean, technically we can do nothing for those and rely on user manually replacing the files involved (scattered in various directories and subdirectories) in their own installation with their own set of translated files, though I don't think that counts as a solution (well, at least probably not a good solution).

The way I handled the translation for the game's own text, there is effectively already a std::map and a user-supplied language file, I just had a layer of macros so that the existing source code can continue to just reference global variables and still get the translations (hint: the map is actually <string, pointer to the global variables>). It is admittedly slightly hacky so in principle, a proper rewrite is probably "better". On the other hand if you're porting to D anyway at some point, I think we can live with slightly hacky for a while given the eventual demise of the current C++ port.

One other sidenote: I discovered to my dismay that the Windows port of A4 has some bugs having some impact to this effort. The one in particular is that despite making the A4 APIs Unicode-aware, inexplicably in their internal code for keyboard handling, A4 wound up using an ASCII version of an underlying OS function even though Windows has a Unicode equivalent for that. The end result is that it looks like Unicode characters above codepoint value of 255 cannot be typed in Lix in Windows even when you are using a keyboard layout in Windows that can generate such characters.

[And yes, some European languages do need that, like Hungarian, just not the "common" ones like Spanish/French/German etc.] I'm tempted to chalk this one up for now to "wait for the D port" hoping A5 doesn't have the same stupid bug in its Windows port.

Simon · April 02, 2015, 02:14:15 AM

Thanks for the quick reply!

i18n for level titles, hints, level dir descriptions: Yes, I didn't want to commit to a solution yet. This is a nagging item on the agenda, and it deserves a good solution. I'd be willing to release the implemented unicode features without a solution to it. We have (user name -> user filename) mangling, and translatable strings in the GUI with exactly one user language.

The diligent user can submit his language file, so I can add its contents to the code.

Hacky solution with #define and std::map <string, *string>: It's very much adequate. Especially as a patch, where easy merging is a nice to have. With the D/A5 port underway, it's fine to keep it like this.

Bug in A4: Yes, I'm okay with deferring it to the D/A5 port, even if that won't be ready for a rather long time.

-- Simon