Full featured UTF8 Library
Nexii Malthus
The utf8 library really should be full featured with equivalent methods as found in the string library, even if that method works with utf8 strings already.
The point being that as a scripter you can trust these methods work fine with utf8 strings without having to tip toe around the reference docs to make sure what might or might not or have certain caveats such as the string position index being based on bytes.
The first implementation would be the easy low hanging fruit: bytecode equivalents. Methods in string library that work with utf8 strings fine already and just simply generate the same bytecode.
Second would be looking at existing ll* functions and copying the code to work with the slua bindings. For example llToLower / llToUpper.
Third would be writing brand new methods.
The big question(s) might be whether this should be a bigger effort/coordination upstream in Luau given that Roblox has all the same problems. This also means whether these might be added even further upstream in Lua 5.6+ in the future or if Luau and even SLua continue to further diverge.
There are often threads in Roblox around utf8 compatibility issues due to user input such as names and chat. SL has a big international presence, official language support and existing legacies of unicode scripting support which significantly exacerbates how critical this problem is for SLua.
Log In
Tapple Gao
I agree with you that roblox needs a
utf8.upper
and utf8.lower
. But, I disagree in that I don't think it belongs in base luau, because it would require a unicode database. For SL, ll.ToLower
isn't going anywhere.Nexii Malthus
Tapple Gao Is it a bad thing for luau to require a unicode database when both roblox and SL kinda you know require those for its international communities? Doesnt that make such an endeavour to avoid it pointless?
Tapple Gao
Nexii Malthus Well, the proper place to propose it is https://github.com/luau-lang/rfcs . or I don't know where for base lua. It would be an uphill battle getting it into lua or luau since, in most lua embeddings except SLua, it's basically free to add a third-party unicode lib if you need it. It looks like stdlib utf8 module was based on this formerly third-party module, with the parts requiring a unicode database stripped out: https://github.com/starwing/luautf8
Tapple Gao
Doing further research, the earliest discussions of what became the stdlib utf8 library are in this thread in the mailing list, based on someone's keynote presentation:
These design goals were already present way back then:
- Not including a unicode database
- sticking with utf8 byte offsets as the native address, rather than a new codepoint-oriented data structure (like python3 has)
- Providing only the primitives to build better unicode libraries outside the stdlib
Tapple Gao
it should really be a new library if it made such a big divergence as changing the meaning of indices.
string
and utf8
are currently fully consistent in always using byte offsets.Tapple Gao
what do other language's string libraries do?
- python3 uses codepoint indices (codepoints are 1 index wide), because strings are not stored in utf8, but in a constant-width encoding, depending on highest-code character present (1-4 bytes per codepoint). This was a huge backwards-compatability break with python2, and took about 8 years to settle
- javascript uses UTF16 offsets (codepoints are 1-2 indices wide)
- rust uses utf8 byte offsets, just like luau (codepoints are 1-4 indices wide)
- luau uses utf8 byte offsets
- lsl uses codepoint indices. I don't know if it does this in constant time like python, or linear time by scanning (I assume the latter)
I'm curious if people hate byte offsets because
- lsl doesn't use them, or because
- their favorite language doesn't use them
I personally don't understand why people are so hung up on the indexing scheme. split/concat/replace don't use indices. Find does, but, as long as you are consistent, the details of the index are irrelevant.
The only application I can think of where indices matter is in a COBOL-style fixed-field database. And I hope nobody made one of those in lsl. And even if you did, byte offsets are better than codepoints for that. (hello
buffer
and string.unpack
)Nexii Malthus
Tapple Gao the indexing scheme matters, for example in roblox there was a problem where someone had CJK characters or a smiley in their name, whereas the user had implemented a UI where a max length size was enforced for rendering the name in a tight space, this caused international names to be cutoff in unfortunate ways
Tapple Gao
Nexii Malthus Byte offsets do make mid-character truncation possible, but, it's easy to avoid by truncating to one of these:
- 20 -- exactly 20 bytes, even if it cuts a codepoint in half
- utf8.offset(s, 20)-- exactly 20 codepoints
- utf8.offset(s, 0, 20)-- however many full codepoints fit in 20 bytes