Full featured UTF8 Library | Voters

Full featured UTF8 Library

under review

Nexii Malthus

The utf8 library really should be full featured with equivalent methods as found in the string library, even if that method works with utf8 strings already.
The point being that as a scripter you can trust these methods work fine with utf8 strings without having to tip toe around the reference docs to make sure what might or might not or have certain caveats such as the string position index being based on bytes.
The first implementation would be the easy low hanging fruit: bytecode equivalents. Methods in string library that work with utf8 strings fine already and just simply generate the same bytecode.
Second would be looking at existing ll* functions and copying the code to work with the slua bindings. For example llToLower / llToUpper. 
Third would be writing brand new methods.
The big question(s) might be whether this should be a bigger effort/coordination upstream in Luau given that Roblox has all the same problems. This also means whether these might be added even further upstream in Lua 5.6+ in the future or if Luau and even SLua continue to further diverge.
There are often threads in Roblox around utf8 compatibility issues due to user input such as names and chat. SL has a big international presence, official language support and existing legacies of unicode scripting support which significantly exacerbates how critical this problem is for SLua.

March 27, 2026

Mercury Linden

updated the status to

under review

Tapple Gao

I agree with you that roblox needs a

utf8.upper

and

utf8.lower

. But, I disagree in that I don't think it belongs in base luau, because it would require a unicode database. For SL,

ll.ToLower

isn't going anywhere.

Nexii Malthus

Tapple Gao Is it a bad thing for luau to require a unicode database when both roblox and SL kinda you know require those for its international communities? Doesnt that make such an endeavour to avoid it pointless?

Tapple Gao

Nexii Malthus Well, the proper place to propose it is https://github.com/luau-lang/rfcs . or I don't know where for base lua. It would be an uphill battle getting it into lua or luau since, in most lua embeddings except SLua, it's basically free to add a third-party unicode lib if you need it. It looks like stdlib utf8 module was based on this formerly third-party module, with the parts requiring a unicode database stripped out: https://github.com/starwing/luautf8

Tapple Gao

Doing further research, the earliest discussions of what became the stdlib utf8 library are in this thread in the mailing list, based on someone's keynote presentation:

These design goals were already present way back then:

Not including a unicode database
sticking with utf8 byte offsets as the native address, rather than a new codepoint-oriented data structure (like python3 has)
Providing only the primitives to build better unicode libraries outside the stdlib

Tapple Gao

it should really be a new library if it made such a big divergence as changing the meaning of indices.

string

and

utf8

are currently fully consistent in always using byte offsets.

Tapple Gao

what do other language's string libraries do?

python3 uses codepoint indices (codepoints are 1 index wide), because strings are not stored in utf8, but in a constant-width encoding, depending on highest-code character present (1-4 bytes per codepoint). This was a huge backwards-compatability break with python2, and took about 8 years to settle
javascript uses UTF16 offsets (codepoints are 1-2 indices wide)
rust uses utf8 byte offsets, just like luau (codepoints are 1-4 indices wide)
luau uses utf8 byte offsets
lsl uses codepoint indices. I don't know if it does this in constant time like python, or linear time by scanning (I assume the latter)

I'm curious if people hate byte offsets because

lsl doesn't use them, or because
their favorite language doesn't use them

I personally don't understand why people are so hung up on the indexing scheme. split/concat/replace don't use indices. Find does, but, as long as you are consistent, the details of the index are irrelevant.

The only application I can think of where indices matter is in a COBOL-style fixed-field database. And I hope nobody made one of those in lsl. And even if you did, byte offsets are better than codepoints for that. (hello

buffer

and

string.unpack

)

Nexii Malthus

Tapple Gao the indexing scheme matters, for example in roblox there was a problem where someone had CJK characters or a smiley in their name, whereas the user had implemented a UI where a max length size was enforced for rendering the name in a tight space, this caused international names to be cutoff in unfortunate ways

Tapple Gao

Nexii Malthus Byte offsets do make mid-character truncation possible, but, it's easy to avoid by truncating to one of these:

20 -- exactly 20 bytes, even if it cuts a codepoint in half
utf8.offset(s, 20)
-- exactly 20 codepoints
utf8.offset(s, 0, 20)
-- however many full codepoints fit in 20 bytes

Nexii Malthus

Tapple Gao sure that makes it easy but is this something a beginner or intermediate scripter would know? I don't think most scripters should care or know about the underlying specifics of the encoding format, they see characters and want to cut off at characters. Bytes are more advanced topic no matter how you slice or dice it. If characters are corrupted they see it as a failure of the language/API.

I'm very comfortable with this knowledge now but I know I struggled with this for years early on. I think that string manipulation as LSL did it was the right choice. Manipulating by bytes is always more of an advanced scripting/programming technique.