Unicode Concerns

From Official Kodi Wiki
Revision as of 02:54, 4 January 2022 by Fbacher (talk | contribs) (Created page with "This page addresses Unicode Issues beyond simply enabling your addon or Kodi runtime for basic Unicode support.Some of the issues are: Side effects of Kodi issue 19883: Turki...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page addresses Unicode Issues beyond simply enabling your addon or Kodi runtime for basic Unicode support.Some of the issues are:

Side effects of Kodi issue 19883: Turkish locale set via LANG / LC_CTYPE / LC_ALL or by SetGlobalLocale breaks skin loading on any Linux. (github.com/xbmc/xbmc/issues/19883)

Handling text from multiple locales (i.e complications with foreign movies or actors)

How to handle internal programming keywords along with other text.

Quirks of some languages

Searching multilingual text

Unicode file names

Unicode in XML

Unicode file names in a zip file

Json or XML processing of Unicode with keywords

Details:

Side effects of Kodi issue 19883 Kodi is unusable with Locale = tr_TR (Turkish). A quick fix was made, but it has some undesirable side-effects on the Python addons: The default locale cannot be read, making it impossible to determine the country code ('US'). The encoding for filenames is ASCII instead of UTF-8. Trying to open a file with a non-ASCII name throws exceptions.

On closer inspection, the Kodi C++ application has multiple problems: As is common with many programs, keywords are caseless and the code normalized all keywords to lower case. however, there are multiple problems with this: ToLower works with most, but not all locales/character sets since in almost all languages ToUpper(ToLower(char)) == char. However Turkish has several characters that do not obey this rule. In addition, several of these characters are in common with English. The letter 'i' is one of the characters. Kodi used the same locale for processing external text as well as internal keywords. This caused any keyword containing an 'I' to be unrecognized because ToLower would change the 'I' to a Turkish, 'dotless lower case i'

In addition, ToLower and ToUpper modifies the passed string in-place. However, there is no guarantee in Unicode that the number of bytes, or Unicode characters will be the same. This means that junk can be left at the end of the string array, or memory could be clobbered at the end of the array.

The proper solution is to pass the locale to the ToLower/ToUpper methods and for the caller to use a locale which is known to be stable for all of the keywords (such as en_GB or en_US, since the code was originally developed using those locales). Callers processing just normal [Turkish] text, should pass the Turkish locale.

Setting names in Settings.xml Although not widely documented, setting names are considered caseless, and are all mapped to lower case, as keywords are, above. The same problems exist. Addon programmers may be unaware of this. This means that in addition to the Kodi back-end treating setting names as keywords, Addon Python (or other) code must do the same thing when comparing or manipulating these setting names: the addon must do any manipulation (lower/upper) using a safe locale (i.e. en_GB).

Searching Text Kodi provides the ability to search for movie title within the current playlist by simply (quickly) typing the first few characters [in upper case] of the title to search for. This won't work properly for several reasons: 1- The above ToLower/ToUpper problem 2- Character collation rules vary for different locales. If the user is searching for a foreign language movie title, it may not be processed/sorted as expected. [This problem may not exist]

There are several approaches to solving the second problem. One is to do a caseless, accentless comparison. It looks like SQLLite and MySQL both support these. Extra columns or tables (for caseless, accentless copies of the search data) may be needed to improve performance.