Table of Contents
Valentina Kernel Locale Settings
About ICU library
Valentina uses unicode engine ICU of IBM Inc. This is the most famous piece of software in the world for unicode. Thanks to this Valentina is able to work with UTF8, UTF16 and 200+ other world encodings. On the HOME page of ICU http://ibm.com/software/globalization/icu you will find a lots of detailed information, cool on-line demo products where you can play with different settings of major ICU methods.
Note, Apple Inc. uses ICU library inside of Mac OS X.
Valentina Kernel Encodings
Internally Valentina engine always work with UTF16 format.
Talking about encodings we should differ:
- Storage Encoding - specifies the encoding of storing of text in the disk files.
- IO Encoding - specifies the encoding of text, which you give or get back to Valentina engine using this or that method
For example, one user get/set strings from PC using Cyrillic Win, other user get/set strings from MAC using Cyrilic Mac. Valentina convert strings into UTF16 for internal processing and use e.g. UTF-16 to store strings on disk.
Working with Valentina Server each client on connection must specify what IO Encoding for strings should be used.
Locale Properties
Each VDatabase, VTable and VField object of Valentina have 3 properties related to localization:
StorageEncoding as String LocaleName as String CollationAttribute( EVColAttribute ) as EVColAttributeValue
Information about locale parameters for each object is stored in the system tables.
See API description of each this class in the corresponded section of API Reference.
Storage Encoding
Valentina create a new databse in UTF-16 encoding on default. It is the recommended option, although you can try tune your database taking into account the following.
If database contains a MacWestern language, then UTF8 can be the best choice. Because in this case one letter will use one byte on disk.
UTF8 for e.g. Cyrillic will not be the best choice, because Russian letters in the UTF8 format use 2 bytes per letter. For Cyrillic the best choice from the point of view of size will be Cyrillic Win or Cyrillic Mac depending on the platform hosting.
Those who use such languages as Japanese, Chinese, Korean, will prefer UTF16 encoding for storage encoding(as well as for IO).
NOTE: UTF8 as storage encoding should be avoided for non-Western single byte languages. At least for fixed-size strings. Because UTF8 encoding may use for one letter 1, 2, 3 bytes. This is not so problematic for VarChar (which can be up to 4088 bytes length) and absolutely non problematic for TEXT fields.
You can have in the database Tables/Fields with different storage encoding. You may wish to do this if you store different languages, although, probably, it is better just to use universal UTF-16 for such tasks.
CollationAttribute
Collation Attribute affects how Valentina sort and compare strings.
There are several collation attributes (details about each you can find in the ICU documentation).
kFrenchCollation kAlternateHandling kCaseFirst kCaseLevel kNormalizationMode kStrength kHiraganaQuaternaryMode kNumericCollation
The most interesting for developer is attribute kStrength. There are the following values for this attribute:
kPrimary = 0 - ignore accents and case role = Role = rôle kSecondary = 1 - ignore case but differ accents role = Role < rôle kTertiary = 2 - differ case and accents role < Role < rôle
There can be also interesting kNumericCollation. If you set this ON, then strings containing numbers will be sorted as numbers. On default it is OFF.
Inheritance of Locale Parameters
It is important to know that hierarchy
Database -> Table -> Field
provides inheritance of Locale parameters (StorageEncoding, LocaleName, CollationAtribute). 2 aspects of this behavior are shown on the picture.
a) Assume you have assign to database the StorageEncoding UTF8. Now if you create Table in the database then Table will inherit the value of StorageEncoding from database. The same is true for field creation. See the left part of picture.
b) If you have existed objects and change parameter for some top object then this change is propagated down by hierarchy. See the right part of picture.
If some object that is lower by hierarchy already has assigned value, then this object does not accept propagated changes from parent. The next picture shows 6 steps that explain several cases:
- all objects have UTF16 encoding;
- the f1 field is assigned UTF8 encoding;
- Database get Latin1 encoding and propagate it to child objects. But field f1 do not accept this change.
- Database get UTF8 encoding and propagate it to child objects. But field f1 do not accept this change (although now all objects have UTF8).
- Database get UTF16 encoding and propagate it to child objects. But field f1 do not accept this change.
- Field f1 get NULL encoding, i.e. it should forget own encoding and start to use the parent encoding.