DBAzine.com: the online portal for database issues and solutions

Collations

by Peter Gulutzan

A collation is a set of rules for comparing character-string (CHAR/VARCHAR/CLOB) values. Other terms for the same thing are sort order and linguistic comparison, but "collation" is the official SQL:1999 word. Collations are especially important in search conditions or sorts, as in this example:

SELECT ...
WHERE char_column = 'char_literal'
ORDER BY char_column

In Europe and the Americas, collation rules are relatively simple: use the order in the 26-letter Latin alphabet. There are just two major complications: accents and cases.

Accents

The Latin alphabet has 26 letters, but we can extend it by adding marks above or below the letters, or by adding new letters. Technically the extensions aren't always accents but there's no better word. Here are examples that I've found in English dictionaries and references:

Name	Example
Circumflex	crêpe
Grave	à-la-mode
Acute	résumé
Diaeresis (marks new syllable)	Artaÿctes naïve Noël
Umlaut (changes vowel sound)	Führer
Tilde	Angeleño
Ligature	Æsop
Cedilla	garçon
Extra letter in Scandinavian alphabet	Bjørn Ångstrom
Extra letter in Old English / Icelandic alphabet	ðottir

To handle all Latin-based alphabets, the collating rules must include the following:

One character maps to another. For example, in English, the accented letter 'ê' can be considered equal to the unaccented letter 'e'.

Two characters map to one. For example in Czech, 'CH' is a separate character that comes between 'H' and 'I'.

One character maps to two. For example in the German "phone book" collation, 'Ü' is the same as 'UE'.

Accented characters don't map to unaccented equivalents. For example in Danish, 'Ü' maps not to 'U' but to 'Y'.

Accented characters don't map. For example in Turkish, 'Ü' does not map to 'U' (or to 'Y' for that matter). Instead it is a separate character between 'U' and 'V'.

Alphabetic order isn't as in English. For example in Swedish, 'W' and 'V' collate together.

Characters don't exist in Latin alphabet. For example in Icelandic, there are two letters: [Ðð eth] and [Þþ thorn] that dropped out of English before 1400. And that's minor compared to Cyrillic scripts, which won't map at all.

Accented should follow unaccented. For example in the German "dictionary" collation, 'Ü' maps to 'U' but should sort after 'U' if the whole string is equal in other respects; that is 'Mueller' < 'Müller' < 'Muellers'.

Case. The Latin alphabet, like the Greek, Cyrillic, and Armenian alphabets, allows two forms of letters: capital (uppercase) and small (lowercase). The rules must allow for the German "sharp s" character (which has no uppercase form) and the Turkish "dotless i" character (which has an uppercase form that looks like the Latin letter "I").

There is a vague vocabulary to describe some of these rules. The term accent insensitive means accented characters map to unaccented equivalents. The term case insensitive means lowercase letters map to uppercase equivalents and vice versa. The term unaccented preference means accented should follow unaccented. This vocabulary is inadequate for clarifying all rules in all situations, so experts have devised a more accurate method of description: weights.

Weights

A character's weight is an unsigned number that is assigned consistently whenever a specific letter or combination of letters is encountered. Since there is no controversy about how to compare two unsigned numbers, once you've assigned the right weights to a string, you've solved the collating problem.

The obvious and simplest way to assign a weight is to say the weight is the same as a character's code point. (The code point is the number of a character in its code page. For example the code point for 'A' in 7-bit ASCII or ISO-8859 is 65 and the code point for 'a' is 97.) But a collation using code points -- usually known as a binary collation -- handles exactly zero of the rules I've described.

An adequate solution involves assigning the same weight for two different letters (which is what mapping means), allowing gaps between the weights for Latin letters so that non-Latin letters or combinations can fit between them, and assigning different sets of weights for different purposes. The Unicode sorting algorithm meets these requirements. Despite its name, the algorithm is really a way to assign weights for any character string, regardless of whether you use the Unicode character set or not. The algorithm is simple, standard, and common, e.g. it's used by Java's string functions and in Microsoft products. For illustration I'll use the Windows implementation of the Unicode algorithm, as seen in the Windows-API function, LCMapString. Here are the respective weights that can be assigned for the letters 'ABCDE':

1402 1406 1410 1414 1418

Now, if you encounter a situation where an accented character must sort between 'C' and 'D' (for example a Spanish 'CH') you have space to fit the character along this number line -- you can assign 'CH' a weight of 1413.

In English, all forms of 'E', including 'E' or 'e' or 'É' or 'é', must be assigned the same weight: 1418. This rule is necessary because, if 'é' were assigned a higher weight, then 'Chrétien' would sort higher than 'Chretienne' -- an absurdity. Therefore weights for accents are assigned separately, in a way that ensures they have lower priority.

LCMapString generates a string consisting of three parts:

Character Weights, also known as "Primary" or "Strength-1"

Accent Weights, also known as "Secondary" or "Strength-2"

Case Weights, also known as "Tertiary" or "Strength-3"

It is possible to generate more parts than the three shown here, but this is sufficient for any Western alphabet.

Character weights are for characters. Every character has a 16-bit character weight (2 bytes), most significant byte first. The minimum weight is 2, for the null character. To save space, trailing 2s are stripped.

Accent weights are for accents. Every character has an 8-bit accent weight (1 byte) but the most common weight is 2 (for unaccented), and once again trailing 2s are stripped. For some languages, if an accented character has a different character weight, the accent weight is 2 despite the accent.

Case weights are for variant ways of writing the same letter. Generally that means capitalization (lowercase is 2 and uppercase is 18); in Greek it can also apply to the two ways that sigma is written. Trailing 2s are stripped.

The LCMapString output parts are not intermixed. First come all the character weights, then there is a separator value (1) followed by the accent weights, then there is another separator value followed by the case weights. For example, LCMapString(..."Chäu"...) produces these strings for five locales:

 
        C       h       ä       u                   :           UP 
English 14  10  14  44  14   2  14 159   1   2   2  19  14   1  18   1   
French  14  10  14  44  14   2  14 159   1  14  19           1  18   1   
German  14  10  14  44  14   2  14 159   1   2   2  19  14   1  18   1   
Spanish 14  13          14   2  14 159   1   2  19  14       1  18   1   
Danish  14  10  14  44  14 172  14 159   1   2   2  19  14   1  18   1   
        Character Weights                    Accent Weights     Case Weights

Character Weights Accent Weights Case Weights

Here are some things to note about LCMapString's result:

Output order is the same as input order, except accent weights are in reverse order for French. Supposedly there's a French rule that accents matter more if they come later in the word, though I've never seen a realistic example.

'Ch' causes a single character-weight output (14 13) for traditional Spanish. That's because 'Ch' is considered a single character between 'C' and 'D'.

'ä' causes a large character-weight output (14 172) for Danish. That's because 'ä' is considered a separate character that follows 'Z'.

The case weight is always the same (18) for the uppercase 'C'.

A separator value is 1, which is less than any possible weight. So longer strings will come after shorter ones.

The three parts of the result are concatenated so a single comparison is all that's necessary -- this is not a three-pass system requiring "tiebreaks" or "equivalency class" complications. However, it is easy to stop at the first separator value if only character weights are being compared, or stop at the second separator value if only character plus accent weights are being compared.

Search by Character Weights Only, Sort by All Weights

Consider this query:

SELECT column1 FROM Table1
WHERE column1 = 'Chretien'
ORDER BY column1

A typical searcher will want to find either 'Chretien' or 'Chrétien' since these are the same names in English. To resolve this WHERE clause, it's necessary to compare only the character weights -- 'chretien', 'Chretien', and 'Chrétien' all have the same character weights. The ORDER BY can then be handled with a sort that takes into account the other weights -- 'e' has a lower accent weight than 'é' and 'c' has a lower case weight than 'C'.

Now, let's compare two descriptions of the same thing:

"dictionary sort case insensitive accent insensitive, giving preference in collation to case" (this is taken from a help dialog with Microsoft SQL Server)

versus

"search by character weight, sort by character weight plus case weight"

The second description is clearer because it requires fewer special terms, and is less ambiguous. That's a good reason to use weights when describing collation rules, even if your DBMS doesn't support all three sets of weights.

Incidentally, dictionaries usually sort by character weight plus accent weight. For example I looked up the words 'resume' and 'résumé' in several dictionaries and they always appear in that order.

Since an index's primary purpose is to facilitate retrieval, an index will only be in order by character weights. But in that case, an index is not in order by (character weights plus accent weights plus case weights)! So sorting poses a dilemma. Should the DBMS (1) produce ORDER BY results according to an index that has character weights only? Or (2) ignore indexes for ORDER BY? Or (3) use separate indexes for searching and for ORDER BY? Option (1) is poor functionality, option (2) is slow, and option (3) is overkill. DBMS vendors and users have to compromise.

IBM DB2, Microsoft SQL Server, and Oracle

In the chart below, I show how three DBMSs sorted variations of the word 'naive' with an ORDER BY clause.

IBM	Microsoft	Oracle
naive	naive	NAIVE
naive	NAIVE	Naive
naive	naive	Naive
naive	Naive	naive
naive	naive	naive
naive	naive	naive
naive	naive	naive
naive	naïve	naive
Naive	Naïve	naive
Naive	NAIVF	NAIVF
Naive	naivf	Naivf
Naive	Naivf	Naivf
NAIVE	naïvf	naivf
NAIVE	Naïvf	naivf

In the IBM column, 'naïve' comes after 'naivf'. That is, the character weight of an accented character 'ï' is greater than the character weight of an unaccented character 'i'. IBM cannot distinguish character weights from accent weights. I cannot think of any locale where people would desire such behavior.

In the Microsoft column, occurrences of the word 'naive' don't all appear together. That is, the accent weight of 'ï' is the same as the accent weight of 'i'. This slight flaw happens with the default SQL collation that is a legacy from SQL Server version 7. There are other collations where Microsoft is more sensitive and gets the case weights right, and other collations where it doesn't handle accent weights correctly.

In the Oracle column, 'naïve' comes before 'naivf' and all occurrences of 'naive' appear together. Oracle delivers precisely the result that we would expect from an ideal DBMS that used all weights -- character, accent, and case -- for ORDER BY.

In the next chart, I show a syntax that each DBMS supports to force a non-default collation.

IBM:
Not Applicable

Microsoft:
SELECT column1 FROM Table1
WHERE column1 = 'Vide' COLLATE Finnish_Swedish_CI_AI
ORDER BY column1 COLLATE Finnish_Swedish_CI_AI

Oracle:
CREATE INDEX test_idx ON test(NLSSORT(col, 'NLS_SORT=FRENCH'));

IBM has no built-in capabilities. DB2 allows you to define your own collation, but that shifts the workload from the DBMS vendor to the customer. For example, to get accent-weight sorting, you'll have to write a user-defined function to produce weighted strings.

Microsoft supports a COLLATE clause that meets the ANSI/ISO SQL:1999 requirement. It's possible to override the default collation during both retrieval and sorting. This huge feature arrived with SQL Server 2000.

Oracle has about the same power as Microsoft, but instead of using ANSI/ISO syntax Oracle has vendor-specific terms like "National Language Support Sort" or NLS_SORT for short.

Both Microsoft and Oracle include many built-in collations, so you won't have to write your own very often. But Oracle does provide the capability.

Here's a comparison chart that sums up the main features:

Feature	DB2	SQL Server	Oracle
User-Defined	Yes	No	Yes
Order by Accent or Case Weights	No	Usually	Yes
Change Collation at Runtime	No	Yes	Yes

Conclusion: Oracle does collations best, but Microsoft SQL Server 2000 gets honorable mention for using nearly-standard SQL syntax.

European Collations

German, Spanish, and Scandinavian collations are tricky. You must understand their special rules or you will end up with an inappropriate sort order about half the time. I have prepared a resource page to explain details that are indispensable for applications that must work in German-speaking countries, in Spain and Latin America, or in the Nordic countries (Denmark, Finland, Iceland, Norway, Sweden).

The Parting Shot

Firm, fussy rules are needed so that searches and sorts will deliver consistent results. It's nice if results are correct too, which means you should ask: What do users want, as opposed to what the DBMS can deliver? Binary sorting is faster, but for correct results you need a variety of collations. When choosing or creating a collation, understand weights and beware of surprising rules.

Besides the resource page mentioned above, visit these Web pages:

"Unicode Technical Standard #10 -- Unicode Collation Algorithm"

"International Features in Microsoft SQL Server 2000"

"Oracle9i Globalization Support Guide"

"Posix locale descriptors"

"Help file for reading Posix locale descriptors"

"IBM Application Development Guide: National Language Support Considerations"

Peter Gulutzan is co-author of SQL-99 Complete Really (CMP Books 1999) and SQL Performance Tuning, which Addison-Wesley will publish in September 2002..

SEARCH
Can't find what you're looking for?
Try our search!

REGISTER
Register for free
e-mail updates
for DBAzine.com!

DISCUSSION
Got a question? Join us in our DBAzine Forum!

Sponsored Whitepapers

SEARCH

legal statement | contact us
Links to external sites are subject to change; dbazine.com and BMC Software do not control or endorse
the content of these external Web sites, and are not responsible for their content.
© 2002-2001 dbazine.com. All Rights Reserved.