A collation is a set of rules for comparing character-string (CHAR/VARCHAR/CLOB) values. Other terms for the same thing are sort order and linguistic comparison, but "collation" is the official SQL:1999 word. Collations are especially important in search conditions or sorts, as in this example:
In Europe and the Americas, collation rules are relatively simple: use the order in the 26-letter Latin alphabet. There are just two major complications: accents and cases.
The Latin alphabet has 26 letters, but we can extend it by adding marks above or below the letters, or by adding new letters. Technically the extensions aren't always accents but there's no better word. Here are examples that I've found in English dictionaries and references:
handle all Latin-based alphabets, the collating rules must include the
There is a vague vocabulary to describe some of these rules. The term accent insensitive means accented characters map to unaccented equivalents. The term case insensitive means lowercase letters map to uppercase equivalents and vice versa. The term unaccented preference means accented should follow unaccented. This vocabulary is inadequate for clarifying all rules in all situations, so experts have devised a more accurate method of description: weights.
A character's weight is an unsigned number that is assigned consistently whenever a specific letter or combination of letters is encountered. Since there is no controversy about how to compare two unsigned numbers, once you've assigned the right weights to a string, you've solved the collating problem.
The obvious and simplest way to assign a weight is to say the weight is the same as a character's code point. (The code point is the number of a character in its code page. For example the code point for 'A' in 7-bit ASCII or ISO-8859 is 65 and the code point for 'a' is 97.) But a collation using code points -- usually known as a binary collation -- handles exactly zero of the rules I've described.
An adequate solution involves assigning the same weight for two different letters (which is what mapping means), allowing gaps between the weights for Latin letters so that non-Latin letters or combinations can fit between them, and assigning different sets of weights for different purposes. The Unicode sorting algorithm meets these requirements. Despite its name, the algorithm is really a way to assign weights for any character string, regardless of whether you use the Unicode character set or not. The algorithm is simple, standard, and common, e.g. it's used by Java's string functions and in Microsoft products. For illustration I'll use the Windows implementation of the Unicode algorithm, as seen in the Windows-API function, LCMapString. Here are the respective weights that can be assigned for the letters 'ABCDE':
1402 1406 1410 1414 1418
Now, if you encounter a situation where an accented character must sort between 'C' and 'D' (for example a Spanish 'CH') you have space to fit the character along this number line -- you can assign 'CH' a weight of 1413.
In English, all forms of 'E', including 'E' or 'e' or 'É' or 'é', must be assigned the same weight: 1418. This rule is necessary because, if 'é' were assigned a higher weight, then 'Chrétien' would sort higher than 'Chretienne' -- an absurdity. Therefore weights for accents are assigned separately, in a way that ensures they have lower priority.
generates a string consisting of three parts:
It is possible to generate more parts than the three shown here, but this is sufficient for any Western alphabet.
Character weights are for characters. Every character has a 16-bit character weight (2 bytes), most significant byte first. The minimum weight is 2, for the null character. To save space, trailing 2s are stripped.
Accent weights are for accents. Every character has an 8-bit accent weight (1 byte) but the most common weight is 2 (for unaccented), and once again trailing 2s are stripped. For some languages, if an accented character has a different character weight, the accent weight is 2 despite the accent.
Case weights are for variant ways of writing the same letter. Generally that means capitalization (lowercase is 2 and uppercase is 18); in Greek it can also apply to the two ways that sigma is written. Trailing 2s are stripped.
The LCMapString output parts are not intermixed. First come all the character weights, then there is a separator value (1) followed by the accent weights, then there is another separator value followed by the case weights. For example, LCMapString(..."Chäu"...) produces these strings for five locales:
C h ä u : UP English 14 10 14 44 14 2 14 159 1 2 2 19 14 1 18 1 French 14 10 14 44 14 2 14 159 1 14 19 1 18 1 German 14 10 14 44 14 2 14 159 1 2 2 19 14 1 18 1 Spanish 14 13 14 2 14 159 1 2 19 14 1 18 1 Danish 14 10 14 44 14 172 14 159 1 2 2 19 14 1 18 1 Character Weights Accent Weights Case Weights
Character Weights Accent Weights Case Weights
some things to note about LCMapString's result:
The three parts of the result are concatenated so a single comparison is all that's necessary -- this is not a three-pass system requiring "tiebreaks" or "equivalency class" complications. However, it is easy to stop at the first separator value if only character weights are being compared, or stop at the second separator value if only character plus accent weights are being compared.
by Character Weights Only, Sort by All Weights
Consider this query:
SELECT column1 FROM
A typical searcher will want to find either 'Chretien' or 'Chrétien' since these are the same names in English. To resolve this WHERE clause, it's necessary to compare only the character weights -- 'chretien', 'Chretien', and 'Chrétien' all have the same character weights. The ORDER BY can then be handled with a sort that takes into account the other weights -- 'e' has a lower accent weight than 'é' and 'c' has a lower case weight than 'C'.
Now, let's compare two descriptions of the same thing:
"dictionary sort case insensitive accent insensitive, giving preference in collation to case" (this is taken from a help dialog with Microsoft SQL Server)
"search by character weight, sort by character weight plus case weight"
The second description is clearer because it requires fewer special terms, and is less ambiguous. That's a good reason to use weights when describing collation rules, even if your DBMS doesn't support all three sets of weights.
Incidentally, dictionaries usually sort by character weight plus accent weight. For example I looked up the words 'resume' and 'résumé' in several dictionaries and they always appear in that order.
Since an index's primary purpose is to facilitate retrieval, an index will only be in order by character weights. But in that case, an index is not in order by (character weights plus accent weights plus case weights)! So sorting poses a dilemma. Should the DBMS (1) produce ORDER BY results according to an index that has character weights only? Or (2) ignore indexes for ORDER BY? Or (3) use separate indexes for searching and for ORDER BY? Option (1) is poor functionality, option (2) is slow, and option (3) is overkill. DBMS vendors and users have to compromise.
Microsoft SQL Server, and Oracle
In the chart below, I show how three DBMSs sorted variations of the word 'naive' with an ORDER BY clause.
In the IBM column, 'naïve' comes after 'naivf'. That is, the character weight of an accented character 'ï' is greater than the character weight of an unaccented character 'i'. IBM cannot distinguish character weights from accent weights. I cannot think of any locale where people would desire such behavior.
In the Microsoft column, occurrences of the word 'naive' don't all appear together. That is, the accent weight of 'ï' is the same as the accent weight of 'i'. This slight flaw happens with the default SQL collation that is a legacy from SQL Server version 7. There are other collations where Microsoft is more sensitive and gets the case weights right, and other collations where it doesn't handle accent weights correctly.
In the Oracle column, 'naïve' comes before 'naivf' and all occurrences of 'naive' appear together. Oracle delivers precisely the result that we would expect from an ideal DBMS that used all weights -- character, accent, and case -- for ORDER BY.
In the next chart, I show a syntax that each DBMS supports to force a non-default collation.
IBM has no built-in capabilities. DB2 allows you to define your own collation, but that shifts the workload from the DBMS vendor to the customer. For example, to get accent-weight sorting, you'll have to write a user-defined function to produce weighted strings.
Microsoft supports a COLLATE clause that meets the ANSI/ISO SQL:1999 requirement. It's possible to override the default collation during both retrieval and sorting. This huge feature arrived with SQL Server 2000.
Oracle has about the same power as Microsoft, but instead of using ANSI/ISO syntax Oracle has vendor-specific terms like "National Language Support Sort" or NLS_SORT for short.
Both Microsoft and Oracle include many built-in collations, so you won't have to write your own very often. But Oracle does provide the capability.
Here's a comparison chart that sums up the main features:
Conclusion: Oracle does collations best, but Microsoft SQL Server 2000 gets honorable mention for using nearly-standard SQL syntax.
German, Spanish, and Scandinavian collations are tricky. You must understand their special rules or you will end up with an inappropriate sort order about half the time. I have prepared a resource page to explain details that are indispensable for applications that must work in German-speaking countries, in Spain and Latin America, or in the Nordic countries (Denmark, Finland, Iceland, Norway, Sweden).
Firm, fussy rules are needed so that searches and sorts will deliver consistent results. It's nice if results are correct too, which means you should ask: What do users want, as opposed to what the DBMS can deliver? Binary sorting is faster, but for correct results you need a variety of collations. When choosing or creating a collation, understand weights and beware of surprising rules.
Besides the resource page mentioned above, visit these Web pages:
Peter Gulutzan is co-author of SQL-99 Complete Really (CMP Books 1999) and SQL Performance Tuning, which Addison-Wesley will publish in September 2002..
Register for free