开发者

Unicode-correct title case in Java

开发者 https://www.devze.com 2023-04-04 06:02 出处:网络
I\'ve been looking through all StackOverflow in the bazillion of questions about capitalizing a word in Java, and none of them seem to care the least about internationalization and as a matter of fact

I've been looking through all StackOverflow in the bazillion of questions about capitalizing a word in Java, and none of them seem to care the least about internationalization and as a matter of fact none really seem to work in an international context. So here is my question.

I have a String in Java, which represents a word - all isLetter() characters, n开发者_运维问答o whitespace. I want to make the first character upper case and the rest lower case. I do have the locale of my word in handy.

It's easy enough to call .substring(1).toLowerCase(Locale) for the last part of my string. I have no idea how to get the correct first character, though.

The first problem I have is with Dutch, where "ij" being a digraph should be capitalized together. I could special-case this by hand, because I know about it; now there may be other languages with this kind of thing that I don't know about, and I'm sure Unicode will tell me if I ask nicely. But I don't know how to ask.

Even if the above problem is solved, I'm still stuck with no proper way to handle English, Turkish and Greek, because Character supports titlecase but no locale, and String supports locales but not titlecase.

If I take the code point, and pass it to Character.toTitleCase(), this will fail because there is no way to pass the locale to this method. So if the system locale is in English but the word is Turkish, and the first char of the word is "i", I'll get "I" instead of "İ" and this is wrong. Now if I take a substring and use .toUpperCase(Locale), this will fail because it's upper and not title case. So if the word is Greek, I'll still get the wrong character.

If anyone has useful pointers, I'd be happy to hear them.


Like you, I was unable to find a suitable method in the core Java API.

However, there does seem to be a locale-sensitive string-title-case method (UCharacter#toTitleCase) in the ICU library.


Looking at the source for the relevant ICU methods (UCharacter#toTitleCase and UCaseProps#toUpperOrTitle), there don't seem to be many locale-specific special cases for title-casing, so you might be able to get away with the following:

  1. Find the first cased character in the string.
  2. If it has a title-case form distinct from its upper-case form, use that.
  3. Otherwise, perform a locale-sensitive upper-case on that first character and its combining characters.
  4. Perform a locale-sensitive lower-case on the rest of the string.
  5. If the locale is Dutch and the first cased character is an "I" followed by a "j", upper-case the "j".


The only two character digraph in which both characters are capitalized at once and that you probably will encounter in a real life program is the Dutch IJ. Just handle it if the locale is Dutch. In the worst improbable scenario, there will be 1-2 cases that you'll need to add later, it is not that you'll encounter new capitalization digraph every day so it is not worth focusing on generalization here.

Note, in general, it is not possible to use character to character conversion to get either title or upper case for an arbitrary language. Some lower case characters translate to more than one upper case characters. So you have to use String in a generic case.

But there is no any problem with title case locale. There is probably a small misunderstanding about how toTitleCase() method works. It will convert to title case any character, including one that is already in the upper case.

For example, consider the dž character. It's upper case form is DŽ and the title case form is Dž:

System.out.println(Character.toUpperCase('\u01C4'));
DŽ

and

System.out.println(Character.toTitleCase('\u01C4'));
Dž

however, the following will also give title case

System.out.println(Character.toTitleCase(Character.toUpperCase('\u01C4')));
Dž

So, if you convert with locale to upper case before title case, you get the correct code point and there is no problem to use title case on the result, including Turkish, etc.:

System.out.println(Character.toTitleCase("dž".toUpperCase().charAt(0)));
System.out.println(Character.toTitleCase("i".toUpperCase(Locale.forLanguageTag("tr")).charAt(0)));
Dž
İ

Note, just using title case of a single character if there is a difference from its upper case is not correct in a generic case.

To summarize:

  • Handle Dutch digraph (or other digraphs if you encounter them, I highly doubt that and at worst it will be 1-2 cases for program lifetime).
  • Convert the required characters as String using locale and toUpperCase()
  • Convert all characters of the toUpperCase result using Character toTitleCase.

Note, there are still some capitalization cases that are context aware, like Irish prefix, English ff names, etc. which require more than just a character/string processing, but I doubt you need to handle them for title generation in a program.


The problem is that the differentiation of upper and lower case letters is very language specific. So many, maybe most languages, do not have such.

Anyway, there is a Unicode faq: http://www.unicode.org/faq/casemap_charprop.html

..and I guess there is a Unicode specific mapping table somewhere (something like that ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt). So its probably best to use your own conversion method.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号