2012-07-11

The Policeman’s Horror: Default Locales, Default Charsets, and Default Timezones

Time for a tool to prevent any effects coming from them!

Did you ever try to run software downloaded from the net on a computer with Turkish locale? I think most of you never did that. And if you ask Turkish IT specialists, they will tell you: “It is better to configure your computer using any other locale, but not tr_TR”. I think you have no clue what I am talking about? Maybe this article gives you a hint: “A Cellphone’s Missing Dot Kills Two People, Puts Three More in Jail”.

What you see in lots of software is a so-called case-insensitive matching of keywords like parameter names or function names. This is implemented in most cases by lowercasing or upper-casing the input text and compare it with a list of already lowercased/uppercased items. This works in most cases fine, if you are anywhere in the world, except Turkey! Because most programmers don’t care about running their software in Turkey, they do not test their software under the Turkish locale.

But what happens with the case-insensitive matching if running in Turkey? Let’s take an example:

User enters “BILLY” in the search field of you application. The application then uses the approach presented before and lower-cases “BILLY” and then compares it to an internal table (e.g. our search index, parameter table, function table,...). So we search in this table for “billy”. So far so good, works perfect in USA, Germany, Kenia, almost everywhere - except Turkey. What happens in the Turkish locale when we lowercase “BILLY”? After reading the above article, you might expect it: The “BILLY”.toLowerCase() statement in Java returns “bılly” (note the dot-less i: 'ı' U+0131). You can try this out on your local machine without reconfiguring it to use the Turkish locale, just try the following Java code:
assertEquals(“bılly”, “BILLY”.toLowerCase(new Locale(“tr”,“TR”)));
The same happens vice versa, if you uppercase a ‘i’, it gets I with dot (‘İ’ U+0130). This is really serious, million lines of code out there in Java and other languages don’t take care that the String.toLowerCase() and String.toUpperCase() methods can optionally take a defined Locale (more about that later). Some examples from projects I am involved in:

  • Try to run an XSLT stylesheet using Apache XALAN-XSLTC (or Java 5’s internal XSLT interpreter) in the Turkish locale. It will fail with “unknown instruction”, because XALAN-XSLTC compiles the XSLT to Java Bytecode and somehow lowercases a virtual machine opcode before compiling it with BCEL (see XALANJ-2420, BCEL bug #38787).
  • The HTML SAX parser NekoHTML uses locale-less uppercasing/lowercasing to normalize charset names and element names. I opened a bug report (issue #3544334).
  • If you use PHP as your favourite scripting language, which is not case sensitive for class names and other language constructs, it will throw a compile error once you try to call a function with an “i” in it (see PHP bug #18556). Unfortunately it is unlikely that this serious bug is fixed in PHP 5.3 or 5.4!

The question is now: How to solve this?

The most correct way to do this is to not lowercase at all! For comparing case insensitive, Unicode defines “case folding”, which is a so-called canonical form of text where all upper/lower case of any character is normalized away. Unfortunately this case folded text may no longer be readable text (this depends on the implementation, but in most cases it is). It just ensures, that case-folded text can be compared to each other in a case-insensitive way. Unfortunately Java does not offer you a function to get this string, but ICU-4J can do (see UCharacter#foldCase). But Java offers something much better: String.equalsIgnoreCase(String), which internally handles case folding! But in lots of cases you cannot use this fantastic method, because you want to lookup such strings in a HashMap or other dictionary. Without modifying HashMap to use equalsIgnoreCase, this would never work. So we are back at lower-casing! As mentioned before, you can pass a locale to String.toLowerCase(), so the naive approach would be to tell Java, that we are in the US or using the English language: String.toLowerCase(Locale.US) or String.toLowerCase(Locale.ENGLISH). This produces identical results but is still not consistent. What happens if the US government decides to lowercase/uppercase like in Turkey? -- OK, don’t use Locale.US (this is also too US-centric). Locale.ENGLISH is fine and very generic, but languages also change over the years (who knows?), but we want to have it language invariant! If you are using Java 6, there is a much better constant: Locale.ROOT -- You should use this constant for our lowercase example: String.toLowerCase(Locale.ROOT).
You should start now and do a global search/replace on all your Java projects (if you do not rely on language specific presentation of text)! REALLY!
String.toLowerCase is not the only example of “automatic default locale usage” in the Java API. There are also things like transforming dates or numbers to strings. If you use the Formatter class, and you run it somewhere in another country, String.format(“%f”, 15.5f) may not always use a period (‘.’) as decimal separator; most Germans will know this. Passing a specific locale here helps in most cases. If you are writing a GUI in English language, pass Locale.ENGLISH everywhere, otherwise text output of numbers or dates may not match the language of your GUI! If you want Formatter to behave in a invariant way, use Locale.ROOT, too (then it will for sure format numbers with period and no comma for thousands, just like Float.toString(float) does).

A second problem affecting lot’s of software are two other system-wide configurable default settings: default charset/encoding and timezone. If you open a text file with FileReader or convert an InputStream to a Reader with InputStreamReader, Java assumes automatically, that the input is in the default platform encoding. This may be fine, if you want the text to be parsed by the defaults of the operating system -- but if you pass a text file together with your software package (maybe as resource in your JAR file) and then accidentally read it using the platform’s default charset... it’ll break your app! So my second recommendation:
Always pass a character set to any method converting bytes to strings (like InputStream <=> Reader, String.getBytes(),...). If you wrote the text file and ship it together with your app, only you know its encoding!
For timezones, similar examples can be found.

How this affects Apache Lucene!

Apache Lucene is a full-text search engine and deals with text from different languages all the time; Apache Solr is a enterprise search server on top of Lucene and deals with input documents in lots of different charsets and languages. It is therefore essential for a search library like Lucene to be as most independent from local machine settings as possible. A library must make it explicit what input it wants to have. So we require charsets and locales in all public and private APIs (or we only take e.g. java.io.Reader instead of InputStream if we expect text coming in), so the user must take care.

Robert Muir and I reviewed the source code of Apache Lucene and Solr for the coming version 4.0 (an alpha version is already available on Lucene’s homepage, documentation is here). We did this quite often, but whenever a new piece of code is committed to the source tree, it may happen that undefined locales, charsets, or similar things appear again. In most cases it is not the fault of the committer, this happens because auto-complete in IDE automatically lists possible methods and parameters to the developer. Often you select the easiest variant (like String.toLowerCase()).

Using default locales, charsets and timezones are in my opinion a big design issue in programming languages like Java. If there are locale-sensitive methods, those methods should take a locale, if you convert a byte[] stream to a char[] stream, a charset must be given. Automatically falling back to defaults is a no-go in the server environment. 
If a developer is interested in using the default locale of the user’s computer, he can always explicitely give the locale or charset. In our example this would be String.toLowerCase(Locale.getDefault()). This is more verbose, but it is obvious what the developer intends to do.

My proposal is to ban all those default charset and locale methods / classes in the Java API by deprecating them as soon as possible, so users stop using them implicit!


Robert’s and my intention is to automatically fail the nightly builds (or compilation on the developer’s machine) when somebody uses one of the above methods in Lucene’s or Solr’s source code. We looked at different solutions like PMD or FindBugs, but both tools are too sloppy to handle that in a consistent way (PMD does not have any “default charset” method detection and Findbugs has only a very short list of method signatures). In addition, both PMD and FindBugs are very slow and often fail to correctly detect all problems. For Lucene builds we only need a tool, that looks into the byte code of all generated Java classes of Apache Lucene and Solr, and fails the build if any signature that violates our requirements is found.

A new Tool for the Policeman

I started to hack a tool as a custom ANT task using ASM 4.0 (Lightweight Java Bytecode Manipulation Framework). The idea was to provide a list of methods signatures, field names and plain class names that should fail the build, once bytecode accesses it in any way. A first version of this task was published in issue LUCENE-4199, later improvements was to add support for fields (LUCENE-4202) and a sophisticated signature expansion to also catch calls to subclasses of the given signatures (LUCENE-4206).

In the meantime, Robert worked on the list of “forbidden” APIs. This is what came out in the first version:
java.lang.String#<init>(byte[])
java.lang.String#<init>(byte[],int)
java.lang.String#<init>(byte[],int,int)
java.lang.String#<init>(byte[],int,int,int)
java.lang.String#getBytes()
java.lang.String#getBytes(int,int,byte[],int) 
java.lang.String#toLowerCase()
java.lang.String#toUpperCase()
java.lang.String#format(java.lang.String,java.lang.Object[])
java.io.FileReader
java.io.FileWriter
java.io.ByteArrayOutputStream#toString()
java.io.InputStreamReader#<init>(java.io.InputStream)
java.io.OutputStreamWriter#<init>(java.io.OutputStream)
java.io.PrintStream#<init>(java.io.File)
java.io.PrintStream#<init>(java.io.OutputStream)
java.io.PrintStream#<init>(java.io.OutputStream,boolean)
java.io.PrintStream#<init>(java.lang.String)
java.io.PrintWriter#<init>(java.io.File)
java.io.PrintWriter#<init>(java.io.OutputStream)
java.io.PrintWriter#<init>(java.io.OutputStream,boolean)
java.io.PrintWriter#<init>(java.lang.String)
java.io.PrintWriter#format(java.lang.String,java.lang.Object[])
java.io.PrintWriter#printf(java.lang.String,java.lang.Object[])
java.nio.charset.Charset#displayName()
java.text.BreakIterator#getCharacterInstance()
java.text.BreakIterator#getLineInstance()
java.text.BreakIterator#getSentenceInstance()
java.text.BreakIterator#getWordInstance()
java.text.Collator#getInstance()
java.text.DateFormat#getTimeInstance()
java.text.DateFormat#getTimeInstance(int)
java.text.DateFormat#getDateInstance()
java.text.DateFormat#getDateInstance(int)
java.text.DateFormat#getDateTimeInstance()
java.text.DateFormat#getDateTimeInstance(int,int)
java.text.DateFormat#getInstance()
java.text.DateFormatSymbols#<init>()
java.text.DateFormatSymbols#getInstance()
java.text.DecimalFormat#<init>()
java.text.DecimalFormat#<init>(java.lang.String)
java.text.DecimalFormatSymbols#<init>()
java.text.DecimalFormatSymbols#getInstance()
java.text.MessageFormat#<init>(java.lang.String)
java.text.NumberFormat#getInstance()
java.text.NumberFormat#getNumberInstance()
java.text.NumberFormat#getIntegerInstance()
java.text.NumberFormat#getCurrencyInstance()
java.text.NumberFormat#getPercentInstance()
java.text.SimpleDateFormat#<init>()
java.text.SimpleDateFormat#<init>(java.lang.String)
java.util.Calendar#<init>()
java.util.Calendar#getInstance()
java.util.Calendar#getInstance(java.util.Locale)
java.util.Calendar#getInstance(java.util.TimeZone)
java.util.Currency#getSymbol()
java.util.GregorianCalendar#<init>()
java.util.GregorianCalendar#<init>(int,int,int)
java.util.GregorianCalendar#<init>(int,int,int,int,int)
java.util.GregorianCalendar#<init>(int,int,int,int,int,int)
java.util.GregorianCalendar#<init>(java.util.Locale)
java.util.GregorianCalendar#<init>(java.util.TimeZone)
java.util.Scanner#<init>(java.io.InputStream)
java.util.Scanner#<init>(java.io.File)
java.util.Scanner#<init>(java.nio.channels.ReadableByteChannel)
java.util.Formatter#<init>()
java.util.Formatter#<init>(java.lang.Appendable)
java.util.Formatter#<init>(java.io.File)
java.util.Formatter#<init>(java.io.File,java.lang.String)
java.util.Formatter#<init>(java.io.OutputStream)
java.util.Formatter#<init>(java.io.OutputStream,java.lang.String)
java.util.Formatter#<init>(java.io.PrintStream)
java.util.Formatter#<init>(java.lang.String)
java.util.Formatter#<init>(java.lang.String,java.lang.String)
Using this easily extend-able list, saved in a text file (UTF-8 encoded!), you can invoke my new ANT task (after registering it with <taskdef/>) very easy -- taken from Lucene/Solr’s build.xml:
<taskdef resource="lucene-solr.antlib.xml">
  <classpath>
    <pathelement location="${custom-tasks.dir}/build/classes/java" />
    <fileset dir="${custom-tasks.dir}/lib" includes="asm-debug-all-4.0.jar" />
  </classpath>
</taskdef>
<forbidden-apis>
  <classpath refid="additional.dependencies"/>
  <apiFileSet dir="${custom-tasks.dir}/forbiddenApis">
    <include name="jdk.txt" />
    <include name="jdk-deprecated.txt" />
    <include name="commons-io.txt" />
  </apiFileSet>
  <fileset dir="${basedir}/build" includes="**/*.class" />
</forbidden-apis>
The classpath given is used to look up the API signatures (provided as apiFileSet). Classpath is only needed if signatures are coming from 3rd party libraries. The inner fileset should list all class files to be checked. For running the task you also need asm-all-4.0.jar available in the task’s classpath.

If you are interested, take the source code, it is open source and released as part of the tool set shipped with Apache Lucene & Solr: Source, API lists (revision number 1360240).

At the moment we are investigating other opportunities brought by that tool:
  • We want to ban System.out/err or things like horrible Eclipse-like try...catch...printStackTrace() auto-generated Exception stubs. We can just ban those fields from the java.lang.System class and of course, Throwable#printStackTrace().
  • Using optimized Lucene-provided replacements for JDK API calls. This can be enforced by failing on the JDK signatures.
  • Failing the build on deprecated calls to Java’s API. We can of course print warnings for deprecations, but failing the build is better. And: We use deprecation annotations in Lucene’s own library code, so javac-generated warnings don’t help. We can use the list of deprecated stuff from JDK Javadocs to trigger the failures.
I hope other projects take a similar approach to scan their binary/source code and free it from system dependent API calls, which are not predictable for production systems in the server environment.

Thanks to Robert Muir and Dawid Weiss for help and suggestions!

EDIT (2015-03-14): On 2013-02-04, I released the plugin as Apache Ant, Apache Maven and CLI task on Google Code; later on 2015-03-14, it was migrated to Github. The project URL is: https://github.com/policeman-tools/forbidden-apis. The tool is available to your builds using Maven/Ivy through Maven Central and Sonatype repositories. Nightly snapshot builds are done by the Policeman Jenkins Server and can be downloaded from the Sonatype Snapshot repository.

49 comments:

  1. Stephen Colebourne also blogged about this: http://blog.joda.org/2012/12/annotating-jdk-default-data.html

    ReplyDelete
  2. This looks just like an inverse of the animal-sniffer toolchain. Would be good to add this feature into animal sniffer @ mojo.codehaus.org

    ReplyDelete
  3. I have a bit of a practical usage question. Our project has both GUI modules and non-GUI modules. If I run this tool, it flags all the usages of these sorts of things in the GUI, which really *should* be using the default locale, charset and time zone.

    Additionally, we have rules of our own which really do apply to *all* code. A good example is that java.text.MessageFormat is completely inadequate for plural rules of all languages so we are enforcing usage of ICU4J's alternative until the JRE corrects it. So we can't just turn off the checks on some modules.

    What are other people people doing about this sort of situation? Is there a good, repetition-free way to have different sets of rules apply per-module? Or are people just sucking it up and putting explicit Charset.defaultCharset(), TimeZone.getDefault() and Locale.getDefault() calls into their GUI code?

    ReplyDelete
  4. 1z0-1077 Exam Dumps Simply make sure your grip on the IT braindumps devised the industry’s best IT professionals and get a 100% guaranteed success in Exam.

    ReplyDelete
  5. This is a informative post. Post that is trustworthy, informative, and local will be, The good news is that once you stop posting such clickbait stories. My social platforms are given below. Please follow us.
    mixcloud
    issuu
    codecademy

    ReplyDelete
  6. Great post, thanks for giving us nice info.Fantastic Good Post
    walk-through. I appreciate this post.

    ReplyDelete
  7. Thank you for give us great and informative post. That is very useful for us i think that post Read lots of people so brand your self that click that link

    ReplyDelete
  8. Check out our women necklace selection for the very best in unique or custom, handmade pieces from our necklaces shops.necklace for women

    ReplyDelete
  9. Check out our women necklace selection for the very best in unique or custom, handmade pieces
    nice post

    ReplyDelete
  10. Great post, thanks for giving us nice info
    sundance film festival

    ReplyDelete
  11. This is much safer and smarter for everyone. We

    ReplyDelete
  12. Im no expert, but I believe you just made an excellent point. You certainly fully understand what youre speaking about, and I can truly get behind that. haunted stories

    ReplyDelete
  13. Superb Post.. Thanks For sharing
    https://www.braindumpsstore.com/

    ReplyDelete
  14. I want to start a blog to write about everything that happens at school and
    with friends…anonymously…any sugestions?.

    Try to check my blog: 오피사이트
    (jk)

    ReplyDelete
  15. I truly love your site.. Pleasant colors & theme. Did you create this amazing site yourself? Please reply back as I’m hoping to create my own site and would love to find out where you got this from or what the theme is named. Cheers! ufabet

    ReplyDelete
  16. I've always adored Russian culture and girls, and I even had a secret desire to marry a Russian woman. Unfortunately, I didn't know where to begin. So I stumbled across an article about buying a Russian lady one day. I began studying it and realized that this was what I had been yearning for for so long. I'm 28 years old, in a relationship with a buying a russian wife, and we're thinking about getting married. I'd want to thank these people for this chance, and I would encourage everyone that they have the potential to alter their lives.

    ReplyDelete
  17. This comment has been removed by the author.

    ReplyDelete
  18. Rufen Sie uns an! Wir nehmen Ihnen die harte Arbeit ab.
    Wir bieten professionelle Haushaltsauflösungen/ Entrümpelungen von Häusern, Wohnungen, Dachböden, Kellerräumen, Garagen, Gärten und anderen Objekten. Zudem bieten wir Seniorenumzüge, Möbeltransporte und Entsorgungsfahrten.
    RümpelRuder ist Ihr starker Partner in Hannover und Umgebung. Haushaltsauflösung Lehrte

    ReplyDelete
  19. It’s a way of implying your company is at the forefront of innovation without necessarily having to innovate.
    https://www.thetechiefind.com/

    ReplyDelete


  20. Den meisten Menschen ist nicht bewusst, dass das Fundament für große Vermögen immer in Krisen gelegt wird. Jede Krise bringt eine riesige Chance mit sich, die es zu erkennen und wahrzunehmen gilt. Börsencrash 2022

    ReplyDelete
  21. What a great blog website! Keep up the good work; you done a terrific job of touching on everything that matters to all readers. And your blog pays far more attention to details. I really appreciate you sharing.
    bankruptcy lawyer fairfax va

    ReplyDelete
  22. Your blogs are really good and interesting. It is very great and informative. This is implemented in most cases by lowercasing or upper-casing the input text and compare it with a list of already lowercased/uppercased items. This works in most cases fine, if you are anywhere in the world, except Turkey! Bankruptcy lawyer fairfax vaI got a lots of useful information in your blog. Keeps sharing more useful blogs. I appreciate your blogs...

    ReplyDelete
  23. I have read your article and will definitely use Apache Lucene search engine to perform different searches.

    ReplyDelete
  24. Fantastic blog site! Continue your excellent work; you have successfully addressed the interests of every reader. Your blog also seems to be more detail-oriented. Thank you very much for explaining it to me.

    ReplyDelete
  25. Incredible blog site, by the way. Keep up the great work; you've really covered all the basesfor your readers. More attention to detail is taken in your blog. Your honesty is really appreciated.

    ReplyDelete
  26. Thank you for your wonderful post; I'm glad I saw it on Hurray. 2V0-21.23

    ReplyDelete
  27. pg ทางเข้าเล่น ยินดีต้อนรับสู่คู่มือสุดท้ายเรื่องการเข้าถึง pg ทางเข้าเล่นที่ดีที่สุด ที่นี่คุณจะได้รู้เกี่ยวกับขั้นตอนสำคัญในการเข้าถึง PG SLOT ที่ดีที่สุดอย่างสะดวกและสนุกสนาน

    ReplyDelete
  28. The "Generics Policeman Blog" is likely a platform where an author or authors discuss topics related to generic drugs, new jersey divorce litigationpharmaceutical policies, regulations, or the pharmaceutical industry. It may provide insights, opinions, or news on matters concerning generic medications lista de verificación de divorcio no disputado de virginiaand their impact on healthcare and the pharmaceutical sector. For more specific information, you would need to access the content of the blog directly.

    ReplyDelete
  29. "Default Locales and Default Charsets" is a comprehensive and well-organized article that covers various topics related to default locales and default charsets. The article is up-to-date and covers the latest trends in these fields. It covers topics from basics to more advanced ones, making it a valuable resource for those interested in these topics. The article is well-referenced, providing citations for all sources used. The article is particularly well-written on different types of locales and provides step-by-step instructions for changing them on different operating systems. Overall, "Default Locales and Default Charsets" is an excellent resource for those interested in learning more about these topics. truck accidents lawyer

    ReplyDelete
  30. This chilling narrative explores the nightmarish consequences that unfold when a diligent policeman confronts the triad of default locales, default charsets, and default timezones, revealing the unsettling vulnerability within our digital infrastructure. Through this suspenseful tale, the story underscores the critical importance of customized settings to avert unforeseen chaos in the realms of law enforcement.||Is there No Fault Divorce in New York||Is New York A No Fault State for Divorce||

    ReplyDelete
  31. This comment has been removed by the author.

    ReplyDelete
  32. Good news is here! As you read on, you'll learn how to sell on amazon without inventory and how to launch your online business with a goal-oriented strategy.

    ReplyDelete
  33. Addressing the overlooked challenges in software localization, especially in Turkish locales, is crucial. Your insights on case-insensitive matching shed light on potential issues. Looking forward to more discussions on effective solutions
    Leyes Violencia Doméstica Nueva Jersey

    ReplyDelete
  34. "The Generics Policeman Blog" likely explores various aspects of generic medications, their policies, and related issues. The blog may cover topics such as generic drug regulations, market trends, and the impact on healthcare. Expect insightful commentary, analysis, and updates on the generics industry from the perspective of a knowledgeable and engaged author. immigration the us from india

    ReplyDelete
  35. Healing Shilajit's 100% Pure Best Shilajit Resin, which is derived from the esteemed Himalayan highlands.

    ReplyDelete
  36. This article sheds light on the often overlooked consequences of default locales, charsets, and timezones in software. It's crucial to address these issues to prevent unexpected errors. Thanks for the insightful read!
    Fast Cheap Divorce New York

    ReplyDelete
  37. Additionally, we have rules of our own which really do apply to *all* code. A good example is that java.text.MessageFormat is completely inadequate for plural rules of all languages so we are enforcing usage of ICU4J's alternative until the JRE corrects it. So we can't just turn off the checks on some modules.

    Plus Size Pakistani Salwar Kameez , Salwar Kameez Plus Size , Plus Size Pakistani Clothes

    ReplyDelete
  38. "The policeman's horror ||How Quickly Can You Get A Divorce in New York||How Much is It for A Divorce in New York unfolds in the realm of default locales, charsets, and timezones, revealing the critical importance of meticulous configuration to avoid unforeseen challenges and enhance system reliability."

    ReplyDelete
  39. This insightful post sheds light on an often overlooked aspect of software development: default locales, charsets, and timezones. It's time we implement tools to prevent mishaps stemming from these defaults!
    New Jersey Order of Protection

    ReplyDelete