2011-07-31

The real story behind the Java 7 GA bugs affecting Apache Lucene / Solr

I started this blog about two months ago, but until now, I had no time to write any posts. Since Oracle's release of Java 7 on Thursday last week, a lot of myths around the problem with Porter stemmer, wrong calculations in pulsing codec, and corrupt indexes appeared on the web. Some guys already blamed the Apache Software Foundation because they may have used the hype around Java 7 to fight against Oracle (since they stepped out of JCP).

I want to start with a chronological summary about what happened since last weekend:

Last Saturday I woke up in the morning and had no real plans for the weekend. I checked my Lucene JIRA issues and after I helped my GSoC student, I read Heise newsticker. One news article stepped into my eye: "Gearing up for Java 7" (english version). I noticed that there is only one week left until Oracle plans to release Java 7. "Maybe we should add some Jenkins jobs to the Apache build server to test Apache Lucene/Solr trunk and 3.x branches?" As always, Robert and me had some quick Google Talk session and I pointed out my plans. Our first problem was that the Jenkins server of Lucene runs under FreeBSD, so the chances were high, that we get no recent release of OpenJDK 7 running. I already installed the 64bit Windows preview build from Oracle (b147) on my Thinkpad and checked out the Lucene Core tests, which ran fine.

A quick review showed that OpenJDK7 build 147 was already available in the FreeBSD ports collection. I logged onto the jenkins slave machine and started to update ports and built the openjdk7 package. In parallel I also rebuilt the openjdk6 package for the standard test runs (a fix for a IPv6 bug was already available, so we were able to use latest version). After approx 25 minutes, the build was done and the package installed. I cloned the Jenkins jobs for the half-hourly test builds (3.x and trunk), hacked the shell scripts a little bit and started the first build. Robert and me were watching the build output,... live.

During the tests of the new analyzers module suddenly we got a crash with SIGSEGV in TestPorterStemmer. As I was only running the core tests on my local machine, this was not yet discoverered. We were shocked, "this is just a stupid FreeBSD problem, I am sure... hmhmhm... but a few weeks ago a user on the Solr mailing list reported a crash in PorterStemmer, too" (see this post). I started the analyzer tests locally on my windows box, bumm - same issue. Important was that the issue only appeared when I raised the number of test iterations, with the default of 1, the test passed. That clearly shows, there was an hotspot optimization bug, as they are only triggered when a method is executed ten thousand times.

I stopped the 2-hourly builds with Java 7 on the Jenkins slave, as we did not want to spam the mailing list with failed build reports. Robert opened LUCENE-3335 and started to investigate the problem. Step for step, he disabled compilation of several porter stemmer methods until he reached PorterStemmer.ends(). Robert opened a bug report at Oracles bug tracker: #7070134

The rest of the day we also fixed another small issues in our test cases (TestWordDelimiterFilter failed because it used a character that changed properties in Java 7 - Robert committed a fix for this).

On Sunday morning I gave Jenkins another chance: To proceed with testing on Jenkins, I hacked the shell scripts again to pass the following args to the Lucene test suite (only for Java 7 builds): -Dargs="-XX:CompileCommand=exclude,org/apache/lucene/analysis/en/PorterStemmer,ends"

The Lucene builds then passed fine most of the time, but we had some additional failures in another test inside the new facetting module: LUCENE-3346 (see below). But when starting to test the Solr builds, the problems began again: Some tests sometimes failed with strange error messages, but only randomly. It took us two hours and a hint from Yonik to find the issue behind this (SOLR-2673):
"It looks like with Java7, that the test methods are not being run in the order they are declared in the file. That's probably the root cause of the other Solr test failures with Java7, too."
A quick test with Java 6, inserting a Collections.shuffle() into the customized Lucene TestRunner, produced the same bugs. Additionally, the JDK docs nowhere explicitely state that Class.getMethods() does return in a particular order:
"Returns an array containing Method objects reflecting all the public member methods of the class or interface represented by this Class object, including those declared by the class or interface and those inherited from superclasses and superinterfaces. Array classes return all the (public) member methods inherited from the Object class. The elements in the array returned are not sorted and are not in any particular order."
The next few hours were then spent in fixing all Solr tests that relied on order of method executions (which is a violation against testing principles). On the evening we had all tests running fine, except the facetting bug and a random failure in the ICU module: LUCENE-3344 (I will come back to that later).

Status on Sunday: Until now, no response from Oracle about the new bug...

Monday morning CET, still no news from Oracle!

In late afternoon we got the first response; Robert informed me:
"Bug is visible, but it's priority is LOW. I am sure, they will release Java 7 with PorterStemmer crashing. All Solr users with the default configuration will blame us!"
Dawid Weiss then contacted the hotspot-compiler-dev mailing list; I subscribed to it and started to ask them questions. The developer responsible for these bugs (Vladimir Kozlov) sent me in the evening links to patches that might fix the issue. I also got an explanation, why we had this issue, that also exists in Java 6, but is hidden there due to optimizations not enabled (please read the thread for more information).

On Tuesday morning CET I started to produce a FreeBSD ports patch, so I was able to compile our openjdk7 package on the Jenkins slave with the proposed fixes (see patch-0uwe.patch). I changed the Jenkins configuration again and suddenly all tests passed, even the facetting tests!

We then worked around the last Java7-related test bug (LUCENE-3344), which made ICU fail on classloading ULocale. The problem is caused by some new locales in Java 7, that lead to a chicken-and-egg problem in the static initializer of ULocale. It initializes its default locale from the JDK locale in a static ctor. Until the default ULocale instance is created, the default is not set in ULocale. But ULocale's ctor itsself needs the default locale to fetch some ressource bundles and throws NPE. We opened a bug report at ICU (#8734) and added a workaround into our LuceneTestCase's locale randomization.

Time goes on...

On Thursday, lunch time CET, Java 7 was released by Oracle. Still hoping for a wonder, I downloaded the Windows x64 build from the official download location. Clicking on install it reported: "this version of the JDK is already installed on your computer, do you want to reinstall?" I did this and I was able to verify: it's exactly the same version as released on June 27th (you can verify this, too: the timestamp on the signature of the Windows Installer EXE file is June 27th). So Oracle ignored all bugs (not only ours) in the preview release and simply released a one month old package! So what was the sense of the preview release? They could have released it one month before! It was for sure not intended for public review and bug hunting!

In the evening (CET) I then opened a new issue (LUCENE-3349) as we already discussed on IRC, that we don't want our users to crash their Solr installations and corrupt their indexes. Robert spent the last days on hunting the causes behind the CheckIndex failures in the facetting module (the so-called index corru(m)ption bug). He was able to produce some random seeds that triggered the bug. In fact, it is a reincarnation of the well known readVInt-bug (LUCENE-2975) we discovered before release of Lucene 3.1. It affects the most often used Lucene method during reading index contents from disk: DataInput.readVInt(). For optimization purposes we have several incarnations of this method dependent on the underlying data structures. Java 6 JDKs only triggered this bug in combination with MappedByteBuffer.getByte(), but now it was triggered almost everywhere! Especially when the new pulsing codec was used (which has a byte[] variant of this method), all was f*cked up! StandardPostingsReader was merging index segments, the wrong results of readVInt() were copied to the newly created index segments, and finally we produced corrupt indexes!

The warning mail was released...

Shortly before midnight CET I sent the warning mail, prepared in LUCENE-3349, to the mailing lists:
Hello Apache Lucene & Apache Solr users, Hello users of other Java-based Apache projects,

Oracle released Java 7 today. Unfortunately it contains hotspot compiler optimizations, which miscompile some loops. This can affect code of several Apache projects. Sometimes JVMs only crash, but in several cases, results calculated can be incorrect, leading to bugs in applications (see Hotspot bugs 7070134 [1], 7044738 [2], 7068051 [3]).

Apache Lucene Core and Apache Solr are two Apache projects, which are affected by these bugs, namely all versions released until today. Solr users with the default configuration will have Java crashing with SIGSEGV as soon as they start to index documents, as one affected part is the well-known Porter stemmer (see LUCENE-3335 [4]). Other loops in Lucene may be miscompiled, too, leading to index corruption (especially on Lucene trunk with pulsing codec; other loops may be affected, too - LUCENE-3346 [5]).

These problems were detected only 5 days before the official Java 7 release, so Oracle had no time to fix those bugs, affecting also many more applications. In response to our questions, they proposed to include the fixes into service release u2 (eventually into service release u1, see [6]). This means you cannot use Apache Lucene/Solr with Java 7 releases before Update 2! If you do, please don't open bug reports, it is not the committers' fault! At least disable loop optimizations using the -XX:-UseLoopPredicate JVM option to not risk index corruptions.

Please note: Also Java 6 users are affected, if they use one of those JVM options, which are not enabled by default: -XX:+OptimizeStringConcat or -XX:+AggressiveOpts

It is strongly recommended not to use any hotspot optimization switches in any Java version without extensive testing!

In case you upgrade to Java 7, remember that you may have to reindex, as the unicode version shipped with Java 7 changed and tokenization behaves differently (e.g. lowercasing). For more information, read JRE_VERSION_MIGRATION.txt in your distribution package!

On behalf of the Lucene project,
Uwe


What then happened is well-known:


Until now, California was still sleeping...


In late afternoon CET, California woke up:


On the evening before, Hoss already posted the following blog post on the Lucid Imagination website: "Don’t Use Java 7, For Anything".

  • After California woke up, the first person posted Hoss' blog post on Slashdot: "Java 7 Ships With Severe Bug" referring to Hoss' blog post.

And then the whole thing went crazy: On Twitter, new posts refering to Slashdot, InfoWorld, and, finally, Hoss' blog post on Lucid Imagination appeared every few seconds. There were more tweets stating "Don't use Java 7, For Anything" than tweets about "the first Java release since 4 years".

A little bit later I recognized the first pro-Oracle blog trying to explain some background information: "Don't Use Java 7? Are you kidding me?" (Markus Eisele). Thanks for posting this!

I went to sleep and on the following day, the original Oracle Bug report that caused this was upgraded to priority "HIGH" - yeah. So we will hopefully get a corrected Java 7 release quite soon in Update Pack 1!

Finally I wanted to say thank you to the other Lucene committers that helped during investigation: Robert Muir, Yonik Seeley, Dawid Weiss, and Mike McCandless. And of course Hoss for his funny caption on Lucid Imagination's blog!

Update: (2011-08-01)

On Saturday morning CET, Cay Horstmann, professor of computer science at San Jose State University, compared the JDK 7 bugs with the Pentium FDIV bug in 1994. In his blog article "Java 7 Unsafe at Any Speed?", he stated that SIGSEGVs are easy bugs; much more serious are hidden bugs, only appearing under certain conditions and then silently produce wrong computation results. On Sunday evening CET, a user asked on Stackoverflow: "How serious is the Java7 'Solr/Lucene' bug?" In his response, Robert Muir described his work how to track down Java's "pentium bug" and circumvent it, with no success (see above).

Luckily, Oracle raised the priority of the SIGSEGV bug (#7070134) to "HIGH", but the other two bugs are still on "MEDIUM". One of them (#7044738) is exactly such a bug that Cay Horstmann described in his blog post. We applied the patches for all three bugs to our JVM installation, the "HIGH" priority one only fixes the SIGSEGV. Since our wrong readVInt-calculations in the method's loop are fixed by the combination of all three patches, wouldn't it be a good idea for Oracle, to rate all three issues with priority "HIGH"? Otherwise it could happen that we get service pack 1 only with the SIGSEGV bug fixed, but still silently producing corrupt indexes?

On Monday, the German "JAXenter" / "IT Republic" published a nice article: "Wie gravierend sind die Bugs in JDK7 wirklich?", referring to the above posts. There is also an English version available: "Java 7 Causes Headaches for Lucene and Solr Users".

Update: (2011-08-03)

Yesterday I had some interviews with journalists/bloggers:

Update: (2011-08-04)

Today, Neil McAllister published "Oracle: Java's worst enemy" on the InfoWorld newsticker. When California woke up in the evening CET, someone posted this article on Slashdot with the title "Oracle's Java Policies Are Destroying the Community", resulting in a heated discussion and a high tweet rate again.

Update: (2011-08-10)

The last days, community and journalists were blogging about the backgrounds:
Two days ago, Oracle offered to some committers that they could get access to early builds of the Java SE before they are released. This would allow us to check compatibility of bugfix releases and service packs of the Java SE with Apache Lucene and Solr. All this is covered by their Java CAP (Compatibility and Performance Program).


Update: (2011-08-12)

Oracle published a FAQ about Java 7, that clarifies when and how the Lucene-related fixes in Java 7 will be released. They also mentioned another bug (#7077439), they investigated while fixing the others. Unfortunately the page is not accessible to the public, but the fix and an explanation is already reviewed and committed to OpenJDK. Does this mean, Oracle started to try and test Lucene builds with their JDK?

    14 comments:

    1. Thanks, nice summary to understand the Java7 bug history better. I'm curious how the Oracle story sounds. :-)

      ReplyDelete
    2. "Lucene/SolrをJava 7で使うときの注意(あるいはJava 6以前でもホットスポットのバグを踏む可能性あり)"

      My Japanese is crappy, but here it goes: "Warning when you use Lucene/Solr with Java 7 (probably there is a possibility for Hotspot Bugs appearing in versions up to Java 6 as well)"

      BTW, interesting post!

      ReplyDelete
    3. Regarding the ordering issue in Class.getMethods (and getDeclaredMethods), you may want to look at and vote on http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6807471

      ReplyDelete
    4. Andrew: We recognized this behaviour for the reflection layer in Java 7, but it is no bug at all, the javadocs explicitely state that the returned arrays have no explicit order. The bug was in fact on our side, as the JUnit tests of Solr expected the array ordered like the method declarations in the source file (which was always the case in previous Java versions, but there was no guarantee).

      ReplyDelete
    5. Unlike you, I don´t think there´s anything to compliment in Hoss´ "creative" headline. Yes, it might have succeeded in getting a fix done sooner, but at what cost?.

      "And then the whole thing went crazy" is quite an understatement. I´m sure the foes of Java, you know the ones that wanted to kill Java since the 1.0 days, will use the worst-titled stories quoted above to discredit Java as a platform and promote their alternative solutions.

      "Don´t use Java 7, at all" is plain FUD... it prevents end-users on desktop computers from trying the new release which based on my tests gives much better performance.

      I´ve been using JRE7 to run the kind of desktop Java apps that I use daily (Java Image Editor, Vuze, muCommander withoout a single crash -so far, knock on wood ;). And I bet there´s quite a bit of loops in that code.

      Just my $0.02

      ReplyDelete
    6. Oh "Don´t use Java 7, at all" and the following backlash only conveys the message that "Oracle releases buggy code" and implicitely that "they don´t know what they´re doing" and are not to be trusted.

      It´s ironic that we don´t see headlines like "Microsoft releases buggy software" given Microsoft´s track record and the sole existance of "Patch tuesday".

      FC

      ReplyDelete
    7. Hossman did an awesome job. He probably prevented more cases of index corruption with one blog post than I ever have hammering on Lucene's tests.

      ReplyDelete
    8. That's fascinating! When will the movie be coming out? I wonder who they will get to play you...

      ReplyDelete
    9. Thank you for a great post. The movie - what stops you from doing a YouTube trailer? In this style: Hamlet

      ReplyDelete
    10. Great story! I can't see any point in blaming Oracle developers as well as anybody else.
      Had they had the bug fixed before the release, it would be better, but who knows their internal matters? At least they've done the release :)

      ReplyDelete
    11. Hi, this blog is really instructive. I would like to know more about this.


      Up-right Cases

      ReplyDelete
    12. So, a huge amount of time has passed. Is it now fixed?

      ReplyDelete
    13. You can read in the later blog posts that the bugs are fixed since Oracle Java 7u1 and 6u29.

      ReplyDelete