2011-10-19

Java 7 Update 1 released - Does it fix the Lucene index corru(m)ption and SIGSEGV bugs?

After the serious issues with the Java 7 GA release, Oracle released Update 1 of Java 7 yesterday. The first thing, of course, was to check the release notes, download the package, and finally run our Lucene test cases. When inspecting the download page and release notes on the Oracle web site, I got confused: The new release is Java 7u1, in contrast the developer preview released one week ago named Java 7u2 Developer Preview. Both builds have the same build number (b08). More strange is: The release notes of the developer preview differ from the release notes on the Update 1 release:
  • The official Update 1 release notes only list one bug of the three ones originally reported to Oracle: #7068051 (SIGSEGV in PhaseIdealLoop::build_loop_late_post on T5440). The other ones are not listed, so we cannot be sure that they are really fixed.
  • The famous SIGSEGV bug in Porter Stemmer (#7070134) is not listed at all, it also disappeared from the Oracle issue tracker. It seems like hidden to the public (maybe they declared it as confidental because it's security related???).

No matter what release notes say - I had to download the offical release package! I took some free time on the Lucene Eurocon Conference in Barcelona, downloaded the packages and installed them in parallel on my Windows 64bit Thinkpad. First I tried to run the Porter Stemmer test with 100 iterations on Java 7 GA, and verified that it crashed. Robert Muir joined and we tested the new U1 release: Test passes! This means, Hotspot issue #7070134 was fixed, but Oracle missed to put it into the release notes.

The second part of the investigations were more complicated: The index corru(m)ption bugs are more complicated to reproduce, as the virtual machine does not crash and simply produces corrupt indexes after merging segments in one of the facetting tests. We checked out the Lucene Trunk revision of the time, when the bug was first discovered (issue LUCENE-3346) and used the random seed mentioned on the issue. We were able to verify the bug with Java 7 GA (the indexes are corrupt 90% of the time), and luckily after 20 iterations of the same test and random seed in Java 7u1, we have seen no corrupt index! It seemed to us that Oracle maybe fixed Java issues #7044738 and #7068051, but missed to put both of them into the release notes. Of course, without an additional statement from Oracle, we cannot be sure, that the issues are really fixed!

Oracle also released Java 6 Update 29 yesterday. The release notes on that version doesn't mention any relation to the Lucene bugs, so we were not sure if this version is completely different to the Java 6u29 developer preview, released one week ago, which listed those bugs (unfortunately, the package is no longer available on the net), or if they also missed to mention them. A quick review as done for Java 7 showed, that Porter Stemmer no longer crashes with -XX:+AggressiveOpts, so the bug seems to be also fixed here, too. We were not able to actually discover any index corru(m)ption.

Finally, we can somehow verify that the bugs seem to be fixed for both versions, but without an official statement from Oracle (in their release notes), we cannot recommend to use Java 7u1 (and Java 6u29 with aggressive opts) with Lucene and Solr.

Once I will be back in Germany, I will try to get an updated FreeBSD package of Java 7 and install it on our Apache Jenkins server.

Update (2011-10-26)

Last night, Oracle updated the release notes of Java 7u1 and Java 6u29, stating that they fixed the three Lucene-relevant bugs (plus another one related to that). Based on this confirmation, it's now safe to use Java 7 Update 1 (and later) with Apache Lucene and Apache Solr.

Of course, there is still the recommendation not to use -XX:+AggressiveOpts on any JVM in production!

We are still waiting for updated OpenJDK packages to install this release on our build servers.

2011-10-15

GotoCon Aarhus 2011

Last week I visited the Gotocon Conference 2011 in Aarhus. I was invited by the Java User Group (Javagruppen) of Denmark to have a talk at their event during this conference. Of course this talk was about the famous Java 7 Launch Bug, I posted earlier in this blog.

I started the trip on Monday morning and wanted to arrive in the afternoon but unfortunately the first train from Bremen to Hamburg was canceled by Deutsche Bahn. Because of this I missed the only direct train to Aarhus and had to use slow trains like RegionalExpress and arrived about 3 hours later at the conference. Unfortunately I was only able to listen to the final keynote on the first day, so I missed the talks I wanted to visit on the afternoon.

Day One

The "party keynote" was about "Cool Code", held by Kevlin Henney. He presented lots of nice code fragments "that are interesting because of historical significance, profound concepts, impressive technique, exemplary style or just sheer geekiness". Unfortunately the room was so big and the screen so small, that most code parts were unreadable. You were still able to follow, but the coolness of the shown code was not really visible. Alltogether, the talk was very interesting and a must to follow! After this keynote, still very hungry from the train ride, I attended the conference dinner, met lots of people. On my table were people from Siemens, one of them was Frank Buschmann, who hold the talk "Seven Secrets Every Architect Should Know" on Wednesday.

Day Two

The next day was packed with lot's of talks. Unfortunately, I still had to discuss with Robert Muir through Google Talk about Apache Lucene issue LUCENE-1536 (applying Lucene filters low, once it is committed, I will write about it!) and a bug in our FilteredQuery#Weight implementation, so I missed the first talk. I started with "Eigenharp: Experiencing Music Differently (and what programmers can learn from it)" which compared music instruments with user interfaces and what developers can learn from music instruments and their long history.

After lunch I attended the talk about node.js by Bert Belder, who had some problems to do his live demonstration on Apple computers (yes: "die, Apple, die"). It gave me some ideas, how to use a select-based webserver for Apache Solr. Maybe we will use the Java-only JBoss' Netty for Solr in the future!

The Java User Group: My talk about the Java 7 launch bug

On 15:50, the Java 7 event started in Kammermusiksalen. Martin Boel from Javagruppen did some introduction about the new features in Java 7 and introduced myself. He showed features, I did not know about, like an improved swing user interface. These are all features, I generally don't use in my code. One goodness is the new "try-with-resources" statement, that can make your code cleaner, as you don't have to take care of closing all resources when an exception occurs. Java 7 does this automatically (in fact, the Java compiler adds the correct statements in complicated finally blocks automatically to your code). We are using a similar infrastructure in our IOUtils class in Lucene. Recently I already upgraded it (LUCENE-3334) to make use of the new Throwable.addSuppressed() functionality to log supressed exceptions in the stack trace of the original exception. As Apache Lucene is still compatible with Java 5, I added a convenience reflection-based method to IOUtils, so we can log the supressed exception if the code runs on Java 7.

After the introduction, I started with my talk. In the following hour I showed a lot of slides (download them here as PDF) and explained the whole story of the Java 7 launch bug, as I did in my previous blog post. I also explained, for the first time, why Java 7 crashes on the Porter Stemmer code:
During the investigation of the bug in July, I had no idea, why Porter Stemmer's ends() method crashed with SIGSEGV - but almost identical code in the Java String class did not. When preparing my presentation, I analyzed the comments by hotspot's developers and was able to understand it for our case. The reason why ends() crashes in our case is also related to a loop unwinding bug. The difference to the String class is the underlying char[] array, which never changes in Java's final and unmodifiable String class. On the other hand, the Porter Stemmer code modifies the char[] array and leads to a bug because it used wrong assumptions. Java 7's string optimization routines make the hotspot compiler assume String.length() is constant (which is, of course, true). In the Porter Stemmer case, these lengths are generally very short (it checks for short strings only), and unwinds the loops in end(). Unfortunately the underlying char[] array used by Porter Stemmer is not unmodifiable, but length checks were removed leading to the bug. Not using string optimizations like Java 6 without -XX:+OptimizeStringConcat therefore prevented the bug (as the loop optimizations cannot be applied).
My talk was also followed by some Nokia guys from Berlin (I knew them from Berlin Buzzwords conference). After a short reception we went to Aarhus' nightlife and visited some bars.

Day Three

The first talk I visited was Hannes Kruppa's (he works at Nokia in Berlin and joined us the evening before) "Improving Search Ranking Through A/B Tests: A Case Study". After that I went to "Questions for an Enterprise Architect" by Erik Doernenburg, and finally had some lunch. I left the conference at about 14:00 to head back to germany. This time with no train problems.

All togther a very nice conference! Thanks to all who organized it!

2011-07-31

The real story behind the Java 7 GA bugs affecting Apache Lucene / Solr

I started this blog about two months ago, but until now, I had no time to write any posts. Since Oracle's release of Java 7 on Thursday last week, a lot of myths around the problem with Porter stemmer, wrong calculations in pulsing codec, and corrupt indexes appeared on the web. Some guys already blamed the Apache Software Foundation because they may have used the hype around Java 7 to fight against Oracle (since they stepped out of JCP).

I want to start with a chronological summary about what happened since last weekend:

Last Saturday I woke up in the morning and had no real plans for the weekend. I checked my Lucene JIRA issues and after I helped my GSoC student, I read Heise newsticker. One news article stepped into my eye: "Gearing up for Java 7" (english version). I noticed that there is only one week left until Oracle plans to release Java 7. "Maybe we should add some Jenkins jobs to the Apache build server to test Apache Lucene/Solr trunk and 3.x branches?" As always, Robert and me had some quick Google Talk session and I pointed out my plans. Our first problem was that the Jenkins server of Lucene runs under FreeBSD, so the chances were high, that we get no recent release of OpenJDK 7 running. I already installed the 64bit Windows preview build from Oracle (b147) on my Thinkpad and checked out the Lucene Core tests, which ran fine.

A quick review showed that OpenJDK7 build 147 was already available in the FreeBSD ports collection. I logged onto the jenkins slave machine and started to update ports and built the openjdk7 package. In parallel I also rebuilt the openjdk6 package for the standard test runs (a fix for a IPv6 bug was already available, so we were able to use latest version). After approx 25 minutes, the build was done and the package installed. I cloned the Jenkins jobs for the half-hourly test builds (3.x and trunk), hacked the shell scripts a little bit and started the first build. Robert and me were watching the build output,... live.

During the tests of the new analyzers module suddenly we got a crash with SIGSEGV in TestPorterStemmer. As I was only running the core tests on my local machine, this was not yet discoverered. We were shocked, "this is just a stupid FreeBSD problem, I am sure... hmhmhm... but a few weeks ago a user on the Solr mailing list reported a crash in PorterStemmer, too" (see this post). I started the analyzer tests locally on my windows box, bumm - same issue. Important was that the issue only appeared when I raised the number of test iterations, with the default of 1, the test passed. That clearly shows, there was an hotspot optimization bug, as they are only triggered when a method is executed ten thousand times.

I stopped the 2-hourly builds with Java 7 on the Jenkins slave, as we did not want to spam the mailing list with failed build reports. Robert opened LUCENE-3335 and started to investigate the problem. Step for step, he disabled compilation of several porter stemmer methods until he reached PorterStemmer.ends(). Robert opened a bug report at Oracles bug tracker: #7070134

The rest of the day we also fixed another small issues in our test cases (TestWordDelimiterFilter failed because it used a character that changed properties in Java 7 - Robert committed a fix for this).

On Sunday morning I gave Jenkins another chance: To proceed with testing on Jenkins, I hacked the shell scripts again to pass the following args to the Lucene test suite (only for Java 7 builds): -Dargs="-XX:CompileCommand=exclude,org/apache/lucene/analysis/en/PorterStemmer,ends"

The Lucene builds then passed fine most of the time, but we had some additional failures in another test inside the new facetting module: LUCENE-3346 (see below). But when starting to test the Solr builds, the problems began again: Some tests sometimes failed with strange error messages, but only randomly. It took us two hours and a hint from Yonik to find the issue behind this (SOLR-2673):
"It looks like with Java7, that the test methods are not being run in the order they are declared in the file. That's probably the root cause of the other Solr test failures with Java7, too."
A quick test with Java 6, inserting a Collections.shuffle() into the customized Lucene TestRunner, produced the same bugs. Additionally, the JDK docs nowhere explicitely state that Class.getMethods() does return in a particular order:
"Returns an array containing Method objects reflecting all the public member methods of the class or interface represented by this Class object, including those declared by the class or interface and those inherited from superclasses and superinterfaces. Array classes return all the (public) member methods inherited from the Object class. The elements in the array returned are not sorted and are not in any particular order."
The next few hours were then spent in fixing all Solr tests that relied on order of method executions (which is a violation against testing principles). On the evening we had all tests running fine, except the facetting bug and a random failure in the ICU module: LUCENE-3344 (I will come back to that later).

Status on Sunday: Until now, no response from Oracle about the new bug...

Monday morning CET, still no news from Oracle!

In late afternoon we got the first response; Robert informed me:
"Bug is visible, but it's priority is LOW. I am sure, they will release Java 7 with PorterStemmer crashing. All Solr users with the default configuration will blame us!"
Dawid Weiss then contacted the hotspot-compiler-dev mailing list; I subscribed to it and started to ask them questions. The developer responsible for these bugs (Vladimir Kozlov) sent me in the evening links to patches that might fix the issue. I also got an explanation, why we had this issue, that also exists in Java 6, but is hidden there due to optimizations not enabled (please read the thread for more information).

On Tuesday morning CET I started to produce a FreeBSD ports patch, so I was able to compile our openjdk7 package on the Jenkins slave with the proposed fixes (see patch-0uwe.patch). I changed the Jenkins configuration again and suddenly all tests passed, even the facetting tests!

We then worked around the last Java7-related test bug (LUCENE-3344), which made ICU fail on classloading ULocale. The problem is caused by some new locales in Java 7, that lead to a chicken-and-egg problem in the static initializer of ULocale. It initializes its default locale from the JDK locale in a static ctor. Until the default ULocale instance is created, the default is not set in ULocale. But ULocale's ctor itsself needs the default locale to fetch some ressource bundles and throws NPE. We opened a bug report at ICU (#8734) and added a workaround into our LuceneTestCase's locale randomization.

Time goes on...

On Thursday, lunch time CET, Java 7 was released by Oracle. Still hoping for a wonder, I downloaded the Windows x64 build from the official download location. Clicking on install it reported: "this version of the JDK is already installed on your computer, do you want to reinstall?" I did this and I was able to verify: it's exactly the same version as released on June 27th (you can verify this, too: the timestamp on the signature of the Windows Installer EXE file is June 27th). So Oracle ignored all bugs (not only ours) in the preview release and simply released a one month old package! So what was the sense of the preview release? They could have released it one month before! It was for sure not intended for public review and bug hunting!

In the evening (CET) I then opened a new issue (LUCENE-3349) as we already discussed on IRC, that we don't want our users to crash their Solr installations and corrupt their indexes. Robert spent the last days on hunting the causes behind the CheckIndex failures in the facetting module (the so-called index corru(m)ption bug). He was able to produce some random seeds that triggered the bug. In fact, it is a reincarnation of the well known readVInt-bug (LUCENE-2975) we discovered before release of Lucene 3.1. It affects the most often used Lucene method during reading index contents from disk: DataInput.readVInt(). For optimization purposes we have several incarnations of this method dependent on the underlying data structures. Java 6 JDKs only triggered this bug in combination with MappedByteBuffer.getByte(), but now it was triggered almost everywhere! Especially when the new pulsing codec was used (which has a byte[] variant of this method), all was f*cked up! StandardPostingsReader was merging index segments, the wrong results of readVInt() were copied to the newly created index segments, and finally we produced corrupt indexes!

The warning mail was released...

Shortly before midnight CET I sent the warning mail, prepared in LUCENE-3349, to the mailing lists:
Hello Apache Lucene & Apache Solr users, Hello users of other Java-based Apache projects,

Oracle released Java 7 today. Unfortunately it contains hotspot compiler optimizations, which miscompile some loops. This can affect code of several Apache projects. Sometimes JVMs only crash, but in several cases, results calculated can be incorrect, leading to bugs in applications (see Hotspot bugs 7070134 [1], 7044738 [2], 7068051 [3]).

Apache Lucene Core and Apache Solr are two Apache projects, which are affected by these bugs, namely all versions released until today. Solr users with the default configuration will have Java crashing with SIGSEGV as soon as they start to index documents, as one affected part is the well-known Porter stemmer (see LUCENE-3335 [4]). Other loops in Lucene may be miscompiled, too, leading to index corruption (especially on Lucene trunk with pulsing codec; other loops may be affected, too - LUCENE-3346 [5]).

These problems were detected only 5 days before the official Java 7 release, so Oracle had no time to fix those bugs, affecting also many more applications. In response to our questions, they proposed to include the fixes into service release u2 (eventually into service release u1, see [6]). This means you cannot use Apache Lucene/Solr with Java 7 releases before Update 2! If you do, please don't open bug reports, it is not the committers' fault! At least disable loop optimizations using the -XX:-UseLoopPredicate JVM option to not risk index corruptions.

Please note: Also Java 6 users are affected, if they use one of those JVM options, which are not enabled by default: -XX:+OptimizeStringConcat or -XX:+AggressiveOpts

It is strongly recommended not to use any hotspot optimization switches in any Java version without extensive testing!

In case you upgrade to Java 7, remember that you may have to reindex, as the unicode version shipped with Java 7 changed and tokenization behaves differently (e.g. lowercasing). For more information, read JRE_VERSION_MIGRATION.txt in your distribution package!

On behalf of the Lucene project,
Uwe


What then happened is well-known:


Until now, California was still sleeping...


In late afternoon CET, California woke up:


On the evening before, Hoss already posted the following blog post on the Lucid Imagination website: "Don’t Use Java 7, For Anything".

  • After California woke up, the first person posted Hoss' blog post on Slashdot: "Java 7 Ships With Severe Bug" referring to Hoss' blog post.

And then the whole thing went crazy: On Twitter, new posts refering to Slashdot, InfoWorld, and, finally, Hoss' blog post on Lucid Imagination appeared every few seconds. There were more tweets stating "Don't use Java 7, For Anything" than tweets about "the first Java release since 4 years".

A little bit later I recognized the first pro-Oracle blog trying to explain some background information: "Don't Use Java 7? Are you kidding me?" (Markus Eisele). Thanks for posting this!

I went to sleep and on the following day, the original Oracle Bug report that caused this was upgraded to priority "HIGH" - yeah. So we will hopefully get a corrected Java 7 release quite soon in Update Pack 1!

Finally I wanted to say thank you to the other Lucene committers that helped during investigation: Robert Muir, Yonik Seeley, Dawid Weiss, and Mike McCandless. And of course Hoss for his funny caption on Lucid Imagination's blog!

Update: (2011-08-01)

On Saturday morning CET, Cay Horstmann, professor of computer science at San Jose State University, compared the JDK 7 bugs with the Pentium FDIV bug in 1994. In his blog article "Java 7 Unsafe at Any Speed?", he stated that SIGSEGVs are easy bugs; much more serious are hidden bugs, only appearing under certain conditions and then silently produce wrong computation results. On Sunday evening CET, a user asked on Stackoverflow: "How serious is the Java7 'Solr/Lucene' bug?" In his response, Robert Muir described his work how to track down Java's "pentium bug" and circumvent it, with no success (see above).

Luckily, Oracle raised the priority of the SIGSEGV bug (#7070134) to "HIGH", but the other two bugs are still on "MEDIUM". One of them (#7044738) is exactly such a bug that Cay Horstmann described in his blog post. We applied the patches for all three bugs to our JVM installation, the "HIGH" priority one only fixes the SIGSEGV. Since our wrong readVInt-calculations in the method's loop are fixed by the combination of all three patches, wouldn't it be a good idea for Oracle, to rate all three issues with priority "HIGH"? Otherwise it could happen that we get service pack 1 only with the SIGSEGV bug fixed, but still silently producing corrupt indexes?

On Monday, the German "JAXenter" / "IT Republic" published a nice article: "Wie gravierend sind die Bugs in JDK7 wirklich?", referring to the above posts. There is also an English version available: "Java 7 Causes Headaches for Lucene and Solr Users".

Update: (2011-08-03)

Yesterday I had some interviews with journalists/bloggers:

Update: (2011-08-04)

Today, Neil McAllister published "Oracle: Java's worst enemy" on the InfoWorld newsticker. When California woke up in the evening CET, someone posted this article on Slashdot with the title "Oracle's Java Policies Are Destroying the Community", resulting in a heated discussion and a high tweet rate again.

Update: (2011-08-10)

The last days, community and journalists were blogging about the backgrounds:
Two days ago, Oracle offered to some committers that they could get access to early builds of the Java SE before they are released. This would allow us to check compatibility of bugfix releases and service packs of the Java SE with Apache Lucene and Solr. All this is covered by their Java CAP (Compatibility and Performance Program).


Update: (2011-08-12)

Oracle published a FAQ about Java 7, that clarifies when and how the Lucene-related fixes in Java 7 will be released. They also mentioned another bug (#7077439), they investigated while fixing the others. Unfortunately the page is not accessible to the public, but the fix and an explanation is already reviewed and committed to OpenJDK. Does this mean, Oracle started to try and test Lucene builds with their JDK?