I want to start with a chronological summary about what happened since last weekend:
Last Saturday I woke up in the morning and had no real plans for the weekend. I checked my Lucene JIRA issues and after I helped my GSoC student, I read Heise newsticker. One news article stepped into my eye: "Gearing up for Java 7" (english version). I noticed that there is only one week left until Oracle plans to release Java 7. "Maybe we should add some Jenkins jobs to the Apache build server to test Apache Lucene/Solr trunk and 3.x branches?" As always, Robert and me had some quick Google Talk session and I pointed out my plans. Our first problem was that the Jenkins server of Lucene runs under FreeBSD, so the chances were high, that we get no recent release of OpenJDK 7 running. I already installed the 64bit Windows preview build from Oracle (b147) on my Thinkpad and checked out the Lucene Core tests, which ran fine.
A quick review showed that OpenJDK7 build 147 was already available in the FreeBSD ports collection. I logged onto the jenkins slave machine and started to update ports and built the openjdk7 package. In parallel I also rebuilt the openjdk6 package for the standard test runs (a fix for a IPv6 bug was already available, so we were able to use latest version). After approx 25 minutes, the build was done and the package installed. I cloned the Jenkins jobs for the half-hourly test builds (3.x and trunk), hacked the shell scripts a little bit and started the first build. Robert and me were watching the build output,... live.
During the tests of the new analyzers module suddenly we got a crash with SIGSEGV in TestPorterStemmer. As I was only running the core tests on my local machine, this was not yet discoverered. We were shocked, "this is just a stupid FreeBSD problem, I am sure... hmhmhm... but a few weeks ago a user on the Solr mailing list reported a crash in PorterStemmer, too" (see this post). I started the analyzer tests locally on my windows box, bumm - same issue. Important was that the issue only appeared when I raised the number of test iterations, with the default of 1, the test passed. That clearly shows, there was an hotspot optimization bug, as they are only triggered when a method is executed ten thousand times.
I stopped the 2-hourly builds with Java 7 on the Jenkins slave, as we did not want to spam the mailing list with failed build reports. Robert opened LUCENE-3335 and started to investigate the problem. Step for step, he disabled compilation of several porter stemmer methods until he reached PorterStemmer.ends(). Robert opened a bug report at Oracles bug tracker: #7070134
The rest of the day we also fixed another small issues in our test cases (TestWordDelimiterFilter failed because it used a character that changed properties in Java 7 - Robert committed a fix for this).
On Sunday morning I gave Jenkins another chance: To proceed with testing on Jenkins, I hacked the shell scripts again to pass the following args to the Lucene test suite (only for Java 7 builds): -Dargs="-XX:CompileCommand=exclude,org/apache/lucene/analysis/en/PorterStemmer,ends"
The Lucene builds then passed fine most of the time, but we had some additional failures in another test inside the new facetting module: LUCENE-3346 (see below). But when starting to test the Solr builds, the problems began again: Some tests sometimes failed with strange error messages, but only randomly. It took us two hours and a hint from Yonik to find the issue behind this (SOLR-2673):
"It looks like with Java7, that the test methods are not being run in the order they are declared in the file. That's probably the root cause of the other Solr test failures with Java7, too."A quick test with Java 6, inserting a Collections.shuffle() into the customized Lucene TestRunner, produced the same bugs. Additionally, the JDK docs nowhere explicitely state that Class.getMethods() does return in a particular order:
"Returns an array containingThe next few hours were then spent in fixing all Solr tests that relied on order of method executions (which is a violation against testing principles). On the evening we had all tests running fine, except the facetting bug and a random failure in the ICU module: LUCENE-3344 (I will come back to that later).Method
objects reflecting all the public member methods of the class or interface represented by thisClass
object, including those declared by the class or interface and those inherited from superclasses and superinterfaces. Array classes return all the (public) member methods inherited from theObject
class. The elements in the array returned are not sorted and are not in any particular order."
Status on Sunday: Until now, no response from Oracle about the new bug...
Monday morning CET, still no news from Oracle!
In late afternoon we got the first response; Robert informed me:
"Bug is visible, but it's priority is LOW. I am sure, they will release Java 7 with PorterStemmer crashing. All Solr users with the default configuration will blame us!"Dawid Weiss then contacted the hotspot-compiler-dev mailing list; I subscribed to it and started to ask them questions. The developer responsible for these bugs (Vladimir Kozlov) sent me in the evening links to patches that might fix the issue. I also got an explanation, why we had this issue, that also exists in Java 6, but is hidden there due to optimizations not enabled (please read the thread for more information).
On Tuesday morning CET I started to produce a FreeBSD ports patch, so I was able to compile our openjdk7 package on the Jenkins slave with the proposed fixes (see patch-0uwe.patch). I changed the Jenkins configuration again and suddenly all tests passed, even the facetting tests!
We then worked around the last Java7-related test bug (LUCENE-3344), which made ICU fail on classloading ULocale. The problem is caused by some new locales in Java 7, that lead to a chicken-and-egg problem in the static initializer of ULocale. It initializes its default locale from the JDK locale in a static ctor. Until the default ULocale instance is created, the default is not set in ULocale. But ULocale's ctor itsself needs the default locale to fetch some ressource bundles and throws NPE. We opened a bug report at ICU (#8734) and added a workaround into our LuceneTestCase's locale randomization.
Time goes on...
On Thursday, lunch time CET, Java 7 was released by Oracle. Still hoping for a wonder, I downloaded the Windows x64 build from the official download location. Clicking on install it reported: "this version of the JDK is already installed on your computer, do you want to reinstall?" I did this and I was able to verify: it's exactly the same version as released on June 27th (you can verify this, too: the timestamp on the signature of the Windows Installer EXE file is June 27th). So Oracle ignored all bugs (not only ours) in the preview release and simply released a one month old package! So what was the sense of the preview release? They could have released it one month before! It was for sure not intended for public review and bug hunting!
In the evening (CET) I then opened a new issue (LUCENE-3349) as we already discussed on IRC, that we don't want our users to crash their Solr installations and corrupt their indexes. Robert spent the last days on hunting the causes behind the CheckIndex failures in the facetting module (the so-called index corru(m)ption bug). He was able to produce some random seeds that triggered the bug. In fact, it is a reincarnation of the well known readVInt-bug (LUCENE-2975) we discovered before release of Lucene 3.1. It affects the most often used Lucene method during reading index contents from disk: DataInput.readVInt(). For optimization purposes we have several incarnations of this method dependent on the underlying data structures. Java 6 JDKs only triggered this bug in combination with MappedByteBuffer.getByte(), but now it was triggered almost everywhere! Especially when the new pulsing codec was used (which has a byte[] variant of this method), all was f*cked up! StandardPostingsReader was merging index segments, the wrong results of readVInt() were copied to the newly created index segments, and finally we produced corrupt indexes!
The warning mail was released...
Shortly before midnight CET I sent the warning mail, prepared in LUCENE-3349, to the mailing lists:
Hello Apache Lucene & Apache Solr users, Hello users of other Java-based Apache projects,
Oracle released Java 7 today. Unfortunately it contains hotspot compiler optimizations, which miscompile some loops. This can affect code of several Apache projects. Sometimes JVMs only crash, but in several cases, results calculated can be incorrect, leading to bugs in applications (see Hotspot bugs 7070134 [1], 7044738 [2], 7068051 [3]).
Apache Lucene Core and Apache Solr are two Apache projects, which are affected by these bugs, namely all versions released until today. Solr users with the default configuration will have Java crashing with SIGSEGV as soon as they start to index documents, as one affected part is the well-known Porter stemmer (see LUCENE-3335 [4]). Other loops in Lucene may be miscompiled, too, leading to index corruption (especially on Lucene trunk with pulsing codec; other loops may be affected, too - LUCENE-3346 [5]).
These problems were detected only 5 days before the official Java 7 release, so Oracle had no time to fix those bugs, affecting also many more applications. In response to our questions, they proposed to include the fixes into service release u2 (eventually into service release u1, see [6]). This means you cannot use Apache Lucene/Solr with Java 7 releases before Update 2! If you do, please don't open bug reports, it is not the committers' fault! At least disable loop optimizations using the -XX:-UseLoopPredicate JVM option to not risk index corruptions.
Please note: Also Java 6 users are affected, if they use one of those JVM options, which are not enabled by default: -XX:+OptimizeStringConcat or -XX:+AggressiveOpts
It is strongly recommended not to use any hotspot optimization switches in any Java version without extensive testing!
In case you upgrade to Java 7, remember that you may have to reindex, as the unicode version shipped with Java 7 changed and tokenization behaves differently (e.g. lowercasing). For more information, read JRE_VERSION_MIGRATION.txt in your distribution package!
On behalf of the Lucene project,Uwe
What then happened is well-known:
- The first newsticker (heise online) released the message (because Europe is earlier awake than the U.S.): "Java 7 legt Lucene und Solr lahm" (they added some nice background information, so it was a really good article).
- Shortly after, the Austrians posted a report, interestingly they combined our warning with the official release article about Java 7 (Der Standard): "Oracle stellt Java 7 vor"
- Soon after, German "JAXenter" / "IT Republic" published the news, this time directly mentioning myself (I wrote some Lucene-related articles for their journal "Java Magazin" in the past): "Apache warnt vor Bugs in JDK 7"
- Some Japanese guys also posted this, I have no idea what they wrote, but it is the same message behind: "Lucene/SolrをJava 7で使うときの注意(あるいはJava 6以前でもホットスポットのバグを踏む可能性あり)"
- Now it was time for the British news, UK's "Heise Online" published the English variant of the first article mentioned above: Java 7 paralyses Lucene and Solr (Very nice title! In fact this is a translation of my warning message from "German-American" English to real British English).
- Shortly after that, JAXenter posted: "Java 7 Could Cause Bugs in Some Apache Projects"
- Followed by eWeek Europe: "Apache Developers: Java 7 Contains Bugs"
Until now, California was still sleeping...
- The first American newsticker was CNET: "Oracle releases 'buggy' Java SE7"
- Followed by i-Programmer: "Java SE7 out now but with a bug"
In late afternoon CET, California woke up:
- Now the biggest U.S. newsticker (InfoWorld) posted its article: "Apache and Oracle warn of serious Java 7 compiler bugs" (the funny thing here is that suddenly not only "Apache" warned on the Java 7 bugs, also "Oracle" - looks like "whisper down the lane")
On the evening before, Hoss already posted the following blog post on the Lucid Imagination website: "Don’t Use Java 7, For Anything".
- After California woke up, the first person posted Hoss' blog post on Slashdot: "Java 7 Ships With Severe Bug" referring to Hoss' blog post.
And then the whole thing went crazy: On Twitter, new posts refering to Slashdot, InfoWorld, and, finally, Hoss' blog post on Lucid Imagination appeared every few seconds. There were more tweets stating "Don't use Java 7, For Anything" than tweets about "the first Java release since 4 years".
A little bit later I recognized the first pro-Oracle blog trying to explain some background information: "Don't Use Java 7? Are you kidding me?" (Markus Eisele). Thanks for posting this!
I went to sleep and on the following day, the original Oracle Bug report that caused this was upgraded to priority "HIGH" - yeah. So we will hopefully get a corrected Java 7 release quite soon in Update Pack 1!
Finally I wanted to say thank you to the other Lucene committers that helped during investigation: Robert Muir, Yonik Seeley, Dawid Weiss, and Mike McCandless. And of course Hoss for his funny caption on Lucid Imagination's blog!
Update: (2011-08-01)
On Saturday morning CET, Cay Horstmann, professor of computer science at San Jose State University, compared the JDK 7 bugs with the Pentium FDIV bug in 1994. In his blog article "Java 7 Unsafe at Any Speed?", he stated that SIGSEGVs are easy bugs; much more serious are hidden bugs, only appearing under certain conditions and then silently produce wrong computation results. On Sunday evening CET, a user asked on Stackoverflow: "How serious is the Java7 'Solr/Lucene' bug?" In his response, Robert Muir described his work how to track down Java's "pentium bug" and circumvent it, with no success (see above).
Luckily, Oracle raised the priority of the SIGSEGV bug (#7070134) to "HIGH", but the other two bugs are still on "MEDIUM". One of them (#7044738) is exactly such a bug that Cay Horstmann described in his blog post. We applied the patches for all three bugs to our JVM installation, the "HIGH" priority one only fixes the SIGSEGV. Since our wrong readVInt-calculations in the method's loop are fixed by the combination of all three patches, wouldn't it be a good idea for Oracle, to rate all three issues with priority "HIGH"? Otherwise it could happen that we get service pack 1 only with the SIGSEGV bug fixed, but still silently producing corrupt indexes?
On Monday, the German "JAXenter" / "IT Republic" published a nice article: "Wie gravierend sind die Bugs in JDK7 wirklich?", referring to the above posts. There is also an English version available: "Java 7 Causes Headaches for Lucene and Solr Users".
I went to sleep and on the following day, the original Oracle Bug report that caused this was upgraded to priority "HIGH" - yeah. So we will hopefully get a corrected Java 7 release quite soon in Update Pack 1!
Finally I wanted to say thank you to the other Lucene committers that helped during investigation: Robert Muir, Yonik Seeley, Dawid Weiss, and Mike McCandless. And of course Hoss for his funny caption on Lucid Imagination's blog!
Update: (2011-08-01)
On Saturday morning CET, Cay Horstmann, professor of computer science at San Jose State University, compared the JDK 7 bugs with the Pentium FDIV bug in 1994. In his blog article "Java 7 Unsafe at Any Speed?", he stated that SIGSEGVs are easy bugs; much more serious are hidden bugs, only appearing under certain conditions and then silently produce wrong computation results. On Sunday evening CET, a user asked on Stackoverflow: "How serious is the Java7 'Solr/Lucene' bug?" In his response, Robert Muir described his work how to track down Java's "pentium bug" and circumvent it, with no success (see above).
Luckily, Oracle raised the priority of the SIGSEGV bug (#7070134) to "HIGH", but the other two bugs are still on "MEDIUM". One of them (#7044738) is exactly such a bug that Cay Horstmann described in his blog post. We applied the patches for all three bugs to our JVM installation, the "HIGH" priority one only fixes the SIGSEGV. Since our wrong readVInt-calculations in the method's loop are fixed by the combination of all three patches, wouldn't it be a good idea for Oracle, to rate all three issues with priority "HIGH"? Otherwise it could happen that we get service pack 1 only with the SIGSEGV bug fixed, but still silently producing corrupt indexes?
On Monday, the German "JAXenter" / "IT Republic" published a nice article: "Wie gravierend sind die Bugs in JDK7 wirklich?", referring to the above posts. There is also an English version available: "Java 7 Causes Headaches for Lucene and Solr Users".
Update: (2011-08-03)
Yesterday I had some interviews with journalists/bloggers:
- One of them was again on the German "JAXenter" / "IT Republic": "JDK7-Bugs im Fokus: 'Wenn die Fehler auftreten, sind sie schwerwiegend'". They also published the view of Markus Eisele: "JDK7-Bugs im Fokus: 'Ich gehe davon aus, dass aktuell nicht viele Anwender betroffen sein werden'"
- The Silicon Valley analyst, John K. Waters, published in his blog the following interview: "More on Java 7 Bug: Q&A with Uwe, Comments from Geir"
Update: (2011-08-04)
Today, Neil McAllister published "Oracle: Java's worst enemy" on the InfoWorld newsticker. When California woke up in the evening CET, someone posted this article on Slashdot with the title "Oracle's Java Policies Are Destroying the Community", resulting in a heated discussion and a high tweet rate again.
Update: (2011-08-10)
The last days, community and journalists were blogging about the backgrounds:
- Andrew Binstock (Chief editor, Dr. Dobbs Journal): "Sloppy Work at Oracle"
- Richard Mayhew (The Serverside): "Lucene should just shut up about Java 7"
- Ted Neward (DZone): "Of communities, companies, and bugs (Or, "Dr Dobbs Journal is a slut!")"
- A nice conclusion was published on German JAXenter: "Who is bad? Von Java 7, Bugs und bösen Buben"
- Jessica Thornsby (UK JAXenter) posted: "Java 7 Bugs: Should the Release Have Been Delayed?"
Update: (2011-08-12)
Oracle published a FAQ about Java 7, that clarifies when and how the Lucene-related fixes in Java 7 will be released. They also mentioned another bug (#7077439), they investigated while fixing the others. Unfortunately the page is not accessible to the public, but the fix and an explanation is already reviewed and committed to OpenJDK. Does this mean, Oracle started to try and test Lucene builds with their JDK?
Thanks, nice summary to understand the Java7 bug history better. I'm curious how the Oracle story sounds. :-)
ReplyDelete"Lucene/SolrをJava 7で使うときの注意(あるいはJava 6以前でもホットスポットのバグを踏む可能性あり)"
ReplyDeleteMy Japanese is crappy, but here it goes: "Warning when you use Lucene/Solr with Java 7 (probably there is a possibility for Hotspot Bugs appearing in versions up to Java 6 as well)"
BTW, interesting post!
Regarding the ordering issue in Class.getMethods (and getDeclaredMethods), you may want to look at and vote on http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6807471
ReplyDeleteAndrew: We recognized this behaviour for the reflection layer in Java 7, but it is no bug at all, the javadocs explicitely state that the returned arrays have no explicit order. The bug was in fact on our side, as the JUnit tests of Solr expected the array ordered like the method declarations in the source file (which was always the case in previous Java versions, but there was no guarantee).
ReplyDeletenice story, thanks.
ReplyDeleteUnlike you, I don´t think there´s anything to compliment in Hoss´ "creative" headline. Yes, it might have succeeded in getting a fix done sooner, but at what cost?.
ReplyDelete"And then the whole thing went crazy" is quite an understatement. I´m sure the foes of Java, you know the ones that wanted to kill Java since the 1.0 days, will use the worst-titled stories quoted above to discredit Java as a platform and promote their alternative solutions.
"Don´t use Java 7, at all" is plain FUD... it prevents end-users on desktop computers from trying the new release which based on my tests gives much better performance.
I´ve been using JRE7 to run the kind of desktop Java apps that I use daily (Java Image Editor, Vuze, muCommander withoout a single crash -so far, knock on wood ;). And I bet there´s quite a bit of loops in that code.
Just my $0.02
Oh "Don´t use Java 7, at all" and the following backlash only conveys the message that "Oracle releases buggy code" and implicitely that "they don´t know what they´re doing" and are not to be trusted.
ReplyDeleteIt´s ironic that we don´t see headlines like "Microsoft releases buggy software" given Microsoft´s track record and the sole existance of "Patch tuesday".
FC
Hossman did an awesome job. He probably prevented more cases of index corruption with one blog post than I ever have hammering on Lucene's tests.
ReplyDeleteThat's fascinating! When will the movie be coming out? I wonder who they will get to play you...
ReplyDeleteThank you for a great post. The movie - what stops you from doing a YouTube trailer? In this style: Hamlet
ReplyDeleteGreat story! I can't see any point in blaming Oracle developers as well as anybody else.
ReplyDeleteHad they had the bug fixed before the release, it would be better, but who knows their internal matters? At least they've done the release :)
So, a huge amount of time has passed. Is it now fixed?
ReplyDeleteYou can read in the later blog posts that the bugs are fixed since Oracle Java 7u1 and 6u29.
ReplyDeleteGreat post, and great website. Thanks for the information! ChoreRelief
ReplyDeletehttps://www.togelpedia.pro
ReplyDeleteWhat kinds of searches could direct another consumer for this content?. see Not every site allows you to build links that point to other sites.
ReplyDeletePrediksi sgp :
ReplyDeletehttp://45.32.105.226
http://45.32.105.226
http://45.32.105.226
http://45.32.105.226
http://45.32.105.226
http://45.32.105.226
http://45.32.105.226
I can’t imagine focusing long enough to research; much less write this kind of article. You’ve outdone yourself with this material. This is great content. House Cleaning
ReplyDeletesddsa
ReplyDeleteLearning how to do SEO can be a very time consuming task, and if you want to get up to a standard where you can claim to be an SEO "expert" you will have to put a lot of hours in. To be able to learn the discipline well, you will also need to have previous experience of web design, web content and marketing strategies, if you want to be able to understand some of the subtle nuances about what is required from you. You will need to be able to dedicate a lot of time to keeping up with the latest strategies, or else you could get left behind. Therefore becoming your own SEO expert requires more than just a one off time-cost. However, if you have enough time to do this, without detracting from your other business commitments, then you may be able to save yourself a little money on the costs of hiring professional SEO workers. seo company
ReplyDeleteGet this apk for awesome streaming service Hotstar Mod Apk
ReplyDeleteLive matches can also be seen here
dewacintaqq
ReplyDeletedewacintaqq
dewacintaqq
delimapoker
dominohalo
papadomino
Dewacintaqq
Dewacintaqq.com
Dewa cintaqq
Dewa cinta qq
daftar dewacintaqq
link alternatif dewacintaqq
dewacintaqq
dewacintaqq.com
dcpoker99
simbolcinta
gemarq
oppaiqq
procintaqq
Dominohalo
papadomino
Delimapoker
http://45.76.112.124/?ref=nisa2122
http://45.76.127.220/?ref=NISA21
Dewacintaqq
Dewacintaqq.com
Dewa cintaqq
Dewa cinta qq
훌륭한 포스트! 자세한 내용은 먹튀안전업체
ReplyDeleteVery interesting blog. A lot of blogs I see these days don't really provide anything that I'm interested in, but I'm most definitely interested in this one. 먹튀
ReplyDeleteI think this is an edifying post and it is useful and capable. I am always searching online for articles that can help me. I truly believe these site requirements additional consideration. I favor which a great deal. braces in tijuana
ReplyDelete