Showing posts with label Porter Stemmer. Show all posts
Showing posts with label Porter Stemmer. Show all posts

2011-12-21

JDK 7u2 released - How about Linux and other operating systems?

Last week, Oracle released Java 7 Update 2, another milestone. This release included, of course, all the fixes that were already in Update 1 (see also Oracle's page), especially those affecting Apache Lucene and Solr. Since my last post on this blog, I was investigating what changed and how other operating systems like Ubuntu/Redhat Linux and FreeBSD are supported (warning: sarcasm alert!)

Linux

First of all, you can of course download the official Linux packages from Oracle. But those are not automatically updated when a new release comes out. So most Linux users prefer to use the automatic update of their operating system. Unfortunately, at the beginning of this month, Ubuntu wrote in an announcement:
As of August 24th 2011, we no longer have permission to redistribute new Java packages as Oracle has retired the "Operating System Distributor License for Java".
Oracle has published an advisory about security issues in the version of Java we currently have in the partner archive. Some of these issues are currently being exploited in the wild. Due to the severity of the security risk, Canonical is immediately releasing a security update for the Sun JDK browser plugin which will disable the plugin on all machines. This will mitigate users' risk from malicious websites exploiting the vulnerable version of the Sun JDK.
In the near future (exact date TBD), Canonical will remove all Sun JDK packages from the Partner archive. This will be accomplished by pushing empty packages to the archive, so that the Sun JDK will be removed from all users machines when they do a software update. Users of these packages who have not migrated to an alternative solution will experience failures after the package updates have removed Oracle Java from the system.
If you are currently using the Oracle Java packages from the partner archive, you have two options: 
  • Install the OpenJDK packages that are provided in the main Ubuntu archive (openjdk-6-jdk or openjdk-6-jre for the virtual machine).
  • Manually install Oracle's Java software from their web site.
Unfortunately this means that we will never get an official Ubuntu package for Java 7! What are all these security bugs suddenly heavily exploited in the wild?

OK, the latest version of Ubuntu's JDK 6 was Update 26, so what security fixes came in Update 27, Update 29, and Update 30? I inspected the changelogs shipped with the openjdk6 and openjdk7 packages, which are now the "official Java support" for Ubuntu (and also Redhat) but there is something wrong: It's not even OpenJDK! OpenJDK is still on build 147 (as of their official download page) - which is the original Java 7 release that broke Apache Lucene and Apache Solr with index corru(m)ptions and SIGSEGVs. This means no Linux user can run our full text search engine on Linux, because it SIGSEGVs shortly after starting? But thats not what the Ubuntu package contains: What Ubuntu "sells" as OpenJDK is indeed a strange product named "IcedTea" - wtf is that?

IcedTea 2.0 was released on October 19, 2011 with a long ist of security fixes! But the ubuntu download still has the famous build number 147 in its version number: 7~b147-2.0-1ubuntu2 - how does this fit together? Redhat and Ubuntu both sell another product "IcedTea", but labeled as "OpenJDK"! As this is so widely used, this seems to lead to the fact that Oracle does not seem to update their original OpenJDK release anymore. The IcedCreamTea seems to be the "new" offical release? What about all non-Linux operating systems like FreeBSD (see below)? I think that's a bad idea, because it confuses users. Also, when reviewing fixed bugs in official Oracle releases you get an update number (current is Java 6u30 or Java 7u2), but with OpenJDK (sorry IcedTea) Linux packages you get version numbers that don't tell you any relation to Oracle's releases - useless!

In fact to come back to OpenJDK 6 package in Ubuntu: If you install this replacement package on your machine according to the howto on the Ubuntu webpage for the good sun-java6 package (which is u26) - you get an older hotspot version (hotspot version numbers are the only things that you can read and compare from "java -version" output)! Something around official Oracle JDK 6u24 - so in fact you get an older version - that's no upgrade, that's a downgrade! For OpenJDK 7 you get something like Oracle's JDK 7u0 but with thousands of patches applied.

To come back to the Lucene/Solr bugs: Yes, they are fixed in this mysterious OpenJDK/IcedTea 7 release, the long list of changes verifies that. If you download the wrong-named OpenJDK 7 package with the horrible build number 147 (openjdk-7-jdk 7~b147-2.0-1ubuntu2), you will not crash your JVM with Apache Solr and you can try it out with the new garbage collector (G1) and some performance improvements (indeed Lucene tests run faster with Java 7 on my box). It looks very stable.

The second shock on this day occurred when I was searching for the famous Lucene bugs in the list of fixed IcedTea issues. They appear there as one of the horrible security bugs with CVE numbers assigned (CVE-2011-3558and others)! This also explains why Oracle made the orginal porter stemmer bug report hidden! They also appear in the openjdk-6 Ubuntu packages - as horrible security bugs, too. So Ubuntu patched the antique 6u24 and older versions with patches for Java 6u29 [they also patched u20, where the bug was not existent, see here] - thats really strange. And again, confuses users!


And finally: The Lucene bugs seem to be one of the reasons to delete the sun-java6 packages from Ubuntu Partner repository in the future. How funny is this? Does anybody have an exploit except starting Apache Solr with the default configuration and -XX:+AggressiveOpts enabled? OK, it is really a security issue for users working on your Solr Search web frontend and suddenly produce corru(m)pt indexes on your machine! They might not find anything after this disaster.


FreeBSD

What about FreeBSD? It looks much worse: There is no new update for OpenJDK available until today, so you cannot use it to build a new Port. The Jenkins Server at Apache, running the Lucene tests, is still running the original OpenJDK7 b147 build that I patched during the summer to work around the Java 7 bugs. I think the problem is here, that Oracle no longer releases OpenJDK builds, because IcedTea is there. But IcedTea is Linux only!

Please note: This blog post is partially a little bit sarcastic, I just tell my feelings about the whole Linux-Ubuntu-OpenJDK-FreeBSD issue.

A short side note: PANGAEA now runs very stable and horrible fast for some operations with Lucene 3.5 (no Apache Solr) and official Oracle JDK 7u2 on Solaris x64 (MMapDirectory of course)! I wish you merry Xmas and a happy new year!

2011-10-19

Java 7 Update 1 released - Does it fix the Lucene index corru(m)ption and SIGSEGV bugs?

After the serious issues with the Java 7 GA release, Oracle released Update 1 of Java 7 yesterday. The first thing, of course, was to check the release notes, download the package, and finally run our Lucene test cases. When inspecting the download page and release notes on the Oracle web site, I got confused: The new release is Java 7u1, in contrast the developer preview released one week ago named Java 7u2 Developer Preview. Both builds have the same build number (b08). More strange is: The release notes of the developer preview differ from the release notes on the Update 1 release:
  • The official Update 1 release notes only list one bug of the three ones originally reported to Oracle: #7068051 (SIGSEGV in PhaseIdealLoop::build_loop_late_post on T5440). The other ones are not listed, so we cannot be sure that they are really fixed.
  • The famous SIGSEGV bug in Porter Stemmer (#7070134) is not listed at all, it also disappeared from the Oracle issue tracker. It seems like hidden to the public (maybe they declared it as confidental because it's security related???).

No matter what release notes say - I had to download the offical release package! I took some free time on the Lucene Eurocon Conference in Barcelona, downloaded the packages and installed them in parallel on my Windows 64bit Thinkpad. First I tried to run the Porter Stemmer test with 100 iterations on Java 7 GA, and verified that it crashed. Robert Muir joined and we tested the new U1 release: Test passes! This means, Hotspot issue #7070134 was fixed, but Oracle missed to put it into the release notes.

The second part of the investigations were more complicated: The index corru(m)ption bugs are more complicated to reproduce, as the virtual machine does not crash and simply produces corrupt indexes after merging segments in one of the facetting tests. We checked out the Lucene Trunk revision of the time, when the bug was first discovered (issue LUCENE-3346) and used the random seed mentioned on the issue. We were able to verify the bug with Java 7 GA (the indexes are corrupt 90% of the time), and luckily after 20 iterations of the same test and random seed in Java 7u1, we have seen no corrupt index! It seemed to us that Oracle maybe fixed Java issues #7044738 and #7068051, but missed to put both of them into the release notes. Of course, without an additional statement from Oracle, we cannot be sure, that the issues are really fixed!

Oracle also released Java 6 Update 29 yesterday. The release notes on that version doesn't mention any relation to the Lucene bugs, so we were not sure if this version is completely different to the Java 6u29 developer preview, released one week ago, which listed those bugs (unfortunately, the package is no longer available on the net), or if they also missed to mention them. A quick review as done for Java 7 showed, that Porter Stemmer no longer crashes with -XX:+AggressiveOpts, so the bug seems to be also fixed here, too. We were not able to actually discover any index corru(m)ption.

Finally, we can somehow verify that the bugs seem to be fixed for both versions, but without an official statement from Oracle (in their release notes), we cannot recommend to use Java 7u1 (and Java 6u29 with aggressive opts) with Lucene and Solr.

Once I will be back in Germany, I will try to get an updated FreeBSD package of Java 7 and install it on our Apache Jenkins server.

Update (2011-10-26)

Last night, Oracle updated the release notes of Java 7u1 and Java 6u29, stating that they fixed the three Lucene-relevant bugs (plus another one related to that). Based on this confirmation, it's now safe to use Java 7 Update 1 (and later) with Apache Lucene and Apache Solr.

Of course, there is still the recommendation not to use -XX:+AggressiveOpts on any JVM in production!

We are still waiting for updated OpenJDK packages to install this release on our build servers.

2011-07-31

The real story behind the Java 7 GA bugs affecting Apache Lucene / Solr

I started this blog about two months ago, but until now, I had no time to write any posts. Since Oracle's release of Java 7 on Thursday last week, a lot of myths around the problem with Porter stemmer, wrong calculations in pulsing codec, and corrupt indexes appeared on the web. Some guys already blamed the Apache Software Foundation because they may have used the hype around Java 7 to fight against Oracle (since they stepped out of JCP).

I want to start with a chronological summary about what happened since last weekend:

Last Saturday I woke up in the morning and had no real plans for the weekend. I checked my Lucene JIRA issues and after I helped my GSoC student, I read Heise newsticker. One news article stepped into my eye: "Gearing up for Java 7" (english version). I noticed that there is only one week left until Oracle plans to release Java 7. "Maybe we should add some Jenkins jobs to the Apache build server to test Apache Lucene/Solr trunk and 3.x branches?" As always, Robert and me had some quick Google Talk session and I pointed out my plans. Our first problem was that the Jenkins server of Lucene runs under FreeBSD, so the chances were high, that we get no recent release of OpenJDK 7 running. I already installed the 64bit Windows preview build from Oracle (b147) on my Thinkpad and checked out the Lucene Core tests, which ran fine.

A quick review showed that OpenJDK7 build 147 was already available in the FreeBSD ports collection. I logged onto the jenkins slave machine and started to update ports and built the openjdk7 package. In parallel I also rebuilt the openjdk6 package for the standard test runs (a fix for a IPv6 bug was already available, so we were able to use latest version). After approx 25 minutes, the build was done and the package installed. I cloned the Jenkins jobs for the half-hourly test builds (3.x and trunk), hacked the shell scripts a little bit and started the first build. Robert and me were watching the build output,... live.

During the tests of the new analyzers module suddenly we got a crash with SIGSEGV in TestPorterStemmer. As I was only running the core tests on my local machine, this was not yet discoverered. We were shocked, "this is just a stupid FreeBSD problem, I am sure... hmhmhm... but a few weeks ago a user on the Solr mailing list reported a crash in PorterStemmer, too" (see this post). I started the analyzer tests locally on my windows box, bumm - same issue. Important was that the issue only appeared when I raised the number of test iterations, with the default of 1, the test passed. That clearly shows, there was an hotspot optimization bug, as they are only triggered when a method is executed ten thousand times.

I stopped the 2-hourly builds with Java 7 on the Jenkins slave, as we did not want to spam the mailing list with failed build reports. Robert opened LUCENE-3335 and started to investigate the problem. Step for step, he disabled compilation of several porter stemmer methods until he reached PorterStemmer.ends(). Robert opened a bug report at Oracles bug tracker: #7070134

The rest of the day we also fixed another small issues in our test cases (TestWordDelimiterFilter failed because it used a character that changed properties in Java 7 - Robert committed a fix for this).

On Sunday morning I gave Jenkins another chance: To proceed with testing on Jenkins, I hacked the shell scripts again to pass the following args to the Lucene test suite (only for Java 7 builds): -Dargs="-XX:CompileCommand=exclude,org/apache/lucene/analysis/en/PorterStemmer,ends"

The Lucene builds then passed fine most of the time, but we had some additional failures in another test inside the new facetting module: LUCENE-3346 (see below). But when starting to test the Solr builds, the problems began again: Some tests sometimes failed with strange error messages, but only randomly. It took us two hours and a hint from Yonik to find the issue behind this (SOLR-2673):
"It looks like with Java7, that the test methods are not being run in the order they are declared in the file. That's probably the root cause of the other Solr test failures with Java7, too."
A quick test with Java 6, inserting a Collections.shuffle() into the customized Lucene TestRunner, produced the same bugs. Additionally, the JDK docs nowhere explicitely state that Class.getMethods() does return in a particular order:
"Returns an array containing Method objects reflecting all the public member methods of the class or interface represented by this Class object, including those declared by the class or interface and those inherited from superclasses and superinterfaces. Array classes return all the (public) member methods inherited from the Object class. The elements in the array returned are not sorted and are not in any particular order."
The next few hours were then spent in fixing all Solr tests that relied on order of method executions (which is a violation against testing principles). On the evening we had all tests running fine, except the facetting bug and a random failure in the ICU module: LUCENE-3344 (I will come back to that later).

Status on Sunday: Until now, no response from Oracle about the new bug...

Monday morning CET, still no news from Oracle!

In late afternoon we got the first response; Robert informed me:
"Bug is visible, but it's priority is LOW. I am sure, they will release Java 7 with PorterStemmer crashing. All Solr users with the default configuration will blame us!"
Dawid Weiss then contacted the hotspot-compiler-dev mailing list; I subscribed to it and started to ask them questions. The developer responsible for these bugs (Vladimir Kozlov) sent me in the evening links to patches that might fix the issue. I also got an explanation, why we had this issue, that also exists in Java 6, but is hidden there due to optimizations not enabled (please read the thread for more information).

On Tuesday morning CET I started to produce a FreeBSD ports patch, so I was able to compile our openjdk7 package on the Jenkins slave with the proposed fixes (see patch-0uwe.patch). I changed the Jenkins configuration again and suddenly all tests passed, even the facetting tests!

We then worked around the last Java7-related test bug (LUCENE-3344), which made ICU fail on classloading ULocale. The problem is caused by some new locales in Java 7, that lead to a chicken-and-egg problem in the static initializer of ULocale. It initializes its default locale from the JDK locale in a static ctor. Until the default ULocale instance is created, the default is not set in ULocale. But ULocale's ctor itsself needs the default locale to fetch some ressource bundles and throws NPE. We opened a bug report at ICU (#8734) and added a workaround into our LuceneTestCase's locale randomization.

Time goes on...

On Thursday, lunch time CET, Java 7 was released by Oracle. Still hoping for a wonder, I downloaded the Windows x64 build from the official download location. Clicking on install it reported: "this version of the JDK is already installed on your computer, do you want to reinstall?" I did this and I was able to verify: it's exactly the same version as released on June 27th (you can verify this, too: the timestamp on the signature of the Windows Installer EXE file is June 27th). So Oracle ignored all bugs (not only ours) in the preview release and simply released a one month old package! So what was the sense of the preview release? They could have released it one month before! It was for sure not intended for public review and bug hunting!

In the evening (CET) I then opened a new issue (LUCENE-3349) as we already discussed on IRC, that we don't want our users to crash their Solr installations and corrupt their indexes. Robert spent the last days on hunting the causes behind the CheckIndex failures in the facetting module (the so-called index corru(m)ption bug). He was able to produce some random seeds that triggered the bug. In fact, it is a reincarnation of the well known readVInt-bug (LUCENE-2975) we discovered before release of Lucene 3.1. It affects the most often used Lucene method during reading index contents from disk: DataInput.readVInt(). For optimization purposes we have several incarnations of this method dependent on the underlying data structures. Java 6 JDKs only triggered this bug in combination with MappedByteBuffer.getByte(), but now it was triggered almost everywhere! Especially when the new pulsing codec was used (which has a byte[] variant of this method), all was f*cked up! StandardPostingsReader was merging index segments, the wrong results of readVInt() were copied to the newly created index segments, and finally we produced corrupt indexes!

The warning mail was released...

Shortly before midnight CET I sent the warning mail, prepared in LUCENE-3349, to the mailing lists:
Hello Apache Lucene & Apache Solr users, Hello users of other Java-based Apache projects,

Oracle released Java 7 today. Unfortunately it contains hotspot compiler optimizations, which miscompile some loops. This can affect code of several Apache projects. Sometimes JVMs only crash, but in several cases, results calculated can be incorrect, leading to bugs in applications (see Hotspot bugs 7070134 [1], 7044738 [2], 7068051 [3]).

Apache Lucene Core and Apache Solr are two Apache projects, which are affected by these bugs, namely all versions released until today. Solr users with the default configuration will have Java crashing with SIGSEGV as soon as they start to index documents, as one affected part is the well-known Porter stemmer (see LUCENE-3335 [4]). Other loops in Lucene may be miscompiled, too, leading to index corruption (especially on Lucene trunk with pulsing codec; other loops may be affected, too - LUCENE-3346 [5]).

These problems were detected only 5 days before the official Java 7 release, so Oracle had no time to fix those bugs, affecting also many more applications. In response to our questions, they proposed to include the fixes into service release u2 (eventually into service release u1, see [6]). This means you cannot use Apache Lucene/Solr with Java 7 releases before Update 2! If you do, please don't open bug reports, it is not the committers' fault! At least disable loop optimizations using the -XX:-UseLoopPredicate JVM option to not risk index corruptions.

Please note: Also Java 6 users are affected, if they use one of those JVM options, which are not enabled by default: -XX:+OptimizeStringConcat or -XX:+AggressiveOpts

It is strongly recommended not to use any hotspot optimization switches in any Java version without extensive testing!

In case you upgrade to Java 7, remember that you may have to reindex, as the unicode version shipped with Java 7 changed and tokenization behaves differently (e.g. lowercasing). For more information, read JRE_VERSION_MIGRATION.txt in your distribution package!

On behalf of the Lucene project,
Uwe


What then happened is well-known:


Until now, California was still sleeping...


In late afternoon CET, California woke up:


On the evening before, Hoss already posted the following blog post on the Lucid Imagination website: "Don’t Use Java 7, For Anything".

  • After California woke up, the first person posted Hoss' blog post on Slashdot: "Java 7 Ships With Severe Bug" referring to Hoss' blog post.

And then the whole thing went crazy: On Twitter, new posts refering to Slashdot, InfoWorld, and, finally, Hoss' blog post on Lucid Imagination appeared every few seconds. There were more tweets stating "Don't use Java 7, For Anything" than tweets about "the first Java release since 4 years".

A little bit later I recognized the first pro-Oracle blog trying to explain some background information: "Don't Use Java 7? Are you kidding me?" (Markus Eisele). Thanks for posting this!

I went to sleep and on the following day, the original Oracle Bug report that caused this was upgraded to priority "HIGH" - yeah. So we will hopefully get a corrected Java 7 release quite soon in Update Pack 1!

Finally I wanted to say thank you to the other Lucene committers that helped during investigation: Robert Muir, Yonik Seeley, Dawid Weiss, and Mike McCandless. And of course Hoss for his funny caption on Lucid Imagination's blog!

Update: (2011-08-01)

On Saturday morning CET, Cay Horstmann, professor of computer science at San Jose State University, compared the JDK 7 bugs with the Pentium FDIV bug in 1994. In his blog article "Java 7 Unsafe at Any Speed?", he stated that SIGSEGVs are easy bugs; much more serious are hidden bugs, only appearing under certain conditions and then silently produce wrong computation results. On Sunday evening CET, a user asked on Stackoverflow: "How serious is the Java7 'Solr/Lucene' bug?" In his response, Robert Muir described his work how to track down Java's "pentium bug" and circumvent it, with no success (see above).

Luckily, Oracle raised the priority of the SIGSEGV bug (#7070134) to "HIGH", but the other two bugs are still on "MEDIUM". One of them (#7044738) is exactly such a bug that Cay Horstmann described in his blog post. We applied the patches for all three bugs to our JVM installation, the "HIGH" priority one only fixes the SIGSEGV. Since our wrong readVInt-calculations in the method's loop are fixed by the combination of all three patches, wouldn't it be a good idea for Oracle, to rate all three issues with priority "HIGH"? Otherwise it could happen that we get service pack 1 only with the SIGSEGV bug fixed, but still silently producing corrupt indexes?

On Monday, the German "JAXenter" / "IT Republic" published a nice article: "Wie gravierend sind die Bugs in JDK7 wirklich?", referring to the above posts. There is also an English version available: "Java 7 Causes Headaches for Lucene and Solr Users".

Update: (2011-08-03)

Yesterday I had some interviews with journalists/bloggers:

Update: (2011-08-04)

Today, Neil McAllister published "Oracle: Java's worst enemy" on the InfoWorld newsticker. When California woke up in the evening CET, someone posted this article on Slashdot with the title "Oracle's Java Policies Are Destroying the Community", resulting in a heated discussion and a high tweet rate again.

Update: (2011-08-10)

The last days, community and journalists were blogging about the backgrounds:
Two days ago, Oracle offered to some committers that they could get access to early builds of the Java SE before they are released. This would allow us to check compatibility of bugfix releases and service packs of the Java SE with Apache Lucene and Solr. All this is covered by their Java CAP (Compatibility and Performance Program).


Update: (2011-08-12)

Oracle published a FAQ about Java 7, that clarifies when and how the Lucene-related fixes in Java 7 will be released. They also mentioned another bug (#7077439), they investigated while fixing the others. Unfortunately the page is not accessible to the public, but the fix and an explanation is already reviewed and committed to OpenJDK. Does this mean, Oracle started to try and test Lucene builds with their JDK?