2012-07-18

Use Lucene’s MMapDirectory on 64bit platforms, please!

Don’t be afraid – Some clarification to common misunderstandings

Since version 3.1, Apache Lucene and Solr use MMapDirectory by default on 64bit Windows and Solaris systems; since version 3.3 also for 64bit Linux systems. This change lead to some confusion among Lucene and Solr users, because suddenly their systems started to behave differently than in previous versions. On the Lucene and Solr mailing lists a lot of posts arrived from users asking why their Java installation is suddenly consuming three times their physical memory or system administrators complaining about heavy resource usage. Also consultants were starting to tell people that they should not use MMapDirectory and change their solrconfig.xml to work instead with slow SimpleFSDirectory or NIOFSDirectory (which is much slower on Windows, caused by a JVM bug #6265734). From the point of view of the Lucene committers, who carefully decided that using MMapDirectory is the best for those platforms, this is rather annoying, because they know, that Lucene/Solr can work with much better performance than before. Common misinformation about the background of this change causes suboptimal installations of this great search engine everywhere.

In this blog post, I will try to explain the basic operating system facts regarding virtual memory handling in the kernel and how this can be used to largely improve performance of Lucene (“VIRTUAL MEMORY for DUMMIES”). It will also clarify why the blog and mailing list posts done by various people are wrong and contradict the purpose of MMapDirectory. In the second part I will show you some configuration details and settings you should take care of to prevent errors like “mmap failed” and suboptimal performance because of stupid Java heap allocation.

Virtual Memory[1]

Let’s start with your operating system’s kernel: The naive approach to do I/O in software is the way, you have done this since the 1970s – the pattern is simple: whenever you have to work with data on disk, you execute a syscall to your operating system kernel, passing a pointer to some buffer (e.g. a byte[] array in Java) and transfer some bytes from/to disk. After that you parse the buffer contents and do your program logic. If you don’t want to do too many syscalls (because those may cost a lot processing power), you generally use large buffers in your software, so synchronizing the data in the buffer with your disk needs to be done less often. This is one reason, why some people suggest to load the whole Lucene index into Java heap memory (e.g., by using RAMDirectory).

But all modern operating systems like Linux, Windows (NT+), MacOS X, or Solaris provide a much better approach to do this 1970s style of code by using their sophisticated file system caches and memory management features. A feature called “virtual memory” is a good alternative to handle very large and space intensive data structures like a Lucene index. Virtual memory is an integral part of a computer architecture; implementations require hardware support, typically in the form of a memory management unit (MMU) built into the CPU. The way how it works is very simple: Every process gets his own virtual address space where all libraries, heap and stack space is mapped into. This address space in most cases also start at offset zero, which simplifies loading the program code because no relocation of address pointers needs to be done. Every process sees a large unfragmented linear address space it can work on. It is called “virtual memory” because this address space has nothing to do with physical memory, it just looks like so to the process. Software can then access this large address space as if it were real memory without knowing that there are other processes also consuming memory and having their own virtual address space. The underlying operating system works together with the MMU (memory management unit) in the CPU to map those virtual addresses to real memory once they are accessed for the first time. This is done using so called page tables, which are backed by TLBs located in the MMU hardware (translation lookaside buffers, they cache frequently accessed pages). By this, the operating system is able to distribute all running processes’ memory requirements to the real available memory, completely transparent to the running programs.


Schematic drawing of virtual memory
(image from Wikipedia [1], http://en.wikipedia.org/wiki/File:Virtual_memory.svg, licensed by CC BY-SA 3.0)

By using this virtualization, there is one more thing, the operating system can do: If there is not enough physical memory, it can decide to “swap out” pages no longer used by the processes, freeing physical memory for other processes or caching more important file system operations. Once a process tries to access a virtual address, which was paged out, it is reloaded to main memory and made available to the process. The process does not have to do anything, it is completely transparent. This is a good thing to applications because they don’t need to know anything about the amount of memory available; but also leads to problems for very memory intensive applications like Lucene.

Lucene & Virtual Memory

Let’s take the example of loading the whole index or large parts of it into “memory” (we already know, it is only virtual memory). If we allocate a RAMDirectory and load all index files into it, we are working against the operating system: The operating system tries to optimize disk accesses, so it caches already all disk I/O in physical memory. We copy all these cache contents into our own virtual address space, consuming horrible amounts of physical memory (and we must wait for the copy operation to take place!). As physical memory is limited, the operating system may, of course, decide to swap out our large RAMDirectory and where does it land? – On disk again (in the OS swap file)! In fact, we are fighting against our O/S kernel who pages out all stuff we loaded from disk [2]. So RAMDirectory is not a good idea to optimize index loading times! Additionally, RAMDirectory has also more problems related to garbage collection and concurrency. Because the data residing in swap space, Java’s garbage collector has a hard job to free the memory in its own heap management. This leads to high disk I/O, slow index access times, and minute-long latency in your searching code caused by the garbage collector driving crazy.

On the other hand, if we don’t use RAMDirectory to buffer our index and use NIOFSDirectory or SimpleFSDirectory, we have to pay another price: Our code has to do a lot of syscalls to the O/S kernel to copy blocks of data between the disk or filesystem cache and our buffers residing in Java heap. This needs to be done on every search request, over and over again.

Memory Mapping Files

The solution to the above issues is MMapDirectory, which uses virtual memory and a kernel feature called “mmap” [3] to access the disk files.

In our previous approaches, we were relying on using a syscall to copy the data between the file system cache and our local Java heap. How about directly accessing the file system cache? This is what mmap does!

Basically mmap does the same like handling the Lucene index as a swap file. The mmap() syscall tells the O/S kernel to virtually map our whole index files into the previously described virtual address space, and make them look like RAM available to our Lucene process. We can then access our index file on disk just like it would be a large byte[] array (in Java this is encapsulated by a ByteBuffer interface to make it safe for use by Java code). If we access this virtual address space from the Lucene code we don’t need to do any syscalls, the processor’s MMU and TLB handles all the mapping for us. If the data is only on disk, the MMU will cause an interrupt and the O/S kernel will load the data into file system cache. If it is already in cache, MMU/TLB map it directly to the physical memory in file system cache. It is now just a native memory access, nothing more! We don’t have to take care of paging in/out of buffers, all this is managed by the O/S kernel. Furthermore, we have no concurrency issue, the only overhead over a standard byte[] array is some wrapping caused by Java’s ByteBuffer interface (it is still slower than a real byte[] array, but  that is the only way to use mmap from Java and is much faster than all other directory implementations shipped with Lucene). We also waste no physical memory, as we operate directly on the O/S cache, avoiding all Java GC issues described before.

What does this all mean to our Lucene/Solr application?
  • We should not work against the operating system anymore, so allocate as less as possible heap space (-Xmx Java option). Remember, our index accesses rely on passed directly to O/S cache! This is also very friendly to the Java garbage collector.
  • Free as much as possible physical memory to be available for the O/S kernel as file system cache. Remember, our Lucene code works directly on it, so reducing the number of paging/swapping between disk and memory. Allocating too much heap to our Lucene application hurts performance! Lucene does not require it with MMapDirectory.

Why does this only work as expected on operating systems and Java virtual machines with 64bit?

One limitation of 32bit platforms is the size of pointers, they can refer to any address within 0 and 232-1, which is 4 Gigabytes. Most operating systems limit that address space to 3 Gigabytes because the remaining address space is reserved for use by device hardware and similar things. This means the overall linear address space provided to any process is limited to 3 Gigabytes, so you cannot map any file larger than that into this “small” address space to be available as big byte[] array. And when you mapped that one large file, there is no virtual space (address like “house number”) available anymore. As physical memory sizes in current systems already have gone beyond that size, there is no address space available to make use for mapping files without wasting resources (in our case “address space”, not physical memory!).

On 64bit platforms this is different: 264-1 is a very large number, a number in excess of 18 quintillion bytes, so there is no real limit in address space. Unfortunately, most hardware (the MMU, CPU’s bus system) and operating systems are limiting this address space to 47 bits for user mode applications (Windows: 43 bits) [4]. But there is still much of addressing space available to map terabytes of data.

Common misunderstandings

If you have read carefully what I have told you about virtual memory, you can easily verify that the following is true:
  • MMapDirectory does not consume additional memory and the size of mapped index files is not limited by the physical memory available on your server. By mmap() files, we only reserve address space not memory! Remember, address space on 64bit platforms is for free!
  • MMapDirectory will not load the whole index into physical memory. Why should it do this? We just ask the operating system to map the file into address space for easy access, by no means we are requesting more. Java and the O/S optionally provide the option to try loading the whole file into RAM (if enough is available), but Lucene does not use that option (we may add this possibility in a later version).
  • MMapDirectory does not overload the server when “top” reports horrible amounts of memory. “top” (on Linux) has three columns related to memory: “VIRT”, “RES”, and “SHR”. The first one (VIRT, virtual) is reporting allocated virtual address space (and that one is for free on 64 bit platforms!). This number can be multiple times of your index size or physical memory when merges are running in IndexWriter. If you have only one IndexReader open it should be approximately equal to allocated heap space (-Xmx) plus index size. It does not show physical memory used by the process. The second column (RES, resident) memory shows how much (physical) memory the process allocated for operating and should be in the size of your Java heap space. The last column (SHR, shared) shows how much of the allocated virtual address space is shared with other processes. If you have several Java applications using MMapDirectory to access the same index, you will see this number going up. Generally, you will see the space needed by shared system libraries, JAR files, and the process executable itself (which are also mmapped).

How to configure my operating system and Java VM to make optimal use of MMapDirectory?

First of all, default settings in Linux distributions and Solaris/Windows are perfectly fine. But there are some paranoid system administrators around, that want to control everything (with lack of understanding). Those limit the maximum amount of virtual address space that can be allocated by applications. So please check that “ulimit -v” and “ulimit -m” both report “unlimited”, otherwise it may happen that MMapDirectory reports “mmap failed” while opening your index. If this error still happens on systems with lot’s of very large indexes, each of those with many segments, you may need to tune your kernel parameters in /etc/sysctl.conf: The default value of vm.max_map_count is 65530, you may need to raise it. I think, for Windows and Solaris systems there are similar settings available, but it is up to the reader to find out how to use them.

For configuring your Java VM, you should rethink your memory requirements: Give only the really needed amount of heap space and leave as much as possible to the O/S. As a rule of thumb: Don’t use more than ¼ of your physical memory as heap space for Java running Lucene/Solr, keep the remaining memory free for the operating system cache. If you have more applications running on your server, adjust accordingly. As usual the more physical memory the better, but you don’t need as much physical memory as your index size. The kernel does a good job in paging in frequently used pages from your index.

A good possibility to check that you have configured your system optimally is by looking at both "top" (and correctly interpreting it, see above) and the similar command "iotop" (can be installed, e.g., on Ubuntu Linux by "apt-get install iotop"). If your system does lots of swap in/swap out for the Lucene process, reduce heap size, you possibly used too much. If you see lot's of disk I/O, buy more RUM (Simon Willnauer) so mmapped files don't need to be paged in/out all the time, and finally: buy SSDs.

Happy mmapping!

Bibliography

2012-07-11

The Policeman’s Horror: Default Locales, Default Charsets, and Default Timezones

Time for a tool to prevent any effects coming from them!

Did you ever try to run software downloaded from the net on a computer with Turkish locale? I think most of you never did that. And if you ask Turkish IT specialists, they will tell you: “It is better to configure your computer using any other locale, but not tr_TR”. I think you have no clue what I am talking about? Maybe this article gives you a hint: “A Cellphone’s Missing Dot Kills Two People, Puts Three More in Jail”.

What you see in lots of software is a so-called case-insensitive matching of keywords like parameter names or function names. This is implemented in most cases by lowercasing or upper-casing the input text and compare it with a list of already lowercased/uppercased items. This works in most cases fine, if you are anywhere in the world, except Turkey! Because most programmers don’t care about running their software in Turkey, they do not test their software under the Turkish locale.

But what happens with the case-insensitive matching if running in Turkey? Let’s take an example:

User enters “BILLY” in the search field of you application. The application then uses the approach presented before and lower-cases “BILLY” and then compares it to an internal table (e.g. our search index, parameter table, function table,...). So we search in this table for “billy”. So far so good, works perfect in USA, Germany, Kenia, almost everywhere - except Turkey. What happens in the Turkish locale when we lowercase “BILLY”? After reading the above article, you might expect it: The “BILLY”.toLowerCase() statement in Java returns “bılly” (note the dot-less i: 'ı' U+0131). You can try this out on your local machine without reconfiguring it to use the Turkish locale, just try the following Java code:
assertEquals(“bılly”, “BILLY”.toLowerCase(new Locale(“tr”,“TR”)));
The same happens vice versa, if you uppercase a ‘i’, it gets I with dot (‘İ’ U+0130). This is really serious, million lines of code out there in Java and other languages don’t take care that the String.toLowerCase() and String.toUpperCase() methods can optionally take a defined Locale (more about that later). Some examples from projects I am involved in:

  • Try to run an XSLT stylesheet using Apache XALAN-XSLTC (or Java 5’s internal XSLT interpreter) in the Turkish locale. It will fail with “unknown instruction”, because XALAN-XSLTC compiles the XSLT to Java Bytecode and somehow lowercases a virtual machine opcode before compiling it with BCEL (see XALANJ-2420, BCEL bug #38787).
  • The HTML SAX parser NekoHTML uses locale-less uppercasing/lowercasing to normalize charset names and element names. I opened a bug report (issue #3544334).
  • If you use PHP as your favourite scripting language, which is not case sensitive for class names and other language constructs, it will throw a compile error once you try to call a function with an “i” in it (see PHP bug #18556). Unfortunately it is unlikely that this serious bug is fixed in PHP 5.3 or 5.4!

The question is now: How to solve this?

The most correct way to do this is to not lowercase at all! For comparing case insensitive, Unicode defines “case folding”, which is a so-called canonical form of text where all upper/lower case of any character is normalized away. Unfortunately this case folded text may no longer be readable text (this depends on the implementation, but in most cases it is). It just ensures, that case-folded text can be compared to each other in a case-insensitive way. Unfortunately Java does not offer you a function to get this string, but ICU-4J can do (see UCharacter#foldCase). But Java offers something much better: String.equalsIgnoreCase(String), which internally handles case folding! But in lots of cases you cannot use this fantastic method, because you want to lookup such strings in a HashMap or other dictionary. Without modifying HashMap to use equalsIgnoreCase, this would never work. So we are back at lower-casing! As mentioned before, you can pass a locale to String.toLowerCase(), so the naive approach would be to tell Java, that we are in the US or using the English language: String.toLowerCase(Locale.US) or String.toLowerCase(Locale.ENGLISH). This produces identical results but is still not consistent. What happens if the US government decides to lowercase/uppercase like in Turkey? -- OK, don’t use Locale.US (this is also too US-centric). Locale.ENGLISH is fine and very generic, but languages also change over the years (who knows?), but we want to have it language invariant! If you are using Java 6, there is a much better constant: Locale.ROOT -- You should use this constant for our lowercase example: String.toLowerCase(Locale.ROOT).
You should start now and do a global search/replace on all your Java projects (if you do not rely on language specific presentation of text)! REALLY!
String.toLowerCase is not the only example of “automatic default locale usage” in the Java API. There are also things like transforming dates or numbers to strings. If you use the Formatter class, and you run it somewhere in another country, String.format(“%f”, 15.5f) may not always use a period (‘.’) as decimal separator; most Germans will know this. Passing a specific locale here helps in most cases. If you are writing a GUI in English language, pass Locale.ENGLISH everywhere, otherwise text output of numbers or dates may not match the language of your GUI! If you want Formatter to behave in a invariant way, use Locale.ROOT, too (then it will for sure format numbers with period and no comma for thousands, just like Float.toString(float) does).

A second problem affecting lot’s of software are two other system-wide configurable default settings: default charset/encoding and timezone. If you open a text file with FileReader or convert an InputStream to a Reader with InputStreamReader, Java assumes automatically, that the input is in the default platform encoding. This may be fine, if you want the text to be parsed by the defaults of the operating system -- but if you pass a text file together with your software package (maybe as resource in your JAR file) and then accidentally read it using the platform’s default charset... it’ll break your app! So my second recommendation:
Always pass a character set to any method converting bytes to strings (like InputStream <=> Reader, String.getBytes(),...). If you wrote the text file and ship it together with your app, only you know its encoding!
For timezones, similar examples can be found.

How this affects Apache Lucene!

Apache Lucene is a full-text search engine and deals with text from different languages all the time; Apache Solr is a enterprise search server on top of Lucene and deals with input documents in lots of different charsets and languages. It is therefore essential for a search library like Lucene to be as most independent from local machine settings as possible. A library must make it explicit what input it wants to have. So we require charsets and locales in all public and private APIs (or we only take e.g. java.io.Reader instead of InputStream if we expect text coming in), so the user must take care.

Robert Muir and I reviewed the source code of Apache Lucene and Solr for the coming version 4.0 (an alpha version is already available on Lucene’s homepage, documentation is here). We did this quite often, but whenever a new piece of code is committed to the source tree, it may happen that undefined locales, charsets, or similar things appear again. In most cases it is not the fault of the committer, this happens because auto-complete in IDE automatically lists possible methods and parameters to the developer. Often you select the easiest variant (like String.toLowerCase()).

Using default locales, charsets and timezones are in my opinion a big design issue in programming languages like Java. If there are locale-sensitive methods, those methods should take a locale, if you convert a byte[] stream to a char[] stream, a charset must be given. Automatically falling back to defaults is a no-go in the server environment. 
If a developer is interested in using the default locale of the user’s computer, he can always explicitely give the locale or charset. In our example this would be String.toLowerCase(Locale.getDefault()). This is more verbose, but it is obvious what the developer intends to do.

My proposal is to ban all those default charset and locale methods / classes in the Java API by deprecating them as soon as possible, so users stop using them implicit!


Robert’s and my intention is to automatically fail the nightly builds (or compilation on the developer’s machine) when somebody uses one of the above methods in Lucene’s or Solr’s source code. We looked at different solutions like PMD or FindBugs, but both tools are too sloppy to handle that in a consistent way (PMD does not have any “default charset” method detection and Findbugs has only a very short list of method signatures). In addition, both PMD and FindBugs are very slow and often fail to correctly detect all problems. For Lucene builds we only need a tool, that looks into the byte code of all generated Java classes of Apache Lucene and Solr, and fails the build if any signature that violates our requirements is found.

A new Tool for the Policeman

I started to hack a tool as a custom ANT task using ASM 4.0 (Lightweight Java Bytecode Manipulation Framework). The idea was to provide a list of methods signatures, field names and plain class names that should fail the build, once bytecode accesses it in any way. A first version of this task was published in issue LUCENE-4199, later improvements was to add support for fields (LUCENE-4202) and a sophisticated signature expansion to also catch calls to subclasses of the given signatures (LUCENE-4206).

In the meantime, Robert worked on the list of “forbidden” APIs. This is what came out in the first version:
java.lang.String#<init>(byte[])
java.lang.String#<init>(byte[],int)
java.lang.String#<init>(byte[],int,int)
java.lang.String#<init>(byte[],int,int,int)
java.lang.String#getBytes()
java.lang.String#getBytes(int,int,byte[],int) 
java.lang.String#toLowerCase()
java.lang.String#toUpperCase()
java.lang.String#format(java.lang.String,java.lang.Object[])
java.io.FileReader
java.io.FileWriter
java.io.ByteArrayOutputStream#toString()
java.io.InputStreamReader#<init>(java.io.InputStream)
java.io.OutputStreamWriter#<init>(java.io.OutputStream)
java.io.PrintStream#<init>(java.io.File)
java.io.PrintStream#<init>(java.io.OutputStream)
java.io.PrintStream#<init>(java.io.OutputStream,boolean)
java.io.PrintStream#<init>(java.lang.String)
java.io.PrintWriter#<init>(java.io.File)
java.io.PrintWriter#<init>(java.io.OutputStream)
java.io.PrintWriter#<init>(java.io.OutputStream,boolean)
java.io.PrintWriter#<init>(java.lang.String)
java.io.PrintWriter#format(java.lang.String,java.lang.Object[])
java.io.PrintWriter#printf(java.lang.String,java.lang.Object[])
java.nio.charset.Charset#displayName()
java.text.BreakIterator#getCharacterInstance()
java.text.BreakIterator#getLineInstance()
java.text.BreakIterator#getSentenceInstance()
java.text.BreakIterator#getWordInstance()
java.text.Collator#getInstance()
java.text.DateFormat#getTimeInstance()
java.text.DateFormat#getTimeInstance(int)
java.text.DateFormat#getDateInstance()
java.text.DateFormat#getDateInstance(int)
java.text.DateFormat#getDateTimeInstance()
java.text.DateFormat#getDateTimeInstance(int,int)
java.text.DateFormat#getInstance()
java.text.DateFormatSymbols#<init>()
java.text.DateFormatSymbols#getInstance()
java.text.DecimalFormat#<init>()
java.text.DecimalFormat#<init>(java.lang.String)
java.text.DecimalFormatSymbols#<init>()
java.text.DecimalFormatSymbols#getInstance()
java.text.MessageFormat#<init>(java.lang.String)
java.text.NumberFormat#getInstance()
java.text.NumberFormat#getNumberInstance()
java.text.NumberFormat#getIntegerInstance()
java.text.NumberFormat#getCurrencyInstance()
java.text.NumberFormat#getPercentInstance()
java.text.SimpleDateFormat#<init>()
java.text.SimpleDateFormat#<init>(java.lang.String)
java.util.Calendar#<init>()
java.util.Calendar#getInstance()
java.util.Calendar#getInstance(java.util.Locale)
java.util.Calendar#getInstance(java.util.TimeZone)
java.util.Currency#getSymbol()
java.util.GregorianCalendar#<init>()
java.util.GregorianCalendar#<init>(int,int,int)
java.util.GregorianCalendar#<init>(int,int,int,int,int)
java.util.GregorianCalendar#<init>(int,int,int,int,int,int)
java.util.GregorianCalendar#<init>(java.util.Locale)
java.util.GregorianCalendar#<init>(java.util.TimeZone)
java.util.Scanner#<init>(java.io.InputStream)
java.util.Scanner#<init>(java.io.File)
java.util.Scanner#<init>(java.nio.channels.ReadableByteChannel)
java.util.Formatter#<init>()
java.util.Formatter#<init>(java.lang.Appendable)
java.util.Formatter#<init>(java.io.File)
java.util.Formatter#<init>(java.io.File,java.lang.String)
java.util.Formatter#<init>(java.io.OutputStream)
java.util.Formatter#<init>(java.io.OutputStream,java.lang.String)
java.util.Formatter#<init>(java.io.PrintStream)
java.util.Formatter#<init>(java.lang.String)
java.util.Formatter#<init>(java.lang.String,java.lang.String)
Using this easily extend-able list, saved in a text file (UTF-8 encoded!), you can invoke my new ANT task (after registering it with <taskdef/>) very easy -- taken from Lucene/Solr’s build.xml:
<taskdef resource="lucene-solr.antlib.xml">
  <classpath>
    <pathelement location="${custom-tasks.dir}/build/classes/java" />
    <fileset dir="${custom-tasks.dir}/lib" includes="asm-debug-all-4.0.jar" />
  </classpath>
</taskdef>
<forbidden-apis>
  <classpath refid="additional.dependencies"/>
  <apiFileSet dir="${custom-tasks.dir}/forbiddenApis">
    <include name="jdk.txt" />
    <include name="jdk-deprecated.txt" />
    <include name="commons-io.txt" />
  </apiFileSet>
  <fileset dir="${basedir}/build" includes="**/*.class" />
</forbidden-apis>
The classpath given is used to look up the API signatures (provided as apiFileSet). Classpath is only needed if signatures are coming from 3rd party libraries. The inner fileset should list all class files to be checked. For running the task you also need asm-all-4.0.jar available in the task’s classpath.

If you are interested, take the source code, it is open source and released as part of the tool set shipped with Apache Lucene & Solr: Source, API lists (revision number 1360240).

At the moment we are investigating other opportunities brought by that tool:
  • We want to ban System.out/err or things like horrible Eclipse-like try...catch...printStackTrace() auto-generated Exception stubs. We can just ban those fields from the java.lang.System class and of course, Throwable#printStackTrace().
  • Using optimized Lucene-provided replacements for JDK API calls. This can be enforced by failing on the JDK signatures.
  • Failing the build on deprecated calls to Java’s API. We can of course print warnings for deprecations, but failing the build is better. And: We use deprecation annotations in Lucene’s own library code, so javac-generated warnings don’t help. We can use the list of deprecated stuff from JDK Javadocs to trigger the failures.
I hope other projects take a similar approach to scan their binary/source code and free it from system dependent API calls, which are not predictable for production systems in the server environment.

Thanks to Robert Muir and Dawid Weiss for help and suggestions!

EDIT (2013-02-23): On 2013-02-04, I released the plugin as Apache Ant, Apache Maven and CLI task on Google Code. The project URL is: https://code.google.com/p/forbidden-apis/. The tool is available to your builds using Maven/Ivy through Maven Central and Sonatype repositories or can be downloaded on this web site. Nightly snapshot builds are done by the Policeman Jenkins Server and can be downloaded from the Sonatype Snapshot repository.

2012-02-06

Is your IndexReader atomic? - Major IndexReader refactoring in Lucene 4.0

Note: This blog post was originally posted on the SearchWorkings website.

Since Day 1 Lucene exposed the two fundamental concepts of reading and writing an index directly through IndexReader & IndexWriter. However, the API didn’t reflect reality; from the IndexWriter perspective this was desirable but when reading the index this caused several problems in the past. In reality a Lucene index isn’t a single index while logically treated as a such. The latest developments in Lucene trunk try to expose reality for type-safety and performance, but before I go into details about Composite, Atomic and DirectoryReaders let me go back in time a bit.

Since version 2.9 / 3.0 Lucene started to move away from executing searches directly on the top-level IndexReaders towards a per-segment orientation. As Simon Willnauer already explained in his blog entry, this lead to fact that optimizing an index is no longer needed to optimize searching performance. In fact, optimizing would slow your searches down, as after optimizing, all file system and Lucene-internal index caches get invalidated.

A standard Lucene index consists of several so-called segments, which are themselves fully-functional Lucene indexes. During indexing, Lucene writes new documents into separate segments and, once there are too many segments, they are merged. (see Mike McCandless’ blog: Visualizing Lucene's segment merges):



Prior to Lucene 2.9, despite consisting of multiple underlying segments, the segments were treated as though they were a single big index. Since then, Lucene has shifted towards a per-segment orientation. By now almost all structures and components in Lucene operate on a per-segment basis; among others this means that Lucene only loads actual changes on reopen, instead of the entire index. From a users perspective it might still look like one big logical index but under the hood everything works per-segment like this (simplified) IndexSearcher snippet shows:
 public void search(Weight weight, Collector collector) throws IOException {  
  // iterate through all segment readers & execute the search  
  for (int i = 0; i < subReaders.length; i++) {  
   // pass the reader to the collector  
   collector.setNextReader(subReaders[i], docStarts[i]);  
   final Scorer scorer = ...;  
   if (scorer != null) { // score documents on this segment  
    scorer.score(collector);  
   }  
  }  
 }  
However, the distinction between a logical index and a segment wasn’t consistently reflected in the code hierarchy. In Lucene 3.x, one could still execute searches on a top-level (logical) reader, without iterating over its subreaders. Doing so could slowdown your searches dramatically provided your index consisted of more than one segment. Among other reasons, this was why ancient versions of Lucene instructed users to optimize the index frequently.

Let me explain the problem in a little more detail. An IndexReader on top of a Directory is internally a MultiReader on all enclosing SegmentReaders. If you ask a MultiReader for a TermEnum or the Postings it executes an on-the-fly merge all of all subreader’s terms or postings data respectively. This merge process uses priority queues or related data structures leading to a serious slowdown depending on the number of subreaders.

Yet, even beyond these internal limitations using SegmentReaders in combination with MultiReaders can influence higher-level structures in Lucene. The FieldCache is used to uninvert the index to allow sorting of search results by indexed value or Document / Value lookups during search. Uninverting the top-level readers leads to duplication in the FieldCache and essentially multiple instances of the same cache.

Type-Safe IndexReaders in Lucene 4.0

From day one Lucene 4.0 was designed to not allow retrieving of terms and postings data from “composite” readers like MultiReader or DirectoryReader (which is the implementation that is returned for on-disk indexes, if you get a reader from IndexReader.open(Directory)). Initial versions of Lucene trunk simply threw an UnsupportedOperationException when you tried to get instances of Fields, TermsEnum, or DocsEnum from a non SegmentReader. Because of the missing type safety, one couldn’t rely on the ability to get postings from the IndexReader unless manually checking if it was composite or atomic.

LUCENE-2858 is one of the major API changes in Lucene 4.0, it completely changes the Lucene client code “perspective” on indexes and its segments. The abstract class IndexReader has been refactored to expose only essential methods to access stored fields during display of search results. It is no longer possible to retrieve terms or postings data from the underlying index, not even deletions are visible anymore. You can still pass IndexReader as constructor parameter to IndexSearcher and execute your searches; Lucene will automatically delegate procedures like query rewriting and document collection atomic subreaders.

If you want to dive deeper into the index and want to write own queries, take a closer look at the new abstract sub-classes AtomicReader and CompositeReader:

AtomicReader instances are now the only source of Terms, Postings, DocValues and FieldCache. Queries are forced to execute on a Atomic reader on a per-segment basis and FieldCaches are keyed by AtomicReaders. It’s counterpart CompositeReader exposes a utility method to retrieve its composites. But watch out, composites are not necessarily atomic. Next to the added type-safety we also removed the notion of index-commits and version numbers from the abstract IndexReader, the associations with IndexWriter were pulled into a specialized DirectoryReader. Here is an “example” executing a query in Lucene trunk:
 DirectoryReader reader = DirectoryReader.open(directory);  
 IndexSearcher searcher = new IndexSearcher(reader);  
 Query query = new QueryParser("fieldname", analyzer).parse(“text”);  
 TopDocs hits = searcher.search(query, 10);  
 ScoreDoc[] docs = hits.scoreDocs;  
 Document doc1 = searcher.doc(docs[0].doc);  
 // alternative:  
 Document doc2 = reader.document(docs[1].doc);  
Does that look familiar? Well, for the actual API user this major refactoring doesn’t bring much of a change. If you run into compile errors related to this change while upgrading you likely found a performance bottleneck.

Enforcing Per-Segment semantics in Filters

If you have more advanced code dealing with custom Filters, you might have noticed another new class hierarchy in Lucene (see LUCENE-2831): IndexReaderContext with corresponding Atomic-/CompositeReaderContext. This has been added quite a while ago but is closely related to atomic and composite readers.

The move towards per-segment search Lucene 2.9 exposed lots of custom Queries and Filters that couldn't handle it. For example, some Filter implementations expected the IndexReader passed in is identical to the IndexReader passed to IndexSearcher with all its advantages like absolute document IDs etc. Obviously this “paradigm-shift” broke lots of applications and especially those that utilized cross-segment data structures (like Apache Solr).

In Lucene 4.0, we introduce IndexReaderContexts “searcher-private” reader hierarchy. During Query or Filter execution Lucene no longer passes raw readers down Queries, Filters or Collectors; instead components are provided an AtomicReaderContext (essentially a hierarchy leaf) holding relative properties like the document-basis in relation to the top-level reader. This allows Queries & Filter to build up logic based on document IDs, albeit the per-segment orientation.

Can I still use top-level readers?

There are still valid use-cases where Top-Level readers ie. “atomic views” on the index are desirable. Let say you want to iterate all terms of a complete index for auto-completion or facetting, Lucene provides utility wrappers like SlowCompositeReaderWrapper emulating an AtomicReader. Note: using “atomicity emulators” can cause serious slowdowns due to the need to merge terms, postings, DocValues, and FieldCache, use them with care!
 Terms terms = SlowCompositeReaderWrapper.wrap(directoryReader).terms(“field”);  
Unfortunately, Apache Solr still uses this horrible code in a lot of places, leaving us with a major piece of work undone. Major parts of Solr’s facetting and filter caching need to be rewritten to work per atomic segment! For those implementing plugins or other components for Solr, SolrIndexSearcher exposes a “atomic view” of its underlying reader via SolrIndexSearcher.getAtomicReader().

If you want to write memory-effective and fast search applications (that do not need those useless large caches like Solr uses), I would recommend to not use Solr 4.0 and instead write your search application around the new Lucene components like the new facet module and SearcherManager!

2011-12-21

JDK 7u2 released - How about Linux and other operating systems?

Last week, Oracle released Java 7 Update 2, another milestone. This release included, of course, all the fixes that were already in Update 1 (see also Oracle's page), especially those affecting Apache Lucene and Solr. Since my last post on this blog, I was investigating what changed and how other operating systems like Ubuntu/Redhat Linux and FreeBSD are supported (warning: sarcasm alert!)

Linux

First of all, you can of course download the official Linux packages from Oracle. But those are not automatically updated when a new release comes out. So most Linux users prefer to use the automatic update of their operating system. Unfortunately, at the beginning of this month, Ubuntu wrote in an announcement:
As of August 24th 2011, we no longer have permission to redistribute new Java packages as Oracle has retired the "Operating System Distributor License for Java".
Oracle has published an advisory about security issues in the version of Java we currently have in the partner archive. Some of these issues are currently being exploited in the wild. Due to the severity of the security risk, Canonical is immediately releasing a security update for the Sun JDK browser plugin which will disable the plugin on all machines. This will mitigate users' risk from malicious websites exploiting the vulnerable version of the Sun JDK.
In the near future (exact date TBD), Canonical will remove all Sun JDK packages from the Partner archive. This will be accomplished by pushing empty packages to the archive, so that the Sun JDK will be removed from all users machines when they do a software update. Users of these packages who have not migrated to an alternative solution will experience failures after the package updates have removed Oracle Java from the system.
If you are currently using the Oracle Java packages from the partner archive, you have two options: 
  • Install the OpenJDK packages that are provided in the main Ubuntu archive (openjdk-6-jdk or openjdk-6-jre for the virtual machine).
  • Manually install Oracle's Java software from their web site.
Unfortunately this means that we will never get an official Ubuntu package for Java 7! What are all these security bugs suddenly heavily exploited in the wild?

OK, the latest version of Ubuntu's JDK 6 was Update 26, so what security fixes came in Update 27, Update 29, and Update 30? I inspected the changelogs shipped with the openjdk6 and openjdk7 packages, which are now the "official Java support" for Ubuntu (and also Redhat) but there is something wrong: It's not even OpenJDK! OpenJDK is still on build 147 (as of their official download page) - which is the original Java 7 release that broke Apache Lucene and Apache Solr with index corru(m)ptions and SIGSEGVs. This means no Linux user can run our full text search engine on Linux, because it SIGSEGVs shortly after starting? But thats not what the Ubuntu package contains: What Ubuntu "sells" as OpenJDK is indeed a strange product named "IcedTea" - wtf is that?

IcedTea 2.0 was released on October 19, 2011 with a long ist of security fixes! But the ubuntu download still has the famous build number 147 in its version number: 7~b147-2.0-1ubuntu2 - how does this fit together? Redhat and Ubuntu both sell another product "IcedTea", but labeled as "OpenJDK"! As this is so widely used, this seems to lead to the fact that Oracle does not seem to update their original OpenJDK release anymore. The IcedCreamTea seems to be the "new" offical release? What about all non-Linux operating systems like FreeBSD (see below)? I think that's a bad idea, because it confuses users. Also, when reviewing fixed bugs in official Oracle releases you get an update number (current is Java 6u30 or Java 7u2), but with OpenJDK (sorry IcedTea) Linux packages you get version numbers that don't tell you any relation to Oracle's releases - useless!

In fact to come back to OpenJDK 6 package in Ubuntu: If you install this replacement package on your machine according to the howto on the Ubuntu webpage for the good sun-java6 package (which is u26) - you get an older hotspot version (hotspot version numbers are the only things that you can read and compare from "java -version" output)! Something around official Oracle JDK 6u24 - so in fact you get an older version - that's no upgrade, that's a downgrade! For OpenJDK 7 you get something like Oracle's JDK 7u0 but with thousands of patches applied.

To come back to the Lucene/Solr bugs: Yes, they are fixed in this mysterious OpenJDK/IcedTea 7 release, the long list of changes verifies that. If you download the wrong-named OpenJDK 7 package with the horrible build number 147 (openjdk-7-jdk 7~b147-2.0-1ubuntu2), you will not crash your JVM with Apache Solr and you can try it out with the new garbage collector (G1) and some performance improvements (indeed Lucene tests run faster with Java 7 on my box). It looks very stable.

The second shock on this day occurred when I was searching for the famous Lucene bugs in the list of fixed IcedTea issues. They appear there as one of the horrible security bugs with CVE numbers assigned (CVE-2011-3558and others)! This also explains why Oracle made the orginal porter stemmer bug report hidden! They also appear in the openjdk-6 Ubuntu packages - as horrible security bugs, too. So Ubuntu patched the antique 6u24 and older versions with patches for Java 6u29 [they also patched u20, where the bug was not existent, see here] - thats really strange. And again, confuses users!


And finally: The Lucene bugs seem to be one of the reasons to delete the sun-java6 packages from Ubuntu Partner repository in the future. How funny is this? Does anybody have an exploit except starting Apache Solr with the default configuration and -XX:+AggressiveOpts enabled? OK, it is really a security issue for users working on your Solr Search web frontend and suddenly produce corru(m)pt indexes on your machine! They might not find anything after this disaster.


FreeBSD

What about FreeBSD? It looks much worse: There is no new update for OpenJDK available until today, so you cannot use it to build a new Port. The Jenkins Server at Apache, running the Lucene tests, is still running the original OpenJDK7 b147 build that I patched during the summer to work around the Java 7 bugs. I think the problem is here, that Oracle no longer releases OpenJDK builds, because IcedTea is there. But IcedTea is Linux only!

Please note: This blog post is partially a little bit sarcastic, I just tell my feelings about the whole Linux-Ubuntu-OpenJDK-FreeBSD issue.

A short side note: PANGAEA now runs very stable and horrible fast for some operations with Lucene 3.5 (no Apache Solr) and official Oracle JDK 7u2 on Solaris x64 (MMapDirectory of course)! I wish you merry Xmas and a happy new year!

2011-10-19

Java 7 Update 1 released - Does it fix the Lucene index corru(m)ption and SIGSEGV bugs?

After the serious issues with the Java 7 GA release, Oracle released Update 1 of Java 7 yesterday. The first thing, of course, was to check the release notes, download the package, and finally run our Lucene test cases. When inspecting the download page and release notes on the Oracle web site, I got confused: The new release is Java 7u1, in contrast the developer preview released one week ago named Java 7u2 Developer Preview. Both builds have the same build number (b08). More strange is: The release notes of the developer preview differ from the release notes on the Update 1 release:
  • The official Update 1 release notes only list one bug of the three ones originally reported to Oracle: #7068051 (SIGSEGV in PhaseIdealLoop::build_loop_late_post on T5440). The other ones are not listed, so we cannot be sure that they are really fixed.
  • The famous SIGSEGV bug in Porter Stemmer (#7070134) is not listed at all, it also disappeared from the Oracle issue tracker. It seems like hidden to the public (maybe they declared it as confidental because it's security related???).

No matter what release notes say - I had to download the offical release package! I took some free time on the Lucene Eurocon Conference in Barcelona, downloaded the packages and installed them in parallel on my Windows 64bit Thinkpad. First I tried to run the Porter Stemmer test with 100 iterations on Java 7 GA, and verified that it crashed. Robert Muir joined and we tested the new U1 release: Test passes! This means, Hotspot issue #7070134 was fixed, but Oracle missed to put it into the release notes.

The second part of the investigations were more complicated: The index corru(m)ption bugs are more complicated to reproduce, as the virtual machine does not crash and simply produces corrupt indexes after merging segments in one of the facetting tests. We checked out the Lucene Trunk revision of the time, when the bug was first discovered (issue LUCENE-3346) and used the random seed mentioned on the issue. We were able to verify the bug with Java 7 GA (the indexes are corrupt 90% of the time), and luckily after 20 iterations of the same test and random seed in Java 7u1, we have seen no corrupt index! It seemed to us that Oracle maybe fixed Java issues #7044738 and #7068051, but missed to put both of them into the release notes. Of course, without an additional statement from Oracle, we cannot be sure, that the issues are really fixed!

Oracle also released Java 6 Update 29 yesterday. The release notes on that version doesn't mention any relation to the Lucene bugs, so we were not sure if this version is completely different to the Java 6u29 developer preview, released one week ago, which listed those bugs (unfortunately, the package is no longer available on the net), or if they also missed to mention them. A quick review as done for Java 7 showed, that Porter Stemmer no longer crashes with -XX:+AggressiveOpts, so the bug seems to be also fixed here, too. We were not able to actually discover any index corru(m)ption.

Finally, we can somehow verify that the bugs seem to be fixed for both versions, but without an official statement from Oracle (in their release notes), we cannot recommend to use Java 7u1 (and Java 6u29 with aggressive opts) with Lucene and Solr.

Once I will be back in Germany, I will try to get an updated FreeBSD package of Java 7 and install it on our Apache Jenkins server.

Update (2011-10-26)

Last night, Oracle updated the release notes of Java 7u1 and Java 6u29, stating that they fixed the three Lucene-relevant bugs (plus another one related to that). Based on this confirmation, it's now safe to use Java 7 Update 1 (and later) with Apache Lucene and Apache Solr.

Of course, there is still the recommendation not to use -XX:+AggressiveOpts on any JVM in production!

We are still waiting for updated OpenJDK packages to install this release on our build servers.