2012-02-06

Is your IndexReader atomic? - Major IndexReader refactoring in Lucene 4.0

Note: This blog post was originally posted on the SearchWorkings website.

Since Day 1 Lucene exposed the two fundamental concepts of reading and writing an index directly through IndexReader & IndexWriter. However, the API didn’t reflect reality; from the IndexWriter perspective this was desirable but when reading the index this caused several problems in the past. In reality a Lucene index isn’t a single index while logically treated as a such. The latest developments in Lucene trunk try to expose reality for type-safety and performance, but before I go into details about Composite, Atomic and DirectoryReaders let me go back in time a bit.

Since version 2.9 / 3.0 Lucene started to move away from executing searches directly on the top-level IndexReaders towards a per-segment orientation. As Simon Willnauer already explained in his blog entry, this lead to fact that optimizing an index is no longer needed to optimize searching performance. In fact, optimizing would slow your searches down, as after optimizing, all file system and Lucene-internal index caches get invalidated.

A standard Lucene index consists of several so-called segments, which are themselves fully-functional Lucene indexes. During indexing, Lucene writes new documents into separate segments and, once there are too many segments, they are merged. (see Mike McCandless’ blog: Visualizing Lucene's segment merges):



Prior to Lucene 2.9, despite consisting of multiple underlying segments, the segments were treated as though they were a single big index. Since then, Lucene has shifted towards a per-segment orientation. By now almost all structures and components in Lucene operate on a per-segment basis; among others this means that Lucene only loads actual changes on reopen, instead of the entire index. From a users perspective it might still look like one big logical index but under the hood everything works per-segment like this (simplified) IndexSearcher snippet shows:
 public void search(Weight weight, Collector collector) throws IOException {  
  // iterate through all segment readers & execute the search  
  for (int i = 0; i < subReaders.length; i++) {  
   // pass the reader to the collector  
   collector.setNextReader(subReaders[i], docStarts[i]);  
   final Scorer scorer = ...;  
   if (scorer != null) { // score documents on this segment  
    scorer.score(collector);  
   }  
  }  
 }  
However, the distinction between a logical index and a segment wasn’t consistently reflected in the code hierarchy. In Lucene 3.x, one could still execute searches on a top-level (logical) reader, without iterating over its subreaders. Doing so could slowdown your searches dramatically provided your index consisted of more than one segment. Among other reasons, this was why ancient versions of Lucene instructed users to optimize the index frequently.

Let me explain the problem in a little more detail. An IndexReader on top of a Directory is internally a MultiReader on all enclosing SegmentReaders. If you ask a MultiReader for a TermEnum or the Postings it executes an on-the-fly merge all of all subreader’s terms or postings data respectively. This merge process uses priority queues or related data structures leading to a serious slowdown depending on the number of subreaders.

Yet, even beyond these internal limitations using SegmentReaders in combination with MultiReaders can influence higher-level structures in Lucene. The FieldCache is used to uninvert the index to allow sorting of search results by indexed value or Document / Value lookups during search. Uninverting the top-level readers leads to duplication in the FieldCache and essentially multiple instances of the same cache.

Type-Safe IndexReaders in Lucene 4.0

From day one Lucene 4.0 was designed to not allow retrieving of terms and postings data from “composite” readers like MultiReader or DirectoryReader (which is the implementation that is returned for on-disk indexes, if you get a reader from IndexReader.open(Directory)). Initial versions of Lucene trunk simply threw an UnsupportedOperationException when you tried to get instances of Fields, TermsEnum, or DocsEnum from a non SegmentReader. Because of the missing type safety, one couldn’t rely on the ability to get postings from the IndexReader unless manually checking if it was composite or atomic.

LUCENE-2858 is one of the major API changes in Lucene 4.0, it completely changes the Lucene client code “perspective” on indexes and its segments. The abstract class IndexReader has been refactored to expose only essential methods to access stored fields during display of search results. It is no longer possible to retrieve terms or postings data from the underlying index, not even deletions are visible anymore. You can still pass IndexReader as constructor parameter to IndexSearcher and execute your searches; Lucene will automatically delegate procedures like query rewriting and document collection atomic subreaders.

If you want to dive deeper into the index and want to write own queries, take a closer look at the new abstract sub-classes AtomicReader and CompositeReader:

AtomicReader instances are now the only source of Terms, Postings, DocValues and FieldCache. Queries are forced to execute on a Atomic reader on a per-segment basis and FieldCaches are keyed by AtomicReaders. It’s counterpart CompositeReader exposes a utility method to retrieve its composites. But watch out, composites are not necessarily atomic. Next to the added type-safety we also removed the notion of index-commits and version numbers from the abstract IndexReader, the associations with IndexWriter were pulled into a specialized DirectoryReader. Here is an “example” executing a query in Lucene trunk:
 DirectoryReader reader = DirectoryReader.open(directory);  
 IndexSearcher searcher = new IndexSearcher(reader);  
 Query query = new QueryParser("fieldname", analyzer).parse(“text”);  
 TopDocs hits = searcher.search(query, 10);  
 ScoreDoc[] docs = hits.scoreDocs;  
 Document doc1 = searcher.doc(docs[0].doc);  
 // alternative:  
 Document doc2 = reader.document(docs[1].doc);  
Does that look familiar? Well, for the actual API user this major refactoring doesn’t bring much of a change. If you run into compile errors related to this change while upgrading you likely found a performance bottleneck.

Enforcing Per-Segment semantics in Filters

If you have more advanced code dealing with custom Filters, you might have noticed another new class hierarchy in Lucene (see LUCENE-2831): IndexReaderContext with corresponding Atomic-/CompositeReaderContext. This has been added quite a while ago but is closely related to atomic and composite readers.

The move towards per-segment search Lucene 2.9 exposed lots of custom Queries and Filters that couldn't handle it. For example, some Filter implementations expected the IndexReader passed in is identical to the IndexReader passed to IndexSearcher with all its advantages like absolute document IDs etc. Obviously this “paradigm-shift” broke lots of applications and especially those that utilized cross-segment data structures (like Apache Solr).

In Lucene 4.0, we introduce IndexReaderContexts “searcher-private” reader hierarchy. During Query or Filter execution Lucene no longer passes raw readers down Queries, Filters or Collectors; instead components are provided an AtomicReaderContext (essentially a hierarchy leaf) holding relative properties like the document-basis in relation to the top-level reader. This allows Queries & Filter to build up logic based on document IDs, albeit the per-segment orientation.

Can I still use top-level readers?

There are still valid use-cases where Top-Level readers ie. “atomic views” on the index are desirable. Let say you want to iterate all terms of a complete index for auto-completion or facetting, Lucene provides utility wrappers like SlowCompositeReaderWrapper emulating an AtomicReader. Note: using “atomicity emulators” can cause serious slowdowns due to the need to merge terms, postings, DocValues, and FieldCache, use them with care!
 Terms terms = SlowCompositeReaderWrapper.wrap(directoryReader).terms(“field”);  
Unfortunately, Apache Solr still uses this horrible code in a lot of places, leaving us with a major piece of work undone. Major parts of Solr’s facetting and filter caching need to be rewritten to work per atomic segment! For those implementing plugins or other components for Solr, SolrIndexSearcher exposes a “atomic view” of its underlying reader via SolrIndexSearcher.getAtomicReader().

If you want to write memory-effective and fast search applications (that do not need those useless large caches like Solr uses), I would recommend to not use Solr 4.0 and instead write your search application around the new Lucene components like the new facet module and SearcherManager!

4 comments:

  1. Thank you for this piece of vital and totally undocumented knowledge!

    I still don't quite understand what else I can use rather than SlowCompositeReaderWrapper in case I want to retrieve a field for all the documents in the index via FieldCache.
    By the way, I didn't force Lucene to build several segments, the entire index was static from the very first indexing. Still, there are multiple segments in it.

    ReplyDelete
  2. quote. Thank you for this piece of vital...knowledge!" quote.

    Absolutely. Thank you so much. I'm upgrading some legacy code from 2.0.0 to 4.2.0 and it is non trivial to find this kind of stuff out without hitting the bottle in the process. This has saved me a major hangover.

    ReplyDelete
  3. Hi Uwe,

    Do you have any insight for MultiReader, which is a CompositeReader?

    I am using it not just for search but also MoreLikeThis constructor and HighFreqTerms.getHighFreqTerms(IndexReader reader, int numTerms, String field). Is it safe to assume the statistics returned are from across all sub readers in the MultiReaders?

    Also I have a challenge.

    We are upgrading from Lucene 2.9.4 to 4.3. We used to use a ParallelMultiSearcher with an array Searchables. These Searchables were heterogeneous. Some were IndexSearchers made from local directories, some were remote searchers via web services, and some were Solr searcher clients using solrj.jar. We did use and do not plan to use Solr sever at all. Since the Searchable interface is deprecated, what will be our work around to accommodate our heterogeneous searchers?

    Thanks!

    ReplyDelete
  4. As part of an assignment for research I have to find an article with relevant information on this topic and give the teacher our opinion and the article. Your article helped me a lot. hip hop shop

    ReplyDelete