2012-07-18

Use Lucene’s MMapDirectory on 64bit platforms, please!

Don’t be afraid – Some clarification to common misunderstandings

Since version 3.1, Apache Lucene and Solr use MMapDirectory by default on 64bit Windows and Solaris systems; since version 3.3 also for 64bit Linux systems. This change lead to some confusion among Lucene and Solr users, because suddenly their systems started to behave differently than in previous versions. On the Lucene and Solr mailing lists a lot of posts arrived from users asking why their Java installation is suddenly consuming three times their physical memory or system administrators complaining about heavy resource usage. Also consultants were starting to tell people that they should not use MMapDirectory and change their solrconfig.xml to work instead with slow SimpleFSDirectory or NIOFSDirectory (which is much slower on Windows, caused by a JVM bug #6265734). From the point of view of the Lucene committers, who carefully decided that using MMapDirectory is the best for those platforms, this is rather annoying, because they know, that Lucene/Solr can work with much better performance than before. Common misinformation about the background of this change causes suboptimal installations of this great search engine everywhere.

In this blog post, I will try to explain the basic operating system facts regarding virtual memory handling in the kernel and how this can be used to largely improve performance of Lucene (“VIRTUAL MEMORY for DUMMIES”). It will also clarify why the blog and mailing list posts done by various people are wrong and contradict the purpose of MMapDirectory. In the second part I will show you some configuration details and settings you should take care of to prevent errors like “mmap failed” and suboptimal performance because of stupid Java heap allocation.

Virtual Memory[1]

Let’s start with your operating system’s kernel: The naive approach to do I/O in software is the way, you have done this since the 1970s – the pattern is simple: whenever you have to work with data on disk, you execute a syscall to your operating system kernel, passing a pointer to some buffer (e.g. a byte[] array in Java) and transfer some bytes from/to disk. After that you parse the buffer contents and do your program logic. If you don’t want to do too many syscalls (because those may cost a lot processing power), you generally use large buffers in your software, so synchronizing the data in the buffer with your disk needs to be done less often. This is one reason, why some people suggest to load the whole Lucene index into Java heap memory (e.g., by using RAMDirectory).

But all modern operating systems like Linux, Windows (NT+), MacOS X, or Solaris provide a much better approach to do this 1970s style of code by using their sophisticated file system caches and memory management features. A feature called “virtual memory” is a good alternative to handle very large and space intensive data structures like a Lucene index. Virtual memory is an integral part of a computer architecture; implementations require hardware support, typically in the form of a memory management unit (MMU) built into the CPU. The way how it works is very simple: Every process gets his own virtual address space where all libraries, heap and stack space is mapped into. This address space in most cases also start at offset zero, which simplifies loading the program code because no relocation of address pointers needs to be done. Every process sees a large unfragmented linear address space it can work on. It is called “virtual memory” because this address space has nothing to do with physical memory, it just looks like so to the process. Software can then access this large address space as if it were real memory without knowing that there are other processes also consuming memory and having their own virtual address space. The underlying operating system works together with the MMU (memory management unit) in the CPU to map those virtual addresses to real memory once they are accessed for the first time. This is done using so called page tables, which are backed by TLBs located in the MMU hardware (translation lookaside buffers, they cache frequently accessed pages). By this, the operating system is able to distribute all running processes’ memory requirements to the real available memory, completely transparent to the running programs.


Schematic drawing of virtual memory
(image from Wikipedia [1], http://en.wikipedia.org/wiki/File:Virtual_memory.svg, licensed by CC BY-SA 3.0)

By using this virtualization, there is one more thing, the operating system can do: If there is not enough physical memory, it can decide to “swap out” pages no longer used by the processes, freeing physical memory for other processes or caching more important file system operations. Once a process tries to access a virtual address, which was paged out, it is reloaded to main memory and made available to the process. The process does not have to do anything, it is completely transparent. This is a good thing to applications because they don’t need to know anything about the amount of memory available; but also leads to problems for very memory intensive applications like Lucene.

Lucene & Virtual Memory

Let’s take the example of loading the whole index or large parts of it into “memory” (we already know, it is only virtual memory). If we allocate a RAMDirectory and load all index files into it, we are working against the operating system: The operating system tries to optimize disk accesses, so it caches already all disk I/O in physical memory. We copy all these cache contents into our own virtual address space, consuming horrible amounts of physical memory (and we must wait for the copy operation to take place!). As physical memory is limited, the operating system may, of course, decide to swap out our large RAMDirectory and where does it land? – On disk again (in the OS swap file)! In fact, we are fighting against our O/S kernel who pages out all stuff we loaded from disk [2]. So RAMDirectory is not a good idea to optimize index loading times! Additionally, RAMDirectory has also more problems related to garbage collection and concurrency. Because the data residing in swap space, Java’s garbage collector has a hard job to free the memory in its own heap management. This leads to high disk I/O, slow index access times, and minute-long latency in your searching code caused by the garbage collector driving crazy.

On the other hand, if we don’t use RAMDirectory to buffer our index and use NIOFSDirectory or SimpleFSDirectory, we have to pay another price: Our code has to do a lot of syscalls to the O/S kernel to copy blocks of data between the disk or filesystem cache and our buffers residing in Java heap. This needs to be done on every search request, over and over again.

Memory Mapping Files

The solution to the above issues is MMapDirectory, which uses virtual memory and a kernel feature called “mmap” [3] to access the disk files.

In our previous approaches, we were relying on using a syscall to copy the data between the file system cache and our local Java heap. How about directly accessing the file system cache? This is what mmap does!

Basically mmap does the same like handling the Lucene index as a swap file. The mmap() syscall tells the O/S kernel to virtually map our whole index files into the previously described virtual address space, and make them look like RAM available to our Lucene process. We can then access our index file on disk just like it would be a large byte[] array (in Java this is encapsulated by a ByteBuffer interface to make it safe for use by Java code). If we access this virtual address space from the Lucene code we don’t need to do any syscalls, the processor’s MMU and TLB handles all the mapping for us. If the data is only on disk, the MMU will cause an interrupt and the O/S kernel will load the data into file system cache. If it is already in cache, MMU/TLB map it directly to the physical memory in file system cache. It is now just a native memory access, nothing more! We don’t have to take care of paging in/out of buffers, all this is managed by the O/S kernel. Furthermore, we have no concurrency issue, the only overhead over a standard byte[] array is some wrapping caused by Java’s ByteBuffer interface (it is still slower than a real byte[] array, but  that is the only way to use mmap from Java and is much faster than all other directory implementations shipped with Lucene). We also waste no physical memory, as we operate directly on the O/S cache, avoiding all Java GC issues described before.

What does this all mean to our Lucene/Solr application?
  • We should not work against the operating system anymore, so allocate as less as possible heap space (-Xmx Java option). Remember, our index accesses rely on passed directly to O/S cache! This is also very friendly to the Java garbage collector.
  • Free as much as possible physical memory to be available for the O/S kernel as file system cache. Remember, our Lucene code works directly on it, so reducing the number of paging/swapping between disk and memory. Allocating too much heap to our Lucene application hurts performance! Lucene does not require it with MMapDirectory.

Why does this only work as expected on operating systems and Java virtual machines with 64bit?

One limitation of 32bit platforms is the size of pointers, they can refer to any address within 0 and 232-1, which is 4 Gigabytes. Most operating systems limit that address space to 3 Gigabytes because the remaining address space is reserved for use by device hardware and similar things. This means the overall linear address space provided to any process is limited to 3 Gigabytes, so you cannot map any file larger than that into this “small” address space to be available as big byte[] array. And when you mapped that one large file, there is no virtual space (address like “house number”) available anymore. As physical memory sizes in current systems already have gone beyond that size, there is no address space available to make use for mapping files without wasting resources (in our case “address space”, not physical memory!).

On 64bit platforms this is different: 264-1 is a very large number, a number in excess of 18 quintillion bytes, so there is no real limit in address space. Unfortunately, most hardware (the MMU, CPU’s bus system) and operating systems are limiting this address space to 47 bits for user mode applications (Windows: 43 bits) [4]. But there is still much of addressing space available to map terabytes of data.

Common misunderstandings

If you have read carefully what I have told you about virtual memory, you can easily verify that the following is true:
  • MMapDirectory does not consume additional memory and the size of mapped index files is not limited by the physical memory available on your server. By mmap() files, we only reserve address space not memory! Remember, address space on 64bit platforms is for free!
  • MMapDirectory will not load the whole index into physical memory. Why should it do this? We just ask the operating system to map the file into address space for easy access, by no means we are requesting more. Java and the O/S optionally provide the option to try loading the whole file into RAM (if enough is available), but Lucene does not use that option (we may add this possibility in a later version).
  • MMapDirectory does not overload the server when “top” reports horrible amounts of memory. “top” (on Linux) has three columns related to memory: “VIRT”, “RES”, and “SHR”. The first one (VIRT, virtual) is reporting allocated virtual address space (and that one is for free on 64 bit platforms!). This number can be multiple times of your index size or physical memory when merges are running in IndexWriter. If you have only one IndexReader open it should be approximately equal to allocated heap space (-Xmx) plus index size. It does not show physical memory used by the process. The second column (RES, resident) memory shows how much (physical) memory the process allocated for operating and should be in the size of your Java heap space. The last column (SHR, shared) shows how much of the allocated virtual address space is shared with other processes. If you have several Java applications using MMapDirectory to access the same index, you will see this number going up. Generally, you will see the space needed by shared system libraries, JAR files, and the process executable itself (which are also mmapped).

How to configure my operating system and Java VM to make optimal use of MMapDirectory?

First of all, default settings in Linux distributions and Solaris/Windows are perfectly fine. But there are some paranoid system administrators around, that want to control everything (with lack of understanding). Those limit the maximum amount of virtual address space that can be allocated by applications. So please check that “ulimit -v” and “ulimit -m” both report “unlimited”, otherwise it may happen that MMapDirectory reports “mmap failed” while opening your index. If this error still happens on systems with lot’s of very large indexes, each of those with many segments, you may need to tune your kernel parameters in /etc/sysctl.conf: The default value of vm.max_map_count is 65530, you may need to raise it. I think, for Windows and Solaris systems there are similar settings available, but it is up to the reader to find out how to use them.

For configuring your Java VM, you should rethink your memory requirements: Give only the really needed amount of heap space and leave as much as possible to the O/S. As a rule of thumb: Don’t use more than ¼ of your physical memory as heap space for Java running Lucene/Solr, keep the remaining memory free for the operating system cache. If you have more applications running on your server, adjust accordingly. As usual the more physical memory the better, but you don’t need as much physical memory as your index size. The kernel does a good job in paging in frequently used pages from your index.

A good possibility to check that you have configured your system optimally is by looking at both "top" (and correctly interpreting it, see above) and the similar command "iotop" (can be installed, e.g., on Ubuntu Linux by "apt-get install iotop"). If your system does lots of swap in/swap out for the Lucene process, reduce heap size, you possibly used too much. If you see lot's of disk I/O, buy more RUM (Simon Willnauer) so mmapped files don't need to be paged in/out all the time, and finally: buy SSDs.

Happy mmapping!

Bibliography

38 comments:

  1. And just the other day I was using mmapdirectory on a 64bit jvm, trying hard to set the highest heap size possible, reducing available ram for the disk cache. Stupid me.
    Thanks for this real eye opener!

    Would also be interest to learn about any potential mmap cons. I assume ACID is fully kept for write OPs? Thanks.

    ReplyDelete
    Replies
    1. Gill, MMapDirectory in lucene uses MMaps only for reading not for writing so there are no consequences along those lines. Additionally Lucene never modifies committed files so you won't have any corruptions due to the fact you are using MMap.

      @Uwe: you really quoted me on the RUM thing didn't you! :) - good blog and needed I don't need to write answers to this anymore on the list but just point folks here!

      Delete
  2. One additional huge problem I had with mmap on a 32bit system (within a different app, not lucene) was that the system gots slow on strange parts of the code when the available memory got lower. debugging that was horrible :)

    Questions:

    What if my index completely fits into RAM (say only 1 or 2 GB) and I need 100% performance. Wouldn't it then better to use a RAMDir?

    And to avoid concurrency problems with RAMDirectory - wouldn't it be possible to implement a different one via off-heap direct-memory (not mmap)?

    ReplyDelete
    Replies
    1. There is an issue for that. RAMDirectory has the other bug, that it allocates millions of byte[1024], because thats the block size. RAMDirectory is made solely for tests, not for production. I have a patch for RAMDirectory to use larger block sizes in Lucene 4.0, not yet committed (needs some love). It uses IOContext to estimate file size and changes the blocksize. Also for RAMDirectoryies that clones a FSDirectory, it allocates the whole file as one byte[] (or multiple 2 GB blocks, if larger). See https://issues.apache.org/jira/browse/LUCENE-3659

      Delete
  3. Hi Uwe,

    I tried changing the JVM -Xmx setting from 15GB to 6GB for a 21GB footprint index. The perf numbers are identical pretty much. It's much more satisfying to see RES of the process at 4.8GB instead of the 18GB it was before, that and the OS cache is at 20GB instead of 8GB. That's all well and good, but I was hoping for some other measureable indication that this is in fact "Better". E.g. less CPU for the same throughput or something. Any suggestions on where to look to see the tangible benefits?

    ReplyDelete
  4. Hi Uwe,

    I was curious what this means for the solr documentCache? If this is all essentially loaded into memory upon retrieval is there much point in needing it -- other than perhaps desiring autowarming? The other caches seem more relevant for continued use, but documentCache seems redundant. Is that right/sort of right?

    ReplyDelete
  5. Hi Surfdog,

    indeed the Solr documentCache is somehow obsolete with MMapDirectory. The problem with MMapDirectory here is, that the *decoding* of the stored fields file still needs to be done, which takes some time, on the other hand, stored fields are only loaded for search results, which is in most cases only 20 or like that. Just like Lucene TopDocs, which has no cache at all and is very fast in displaying something like 20 documents, I would disable this cache. I generally recommend that to my customers for Lucene code to not cache Document objects, unless they do some result-post-processing fetching thousands of search result document objects.

    ReplyDelete
  6. in addition: The Solr document cache is very old and was in my opinion mainly implemented to work around thread synchronization in FieldsReader (this means, in earlier Lucene versions, IndexReader.document() was synchronized), so the bottleneck was not IO, but synchronization. This is no longer the case (since 2.9, as far as I know).

    ReplyDelete
  7. Hi Uwe,

    Thanks for this interesting article. Do you think that like MongoDB or MySQL, Solr can affected by the NUMA like explained on these articles : http://www.mongodb.org/display/DOCS/NUMA and http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/ ?

    Regards,

    ReplyDelete
  8. Hi Laurent,
    The Oracle JDK has some improvements in its garbage collector and memory allocation for NUMA architectures. If you recognize problems with NUMA, use -XX:+UseNUMA -XX:+UseParallelGC as JVM parameters. This also improves the garbage collector (so it distributes the allocated objects correctly to CPU local memory). But in practise, we have not seen MySQL-like problems. The reason for this is: In JAVA you never allocate a large block with tons of memory (> the node-local memory). The reason for this is, that the maximum size of memory, Java can allocate for a single object is 2 GiB (new byte[Integer.MAX_VALUE]). Also the JDK will use libnuma, to distribute the allocated blocks accoring to their requirements.
    In addition, as described in the blog post: Never allocate too much heap memory using -Xmx to your java process, as this slows down your index, because the operating system has no space for caching files which is the most important thing for Lucene/Solr.
    About use of MMapDirectory: MMap is outside the JVM memory and loading/releasing memory mapped files is handled different in terms of NUMA. MMapped files are swapped in/out by the Linux kernel and the cache space is maintained by the OS kernel, so it is not assigned to a NUMA node at all.

    ReplyDelete
  9. Hi Uwe,
    Like you have mentioned that it's best to disable documentCache.
    What about filterCache, fieldValueCache and queryResultCache. Wouldn't these get effected too if we reduce heap space?

    Also during indexing if the heap space is low could it give me heap space errors during a merge?

    ReplyDelete
  10. @varunthaker: filterCache is cheap to maintain (it is only bitsets) and it helps for fq queries e.g. with common facets. In general a filterCache used for numeric ranges is mostly useless, so i would recommend to turn it on for fq's with facets, but when you do filtering on geographic or date/time ranges, pass the no-cache query parser option to fq. FieldCache is needed for sorting and cannot be turned off, but may get obsolete with the introduction of Lucene-4.0-DocValues fields to Solr's schema (not yet fully implemented). QueryResultCache is like FilterCache caching results, but for queries it also caches the score values, so a simple bitset is not enough. If you have very expensive queries that repeat quite often (like Dismax), you should use it.

    ReplyDelete
  11. Hi Uwe,
    Thanks for explaining the various caches.

    Indeed once DocValues get integrated into the schema using them for sorting and faceting.

    One this I'm still not clear about it how to balance between allocating heap space for running search. I know you can't really put a formula to it but as far as my understanding goes, Lucene benefits from having more free memory with the os but Solr being an application benefits from having more heap space.

    Why I'm so keen on this balance is I want to make sure I don't run out of memory when committing the index or if I have a query with multiple facets and sort on fields while maintaining good search performance.

    ReplyDelete
  12. Hi Uwe,
    I'm linux support for some developers trying to implement lucene and make use of mmapdirectory. I'm trying to figure out exactly what I need to do to facilitate this for them. Are there any kernel tweaks or anything that I need to do on the system? It's a RHEL 5.6 platform. If you could point me to any documentation from a systems, rather than a developers perspective, it would be greatly appreciated.

    Vielen Dank,
    Jay Herbig

    ReplyDelete
  13. Hi Jay,
    unfortunately, I am also on the developer perspective :-). But I can answer parts of your questions, as I am also involved in managing Lucene-based servers:
    - There are no special kernel-settings needed, the default kernels in all Linux distributions should support mmap, as this is required by POSIX and the ld.so loader is also using it to map .so files into address space and share them between processes (in linux, all libraries are mmaped into the processes).

    The problem are (as noted before) some restrictive settings with ulimit:

    - Lucene in general needs a large number of open files, so you should in all cases raise the maximum open file count per process. On servers only running lucene apps and nothing more (which we recommend), the upper limit is just useless and should be raised to some maximum like 32768 open files (depends on kernel). If you don't want to raise it too high, use at a minimum 8192 - this is not really specific to mmap, applies all Lucene storage implementions. This setting is "ulimit -n"

    - The important settings to raise to unlimited/maximum are: "ulimit -v" (virtual memory, must be unlimited, otherwise you cannot map the index. As my article says, this has nothing to do with physical RAM, its only the size of address space the app can occupy. The setting of virtual mem is only important on 32 bit systems, where one process could easily allocate the whole virtual 32 bit off address space, bringing the server down). On 64 bit this is unlikely to happen :-) Max memory size ("ulimit -m") should also be unlimited. Both settings are unlimited in most linux distribs like Ubuntu, Debian. I have no idea about Redhat or SUSE, at least I know that SUSE by default has set some limits on server platforms.

    There may be other settings in ulimit (like maximum file size, but I don't think you would have any limits by default).

    Finally: If yur indexes are *really* large (terabytes), you may need to raise the sysctl "vm.max_map_count", but that's all (see my explanation above).

    In any case, if I get link to an article about "sys admin's settings", I will post it here.

    ReplyDelete
  14. Thanks Uwe, for sharing such valuable information.
    As you explained about JVM allocation size, what about CPU?
    Solr/Lucene is CPU intensive. We see contineous CPU spike on Solr Slaves. Do you have any recommendation on the # of CPUs too?

    ReplyDelete
  15. Hi Uwe, thanks for sharing thoughts on mmap impl for linux 64 bits.
    Is there any additional recommendation when using different hosts of appservers sharing indexes stored in NFS ? Only one appserver instance writes to index, the other instances only read.

    ReplyDelete
  16. Great article! But what if I really don't want to have any disk i/o in my application is the preferred way to solve this by using ramfs? And if I do, what is the preferred Directory to use? If I use MMapDirectory I will duplicate data in memory I guess, would it be better to use NIOFSDirectory in this case?

    ReplyDelete
  17. Johan: There are currently plans to create a new RAMDirectory like implementation that uses ByteBuffers, which can be on Java heap (ByteBuffer.allocate()) or off-heap (allocateDirect()). The code would be similar to MMapDirectory (which already uses a generic ByteBuffer interface). Only code for writing is missing in Lucene. If I have time, I will work on this. The problems with writing (and also RAMDirectory in general) are the fact that in older Lucene versions, we don't really know how big the files are. Because of This, while writing RAMDirectory allocates in blocks of 1024 byte... So Garbage collector has to handle millions of small byte[1024] if you have a large index in RAMDirectory. In Lucene 4.0, with IOContext, we can somehow estimate the size of the merged segments, so the new ByteBuffer-based RAMDirectory (can allocate bigger blocks instead of defaul 1024 bytes).

    The whole thing only makes sense for Indexes that are *never* written to disk. Otherwise, if it is on disk, MMapDirectory is the way to go. Copying from disk to RAM explicitly is stupid, see the blog post, because MMapDirectory does not waste additional RAM.

    ReplyDelete
  18. While your article is helpful in some aspects, it is trivially simplistic in others. Nothing is for "free". Regardless of how much virtual address space is available, the physical address space is limited.

    Your many points about the inefficiency of the java heap are generally correct. Yes, it's true that memory-mapped files are more efficient for loading an index into physical memory. However, in other ways your praise of memory mapping the virtual address space is trivial and simplistic.

    If the index is so large that there are very large amounts of virtual memory from the memory mapped files being loaded into physical memory, then this crowds out the physical memory in use by the app and all other apps on the system. In such cases, large amounts will then be frequently swapped out to the OS swap file. There is only so much physical memory available. Your trivial comment to "buy more RAM" in such cases is silly. Nobody wants to buy more RAM because one process is a memory hog. That silly comment applies just as well to using RAMDirectory. Why would anyone memory-map files at all if they could just "buy more RAM" and load the whole index into RAM. If RAM is unlimited then the memory map is pointless. The reason is that we don't want to buy more RAM whenever a process is a memory hog.

    We are seeing a lucene app with 50 gb of virtual memory. The O/S is running out of swap space, killing performance, because the vast virtual memory in use by Lucene is hogging the physical address space. We're actually hitting the limits on the swap file, which means we are swapping out way too much memory because of the lucene memory mapping.

    It's true that using a different lucene directory mapping would be worse. But it's also true that the MMapDirectory is no panacea, and it has its limits. Virtual address space is not "for free". Unlimited use of virtual memory is not for free.

    ReplyDelete
  19. scf: If your machine is aggressively swapping out process memory in order to make room for memory mapping, you need to adjust your system. If you are using Linux, the keyword is "swappiness". Turning this way down should nullify everything you are ranting about.

    ReplyDelete
  20. Great post! Thanks! I've always reserved a lot of memory for Solr!

    ReplyDelete
  21. Hi, I have a issue here. My server had a crash ans now the solr is not getting started. It is giving this error : org.apache.lucene.index.IndexNotFoundException: no segments* file found in org.apache.lucene.store.MMapDirectory@C:\Program Files\Apache Software Foundation\Tomcat 7.0\india\data\index lockFactory=org.apache.lucene.store.NativeFSLockFactory@72ba007e

    Casically it has mapped the index directory in the viryual memory and accesing it from there. How can i delete this virtual memory casche so that it does not look at the file path mentioned.

    ReplyDelete
  22. Hi rocky,
    this has nothing to do with virtual memory. You index was just corrupted by the crash. The virtual memory is freed asap when the process dies.

    ReplyDelete
  23. Hi Uwe,
    really interesting blog !
    But I have one thing to ask you ...
    Now in Solr the default directory is the "NRTCachingDirectory" that relies on the "StandardDirectoryFactory" .
    Do you know any detail ?
    What are the differences between the MMap approach and this new one ?
    I suppose the new directory implementation is better ?

    ReplyDelete
  24. StandardDirectoryFactory by default uses MMapDirectory, if a 64 bit system is detected. NRTCachingDirectory therefore uses MMapDirectory as the backing perm store. The reason is FSDirectory.open() which returns MMapDirectory on those platforms.

    In gerenal this should be configureable, means NRTCachingDirectory should be configureable to take another DirectoryFactory as delegate.

    ReplyDelete
  25. About the difference: NRTCachingDirectory is just a wrapper around another directoy (in your case MMap). The idea here is to improve the NRT case, where you don't always commit your indexed data but reopen the IndexReader quite often to see changes asap. Once you commit, the data is written to disk. This dir just caches the small amounts of data before they are written to disk on commit. This directory is not really faster than MMap on reading (a little bit it is - see above in blog post, because the Java overhead ByteBuffer vs. byte[]), but makes IO on the filesystem much faster, because the indexed data is not written to disk asap (only on commit or if caching buffer is full).

    ReplyDelete
  26. Uwe , thank you very much for the quick and precise answer :)
    Only few things :
    " This dir just caches the small amounts of data before they are written to disk on commit " -> in the NRT scenario we are talking about the fact that this implementation caches the documents produced by a soft commit until an Hard Commit happens?
    "(only on commit or if caching buffer is full)" -> also in this phrase you are referring to the docs between a Soft and an hard Commit ?

    If i understood well the Real Time Scenario it works like this :
    We have the RamBuffer memory that accumulates documents, as soon as a SoftCommit is triggered this RamBuffer became free, a new Searcher is opened and the docs are moved to the Disk Cache of the NRTCachingDirectory. As soon as an hard commit is triggered the Disk Cache of the NRTCachingDirectory became free and the index is persisted on disk.
    Am I Correct ?

    Thank you in advance

    ReplyDelete
  27. "in the NRT scenario we are talking about the fact that this implementation caches the documents produced by a soft commit until an Hard Commit happens?" -> yes!

    Of course if the buffer is too small, it will write to disk before the hard commit. But a commit still does not happen. The whole idea is just to delay writes as long as possible.

    "If i understood well the Real Time Scenario it works like this" -> yes!

    ReplyDelete
    Replies
    1. Thank you very much Uwe !
      Perfectly clear

      Delete
    2. Hi Uwe, I was going in deep more, because I want to have Clear at most the architecture :)
      I'm going to try to be as clear as possible, because in my opinion in my previous posts I made a mistake :

      Near Real Time Scenario
      First of all we have the Ram Buffer.
      The Ram Buffer will ingest documents until it is full or an hard commit happens. In that case the Ram Buffer is freed.
      When the RamBuffer is full, the index is flushed to the directory but no open Searcher happens .

      When a Soft commit happens we :
      - open a new Searcher
      - invalidate all high level caches ( eventually no per segment caches as for sorted fields or function queries)
      - NO flush to the Solr Directory( this is the point, the RAMBuffer remains ? where is located the memory index for the Soft Commit ? In my opinion, in the RAMBuffer and we don't send anything to the Solr Directory )

      When an Hard Commit Happens:
      - the index segment in memory is flushed to the Solr Directory.
      - the RAMBuffer is freed
      - eventually based on the openSearcher, we open a new Searcher

      After we have flushed our Index to our Directory we close a segment file and we are ready to proceed dependending on the Solr Directory implementation:

      NrtCachingDirectory host small files in cache until is possibile and then finally fsync the files in cache to the Disk.


      Sorry to bother you a lot, but I want to perfectly understand all the process :)

      Cheers

      Delete
  28. Hi Uwe , If i understand correctly during lucene segment merges, it should read few segments and merges as one segment. In this case a read operation happening on index right ? According to lucene docs , MMapDirectory uses memory-mapped IO when reading. If so lucene segment merges will happen on jvm heap memory or out-side jvm memory ?

    ReplyDelete
  29. Hi Anantha,
    merges will use both types of memory: Because merges is not just copying data, while merging, the term dictionary and postings lists have to be rebuilt. E.g., Lucene will build a new FST for the term dictionaryand much more. So merging will need JVM heap and cpu resources, too.

    ReplyDelete
  30. Hi Uwe. When the end of merging occurs at the end of my full import of a 12G index I see 12G+ of actual RES in use, which I assume is for writing the final (single) *.fdt file. The bigger the index gets, the more RES is being used for the merge (to the detriment of other processes on the server which are being swapped out). At some point it seems there won't be enough. Is this unavoidable - do I have to buy as much RAM as the size of my index ?

    (this is solr 3.4 on a jvm with 1g heap on 64-bit linux)

    ReplyDelete
  31. Hi cos inman,
    are you sure that you use MMapDirectory. This looks more like NIOFSDirectory:
    Earlier Lucene versions had a bug that it used quite large I/O buffers when reading stuff like norms. Norms were read as one large chunk. When using NIOFSDirectory, this caused that the JVM internally allocated a direct byte buffer of the requested size (which was pooled, but one you requested a large read, the pool had this large buffer). SimpleFSDirectory had a similar issue, but it just malloced the read buffer and freed it afterwards.
    This was fixed in Lucene 4.5: https://issues.apache.org/jira/browse/LUCENE-5164

    ReplyDelete
  32. Hi Uwe, Thank for the informative article. I am running lucene queries against two identical indexes one stored on Network File storage (SAN) and other on local disk. I am measuing the search respone times by running tests against local index followed by index on NFS.Surpriseingly I am getting better response times with NFS compared to local storage (which supposed to be faster). Could this be due to mmap done during test 1 (with NFS index)? How can I clear the mmap (caching) created by test1 so that I can start run fresh test against local index.

    BTW I am running this against Solaris 10 and 64-bit Java 1.7 update 25 , Lucene 4.2.1

    ReplyDelete
  33. Hi Vijay,
    is this NFS file system or just a SAN server providing a virtual disk like iSCSI?

    ReplyDelete
  34. Hi Uwe, this is NFS file system .

    ReplyDelete