Sphinx in Action: Good and bad in Sphinx real time indexes

This is a post from a series of posts called "Sphinx in Action". See the whole series by clicking here.

Sphinx has supported real time indexes since version 1.10 was released. Ever since, they have been getting more stable and robust, and now it is ready for use in production. Many people like this because it's really simple to understand (i.e. no indexing, crontasks, main + delta schemes, and so on), but anyone who wants to use them should also be aware of the drawbacks of this when comparing it to traditional monolithic indexes.

A real time index consists of 2 parts: one that is stored in memory, and the other is stored on a disk. Once a new insert/update query is sent to a real time index, it updates the memory as well. It works very fast, but the memory isn't unlimited. So as the number of queries begin to reach it's maximum and once rt_mem_limit is exceeded on the memory side, it gets converted into a disk index chunk, and so on. If you have a 10Gb index and rt_mem_limit is set to 1Gb, then you will end up with 10 disk chunks. If you set re_mem_limit to 10Gb you will have only 1 disk chunk. Now here's the dilemma: the more disk chunks you have the worse your search performance will be, but on the other hand less memory is needed to support the real time index. On the contrary, if you set rt_mem_limit to a high value you will have good search performance because you will have fewer disk chunks, but it will take up more memory on your server. Unfortunately, the amount of memory that a Sphinx real time index requires is much more than what a traditional index needs. This is because a traditional index only stores attributes and wordlists while a real time index stores everything else as well until it's converted into a disk chunk.

Here's what it looks like for the same data (1M docs) when rt_mem_limit is high enough:

Traditional index:

[root@SE01 snikolaev]# ls -sh idx.sp*
12M idx.spa
186M idx.spd
8.0K idx.sph
11M idx.spi
4.0K idx.spk
4.0K idx.spl
4.0K idx.spm
103M idx.spp
8.0K idx.sps

Real time index:

[root@SE01 snikolaev]# ls -sh idx_rt.*
8.0K idx_rt.kill
4.0K idx_rt.lock
8.0K idx_rt.meta
442M idx_rt.ram

The bolded lines are the things that are stored in memory. As you can see, the traditional index requires 23Mb while the real time index needs more than 400Mb of memory.

In practice, we usually recommend real time indexes in 2 cases:

When the data volume is really small and is not going to grow quickly. Indeed, it doesn't actually matter whether you spend 5Mb or 100Mb if you have few gigs of memory and using the real time indexes will make sense for you because you will avoid having an indexing routine and will be able to synchronize your database and your search index on a data insert level.
When the data volume is large and growing, but you want to reduce indexing latency (i.e. you want your data to appear in the index in a real time manner). And here comes the best part of using real time indexes. They can be combined with traditional indexes using a Sphinx distributed index. What you can do in this case is still use the main + delta scheme, but make the delta real time, once the main part is rebuilt you then should clean the real time delta index. Since the delta is real time it enables real time data accessibility in the index and because it's the delta, it doesn't require a lot of memory. The only routine is to flush it periodically to the main part. To empty the real time index the latest Sphinx builds provide "TRUNCATE RTINDEX" SphinxQL command.