Symas OpenLDAP Knowledge Base

Index Tuning

Index configuration

There are two places where you may configure indexes :

  • in the database section
  • in the global section

Global section

In this section, we can only configure the following parameters

  • olcIndexIntLen : This is used for indexed integer values. Those values may be big, and the associated key will be limited by this parameter (the default is 4 bytes, which means that indexing long values may lead to some collisions). 

  • olcIndexSubstrIfMaxlen : defines how many chars will be used to index an initial or final substring. The defaut is 4 chars. That means, for instance, in a String like “abcdefghijkl”, the initial index will use “abcd” and the final index will use “ijkl”.

  • olcIndexSubstrIfMinlen : defines the minimal number of chars that will be used to index the initial or final substring. the default is 2 chars. That means, for instance, that a String like “a” will not be indexed.

  • olcIndexSubstrAnyLen : defines the number of characters to use in a any substring. The default is 4 chars. For instance, a String like “abcdefgh” will add the following keys in the index : “abcd”, “bcde”, “cdef”, “defg”, “efgh”.

  • olcIndexSubstrAnyStep : defines the number of characters we skip when creating the any inde. The default is 2. For instance, a String like “abcdefgh” will add the following keys in the index : “abcd”, “cdef”, “efgh”.

Usually, you won’t need to change any of those values. If, and only if, the logs demonstrate that many substring filters are used, and after having proved (ie, measured) that changing those parameters MAY provide better performance. 

Database section

Not all the databases allows you to define indexes. Currently, you can do so for mdb

The possible index are :

  • eq : equality
  • sub : substrings (covers the three subinitial, subany and subfinal part)
  • pres : presence

there are a few ‘special’ and rare type of index that can be declared :

  • approx : approximate
  • subinitial : substring initial part
  • subany : substring any part
  • subfinal : substring final part
  • nolang : the values will be indexed regardless of any lang option being present (for instancecn;lang-de and cn;lang-fr will be indexed as if they were part of the cn attribute)
  • nosybtypes : the subtypes of the indexed attribute will not be indexed (for instance, ‘index name eq, nosubtypes’ will index the ‘name’ attribute values, but not the ‘cn’ or any of the ‘name’ subtypes).

Considerations regarding the index type usage

The pres index should rarely be used, because it’s expensive. This is all about the way the indexes are used during a search operation. Basically, if the attribute is used in a lot of entries, it’s certainly a bad idea to add a pres index on this attribute (see later for some rational about this).

The sub index is quite wide. If the search being done are many using the initial filter (ie, something like (cn=abc) ), then prefer using subinitial. The created index will be smaller, and the entry addition will be faster.

The approx index is equivalent to the eq index (AFAICT). It does not make a lot of sense to define one though.

The filters using an inequality might be way less performant when the attribute is indexed, especially when many entries contain this attribute. Typically, a filter like (modifyTimestamp>=20150101010101) will be way slower when the modifyTimestamp is indexed. Side note : a future version of OpenLDAP might ignore the index when such filter is used.

Other important considerations

The ObjectClass attribute should always be indexed.

The AccessLog database should declare the following indexes : entryCSN, objectClass, reqEnd, reqResult, reqStart

How are indexes used in filters ?

It’s important to understand how indexes are used.

When a search is received, it always has a filter. This filter might be complex, with connectors like ANDOR or NOT. In any case, each filter element will be evaluated separately, and this evaluation will create a set of candidates (actually, a set of entry’s ID), and those set will be combined using the following rules :

  • A & B : the set of candidates is the intersection of the set of candidate for A and the set of candidate for B.
  • A | B : the set of candidates is the union of the set of candidate for A and the set of candidate for B.
  • ! A : the set of candidate is the whole database. Actually, it does not define any set of candidates
  • a filter element using a non indexed attribute will not generate a set of candidate

Once we have all the set of candidates evaluated, the search engine will fetch every candidate and check the entry against each of the remaining filters. Either the entry matches, and it is returned, or it does not match and it is discarded.

We see that the process is a two phase process :

  • candidate selection
  • entry filtering

We can also see how this process impact the result, typically for inequality filters on indexed attributes : it may take a while to construct a set of candidates when many entries contain this attribute (we will see later why it’s even slower than expected).

For the same reason, the pres index is generally not a good idea : if many entries have the attribute, then the set of candidate will be big. Constructing this set will take time and will require a lot of fetches from the index.

How do we store index ?

Index are built using the hashed value of the key, and not the key itself. The rational is that the hashed value has a fixed size (a Long), something that is convenient when it comes to store a known number of elements in a page. The consequence is that keys are not anymore ordered in the index. This is why using an inequality index is inefficient when there are a lot of values : the server will have to read all the index to create the list of candidates, when it would be trivial if the index keys were ordered.

Note that this is a technical choice : having fixed size pages for the nodes instead of non-fixed pages (with the chance that a node has to use multiple pages to store all the keys) speedup the access to the index’ values. And considering that inequality indexes are quite unfrequent, this is quite a rational choice.

Possible improvements

The OpenLDAP server could probably be improved in this area. Typically, when a complex filter contains a filter element that will select a small set of candidates, and considering we know that beforehand, it would be useful not to create the set of candiates. The typical use case is when you have such a filter : (&(&(|(eduPersonAffiliation=member)(eduPersonAffiliation=affiliate))(uidNumber>=1000))(uidNumber=ps851). Here, assuming the uidNumber attribute is indexed, there is absolutely no need to compute any of the other set of candidates, because we will only have one entry matching (uidNumber=ps851).

The simplest optimisation would be to check the filter beforhand to see if a equality filter is in use on an attribute which value would be unique across the database in a AND connector at the top level.

A more complex approach would be to evaluate the number of candidates each filter elements will return before computing the set of candidates. This is possible when we have multiple value if we keep the number of values associated to this key. We then could qualify each filter with a number of candidate, and evaluate globally the filter, using the filter element with the smallest number of candidates as the set of candidate, ignoring the other sets.

Index on fixed size attributeTypes, like those having the GeneralizedTime Syntax, could probably use the value as a key instead of hashing them. There are typically indexes used to find values above or below a value, rarely equals to a specific value.

Last, not least, the index don’t care if the candidate are within a scope and if they are below the search base DN. Considering that the server keep a track of the number of subordinates, it would be possible to use this information to limit the set of candidates using the baseDN and the scope.

Now, these approaches might not work well in all the use cases. It will depend on the filter being used, the index being set and the number of potential candidates. At this point, there is no other way than experimenting with real data and real filters. A tool evaluating the various approaches might help the LDAP administrator to define the best possible heuristic (and we can also imagine having more than one search engine), like what RDBMS have (execution plan).

 

Related articles