Symas OpenLDAP Knowledge Base

Tracing Replication Issues in slapd Logs

THIS HAS TO BE BROKEN UP INTO MANY LINKED PAGES. PROBABLY GROSSLY TOO DETAILED

A few items to note:

  • Log files with synchronization messages can be extremely large and difficult to work with.
    • Familiarity with text processing utilities like grep, sed, sort, etc. can help filter out log messages unrelated to replication.
    • Terminal-based text editors like vim or emacs tend to work better than GUI-based editors when working with extremely large log files.
  • Most syncrepl client connections last for several days and log hundreds of thousands of messages. If a syncrepl client connection spans several days, log rotation will cause the connection messages to span several archive files. This is normal.
    • To recombine archived and compressed slapd logs, use zcat file1.gz file2.gz [file...] > newlog.log

Effective use of RIDs

Some syncrepl log messages are not recorded with an associated connection ID, but, they record the RID (Replica ID) used by the syncrepl client. RIDs may be an integer between 000 and 999.

In MMR environments a DIT being replicated may have one or more syncrepl client configurations (one for each provider). Each syncrepl client for that DIT must have a unique RID. Consumers do not share their RIDs with other consumers. This means consumers can use the same RIDs for their syncrepl clients as the other consumers. In an example MMR environment:

  • There are two providers (P1 and P2) and two consumers (C1 and C2)
  • C1 uses RIDs 001 for P1 and uses 002 for P2
  • C2 uses RIDs 001 for P1 and uses 002 for P2
  • P1 uses RID 001 for P2, and P2 uses RID 001 for P1

Having identical RIDs on each consumer is commonly used and does not create a conflict. However, using identical RIDs for a provider on several consumers can make troubleshooting difficult as identical RIDs make it difficult to tell from the provider side which consumer is associated with a replication issue. It is recommended that a numbering scheme is used to help differentiate syncrepl clients. For example:

  • C1 uses RID 011 for P1 and uses 012 for P2
  • C2 uses RID 021 for P1 and uses 022 for P2
  • P1 uses RID 302 for P2, and P2 uses RID 301 for P1

Log Entry Format

When slapd is run in regular mode (no -d at startup), log messages are formatted and saved by the syslog daemon.

The general pattern for a slapd log entry is as follows:

timestamp<space>hostname<space>slapd[process-id]:<space>message

Note: Timestamp formats vary depending on the OS and sylog configuration.

When running slapd in debug mode (-d ), log messages are written directly to stderr and/or the debug log file (configured as logfile/olcLogfile) instead of being formatted and saved by the syslog daemon. This results in a slightly different slapd log entry pattern:

timestamp(epoch in hexadecimal)<space>message

This pattern prints the timestamp as a less than human readable Unix epoch timestamp in hexadecimal format. Debug log messages also exclude the hostname, process name and process ID number. If a human-readable timestamp is needed from a debug log file, the following perl one-liner will convert the timestamp:

perl -pi -e 'use POSIX "strftime"; s/^([[:xdigit:]]{8})/POSIX::strftime("%F %T", gmtime(hex $1))/ge' <infile>.log

Identifying and Following Replication Sessions

Identifying Replication Connections

When consumers are connected to providers, the normally use the sync type “refreshAndPersist” (only in very specific cases should consumers use “refreshOnly”). The “refreshAndPersist” sync type maintains a continual ldapsearch on the provider to monitor for and process changes from the provider. In the provider logs, the messages for these connections look like any other ldapsearch log messages. There are a couple of ways to find syncrepl connections in the log files.

  • Look for connections with large “op=NNN” values. The “op=NNN” is the sequence number of the operations performed during an individual connection. (Replication sessions typically last forever while other sessions are short-time lived)
  • Look for connections that begin with ‘SRCH base="<searchbase>" scope=2 deref=0 filter="(objectClass=\*)"’ and are followed by messages with ‘SRCH attr=* +’ This is the exact search filter used by syncrepl, most of the time, unless the syncrepl configuration specified a different filter - and in this case, try to match it in the logs).

Following a Replication Session: Single-Master

The following entry and log examples follow an LDAP replication session in a single-master environment from the start of a syncrepl connection to the completion of one modification. In the example, the provider is “P1” and the consumer is “C1” using RID=001.

Pre-update contextCSN and entryCSN:

– P1 –

dn: dc=example,dc=com
contextCSN: 20161117195107.062105Z#000000#000#000000
dn: cn=foo,dc=example,dc=com
entryCSN: 20161117193250.972569Z#000000#000#000000

– C1 –

dn: dc=example,dc=com
contextCSN: 20161117195107.062105Z#000000#000#000000
dn: cn=foo,dc=example,dc=com
entryCSN: 20161117193250.972569Z#000000#000#000000

Provider Log:

Consumer connects to provider and starts persistent search:

01: Nov 17 11:56:33 P1 slapd[3011]: conn=1000 fd=10 ACCEPT from IP=127.0.0.1:54514 (IP=0.0.0.0:389)
02: Nov 17 11:56:33 P1 slapd[3011]: conn=1000 op=0 BIND dn="cn=replicator,dc=example,dc=com" method=128
03: Nov 17 11:56:33 P1 slapd[3011]: conn=1000 op=0 BIND dn="cn=replicator,dc=example,dc=com" mech=SIMPLE ssf=0
04: Nov 17 11:56:33 P1 slapd[3011]: conn=1000 op=0 RESULT tag=97 err=0 duration=0.229ms text=
05: Nov 17 11:56:33 P1 slapd[3011]: conn=1000 op=1 SRCH base="dc=example,dc=com" scope=2 deref=0 filter="(objectClass=*)"
06: Nov 17 11:56:33 P1 slapd[3011]: conn=1000 op=1 SRCH attr=* +

Client connects to provider and requests a MOD operation:

07: Nov 17 11:56:41 P1 slapd[3011]: conn=1001 fd=25 ACCEPT from IP=127.0.0.1:59724 (IP=0.0.0.0:389)
08: Nov 17 11:56:41 P1 slapd[3011]: conn=1001 op=0 BIND dn="dc=example,dc=com" method=128
09: Nov 17 11:56:41 P1 slapd[3011]: conn=1001 op=0 BIND dn="dc=example,dc=com" mech=SIMPLE ssf=0
10: Nov 17 11:56:41 P1 slapd[3011]: conn=1001 op=0 RESULT tag=97 err=0 duration=0.068ms text=
11: Nov 17 11:56:45 P1 slapd[3011]: conn=1001 op=1 MOD dn="cn=foo,dc=example,dc=com"
12: Nov 17 11:56:45 P1 slapd[3011]: conn=1001 op=1 MOD attr=sn

Provider queues a new entryCSN for the entry being modified:

13: Nov 17 11:56:45 P1 slapd[3011]: slap_queue_csn: queueing 0x7fa558106980 20161117195645.269065Z#000000#000#000000

Modification is written successfully:

14: Nov 17 11:56:45 P1 slapd[3011]: conn=1001 op=1 RESULT tag=103 err=0 duration=0.513ms text=

Provider commits the new entryCSN:

15: Nov 17 11:56:45 P1 slapd[3011]: slap_graduate_commit_csn: removing 0x7fa558106980 20161117195645.269065Z#000000#000#000000

Provider notifies the consumer of new contextCSN:

16: Nov 17 11:56:45 P1 slapd[3011]: syncprov_sendresp: cookie=rid=001,csn=20161117195645.269065Z#000000#000#000000

Client disconnects:

17: Nov 17 11:56:48 P1 slapd[3011]: conn=1001 op=2 UNBIND
18: Nov 17 11:56:48 P1 slapd[3011]: conn=1001 fd=25 closed

The persistent search from the consumer continues to listen for changes until the connection is severed or slapd is shut down


Consumer Log: 

Consumer has connected with provider and is waiting for changes:

01: Nov 17 11:56:33 C1 slapd[4031]: do_syncrep2: rid=001 LDAP_RES_INTERMEDIATE - REFRESH_DELETE

Provider has notified (“sent sync cookie”) to consumer with the new contextCSN:

02: Nov 17 11:56:45 C1 slapd[4031]: do_syncrep2: rid=001 cookie=rid=001,csn=20161117195645.269065Z#000000#000#000000

Syncrepl has a message to modify the cn=coo,dc=example,dc=com entry:

03: Nov 17 11:56:45 C1 slapd[4031]: syncrepl_message_to_entry: rid=001 DN: cn=foo,dc=example,dc=com, UUID: f9c248aa-3a99-1036-9792-57a50c7ece6f

Syncrepl searches for the entry to be modified:

04: Nov 17 11:56:45 C1 slapd[4031]: syncrepl_entry: rid=001 LDAP_RES_SEARCH_ENTRY(LDAP_SYNC_MODIFY)

Entry found (“(0)” = Success):

05: Nov 17 11:56:45 C1 slapd[4031]: syncrepl_entry: rid=001 be_search (0) 06: Nov 17 11:56:45 C1 slapd[4031]: syncrepl_entry: rid=001 cn=foo,dc=example,dc=com

Slapd queues and commits the new CSN on the entry:

07: Nov 17 11:56:45 C1 slapd[4031]: slap_queue_csn: queueing 0x7f51f4109690 20161117195645.269065Z#000000#000#000000 08: Nov 17 11:56:45 C1 slapd[4031]: slap_graduate_commit_csn: removing 0x7f51f4109690 20161117195645.269065Z#000000#000#000000

Syncrepl successfully performs entry modification:

09: Nov 17 11:56:45 C1 slapd[4031]: syncrepl_entry: rid=001 be_modify cn=foo,dc=example,dc=com (0)

Slapd queues and commits the new contextCSN for the replica:

10: Nov 17 11:56:45 C1 slapd[4031]: slap_queue_csn: queueing 0x7f51f410cd80 20161117195645.269065Z#000000#000#000000
11: Nov 17 11:56:45 C1 slapd[4031]: slap_graduate_commit_csn: removing 0x7f51f410cd80 20161117195645.269065Z#000000#000#000000

Post-update contextCSN and entryCSN:

  • On P1
dn: dc=example,dc=com
contextCSN: 20161117195645.269065Z#000000#000#000000
dn: cn=foo,dc=example,dc=com
entryCSN: 20161117195645.269065Z#000000#000#000000
  • on C1
dn: dc=example,dc=com
contextCSN: 20161117195645.269065Z#000000#000#000000
dn: cn=foo,dc=example,dc=com
entryCSN: 20161117195645.269065Z#000000#000#000000

Following a Replication Session: Multi-Master

IN PROGRESS

Demo Server Environment
  • MMR environment using two providers and two consumers
  • Each provider replicates from the opposite provider
  • Each consumer replicates from both providers
Name Type serverID RID Provider
p1 Provider 1 111 ldap://p2/
p2 Provider 2 121 ldap://p1/
c1 Consumer N/A 211 ldap://p1/
212 ldap://p2/
c2 Consumer N/A 221 ldap://p1/
222 ldap://p2/

In this demonstration, ldapmodify applies the following update:

dn: cn=te1,dc=example,dc=com"
changetype: modify
replace: sn
sn: Flenderson

Pre-update Replication State: All contextCSNs for the dc=example,dc=com and cn=accesslog backends are synchronized. The “0s” values indicate the time difference between CSNs.

ContextCSN values for: dc=example,dc=com
| RID|          p1:389|        p2:10389|        c1:20389|        c2:30389|
| 000|  20161122083007|  20161122083007|  20161122083007|  20161122083007|
| 001|  20161122104905|              0s|              0s|              0s|
| 002|  20161122105116|              0s|              0s|              0s|

ContextCSN values for: cn=accesslog
| RID|          p1:389|        p2:10389|
| 000|                |  20161122083007|
| 001|  20161122104905|              0s|
| 002|  20161122105116|              0s|

Provider 1:

# Client binds to provider, makes MOD operation to change the sn attribute of "cn=te1,dc=example,dc=com".
00: Nov 22 03:00:08 p1 slapd[342]: conn=1005 fd=10 ACCEPT from IP=127.0.0.1:54514 (IP=0.0.0.0:389)
01: Nov 22 03:00:08 p1 slapd[342]: conn=1005 op=1 MOD dn="cn=te1,dc=example,dc=com"
02: Nov 22 03:00:08 p1 slapd[342]: conn=1005 op=1 MOD attr=sn
# Provider queues new CSNs for the entry and context
03: Nov 22 03:00:08 p1 slapd[342]: slap_queue_csn: queueing 0x7fed6810f650 20161122110008.590508Z#000000#001#000000
04: Nov 22 03:00:08 p1 slapd[342]: slap_queue_csn: queueing 0x7fed6810e4f0 20161122110008.590508Z#000000#001#000000
# MOD is successful, new CSN is applied to the entry and context
05: Nov 22 03:00:08 p1 slapd[342]: slap_graduate_commit_csn: removing 0x7fed6810e4f0 20161122110008.590508Z#000000#001#000000
07: Nov 22 03:00:08 p1 slapd[342]: slap_graduate_commit_csn: removing 0x7fed6810f650 20161122110008.590508Z#000000#001#000000
# Provider returns LDAP_SUCCESS message to client
06: Nov 22 03:00:08 p1 slapd[342]: conn=1005 op=1 RESULT tag=103 err=0 duration=0.805ms text=
# Syncprov sends sync cookie to p2 (serverID=2). Cookie is for replica ID 121 (on p2) from serverID=1 with new contextCSN
08: Nov 22 03:00:08 p1 slapd[342]: syncprov_sendresp: to=002, cookie=rid=121,sid=001,csn=20161122110008.590508Z#000000#001#000000
# Syncprov sends sync cookies to consumers c1 (rid=211) and c2 (rid=221) with new contextCSN
09: Nov 22 03:00:08 p1 slapd[342]: syncprov_sendresp: cookie=rid=211,sid=001,csn=20161122110008.590508Z#000000#001#000000
10: Nov 22 03:00:08 p1 slapd[342]: syncprov_sendresp: cookie=rid=221,sid=001,csn=20161122110008.590508Z#000000#001#000000
# Client unbinds from provider
11: Nov 22 03:00:10 p1 slapd[342]: conn=1005 op=2 UNBIND
<sync complete on provider>

Provider 2:

# Provider p2 receives the sync cookie from p1 with new contextCSN
01: Nov 22 03:00:08 p2 slapd[442]: do_syncrep2: rid=121 cookie=rid=121,sid=001,csn=20161122110008.590508Z#000000#001#000000
# Provider p2 queues the new CSN for the entry and context
02: Nov 22 03:00:08 p2 slapd[442]: slap_queue_csn: queueing 0x7f03e0101950 20161122110008.590508Z#000000#001#000000
03: Nov 22 03:00:08 p2 slapd[442]: slap_queue_csn: queueing 0x7f03e0105bd0 20161122110008.590508Z#000000#001#000000
# Syncprov skips sending updates to p1 (serverID=1), as it is the source of the MOD op
04: Nov 22 03:00:08 p2 slapd[442]: syncprov_matchops: skipping original sid 001
# Update from p1 is good, p2 commits the new CSN to the entry and context
05: Nov 22 03:00:08 p2 slapd[442]: slap_graduate_commit_csn: removing 0x7f03e0105bd0 20161122110008.590508Z#000000#001#000000
06: Nov 22 03:00:08 p2 slapd[442]: slap_graduate_commit_csn: removing 0x7f03e0101950 20161122110008.590508Z#000000#001#000000
# Indicates a successful MOD operation on the entry
07: Nov 22 03:00:08 p2 slapd[442]: syncrepl_message_to_op: rid=121 be_modify cn=te1,dc=example,dc=com (0)
08: Nov 22 03:00:08 p2 slapd[442]: slap_queue_csn: queueing 0x7f03e01019a0 20161122110008.590508Z#000000#001#000000
09: Nov 22 03:00:08 p2 slapd[442]: slap_queue_csn: queueing 0x7f03e010af80 20161122110008.590508Z#000000#001#000000
10: Nov 22 03:00:08 p2 slapd[442]: slap_graduate_commit_csn: removing 0x7f03e010af80 20161122110008.590508Z#000000#001#000000
11: Nov 22 03:00:08 p2 slapd[442]: slap_graduate_commit_csn: removing 0x7f03e01019a0 20161122110008.590508Z#000000#001#000000
# Syncprov sends sync cookies to c1 (rid=212) and c2 (rid=222)
12: Nov 22 03:00:08 p2 slapd[442]: syncprov_sendresp: cookie=rid=212,sid=002,csn=20161122110008.590508Z#000000#001#000000
13: Nov 22 03:00:08 p2 slapd[442]: syncprov_sendresp: cookie=rid=222,sid=002,csn=20161122110008.590508Z#000000#001#000000
<sync complete on provider>

Consumer 1:

# Syncrepl consumer rid=211 gets sync cookie for MOD from p1 (serverID=1) and queues CSN
01: Nov 22 03:00:08 c1 slapd[482]: do_syncrep2: rid=211 cookie=rid=211,sid=001,csn=20161122110008.590508Z#000000#001#000000
02: Nov 22 03:00:08 c1 slapd[482]: slap_queue_csn: queueing 0x7f68dc10cc40 20161122110008.590508Z#000000#001#000000
# Syncrepl consumer rid=212 gets sync cookie for MOD from p2 (serverID=2)
03: Nov 22 03:00:08 c1 slapd[482]: do_syncrep2: rid=212 cookie=rid=212,sid=002,csn=20161122110008.590508Z#000000#001#000000
# Update from p1 is written to c1 and queued CSN is committed
04: Nov 22 03:00:08 c1 slapd[482]: slap_graduate_commit_csn: removing 0x7f68dc10cc40 20161122110008.590508Z#000000#001#000000
05: Nov 22 03:00:08 c1 slapd[482]: syncrepl_message_to_op: rid=211 be_modify cn=te1,dc=example,dc=com (0)
06: Nov 22 03:00:08 c1 slapd[482]: slap_queue_csn: queueing 0x7f68dc10cee0 20161122110008.590508Z#000000#001#000000
07: Nov 22 03:00:08 c1 slapd[482]: slap_graduate_commit_csn: removing 0x7f68dc10cee0 20161122110008.590508Z#000000#001#000000
# Syncrepl consumer ignores update from rid=212, as it has already been saved via rid=211
08: Nov 22 03:00:08 c1 slapd[482]: do_syncrep2: rid=212 CSN pending, ignoring 20161122110008.590508Z#000000#001#000000 (reqStart=20161122110008.000001Z,cn=accesslog)
<sync complete on consumer>

Consumer 2:

# Syncrepl consumer rid=222 gets sync cookie for MOD from p2 (serverID=2), queues CSN
01: Nov 22 03:00:08 c2 slapd[593]: do_syncrep2: rid=222 cookie=rid=222,sid=002,csn=20161122110008.590508Z#000000#001#000000
02: Nov 22 03:00:08 c2 slapd[593]: slap_queue_csn: queueing 0x7fba68109960 20161122110008.590508Z#000000#001#000000
# Entry is updated, CSN is committed and success message is applied
03: Nov 22 03:00:08 c2 slapd[593]: slap_graduate_commit_csn: removing 0x7fba68109960 20161122110008.590508Z#000000#001#000000
04: Nov 22 03:00:08 c2 slapd[593]: syncrepl_message_to_op: rid=222 be_modify cn=te1,dc=example,dc=com (0)
05: Nov 22 03:00:08 c2 slapd[593]: slap_queue_csn: queueing 0x7fba6810ca50 20161122110008.590508Z#000000#001#000000
06: Nov 22 03:00:08 c2 slapd[593]: slap_graduate_commit_csn: removing 0x7fba6810ca50 20161122110008.590508Z#000000#001#000000
# Update form rid=221 is ignored because the CSN is less than or equal to the current CSN
07: Nov 22 03:00:08 c2 slapd[593]: do_syncrep2: rid=221 cookie=rid=221,sid=001,csn=20161122110008.590508Z#000000#001#000000
08: Nov 22 03:00:08 c2 slapd[593]: do_syncrep2: rid=221 CSN too old, ignoring 20161122110008.590508Z#000000#001#000000 (reqStart=20161122110008.000000Z,cn=accesslog)
<sync complete on consumer>

Post-Update Sync State:

ContextCSN values for: dc=example,dc=com
| RID|          p1:389|        p2:10389|        c1:20389|        c2:30389|
| 000|  20161122083007|  20161122083007|  20161122083007|  20161122083007|
| 001|  20161122110008|              0s|              0s|              0s|
| 002|  20161122105116|              0s|              0s|              0s|

ContextCSN values for: cn=accesslog
| RID|          p1:389|        p2:10389|
| 000|                |  20161122083007|
| 001|  20161122110008|              0s|
| 002|  20161122105116|              0s|