Semaphores and Semaphore Timeouts

Technote (FAQ)


Question

What are semaphores and semaphore timeouts?

Answer

What is a semaphore and a semaphore timeout?


    In a multitasking environment there is often a requirement to synchronize the execution of various tasks or ensure one process has been completed before another begins. This requirement is facilitated by the use of a software switch known as a Semaphore or a Flag. The function of this is to work in much the same way a railway signal would; only allowing one train on the track at a time. A semaphore timeout is where the railway signal has been set in one state too long, maybe because the train has broken down.

Example of a semaphore timeout in Notes/Domino
    An example of this in Notes/Domino is when the indexer needs to completely rebuild an index, it locks a semaphore so that other tasks cannot use the index until it is rebuilt. If a user task now tries to open that index while it is being rebuilt, it will have to wait for the indexer to finish the rebuild and then unlock the semaphore. As a result, the user task is stuck until that semaphore is unlocked. While it is stuck waiting for the semaphore, it keeps track of how long it has been waiting. If it is stuck for more than 30 seconds, this is considered a semaphore timeout and in debug mode a message will be logged to the console. The task will continue to wait for the semaphore, timing out every 30 seconds, until the semaphore is unlocked or the task is ended. For most operations, a task might only wait a few microseconds and hence not time-out. With a complicated view on a large database, the task may have to wait several minutes for the index semaphore.

    If an important semaphore is locked by a task and is never unlocked, all tasks can be stopped waiting for that semaphore. This can happen in several different ways. The most common is where a task locks the semaphore and then crashes. This can also happen if a task locks the semaphore and then goes into an endless loop or it gets an error and forgets to unlock it.

    Semaphore deadlock can occur when two tasks try to lock two different semaphores in a different order. For example, Task A locks Semaphore 1 and then tries to lock Semaphore 2. In the meantime, Task B has already locked Semaphore 2 and is now trying to lock Semaphore 1. Task A is stuck waiting for Semaphore 2 and Task B is waiting for Semaphore 1 -- deadlock.

Reasons for semaphore timeouts
    When you receive semaphore timeout messages, the messages are usually the result of one of the following:
      1. A heavy load on the server is causing processes to be delayed from releasing semaphores.

      2. A process has crashed while holding a semaphore, causing other processes to block when trying to acquire the semaphore.

      3. Deadly embrace, semaphore contention where two tasks are waiting on each other and neither task is able to break the loop. In the simplest case, thread A is trying to get a semaphore which is owned by B, while B is trying to get a different semaphore which is owned by A. More complex combinations are also possible: A wants a semaphore owned by B, who wants a semaphore owned by C, who wants a semaphore owned by A, etc.

      4. If a process was to fail to set a semaphore during execution, another process dependent on the semaphore will be blocked awaiting the semaphore.

Determining if a semaphore timeout has occurred
    If an issue is a semaphore issue, it will be reported in two ways:

    1. Sem.Timeouts
      At the server console, you will see the Sem.Timeouts statistic. To view this statistic, type the following:

      sh stat sem.timeouts

      If the problem has occurred, you will see something similar to the following, depending upon the nature of the semaphore timeout

      Sem.Timeouts = 430D:58 0A13:42 030B:28 0116:26 0A12:21

      NOTE: The statistic Sem.Timeouts will not appear in the Statrep if the customer is not experiencing semaphore timeouts.

      The first number is the semaphore ID. The ID tells us what the semaphore is used for. For example, 0x030B is the collection semaphore used by NIF. The second number is a decimal number which shows the number of times the semaphore timeout occurred. It is not unusual in a heavily used server to get semaphore timeouts.

      The following table lists some of the semaphores you may encounter:

      Semaphore Function
      0244 NSF per-database semaphore
      410F OS File system semaphore
      0266 NSF per-database full-text semaphore
      4245 NSF database opening semaphore
      030B NIF collection semaphore
      430D NAMELookup semaphore
      0A13 Log commit semaphore
      0116 console semaphore
      0A12 Buffered log package semaphore
      030B Collection semaphore
      4117 Handle table free chain consistency semaphore
      0A0B Session table semaphore
      4113 Handle table movement semaphore
      33d5 New to Domino 4.5, AdminP's semaphore for ACL modification
      4253 internal control semaphore
      5708 BSAFE semaphore (RSA encryption stuff)
      0255 NSF B-Tree semaphore
      0294 Directory Manager Queue semaphore
      1120 Transfer queue lock semaphore
      0931 Task sync semaphore
      0256 NSF buffer pool container
      03A8 DBCONT page zero semaphore
    2. Error: "Session semaphore held for [n] seconds"
      The second way of determining if there is a semaphore issue, is when you see (at the server console) the following message:

      "Session semaphore held for [n] seconds."

      Note: This error does not print to the log file.


Troubleshooting Semaphore Timeouts
    The following parameters can be added to the NOTES.INI file:

    NOTE: Before enabling any debug parameters, discuss them with IBM Support beforehand. There may be issues surrounding their use, or special precautions that must be considered. These debug parameters might require a large amount of disk space, dependent upon when the server encounters problems; the longer the server stays up, the larger the debug files will be. These files can grow large enough to cause disk space shortages.

    Debug_Capture_Timeout=1
    Debug_Show_Timeout=1

    See the following technote for additional information: "How to Turn on Semaphore Debugging Parameters in the NOTES.INI for Domino" ( #1089976)

    After adding these parameters, if semaphores are generated, they will be captured in a file called SEMDEBUG.TXT in the Domino program directory:

    Output of SEMDEBUG.TXT:
      THREAD [02AB:0123] WAITING FOR FRWSEM 0x030B Collection semaphore (@0BE65A20)
      (R=0,W=1,WRITER=0080:015E,1STREADER=0000:0000) FOR 30000 ms

      THREAD [01FE:0220] WAITING FOR FRWSEM 0x0244 open database semaphore (@00ECBDD2)
      (R=0,W=1,WRITER=0080:015E,1STREADER=0000:0000) FOR 30000 ms

    Output of SEMDEBUG.TXT for Unix:
      THREAD [01676:00001] WAITING FOR RWSEM 0x412C (@EE100210) (R=0,W=1,WRITER=05067:00001,1STREADER=05067:00001) FOR 30000 ms

      THREAD [01684:00001] WAITING FOR RWSEM 0x412C (@EE100210) (R=0,W=1,WRITER=05067:00001,1STREADER=05067:00001) FOR 30000 ms

    What does the output mean?
      0x030B - Indicates the type of semaphore

      [01FE:0220]

      The first number (01FE) indicates the process ID. This number is in hex and must be converted to decimal. This number
      will be completely different when the server is rebooted.

      The second number (0220) is the thread ID.

    What does the output mean (in Unix example):
      0x412C - Indicates the type of semaphore

      [01684:00001]

      The first number (01684) indicates the process ID. This number is the PID as indicated in the NSD output and will be completely different when the server is rebooted.

      The second number (00001) is the thread ID.
        NSD output:
        [1] 1684: /opt/lotus/notes/latest/sunspa/tmmscan <--- fatal thread
        ###### thread 1/4 :: tmmscan, pid=1684, lwp=1 ######
        [1] eed396a0 lwp_sema_p (228a30)
        [2] eed396a0 __lwp_sema_wait (228a30, 1d670, 0, 0, 0, 0) + 8
        [3] eefc779c _park (228990, 228a30, 0, 1, eefe6240, 0) + a0
        [4] eefc7554 _swtch (2289a0, 228b90, 228a10, 228a0c, 228a08, 228a04) + 2cc
        [5] eefc5f8c _cond_timedwait_cancel (efffdd90, efffdd78, efffdd70, 228990, eefe52b0, 0) + 1e4
        [6] eefd3420 _ti_sleep (1e, eefe52b0, eefe52b0, effff2e3, effff154, f8000600) + 100
        [7] ef0776e4 fatal_error (b, ef663dd4, ef657598, efffde50, 228a04, 2289e4) + 2c0
        [8] eefd2f20 __libthread_segvhdlr (b, efffe4b0, efffe1f8, efffe138, eefe52b0, 2289e4) + e0
        [9] eefd2334 sigacthandler (b, efffe4b0, 228990, eefe52b0, efffe1f8, eefd2e40) + 6e0
        [10] 000201d0 KC_ScanStart (efffeec0, 3ae18, efffeec0, efffeec4, efffeec8, efffeecc) + 13a0
        [11] 000157b0 AddInMain (22, 1, effff144, ef7ed2b8, ef7ec9c0, 0) + d00
        [12] 0002fb40 NotesMain (1, effff144, effff144, 0, 0, ef7c16e1) + 40
        [13] 0002fa60 notes_main (0, 0, 0, 1, effff144, 0) + a8
        [14] 000147ec _start (0, 0, 0, 0, 0, 0) + dc

        From the Process Tree section:
        username status pid program
        notes06 R 1682 /opt/lotus/notes/latest/sunspa/http
        notes06 R 1684 /opt/lotus/notes/latest/sunspa/tmmscan


    What other information should be collected?

    For Windows NT:
      1. A screenshot of the Windows NTs Task Manager, or the output of the PSTAT utility in the NT resource kit, redirected to a file. (PSTAT > somefile.txt)
        It is important to collect a screen capture of Task Manager or the PSTAT output immediately when the semaphores begin to scroll on the Domino server console. When matching a process ID from SEMDEBUG.TXT to Task Manager, it is important to take the ID from the first semaphore error that is generated in the file. This is usually the offending process.

      2. Issue a "Show Task Debug" command as soon as the semphores begin to scroll on the Domino server console.

      Example: sh task debug>debug.txt

    For Unix:
      1. Run NSD under root. Make sure "KillProcess=1" is removed from the NOTES.INI to prevent the server from killing the process when the server crashes, thus preventing mapping of PID and semdebug.txt.
        It is important to collect an NSD immediately when the semaphores begin to scroll on the Domino server console. When matching a process ID from SEMDEBUG.TXT to NSD - Process Tree, it is important to take the ID from the first semaphore error that is generated in the file. This is usually the offending process.

      2. Issue a "Show Task Debug" command as soon as the semphores begin to scroll on the Domino server console.
        Example: sh task debug>debug.txt
        Please contact Notes Technical Support if you need further assistance.


How to utilize SEMDEBUG.TXT and Task Manager
    Once the SEMDEBUG.TXT file is collected, it is the FIRST line that is the most important in determining the problem. This line usually contains the offending process. Specifically we want to look at the OWNER process.

    Caveat: There is no time/date stamp in the file (unless the Debug_Capture_Timeout=10 parameter is used for a Domino 5.0.9 or later server).

    The first number in brackets is the Process ID (PID). The second number (after the colon) is the thread ID. The Process ID must be converted from hex to decimal.

    In the excerpt below the OWNER is PID 009A. PID 009A converts to a decimal number of 154 (Process 0053 is waiting for process 009A to release the semaphore). When matched up to Figures 1 and 2 below (screen shots of NT Task Manager), it is apparent that Agent Manager is the offending process.

    Sample SEMDEBUG.TXT

    THREAD [0053:002F] WAITING FOR SEM 0x042D LSIWRAP shared semaphore (@01342C68) ( OWNER= 009A:0099) FOR 120 ms

    THREAD [0045:0058] WAITING FOR SEM 0x042D LSIWRAP shared semaphore (@01342C68) (OWNER=009A:0099) FOR 120

    THREATHREAD [00A4:00A3] WAITING FOR SEM 0x042D LSIWRAP shared semaphore (@01342C68) (OWNER=009A:0099) FOR 120 ms

    THREAD [009E:009D] WAITING FOR SEM 0x042D LSIWRAP shared semaphore (@01342C68) (OWNER=0000:0000) FOR 120 ms

    THREAD [0045:0058] WAITING FOR SEM 0x042D LSIWRAP shared semaphore (@01342C68) (OWNER=009E:009D) FOR 120 ms

    THREAD [00A0:009F] WAITING FOR SEM 0x042D LSIWRAP shared semaphore (@01342C68) (OWNER=009E:009D) FOR 120 ms

    THREAD [00A2:00A1] WAITING FOR SEM 0x042D LSIWRAP shared semaphore (@01

Figure 1 (Task Manager):
    Note: If you are using a browser or client that is not displaying Figures 1 and 2 below, please reference the document titled "How To Utilize SEMDEBUG.TXT and Task Manager" (#175384), where you should be able to view the figures.



Figure 2 (Task Manager):

Related information

How to Turn on Semaphore Debugging Parameters
How To Utilize SEMDEBUG.TXT and Task Manager
A simplified Chinese translation is available

Cross reference information
Segment Product Component Platform Version Edition
Messaging Applications IBM Notes 8.5, 8.0, 7.0

Historical Number

112710

Rate this page:

(0 users)Average rating

Add comments

Document information


More support for:

IBM Domino

Software version:

7.0, 8.0, 8.5

Operating system(s):

AIX, Linux, Solaris, Windows, i5/OS, z/OS

Reference #:

1094630

Modified date:

2006-04-19

Translate my page

Machine Translation

Content navigation