Open Mic Q&A: Troubleshooting an IBM Lotus Domino Server Crash - 31 August 2011
IBM hosted an Open Mic on the topic of "Troubleshooting an IBM Lotus Domino Server Crash" on August 31, 2011. The presentation, recording and transcript are provided within.
For more information about our Open Mic webcasts, visit the IBM Collaboration Solutions Support Open Mics page.
Questions and Answers:
Q: How is NSD configured to run automatically on Domino servers? We get faults and they recover and the NSD is created automatically. From time to time we get a fault report "warning crash information not extracted, make sure NSD is configured to run on the server document". Why do we get that?
A: This error message is reported by senddiag, which drives Automatic Diagnostic Data Collection (ADC) after a crash. This message is displayed if the Domino version is not populated. This information is parsed from the NSD file, and could indicate that there is no NSD file associated with the ADC.
Q: Is there a time limit on how long an NSD should run before we decide to kill it?
A: How long NSD takes to fully complete varies on amount of data being collected. For example, that can depend on how many processes are running on the box, and with Sametime this could get rather high. For each process, typically NSD is going to have to dump call stacks for each thread and report Domino memory statistics for each process. If there is a large amount of virtual memory space being consumed by each process this could result in greater processing time to report Domino memory statistics. When NSD collects this data, it suspends processes it is collecting data on, like a debugger. So those processes will no longer consume CPU.
A common time sink for NSD is the Directory Listings section. This recursively lists every file and directory under the data directory as well as listing the IBM_TECHNICAL_SUPPORT directory, program directory, and transaction logs directory. This can take a large amount of time depending on the directory structure under the data directory. The introduction of XPages has also introduced a large number of files while take time to list. If you determine that NSD is taking a long time due to the Directory Listings section, then the NSD.INI parameter nofs=1 (UNIX) or nodirlist=1 (Windows) will prevent this section from being printed.
The -runtime parameter specifies the maximum amount of time that NSD is to run in seconds. Typically you will see -runtime 300 passed into NSDs run as the result of a crash or panic. If NSD takes longer than this amount of time, then it will time out and exit.
If you're seeing long NSD runs times (greater than 10 minutes) then open the NSD file and see where time is being spent. Timestamps are recorded next to every section header, so it is easy to determine where time is being spent.
Q: What is the recommendation when the server is hung? I tried to run a manual NSD, but it runs and runs and never finishes. If I look at Task Manager, I see several NSD.EXE processes running. To restart the server I have to manually kill all processes.
A: It is completely expected to see multiple processes named NSD.EXE in the Task Manager. We are talking about the windows platform here. There is a single NSD.EXE that performs multiple functions:
1. As the Lotus Domino Diagnostics Service
2. As the stack collection and platform data collection piece of NSD
3. As memcheck, which collects information regarding the Domino Memory Manager
With the service enabled you will always see an NSD.EXE running as SYSTEM, in the task manager. If you run a manual NSD with the service enabled the following actions will occur:
1. The manually executed NSD will pass its arguments to the service
2. The service will take those arguments and fork a child process to do the work
3. If you are collecting memcheck data, which will happen by default, the NSD spawned to do the work in 2 will fork another child process to run as memcheck.
So in all you would have 4 processes with the name NSD.EXE: The service, the NSD the user ran, the NSD child, the service spawned to do the work, and the child process spawned to run memcheck. Similar actions will also occur if NSD is triggered by a crash or panic, the difference being is that the user is not executing an NSD process manually.
Q: When I enable debug settings, the live console is unreadable because of all the data being generated. Is there a setting to keep the debug tools off the live console?
A: Many debug settings have different settings for levels of verboseness. You could check for a technote on the debug parameter for settings or Support could tell you what the appropriate level to set is based on the problem being investigated. If the debug is for a specific process, one thing you can do is start the process first in its own command prompt/console and then start the Domino Server. This way all the extraneous debug will go to the process console that is running stand-alone.
Q: After our server crashed, I very frequently see the FILERET.EXE process doing something with the semaphores and it takes a very long time before the server restarts. Is this expected?
A: The FILERET process starts on server startup. It scans the diagnostic directory for files with the pattern "_<machine name>_". It is looking for files that exceed the configured number of days to keep diagnostic files (default of 365 days) and removes them. It also runs an NSD -info and exports a DXL copy of the server document. You should drill down on what semaphore is being reported so that this problem can be properly investigated.
FILERET also does the Domino initialization of many of the subsystems, and during this initialization there may be some system databases that may require FIXUP to run since we did not shut down cleanly or we crashed. FIXUP may need to run to completion on LOG. NSF, names.nsf, EVENTS4.NSF, etc. On servers that have Transaction Logging enabled this FIXUP delay can be minimized, but if you don't have transaction logging enabled, FIXUP on these database can be time consuming.
Q: Has anybody seen issues with Sametime 8.5.2 running on Domino 8.5.2 FP3 hanging where STMMP.EXE takes 100% CPU?
A: We do not know of any known issues with STMMP.EXE on 8.5.2. STMMP.EXE is a process belonging to the classic meeting server (it is a multimedia server component for audio/video). It is on by default on all Sametime Community servers, but can be disabled if you do not use instant meetings, or if you have a standard meeting server outside of the Classic one. Changing to the embedded client would have no effect on STMMP.EXE. If you do not need classic meetings, you can simply disable the Sametime Meeting Server service (from the control panel/services) and restart the server. You should open a PMR if you see this on a classic/legacy meeting server to determine why. NSD won't help diagnose in this situation.
Q: We are interested in what the Automatic Corrupt Database Collector tool does and the value of it. I understand it captures current database information, but then what does it do with that information? What happens to the database?
A: When the Domino server detects that a database has corruption, it will automatically store the corrupted database as a separate file (with a .cor extension) in a separate directory. This is not something you run all the time, but that you run when you suspect a problem. The tool captures information that can be analyzed to determine what is causing the crash. You can also run maintenance to try to fix the NSF database to fix the problem. For more information, see Detecting corrupted databases (Technote #1429891)
Q: Will IBM Support Assistant (ISA) Lite come packaged with Domino 8.5.3?
A: ISA Lite is installed separately from Domino. For more information, see Collecting Data: Read first for Lotus Notes & Domino (Technote #1415777)
Q: Is it useful to search on the word "Panic" inside the NSD file? I have done that in the past, but I am not sure how useful that is.
A: "Panic" is only useful sometimes. Inside the Domino code there is a function implemented called Panic(). This function invokes NSD and cleans up the Notes/Domino instance if an unexpected, invalid, and unrecoverable state has been reached. By calling panic, a crash has not technically happened, but we've invoked a function called Panic() so that NSD is invoked and cleanup occurs. This forces the server back into a good state by tearing it down completely.
You will be able to determine the thread which encountered this condition by searching for "fatal". On Windows, the fatal thread is tagged with a FATAL label, and under UNIX you will see a fatal_error function in the crashing call stack. In this case Panic() would not be present. "Segments" would generally be found in the same situation where you would find "fatal". "Child_died" is useful in a specific situation where NSD is run if a Domino server process exited abnormally, which then invoked NSD.
Q: When stopping the server on iSeries, we crash due to "Process SERVER (0xA26E) is waiting for Subprocess HTTP (0xA280) to terminate. Please wait...". After that we need to kill it. Is there any choice to solve this issue?
A: If an HTTP thread is hung attempting to process a request, the server will not shutdown until the thread has finished its work. You would need to investigate what this "hung" HTTP thread is working on. You should take a manual NSD when the server fails to shutdown. Issue the command "tell http show thread state" prior to shutdown, then work with Support to analyze the Server Console log and manual NSD.
Q: Is there any tool available to read memory dump files?
A: Check out this wiki article: Using the LND tool to analyze IBM Lotus Notes and Domino hangs and crashes
Q: Is there a description of the memory blocks listed in the NSD? For example BLK_LOCAL or BLK_PCB?
A: There is not one document that lists the various types, but there are technotes that contain details about a block pertaining to a specific issue.
Q: We have eleven 8.5.2 FP1 Domino servers and every time we "quit" them they generate an NSD. I would like for this not to happen.
A: NSD will be run when the server shutdown exceeds a specific threshold. There are cases when a task just takes time to quit normally and the shutdown process can take several minutes, especially if there are a large number of processes. If server shutdown is taking much longer than normal and an NSD is being generated, the most important thing you can do is look at the stacks captured in the NSD and identify what the processes were doing at the time of the shutdown. We recommend trying to understand what the normal shutdown time is. Also, in the Server Document -> Basics Tab there is a section that includes Server Shutdown Timeout. Try bumping this setting up.
Q: Does Domino 8.5.2 and later run on VMWare?
A: Yes, see IBM Lotus Domino support for virtualization platforms (Technote #1427414)
Q: What is the basic difference between a PANIC and a Fatal type crash?
A: PANIC is when the code detects something is wrong . Fatal is when an access violation/exception is thrown and the code was not written to handle it.
Q: Is an add-in program needed to call an NSD?
A: No. When a problem occurs, Domino either panics or there is an access violation and an NSD should get generated. However, you may want to have a Program Document execute NSD on a scheduled basis if you are trying to collect data over time, as in the case where you are trying to track memory usage over time.
Q: Does the Domino Diagnostic Probe only work with Domino 8.5.2 or higher?
A: There is a version that works on Domino 8.5.1. See Monitoring slow or unresponsive servers with the Domino Diagnostic Probe (Technote# 1429892)
Q: How do I get started analyzing NSD files?
A: To start, up at the top of the NSD where it shows you the command line, it will show you the process that may have caused the crash. We try to put the crashing process and thread at the top of the entry. Sometimes in the call stack that we dump in there, you may see a database name. If it is consistently the same database, it may give you an indication that you should run FIXUP on that database. Many times you would want to collect this information and call Support.
Download the Lotus Notes Diagnostic Utility. You can run that on your local Notes client. It will automatically process that NSD file for you and give you a nice way of viewing that information. It will also search for technotes to see if there is already related information published about this crash.
Q: We experience more server hangs than crashes. This is happening at odd times and we don't notice it until we get into the office. What things are being worked on to correct this or what troubleshooting steps would you recommend?
A: Run a full NSD and then wait 2-5 minutes and run another NSD with the -nomemcheck switch. You can compare the NSDs to see if there are different call stacks. You will also see if the server is moving along or not in the same part of the code to see if you have a true hang or a slowdown . Also check your SEMDEBUG.TXT to make sure you have DEBUG_CAPTURE_TIMEOUT=1 and DEBUG_SHOW_TIMEOUT=1 set in your notes.ini file. This will show you a lot of information about what is possibly causing the server to appear to hang. If you are using Transaction Logging, look for Long Held lock messages in your console log and you will see different processes waiting for a specific database for a certain amount of time.
In 8.5.3 these things are enabled by default to make it easier to debug the issue. The fact that there is data logged to the SEMDEBUG.TXT and Long Held Lock entries in the console log does not necessarily indicate a slowdown or hang unless they continue for a long period of time. There are legitimate reasons for getting these messages. These are bits of data needed along with the NSD snapshots a few minutes apart which aides in understanding whether or not you are experiencing a true hang or slow down or otherwise understand what Domino is doing at this time.
Q: Maybe I'm running it wrong and I've tried running the NSD while the server hangs, when I run it the first time it just crashes the server.
A: If you run NSD it will attach and suspend Domino server processes. Users won't be able to connect to the server while NSD has these processes suspended. If you close NSD or NSD terminates mid-processing, then any processes which are attached to will be terminated.
In 8.5.2, there is a tool called the Domino Diagnostic Probe that you can configure on your server that does a loopback call to see if it can connect to the server. Any time you run into a hang, your users cannot get in, but the server has not really crashed, so the tool will take the NSDs for you. We are looking into future functionality to allow you to configure to shutdown the server or run a fault recovery after a specified amount of time of the server being unresponsive.
Q: Is ISA Lite supported on IBM i?
A: We are working on ISA Lite to make it work on IBM i, but it is currently not available.
Q: Is it possible to find servers being crashed because of log file corruption? The server is enabled with archive logging.
A: For starters, you should save a copy of the transaction log directory (all of the .TXNs & the nlogctrl.lfh file) and send in the NSD, console and logasio_*.log files. That may be enough to determine the issue. If not, then a copy of the transaction logs (that you saved previously) would be needed for further problem determination.
Q: We have Windows 2008 R2 64-bit and a Domino 32-bit server with three partitions. Because Domino is only 32-bit, will it utilize the full partition memory?
A: Each partition will use up to its 4G capacity. We would encourage you to move toward 64-bit Domino, which takes a little more advantage of the memory.
Q: In analyzing the NSD we see Amgr is crashing because of one database. We have identified the database, but we cannot figure out which agent or document is causing the crash. Can we do further analyzing of the NSD to see exactly which agent or document is causing the issue?
A: When you have an Amgr crash and you follow the crash stack down to the process ID through the NSD, it can bring you to the database and often you can see the agent (class 0200) and will give the NoteID for the agent. You can also look at class 001 which will give you a document. If you convert that information from decimal to hex you may be able to isolate that document in the database. You can also enable debug_agmr=* while you are troubleshooting this to match up what is in the console with what is in the NSD to see what agent was running.
More helpful information:
What is the Automatic Diagnostic Data Collection tool? (Technote #1085850)
Best practices for Implementing Lotus Domino in a Storage Area Network (SAN) Environment (Technote #7002613)
Notes/Domino 8.5.x Upgrade Cookbook
Notes/Domino Fixlist Database for upcoming release information
See this forum entry for questions and answers posted before or during the call.