|When reporting a problem to us, you can help to ensure the quickest possible solution by including enough information for us to get started immediately, instead of having to ask certain basic questions that we just always have to ask. Including the correct information also helps us preserve our somewhat tenuous grip on sanity...
The executive summary is simply this: we need a detailed description of your environment, and we need to know exactly what problem you're seeing. Special emphasis should be put on "detailed" and "exactly". Vagueness is bad; precision is good.
In general, the information we need in a problem report is:
Sometimes, new problems occur as the first problem is being investigated. For example, during one "fts release" problem, a customer attempted to stop and restart DFS on a machine, but the repserver failed to restart -- the "bos" command reported authorization failures. In such cases, we'll need the same level of detail on the new sub-problem. For example, above we have:
- Description of the problem itself, including exact error messages and timing. For example:
Or for problems with DCE applications, we need enough context to enable us to understand what underlying DCE function is failing. For example:
Attempted to do "fts release fooBar -wait"; immediately received error messages saying "RPC error: phase of moon not correct (dce / rpc)". Problem occurred about 7:15 PM EDT on Monday 7/10.
Our PUserver calls the function sec_rgy_ensure_world_peace(), which returns "RPC error: this is worse than the BeeGees (dce / rpc)". This occurs every time we call the function.
- Description of machines involved. For example:
Read-write copy of fooBar is on machine xy123, which is located in Boston and has IP address 188.8.131.52 and is running AIX 4.3.2 and DCE 2.2 PTF set 6
Read-only copies are on aa101 (in Chicago, 184.108.40.206, AIX 4.3.3, DCE 2.2 PTF 8) and on b292 (in Detroit, 220.127.116.11, AIX 4.3.3, DCE 2.2 with no PTF)
Command was issued from machine uu333, a DFS client in Raleigh, IP address 18.104.22.168, running AIX 4.3.2 and DCE 2.2 PTF set 8
- What troubleshooting / diagnostic / corrective steps were done, and what were the results, like this:
Or, for the application problem with sec_rgy_ensure_world_peace():
We tried the "fts release" command on a few other clients, with the same results. We also tried on the RW fileserver, still the same error message. Next we stopped and restarted DFS on the read-only machines, aa101 and bb292. This caused the error to change to "RPC error: RW has a headache (dce / rpc)". Finally, at about 8 PM, we rebooted the read-write fileserver (xy123), and the problem disappeared.
We have copies of RepLog from all three fileservers from 7:45, after trying the fts command on the RW fileserver and before stopping and restarting DFS.
We call sec_rgy_ensure_world_peace() on the CDS entry we've created, /.:/milky-way/earth. A 'dcecp -c object show /.:/milky-way/earth' fails with the same error message, although a 'dcecp -c dir list /.:/milky-way' works.
- Any additional information about the context of the problem, like:
The problem occurred two hours after we had upgraded all machines to PTF 8.
During a time of very bad luck, we might get this instead:
Next we stopped and restarted DFS on the read-only machines, aa101 and bb292.
Sample problem-reporting template
Next, we attempted to stop and restart DFS on the read-only machines, aa101 and bb292. aa101 restarted fine, but the attempt to restart DFS on bb292 failed, saying "bos: cannot start repserver: not authorized, I want a doughnut (dce / sec)". We had to reboot the machine in order to restart the repserver on bb292. This occurred about 7:30 PM; we have a BosLog from bb292, copied just after the failure to restart.
Here is an example problem-reporting template that could be used to encapsulate all the basics:
- OPERATING SYSTEM (NAME and VERSION):
- DCE/DFS VERSION and PATCH or PTF LEVEL:
- DETAILED PROBLEM DESCRIPTION or SYMPTOMS: EXACT ERROR MESSAGE or CODE:
- IS the PROBLEM REPRODUCIBLE or REPEATABLE:
- PROBLEM MACHINE'S ROLE in the CELL: NETWORK RESTRICTIONS (i.e.: TCP only, restricted ports, etc....):
- DESCRIPTION of OTHER MACHINES in the CELL: USAGE of the CELL (development, production, test, etc):
- SEVERITY of the PROBLEM: AGE of CELL:
- RECENT CHANGES MADE in the CELL: DIAGNOSTIC or TROUBLESHOOTING STEPS ALREADY PERFORMED:
In general, it's very important to send us the exact error messages, or exact copies of problematic command output. We need to know software versions and PTF levels (on AIX) or patch levels (on Solaris) so we look in the right source-code trees, and we need exact dates and times and machine names and IP addresses because all these things appear in the logs and trace files. And we need to know the severity of the problem, which leads to...
A Note about Problem Severity
When you report a problem, you'll be asked what the severity of the problem is. We set severity from sev-1 (highest severity, meaning worse problems) to sev-4 (lowest severity, meaning least important problems). It's important that you be realistic when reporting the severity of an issue, so we can prioritize it properly. General guidelines:
There is a more detailed description of what to expect when contacting IBM Support, including more on severity, at IBM's tech support web site.
- Severity 1 (sev-1): Production system down, critical business impact, unable to use the product in a production environment, no workaround is available.
- Severity 2 (sev-2): Serious problem that has a significant business impact; use of the product is severely limited, but no production system is continuously down. Sev-2 problems include situations where customers are forced to restart processes frequently, and performance problems that cause significant degradation of service but do not render the product totally unusable. In general, a very serious problem for which there is an unattractive but functional workaround would be sev-2, not sev-1.
- Severity 3 (sev-3): Problems that cause some business impact but that can be reasonably circumvented; situations where there is a problem but the product is still usable. For example, short-lived problems or problems with components that have failed and then recovered and are back in normal operation at the time the problem is being reported. The default severity of new problem reports should be sev-3.
- Severity 4 (sev-4): Minor problems that have minimal business impact.
Gathering more detailed troubleshooting information
The basic information as listed above is needed for every problem report. Certain types of problems will require additional diagnostic data that depends on the nature of the problem.
First of all, we'll almost always ask for output from our show_conf script when you report a problem. The show_conf script gathers all sorts of information about a machine: OS version and configuration, DCE/DFS version and PTF info, logfiles, and so on. We usually want you to run show_conf on your DCE/DFS server machines and perhaps on a few selected client machines. For example, if you have a problem that occurs on some clients but not on others, we may ask you to run show_conf on one or two of the "bad" clients and also on one or two of the "good" clients.
The output of show_conf will be 20 to 40 pages of text for each machine. You can send the output to us via anonymous FTP to IBM's "testcase" site, at testcase.software.ibm.com. Just run FTP, then cd /aix/toibm and send your data in a file named like 12345.b678.data.tar.gz, for a compressed tarfile related to PMR 12345, B678. Be sure to use unique names (embed a date or a sequential number or something) if you send multiple files for the same PMR.
- Core files. If a process related to DCE/DFS drops core, you will need to send us the corefile. But, the core alone is not enough, since it will depend on shared libraries on your system that may differ from the libraries on our systems. The solution is to use one of our debug tools to package up the core and all the other binaries that it depends on. If the core occurred Solaris, then you would use grab_core; if on AIX, you would use senddata.
Note that on AIX, the default limit on core file size may be relatively small, and that may cause the core to be truncated. A truncated core will probably be useless for debugging; if your AIX limits cause your core to be trunacted, you will have to raise the limit and wait for another core before we can attempt to diagnose your problem. Also, note that senddata requires Perl on the system where it is run.
You may not want to send the entire core when first reporting a problem. The tools described in the next bullet (showProcInfo and dumpthreads) can be run against a core file on your system, and will yield a set of thread stacks from the core. When first reporting a core, you may want to just run one of these tools and send us the output; but please be sure to save the core in case it turns out that we need the whole thing.
- Process hangs or spins. If a process is hanging and appears to be unresponsive, or if it is spinning madly and consuming large amounts of CPU time, then we need to see what it's doing in order to figure out what's wrong. We have a couple of tools that allow us to see stack traces for each thread in the process; by running one of these tools a few times (waiting a minute or so between runs), we can see which threads are moving and which are stuck. You can use showProcInfo on either Solaris or AIX; or you can use dumpthreads on AIX.
Both of these tools require dbx on the system where they are run, and showProcInfo also requires Perl. If you don't have the required prerequisite tools (dbx and/or Perl), then you could force the process to drop core, using gcore on Solaris (which does not kill the process), or kill -6 on Solaris or AIX (but this will kill the process). You would then have to package the core(s) using one of the tools mentioned in the previous bullet. Another alternative if you want a set of stack traces from a Solaris process, but you can't run showProcInfo because you lack dbx and/or Perl on the system, is to use /usr/proc/bin/pstack.
- Tracing. DCE has a bunch of trace facilities. If your support rep wants DCE tracing for a particular problem, he or she will give you precise details regarding how to do it. You should probably not attempt to gather trace information on your own unless you are very experienced with DCE.
- Background monitoring. Sometimes a problem will occur sporadically, without warning, and it's hard to catch it "in the act". In cases like this, it may be helpful to run the watcher script in the background. This script wakes up once every 10 minutes and writes some general machine status information to a set of rolling logfiles; by default there are 5 files of size 1 MB, so we don't risk filling up your disk. The idea is that you can leave the script running, and then if something terrible happens, we can go back and look at the output from the time of the problem, to see if some particular situation (like a full disk partition or something like that) may have triggered your problem. The wakeup time (default 10 minutes) and number and size of logfiles can be modified easily if necessary in specific cases.
There's a second watcher-type script called watchNet that looks specifically for network problems; it is useful in situations where we suspect that some occasional network glitch may be causing problems with DCE.