IBM Support

IBM InfoSphere Information Server Enterprise Search related pods are in a failure state, or an upgrade fails, and an error indicates DiskPressure

Troubleshooting


Problem

Pods used by Information Server Enterprise Search 11.7.x.x are in a ImagePullBackOff or CrashLoopBackOff  failure state.

These failures may occur while upgrading Information Server to 11.7.0.2, or they may occur in any Information Server Enterprise Search 11.7.x.x installation.

Symptom

One or more pods are not running. Often times, the failed pods are in the ImagePullBackOff state. Error messages in the log indicates issues related to DiskPressure.
This Technote shows an example where the s4i-gremlin-console-configMap.yaml pod was not running.

Cause

A lack of disk space can trigger Kubernetes garbage collection to run, which in turn could trigger such failures.

Environment

Information Server Enterprise Search versions 11.7.0.0, 11.7.0.1, 11.7.0.2, 11.7.1.0.

Diagnosing The Problem

Note down the pods that are not running and examine their logs to determine the cause of their failure.

For example, check if pod s4i-gremlin-console-configMap.yaml is already running. If not, you may get an error as follows:
    > kubectl apply -f s4i-gremlin-console-configMap
    error: error when retrieving current configuration of:
    &{0xc4208218c0 0xc4201c23f0 default shop4info-gremlin-console-config /opt/IBM/UGinstall/selfextract.rsniZw.2018_09_27_12_40_18/upgradetools/manifests/s4i-gremlin-console-configMap.yaml 0xc42000d718 0xc42000d718 false}
    from server for: "/opt/IBM/UGinstall/selfextract.rsniZw.2018_09_27_12_40_18/upgradetools/manifests/s4i-gremlin-console-configMap.yaml": Get https://9.20.91.182:6443/api/v1/namespaces/default/configmaps/shop4info-gremlin-console-config: unexpected EOF
    2018-09-27 12:58:24.583 UTC -- ERROR: kubectl set image for s4i-gremlin-console-configMap.yaml failed

    In this example, kubectl apply -f uses the kube-apiserver pod whose log contained the error:
    Sep 27 14:57:36 CO9020091182 kubelet: W0927 14:57:36.204160   13593 eviction_manager.go:142] Failed to admit pod kube-proxy-djsx3_kube-system(e81c5638-c254-11e8-b444-005056941056) - node has conditions: [DiskPressure]

    In such a case, check disk consumption of the root partition, or more importantly wherever /var is mounted.

Resolving The Problem

Prior to installation, ensure that there is enough free disk space in the root partition, especially /var.
Refer to Information Server Enterprise Search node disk usage for details of disk space usage.

While the system is in use, Kubernetes monitors disk consumption. When the file system reaches the maximum threshold (default is 90%), garbage collection is initiated which may result in deletion of pods, containers, images, etc until the minimum threshold (default 80%) is met. If this cleanup results in eviction of pods that you subsequently attempt to use, you will see error messages indicating that they could not be found.

The /var partition is where kubelet and all the Enterprise Search system/user data is stored. If your system has closer to 100 GB allocated, you are likely to encounter such issues. We recommend having a minimum of 200 GB for the /var partition, or more depending on your usage pattern.

To check the disk space available on the / partition:
    > df -h
    [root@iisga1-node-2 ~]# df -h
    Filesystem                            Size  Used Avail Use% Mounted on
    /dev/mapper/rhel-root  240G   48G  193G  20% /

    In this case 80% or 193 GB of the allocated 240 GB is available.

If the disk usage rises to 80%, the likelihood of triggering Garbage collection increases. You can check whether this is happening by examining the kubelet logs as follows:
     > cat /var/log/messages | grep DiskPressure

    Messages of the form shown below are an indication of Disk pressure:
    Sep 27 14:57:36 CO9020091182 kubelet: W0927 14:57:36.204160   13593 eviction_manager.go:142] Failed to admit pod kube-proxy-djsx3_kube-system(e81c5638-c254-11e8-b444-005056941056) - node has conditions: [DiskPressure]

    Messages indicating occurrences of Garbage collection can be identified with the string "imageGCManager".

    If you see an occurrence of Disk pressure, urgent cleanup of the root partition, or addition of disk space, is needed.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSZJPZ","label":"IBM InfoSphere Information Server"},"Component":"","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"11.7.0.0;11.7.0.1;11.7.0.2;11.7.1.0","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
01 April 2019

UID

ibm10733777