Troubleshooting
Problem
Pods used by Information Server Enterprise Search 11.7.x.x are in a ImagePullBackOff or CrashLoopBackOff failure state.
These failures may occur while upgrading Information Server to 11.7.0.2, or they may occur in any Information Server Enterprise Search 11.7.x.x installation.
Symptom
One or more pods are not running. Often times, the failed pods are in the ImagePullBackOff state. Error messages in the log indicates issues related to DiskPressure.
This Technote shows an example where the s4i-gremlin-console-configMap.yaml pod was not running.
Cause
A lack of disk space can trigger Kubernetes garbage collection to run, which in turn could trigger such failures.
Environment
Information Server Enterprise Search versions 11.7.0.0, 11.7.0.1, 11.7.0.2, 11.7.1.0.
Diagnosing The Problem
Note down the pods that are not running and examine their logs to determine the cause of their failure.
For example, check if pod s4i-gremlin-console-configMap.yaml is already running. If not, you may get an error as follows:
> kubectl apply -f s4i-gremlin-console-configMap
error: error when retrieving current configuration of:
&{0xc4208218c0 0xc4201c23f0 default shop4info-gremlin-console-config /opt/IBM/UGinstall/selfextract.rsniZw.2018_09_27_12_40_18/upgradetools/manifests/s4i-gremlin-console-configMap.yaml 0xc42000d718 0xc42000d718 false}
from server for: "/opt/IBM/UGinstall/selfextract.rsniZw.2018_09_27_12_40_18/upgradetools/manifests/s4i-gremlin-console-configMap.yaml": Get https://9.20.91.182:6443/api/v1/namespaces/default/configmaps/shop4info-gremlin-console-config: unexpected EOF
2018-09-27 12:58:24.583 UTC -- ERROR: kubectl set image for s4i-gremlin-console-configMap.yaml failed
In this example, kubectl apply -f uses the kube-apiserver pod whose log contained the error:
Sep 27 14:57:36 CO9020091182 kubelet: W0927 14:57:36.204160 13593 eviction_manager.go:142] Failed to admit pod kube-proxy-djsx3_kube-system(e81c5638-c254-11e8-b444-005056941056) - node has conditions: [DiskPressure]
In such a case, check disk consumption of the root partition, or more importantly wherever /var is mounted.
Resolving The Problem
Prior to installation, ensure that there is enough free disk space in the root partition, especially /var.
Refer to Information Server Enterprise Search node disk usage for details of disk space usage.
While the system is in use, Kubernetes monitors disk consumption. When the file system reaches the maximum threshold (default is 90%), garbage collection is initiated which may result in deletion of pods, containers, images, etc until the minimum threshold (default 80%) is met. If this cleanup results in eviction of pods that you subsequently attempt to use, you will see error messages indicating that they could not be found.
The /var partition is where kubelet and all the Enterprise Search system/user data is stored. If your system has closer to 100 GB allocated, you are likely to encounter such issues. We recommend having a minimum of 200 GB for the /var partition, or more depending on your usage pattern.
To check the disk space available on the / partition:
> df -h
[root@iisga1-node-2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rhel-root 240G 48G 193G 20% /
In this case 80% or 193 GB of the allocated 240 GB is available.
If the disk usage rises to 80%, the likelihood of triggering Garbage collection increases. You can check whether this is happening by examining the kubelet logs as follows:
> cat /var/log/messages | grep DiskPressure
Messages of the form shown below are an indication of Disk pressure:
Sep 27 14:57:36 CO9020091182 kubelet: W0927 14:57:36.204160 13593 eviction_manager.go:142] Failed to admit pod kube-proxy-djsx3_kube-system(e81c5638-c254-11e8-b444-005056941056) - node has conditions: [DiskPressure]
Messages indicating occurrences of Garbage collection can be identified with the string "imageGCManager".
If you see an occurrence of Disk pressure, urgent cleanup of the root partition, or addition of disk space, is needed.
Was this topic helpful?
Document Information
Modified date:
01 April 2019
UID
ibm10733777