Troubleshooting
Problem
From time to time the Pod monitoring-prometheus-xxxxxxxxxx shows CrashLoopBackOff.
By deleting the pod, it is restarted successfully but some time later it shows again CrashLoopBackOff state.
By deleting the pod, it is restarted successfully but some time later it shows again CrashLoopBackOff state.
Cause
This is a common scenario when the amount of memory assigned to the container is not enough to cope with the requests
coming from the prometheus process to perform its task.
coming from the prometheus process to perform its task.
In this case, a kubectl describe for the pod will show:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Diagnosing The Problem
It means that the previous instance of the Prometheus container was killed by OOMKiller.
Looking at the system log, we can also see that the issue occurs due to out of memory conditions occurring periodically for prometheus pod.
Looking at the system log, we can also see that the issue occurs due to out of memory conditions occurring periodically for prometheus pod.
The system log shows many of these:
[1022228.045284] Memory cgroup out of memory: Kill process 35859 (prometheus) score 1997 or sacrifice child
[1022228.045444] Killed process 35743 (prometheus) total-vm:2152980kB, anon-rss:2095672kB, file-rss:10792kB, shmem-rss:0kB
This error keeps occurring only for prometheus, so it is very specific and related to prometheus execution context.
The OOM is occurring for cgroup memory, so the problem is not related to physical memory shortage on the node.
It simply means that the memory limits specified for Prometheus pod is too small
[1022228.045444] Killed process 35743 (prometheus) total-vm:2152980kB, anon-rss:2095672kB, file-rss:10792kB, shmem-rss:0kB
This error keeps occurring only for prometheus, so it is very specific and related to prometheus execution context.
The OOM is occurring for cgroup memory, so the problem is not related to physical memory shortage on the node.
It simply means that the memory limits specified for Prometheus pod is too small
Resolving The Problem
You must increase the memory limit for Prometheus to avoid it.
This can be done by editing the deployment called "monitoring-prometheus" :
This can be done by editing the deployment called "monitoring-prometheus" :
kubectl edit deployment monitoring-prometheus
Find the limit under "resources:" and change memory limit from 2Gi to 4Gi.
The section looks like:
The section looks like:
resources:
limits:
cpu: 500m
memory: 2Gi
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 2Gi
requests:
cpu: 100m
memory: 128Mi
Change it to:
resources:
limits:
cpu: 500m
memory: 4Gi
requests:
cpu: 100m
memory: 128Mi
Save and close the editor.
The pod monitoring-prometheus-xxxxxx will be automatically restarted.
limits:
cpu: 500m
memory: 4Gi
requests:
cpu: 100m
memory: 128Mi
Save and close the editor.
The pod monitoring-prometheus-xxxxxx will be automatically restarted.
Verify with kubectl describe pod "monitoring-prometheus-xxxxxx" (check the real pod name) that it shows now 4Gi as memory limit.
This action will prevent further occurrence of the error.
This action will prevent further occurrence of the error.
Document Location
Worldwide
[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSBS6K","label":"IBM Cloud Private"},"Component":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]
Product Synonym
IBM Cloud Private; ICP
Was this topic helpful?
Document Information
Modified date:
23 April 2019
UID
ibm10882172