Prometheus pod monitoring-prometheus-xxxxxxxxxx often failing with CrashLoopBackOff status.

Troubleshooting

Problem

From time to time the Pod monitoring-prometheus-xxxxxxxxxx shows CrashLoopBackOff.
By deleting the pod, it is restarted successfully but some time later it shows again CrashLoopBackOff state.

Cause

This is a common scenario when the amount of memory assigned to the container is not enough to cope with the requests
coming from the prometheus process to perform its task.

In this case, a kubectl describe for the pod will show:

   State:         Waiting
     Reason:      CrashLoopBackOff
   Last State:    Terminated
     Reason:      OOMKilled
     Exit Code:   137

Diagnosing The Problem

It means that the previous instance of the Prometheus container was killed by OOMKiller.
Looking at the system log, we can also see that the issue occurs due to out of memory conditions occurring periodically for prometheus pod.

The system log shows many of these:

[1022228.045284] Memory cgroup out of memory: Kill process 35859 (prometheus) score 1997 or sacrifice child
[1022228.045444] Killed process 35743 (prometheus) total-vm:2152980kB, anon-rss:2095672kB, file-rss:10792kB, shmem-rss:0kB

This error keeps occurring only for prometheus, so it is very specific and related to prometheus execution context.
The OOM is occurring for cgroup memory, so the problem is not related to physical memory shortage on the node.
It simply means that the memory limits specified for Prometheus pod is too small

Resolving The Problem

You must increase the memory limit for Prometheus to avoid it.
This can be done by editing the deployment called "monitoring-prometheus" :

kubectl edit deployment monitoring-prometheus

Find the limit under "resources:" and change memory limit from 2Gi to 4Gi.
The section looks like:

        resources:
          limits:
            cpu: 500m
            memory: 2Gi
          requests:
            cpu: 100m
            memory: 128Mi

Change it to:

        resources:
          limits:
            cpu: 500m
            memory: 4Gi
          requests:
            cpu: 100m
            memory: 128Mi

Save and close the editor.
The pod monitoring-prometheus-xxxxxx will be automatically restarted.

Verify with kubectl describe pod "monitoring-prometheus-xxxxxx" (check the real pod name) that it shows now 4Gi as memory limit.
This action will prevent further occurrence of the error.

Document Location

Worldwide

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSBS6K","label":"IBM Cloud Private"},"Component":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Product Synonym

IBM Cloud Private; ICP

Was this topic helpful?

Document Information

Modified date:
23 April 2019

UID

ibm10882172

Tips