Monitoring Complex Docker-Based Application Setups

Philip Flohr | 9 min

TechnicalBlogCloudDockerMonitoring

Monitoring Complex Docker-Based Application Setups

Docker—especially the Compose plugin—simplifies shipping complex application setups. However, many legacy tools lack health checks and useful metrics, and cAdvisor focuses on performance rather than the core question: “Is my Docker-based setup actually healthy?” This post outlines an approach to answer that question.

Monitoring Complex Docker-Based Application Setups

The Docker plugin docker compose has become the de facto standard for deploying multi-container applications outside Kubernetes. Whether you’re spinning up a small web service or orchestrating a complex software stack with web frontends, databases, cron jobs, and mail components, Docker Compose makes deployment almost trivial. A single YAML file, one command, and your entire environment is online.

Monitoring such environments is surprisingly challenging, even though deployment with docker compose is simple. This is especially true if you want to monitor many different Compose-based setups in a single Prometheus instance using a single set of alerting rules.

In this post, we explore why monitoring Docker Compose–based setups with Prometheus is harder than expected, where common tools fall short, and how a lightweight exporter can fill the gap.

Complex Setups With Compose: Easy to Ship

The orchestration complexity typically associated with multi-service systems is abtracted by docker compose. A realistic production stack may include:

A web application
A reverse proxy
A database server
A database backup service
A Postfix SMTP relay
A cron container for scheduled tasks

And yet, deploying all of these components often boils down to a single file such as:

services:
  webapp:
    build: ./app
    depends_on:
      - db
    ports:
      - "8080:8080"

  nginx:
    image: nginx:stable
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - webapp
    ports:
      - "80:80"

  db:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: example
    volumes:
      - dbdata:/var/lib/postgresql/data

  dbbackup:
    image: postgres:16
    entrypoint: ["sh", "-c", "pg_dump ..."]
    depends_on:
      - db

  postfix:
    image: postfix:latest

  cron:
    build: ./cron
    depends_on:
      - webapp
      - db

volumes:
  dbdata:

Compose hides much of the underlying orchestration complexity. That’s the good part.

Controlling Compose With Systemd

A common drawback is that docker compose does not act as a process supervisor. It does not monitor running containers, detect failures, restart dependent services based on runtime health, or integrate deeply with the OS. As a result, production setups often wrap docker compose calls inside Systemd units for basic supervision.

A common pattern is:

[Unit]
Description=My Compose Stack
Requires=docker.service
After=docker.service

[Service]
Type=simple
WorkingDirectory=/opt/mystack
ExecStart=/usr/bin/docker compose up
ExecReload=/usr/bin/docker compose up -d
ExecStop=/usr/bin/docker compose down

[Install]
WantedBy=multi-user.target

This improves startup behavior, logging, and restart semantics, but interactions with Docker’s error handling lead to ambiguous states. Another downside is that Systemd is not a great fit for monitoring Docker Compose applications because it can only detect failures at the unit level. Failures of containers are not detected by default.

Error Handling and Error Propagation

The above setup can expose different, but mutually exclusive, ways of handling and propagating errors.

Handling of Container Failures in Docker Compose

Services in the compose.yml file can have a restart policy assigned. Depending on the restart_policy, containers can either:

Stay in their exited or failed state
Be restarted on failure. Optionally, stay failed after a certain number of restarts.
Be restarted without taking care of their exit code.
Be restarted unless-stopped. This also restarts containers on system reboots, and therefore it’s a bad fit for Systemd-wrapped Compose stacks.

All these possibilities have in common that container exits and errors are not propagated to the Systemd unit. The common approach of monitoring these services using Prometheus node_exporter (or an equivalent) to watch the state of the unit is therefore not enough.

Handling of Container Failures in Systemd

The docker compose plugin provides two possibilities to propagate container failures to Systemd:

--abort-on-container-exit stops all containers if any container stops.
--abort-on-container-failure stops all containers if any container exits with failure.

While these options can be used to detect container failures, they do not provide a comprehensive view of the entire stack. On top of that, they tear down the entire stack if a single container fails, which is not ideal for production setups. Additionally, some containers are expected to exit. In this way, a behavior inspired by Kubernetes init-containers can be achieved.

Complex Setups With Compose: Hard to Monitor

We assume that the above setup is running in production and that it is monitored by Prometheus. For increased overall availability, container failures are not propagated to Systemd; instead, Prometheus is used to collect information about the health of the entire stack and alert on failures.

Our goal: Monitor many different Compose-based setups using a unified set of alerting rules and minimal changes to the existing setups.

cAdvisor Is Powerful but Not Sufficient

Google’s cAdvisor delivers excellent performance metrics (CPU, memory, I/O, filesystem), but it is not suitable for operational state monitoring. What’s missing?

Container running/stopped state
Exit codes
Restart counters
Health state transitions

While it’s possible to derive these metrics from the cAdvisor metrics, it is not straightforward and requires a lot of in-depth knowledge about your application stack. This contradicts our goal of monitoring many different Compose-based setups with simple, and therefore less error-prone, alerting rules.

Source of Required Information: The Docker Socket

All relevant lifecycle information is exposed by the Docker API via the Docker socket (/var/run/docker.sock).

Only there can you reliably retrieve:

Container status
Exit codes
Restart counts
Health states
Runtime transitions

Thus, to properly monitor Docker containers, a Prometheus exporter must talk to Docker directly.

A Simple Approach: A Docker Container Status Exporter

To close the monitoring gap, we provide a docker-container-status-exporter built specifically for container lifecycle monitoring. It exposes what cAdvisor doesn’t: the operational health of your Compose environment.

What to Monitor?

Base idea: Monitor all available containers.

Docker distinguishes between images and containers. Images are immutable blueprints for containers. They do not run, have no state, and are almost certainly not interesting to monitor. Containers, on the other hand, are instances of images. They run, have state, and—if they work as expected—provide some service. Every time you run a container using docker run or docker compose up, Docker either creates a new container instance from the specified image or starts an existing one.

In most cases, the following statement is true, and therefore it is the base assumption for our exporter: If a container exists, it should be running.

Exported Metrics

The exporter converts Docker lifecycle information strings into numeric Prometheus metrics. For every existing container, three metrics are exported:

docker_container_state, docker_container_exitcode, and docker_container_health_status.

Some of those metrics can only contain useful information if preconditions are met. For example, docker_container_exitcode is only meaningful if the container is in the exited state. All metrics that depend on an unmet precondition are therefore set to -1.

This allows precise alerting on:

Containers not running
Expected and unexpected exits
Restart loops
Health status changes
Containers stuck in transitional states

These metrics finally bring operational visibility into Compose-based systems.

Usage

The simplest way to use our exporter is to deploy it alongside your existing Compose stack.

services:
  docker-container-exporter:
    image: awesomeit/docker-container-status-exporter:latest
    restart: no
    environment:
      - EXCLUDE_LABEL_KEY=monitoring.disabled
      - EXCLUDE_LABEL_VALUE=true
      - EXPORTER_PORT=8000
    ports:
      - "127.0.0.1:8000:8000"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro

Disable Monitoring of a Container

Our base assumption is not always valid. For this reason, a mechanism is provided to exclude containers from monitoring. Add the label defined above to your container definition, and the exporter will ignore it.

services:
  image: my_image:latest
  labels:
    monitoring.disabled: "true"

What About Security?

Allowing an external application to talk to your Docker socket is a security risk. It is effectively equivalent to granting root access to your entire system. For that reason, running the exporter in combination with additional security measures is recommended. A Compose example is provided for use with docker-socket-proxy:

services:
  docker-socket-proxy:
    image: zoeyvid/docker-socket-proxy:latest
    restart: no
    environment:
      # Minimal required permissions for listing/inspecting containers
      - CONTAINERS=1
      - INFO=1
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    networks:
      - monitor

  docker-container-exporter:
    image: awesomeit/docker-container-status-exporter:latest
    restart: no
    environment:
      - DOCKER_HOST=tcp://docker-socket-proxy:2375
      - EXCLUDE_LABEL_KEY=monitoring.disabled
      - EXCLUDE_LABEL_VALUE=true
      - EXPORTER_PORT=8000
    ports:
      - "127.0.0.1:8000:8000"
    networks:
      - monitor

networks:
  monitor:
    driver: bridge

Alertmanager Alerts

These alerts cover the common failure states. Depending on your use case, you might want to remove the ContainerExited alert.

groups:
- name: docker-container-state
  rules:
  - alert: ContainerRestarting
    annotations:
      description: Container {{ $labels.container_name }} on {{ $labels.instance }} is restarting.
      summary: The container {{ $labels.container_name }} on {{ $labels.instance }} is restarting.
    expr: docker_container_state == 1
  - alert: ContainerPaused
    annotations:
      description: Container {{ $labels.container_name }} on {{ $labels.instance }} is paused.
      summary: The container {{ $labels.container_name }} on {{ $labels.instance }} is paused.
    expr: docker_container_state == 4
  - alert: ContainerExited
    annotations:
      description: Container {{ $labels.container_name }} on {{ $labels.instance }} has exited. If this is on purpose, remove the container with Docker rm
      summary: The container {{ $labels.container_name }} on {{ $labels.instance }} has exited.
    expr: docker_container_state == 5
  - alert: ContainerDead
    annotations:
      description: Container {{ $labels.container_name }} on {{ $labels.instance }} has exited. Try to remove the container with Docker rm
      summary: The container {{ $labels.container_name }} on {{ $labels.instance }} has exited.
    expr: docker_container_state == 6
- name: docker-container-exitcode
  rules:
  - alert: ContainerExited
    annotations:
      description: Container {{ $labels.container_name }} on {{ $labels.instance }} exited. Exit status was 0, but this is still likely an issue.
      summary: The container {{ $labels.container_name }} on {{ $labels.instance }} exited.
    expr: docker_container_exitcode == 0
  - alert: ContainerExitedWithFailure
    annotations:
      description: Container {{ $labels.container_name }} on {{ $labels.instance }} exited with exit code {{ $value}}.
      summary: The container {{ $labels.container_name }} on {{ $labels.instance }} exited with exit code {{ $value}}.
    expr: docker_container_exitcode > 0
- name: docker-container-health
  rules:
  - alert: ContainerUnhealthy
    annotations:
      description: Container {{ $labels.container_name }} on {{ $labels.instance }} is unhealthy.
      summary: The container {{ $labels.container_name }} on {{ $labels.instance }} is unhealthy.
    expr: docker_container_health_status == 2
- name: docker-container-restart-rate
  rules:
  - alert: HighContainerRestartRate
    annotations:
      description: Container {{ $labels.container_name }} on {{ $labels.instance }} restarted {{ $value}} times during the last 10 minutes.
      summary: The container {{ $labels.container_name }} on {{ $labels.instance }} restarted {{ $value}} times during the last 10 minutes.
    expr: increase(docker_container_restart_count[10m]) > 0

Conclusion

Deployments are well simplified with docker compose, but its monitoring model lacks essential operational insights. cAdvisor alone is not enough for production-grade visibility because it does not expose container lifecycle information.

Our docker-container-status-exporter fills this gap by reading directly from the Docker socket and exposing metrics that reflect the operational state of your Compose environment.

This overview shows how to close the monitoring gap in Docker Compose setups where the monitoring model lacks essential operational insights and cAdvisor alone is not enough.

Systemd manages the Compose stack, while the Docker Engine provides the container lifecycle data. The docker-container-status-exporter then reads directly from the Docker socket and exposes metrics that reflect the operational state of your Compose environment (running state, exit code, restart count, health). Prometheus scrapes these metrics, and—combined with the Alertmanager ruleset above—you can monitor many different Compose-based setups in a single Prometheus instance with ease, with alerts routed to Email, Slack, PagerDuty, webhooks, and similar channels.

Combined with the shown Alertmanager ruleset, many different Compose-based setups can be monitored in a single Prometheus instance with ease. We've published code and examples under the BSD open-source license.

Find the source code and more information on GitHub.