Using Traefik as your load-balancer in Docker Swarm with Let's Encrypt

I previously wrote about Traefik (Docker on Azure, how to build your own Swarm cluster), but I would like to explain how I made my Traefik cluster on Docker Swarm.

Infrastructure


The infrastructure is a Docker Swarm cluster with manager and worker nodes.

Traefik runs on the manager nodes, and, in front of, we have a load balancer from our favorite cloud provider or in-house. The load balancer job is to route TCP traffic on port 80, 443 and 8080 to Traefik. A TCP check on port 8080 is configured for ensuring Traefik is up and running on this node.

Container without configuration file

I chose not having a configuration file and not building a custom image that will integrate the configuration. Traefik can be configured only by using parameters.

Some of them are not available, like protecting the Web UI with a basic auth, but, it's the only missing flag I see.

Without a custom image, you gain a lot of time for applying Traefik updates you just have to change the SHA of the tag in our docker-compose file and redeploy.

Example of how to run Traefik with this method:

version: "3"
services:
  traefik:
    image: traefik:1.4
    depends_on:
      - traefik_init
      - consul
    command:
      - "--web"
      - "--entrypoints=Name:http Address::80 Redirect.EntryPoint:https"
      - "--entrypoints=Name:https Address::443 TLS"
      - "--defaultentrypoints=http,https"
      - "--acme"
      - "--acme.storage=/etc/traefik/acme/acme.json"
      - "--acme.entryPoint=https"
      - "--acme.OnHostRule=true"
      - "--acme.onDemand=false"
      - "--acme.email=contact@jmaitrehenry.ca"
      - "--docker"
      - "--docker.swarmmode"
      - "--docker.domain=jmaitrehenry.ca"
      - "--docker.watch"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    networks:
      - webgateway
      - traefik
    ports:
      - target: 80
        published: 80
        mode: host
      - target: 443
        published: 443
        mode: host
      - target: 8080
        published: 8080
        mode: host
    deploy:
      mode: global
      placement:
        constraints:
          - node.role == manager
      update_config:
        parallelism: 1
        delay: 10s
      restart_policy:
        condition: on-failure
networks:
  webgateway:
    driver: overlay
    external: true
  traefik:
    driver: overlay

Why I place Traefik on manager nodes and not on worker nodes?

Traefik needs to listen to docker cluster events and pull information about Swarm services which are accessible on the manager nodes.

You have some solutions for exposing the events to worker node like using a reverse proxy / socat on manager nodes.

The other reason is if you want to know the user IP, you need to expose Traefik port in host mode. By doing this, only the node with traefik could respond to traffic and not all swarm node.

In my case, I use my cloud provider load balancer (ELB for AWS, Azure Load Balancer for Azure, etc.), and I need to know on which node the load balancer should send traffic to. I see two solutions:

  • Add all worker nodes to the load balancer and let the load balancer health-check for only sent traffic to the node with traefik running. In my opinion, it's not the best solution.
  • Add traefik to the manager node in global mode, as it's a fixed number of nodes, it's easier to configure it in the load balancer.

For the health-check, I usually use a TCP check on port 8080 which is exposed in host mode too.

If you have a better solution, I will be happy to see it. I use the 2nd solution in production as Traefik don't use a lot of CPU, it's not a problem for now.

Traefik Cluster/HA

Why do we need Traefik in cluster mode? Running multiple instances should work out of the box?

If you want to use Let's Encrypt with Traefik, sharing configuration or TLS certificates between many Træfik instances, you need Traefik cluster/HA.

Ok, could I mount a shared volume used by all my instances? Yes, you can, but it will not work. When you use Let's Encrypt, you need to store certificates, but not only. When Traefik generates a new certificate, it configures a challenge and once Let's Encrypt will verify the ownership of the domain, it will ping back the challenge. If the challenge is not knowing by other Traefik instances, the validation will fail.

For more information about challenge: Automatic Certificate Management Environment (ACME)

What the cluster do?

When you run Traefik in cluster mode, the configuration is read and stored in a Key-Value store. The cluster elects a leader, and the leader is responsible for changing the configuration to the Key-Value store. The other node listens to changes made.

How to initialize the cluster?

The best way I found is to have an initializer service. This service will push the config to Consul via the storeconfig sub-command.

This service will retry until finishing without error because Consul could be not ready when the service tries to push the configuration.

The initializer in a docker-compose file will be:

  traefik_init:
    image: traefik:1.4
    command:
      - "storeconfig"
      - "--web"
      [...]
      - "--consul"
      - "--consul.endpoint=consul:8500"
      - "--consul.prefix=traefik"
    networks:
      - traefik
    deploy:
      restart_policy:
        condition: on-failure
    depends_on:
      - consul

And now, the traefik part will only have the consul configuration and acme.storage because Traefik seems not to read it from consul.

  traefik:
    image: traefik:1.4
    depends_on:
      - traefik_init
      - consul
    command:
      - "--consul"
      - "--consul.endpoint=consul:8500"
      - "--consul.prefix=traefik"
      - "--acme.storage=traefik/acme/account"
    [...]

If you have some update to do, update the initializer service and re-deploy it. The new configuration will be store on Consul, and you need to restart the Traefik node: docker service update --force traefik_traefik.

Complete stack file

version: "3.4"
services:
  traefik_init:
    image: traefik:1.4@sha256:9c299d9613cb01564c8219f4bc56ecc55f30d8f06d35cf3ecf83a85426c13225
    command:
      - "storeconfig"
      - "--web"
      - "--entrypoints=Name:http Address::80 Redirect.EntryPoint:https"
      - "--entrypoints=Name:https Address::443 TLS"
      - "--defaultentrypoints=http,https"
      - "--acme"
      - "--acme.storage=traefik/acme/account"
      - "--acme.entryPoint=https"
      - "--acme.OnHostRule=true"
      - "--acme.onDemand=false"
      - "--acme.email=contact@jmaitrehenry.ca"
      - "--docker"
      - "--docker.swarmmode"
      - "--docker.domain=jmaitrehenry.ca"
      - "--docker.watch"
      - "--consul"
      - "--consul.endpoint=consul:8500"
      - "--consul.prefix=traefik"
    networks:
      - traefik
    deploy:
      restart_policy:
        condition: on-failure
    depends_on:
      - consul
  traefik:
    image: traefik:1.4@sha256:9c299d9613cb01564c8219f4bc56ecc55f30d8f06d35cf3ecf83a85426c13225
    depends_on:
      - traefik_init
      - consul
    command:
      - "--consul"
      - "--consul.endpoint=consul:8500"
      - "--consul.prefix=traefik"
      - "--acme.storage=traefik/acme/account"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    networks:
      - webgateway
      - traefik
    ports:
      - target: 80
        published: 80
        mode: host
      - target: 443
        published: 443
        mode: host
      - target: 8080
        published: 8080
        mode: host
    deploy:
      mode: global
      placement:
        constraints:
          - node.role == manager
      update_config:
        parallelism: 1
        delay: 10s
      restart_policy:
        condition: on-failure
  consul:
    image: consul
    command: agent -server -bootstrap-expect=1
    volumes:
      - consul-data:/consul/data
    environment:
      - CONSUL_LOCAL_CONFIG={"datacenter":"us_east2","server":true}
      - CONSUL_BIND_INTERFACE=eth0
      - CONSUL_CLIENT_INTERFACE=eth0
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.role == manager
      restart_policy:
        condition: on-failure
    networks:
      - traefik

networks:
  webgateway:
    driver: overlay
    external: true
  traefik:
    driver: overlay

volumes:
  consul-data:
      driver: [not local]

Thanks to the french docker community for your feedback!
Do not hesitate to join the Docker slack community or the Traefik slack community.
If you find a typo, have a problem when trying what you find on this article, please contact me!