Saturday, August 29, 2020

Healthchecks and Sitecore containers

I just figured something out, and thought I'd share it.  I just tried spinning up Sitecore 10 on Docker, using the XM1 configuration, and I saw I was getting this error:

ERROR: for traefik  Container "5530f939e540" is unhealthy.


I first thought this was an error from the "traefik" container (which is involved in routing HTTPS traffic to the CM, CD and ID servers, see Rey Rehadian's post and the Container Installation Guide), but when I ran "docker ps", I saw that the health issue was with my CM container (and my CD as well):

Looking at the docker-compose.yaml file, you can see that CM and CD health are a prerequisite for bringing up the traefik container.

Okay, so that makes sense. Before Docker-Compose can bring up the traefik container, it checks on the health of the containers it depends on.  CM was unhealthy, so traefik couldn't be brought up, and I got an error.

So I started looking into what can make a container unhealthy.  Turns out that's a feature in Dockerfile; adding a HEALTHCHECK command gives a way of monitoring container health; if a command returns 0, everything is fine.  Useful write up here, DOCKERFILE docs here.  Problem is, as far as I can tell, the Dockerfiles for the Sitecore 10 images are not public, and I couldn't see what the healthcheck was from "docker image inspect".  However, "docker container inspect 553" (553 is the beginning of my CM container ID) showed the issue:

Since traefik wasn't up, I couldn't hit https://xm1cm.localhost/ directly, so I looked up the container IP address in the "docker container inspect" output, and used that.  (Also, you can "docker exec -it <container id> powershell" to get a container command line, and Invoke-WebRequest -usebasicparsing -uri http://localhost:80/healthz/ready to get at the output below.)

Hitting this IP + /healthz/ready, showed my issue, an invalid license file value. 

 /healthz is a standard naming convention for health-check pages, originally from Google, and build into Sitecore since 9.3, as these articles by Neil Killen and Vitalii Tylyk explain. (I did a fair amount of "healthz" Googling today.)

I'm still not sure what was wrong with my SITECORE_LICENSE setting.  I removed the variable form my .env file (which sets environment variables for docker-compose) and used the Set-LicenseEnvironmentVariable script from the Docker-Images repo. That worked, then I removed it, and moved the value back to the .env file and my issue went away. Best guess, I had an invalid value in my main environment variable, which takes precedence.  I confirmed this by setting my environment variable to "BAD" at the OS level.  Precedence discussed here; 

When you set the same environment variable in multiple files, here’s the priority used by Compose to choose which value to use:

  1. Compose file
  2. Shell environment variables
  3. Environment file
  4. Dockerfile
  5. Variable is not defined

From: https://docs.docker.com/compose/environment-variables/

One final note. The healthcheck.ps1 script checks for the existence of a lock file at $env:LOCALAPPDATA\Healthcheck\readylock.  If present, it checks /healthz/live, otherwise it checks /healthz/ready.   This file was missing when I had my license issues, but is now present on the running system, so I assume it is created after Sitecore comes on line. Per Vitali Tylyk's Docker post, /ready is the more thorough check, since it checks Solr and the presence of core, master, web, and security in SQL Server, so presumably the reason that it is not used after the lock file is written is to reduce load.  Once the presence of these is confirmed, healthcheck.ps1 uses the simpler "live" check.

No comments:

Post a Comment