When a health check makes its host unhealthy

13 Apr 2020 in Devops / Containerization / Docker / Debugging on Docker, Healthcheck, Containerization, Virtualization, Debugging, Resource, Limit, Ulimit

Leading up to this post, I’ve been fiddling with my HushDNS project for quite some time. In the words of the project. It’s all about:

....The DNS queries your system/browser and whatever software you use execute, leave you exposed to tracking by
prying outsiders. Leaving you with little privacy in regards to e.g. which websites you visit. This is so. 
Because the data (requests and responses) involved when using DNS, is in the clear.
Meaning --> everybody can determine the websites and other Internet related services you use.
Furthermore, a DNS query you execute can be mingled with. So you could end up on a website
controlled be ill intended people. Rather on the one YOU intended.

Luckily, it's possible to seriously combat the issues that DNS have......

Early April 2020 I finally got everything to work. From a functionality point of view. However, after a couple of days, DNS queries came to a halt. Either taking a very long time to complete, or not being answered at all. Worse was it, that the host on which dnscrypt-proxy and Pi-hole is running, came to a halt. So I put on my superman debug cape and pursued the issue.

Come with me. Follow my debug tracks to see what I found.

Initial brute-force debugging

So yes, I must admit, the first action I took was restarting the box hosting the HushDNS containers. It did help. However, only temporarily. Not really a surprise though. Excuse me my laziness. Just how it is at times .

I quickly came to the conclusion that restarting the host was not going to be a permanent solution. I had to go deeper, way deeper.

Spearhead debugging

The next time it happened, I was lucky enough to already be SSH connected to the container host. So I did my cmd fu. Calling upon top, ps and the like. To my frustration the commands right in my face. With a very worrying fault: fork: Resource temporarily unavailable. This usually means that a resource limit on Linux has been reached.

As I couldn’t really interact with the host because of its rotten state. I had to restart it again. Hopefully for the last time (related to this issue ) .

With the host up again. I SSH’ed on to it and decided to do some monitoring to determine what could be hogging resources in such an aggressive way. Furthermore, I went over what recent changes I had made to the host. They where:

Updating the OS of my QNAP
Setting up HushDNS related containers
- Pi-hole had been running for a long time already
- the dnscrypt-server container is running on a host outside my LAN
- dnscrypt-proxy was rather new and had its configuration changed several times

It could be some bug in the newest QNAP OS release. I categorized that as the less likely of the two. I therefore went with looking at the dnscrypt-proxy container. But how is that container resource exhausting the host?

Determining resource limit values

On Linux we do that with the ulimit cmd. E.g. execute:

ulimit -a to eye-gaze values for all the different types of resource limits that can be configured
ulimit -u to determine the value of the max user processes setting

As the fork: Resource temporarily unavailable error is often an issue of too many open files by an user or a max number of running processes met for a user, I had a look at the values of the max user processes- and the open files setting.

max user processes is at 31335 and open files at 1024.

The brute-force way of fixing this would be to set these two to unlimited. But, that would be a peasant solution and would open us up to a fork bomb attack.

Determining the cause of the resource exhaustion

Next up was to determine the current value of max user processes and open files. What to do? Hmmmm let’s pull out the top cmd again and fire it of on the host.

Don’t tell me that you would honestly have any better idea at this point . top is as warm and fuzzy as they come. So let’s hold on to it and squeeze it tightly .

Sooooo what did top tell us?

.....
   PID PPID User     Stat                     Command
   67     1 root     Z        0   0%   0   0% [drill]
   60     1 root     Z        0   0%   2   0% [drill]
   74     1 root     Z        0   0%   3   0% [drill]
   89     1 root     Z        0   0%   3   0% [drill]
   97     1 root     Z        0   0%   1   0% [drill]
  104     1 root     Z        0   0%   3   0% [drill]
  110     1 root     Z        0   0%   3   0% [drill]
  117     1 root     Z        0   0%   3   0% [drill]
   47     1 root     Z        0   0%   3   0% [drill]
  131     1 root     Z        0   0%   2   0% [drill]
  138     1 root     Z        0   0%   2   0% [drill]
  145     1 root     Z        0   0%   0   0% [drill]
  152     1 root     Z        0   0%   1   0% [drill]
  159     1 root     Z        0   0%   0   0% [drill]
  166     1 root     Z        0   0%   0   0% [drill]
  173     1 root     Z        0   0%   2   0% [drill]
  180     1 root     Z        0   0%   2   0% [drill]
  189     1 root     Z        0   0%   3   0% [drill]
   32     1 root     Z        0   0%   1   0% [drill]
  203     1 root     Z        0   0%   2   0% [drill]
  209     1 root     Z        0   0%   1   0% [drill]
  216     1 root     Z        0   0%   0   0% [drill]
  223     1 root     Z        0   0%   1   0% [drill]
  230     1 root     Z        0   0%   0   0% [drill]
  237     1 root     Z        0   0%   2   0% [drill]
  244     1 root     Z        0   0%   0   0% [drill]
  251     1 root     Z        0   0%   3   0% [drill]
  258     1 root     Z        0   0%   0   0% [drill]
  265     1 root     Z        0   0%   1   0% [drill]
  272     1 root     Z        0   0%   1   0% [drill]
  279     1 root     Z        0   0%   2   0% [drill]
  286     1 root     Z        0   0%   0   0% [drill]
  292     1 root     Z        0   0%   0   0% [drill]
  299     1 root     Z        0   0%   2   0% [drill]
   40     1 root     Z        0   0%   3   0% [drill]
  313     1 root     Z        0   0%   1   0% [drill]
  321     1 root     Z        0   0%   3   0% [drill]
  328     1 root     Z        0   0%   1   0% [drill]
  335     1 root     Z        0   0%   1   0% [drill]
  342     1 root     Z        0   0%   3   0% [drill]
  349     1 root     Z        0   0%   0   0% [drill]
.....

N.B. The above output is a snippet of the output from top. And for good measure I changed the username in the output to the generic value of root

Oh wow! Holy hamster . That looks bad bad not good! Something is firing of the drill command again and again for some reason. We can live with that, if drill is actually killed. But its not and therefore just keeps hanging. This is likely to be the culprit of the resource exhaustion state that the hosts ends up in.

Let’s go after the many hanging drill commands and save looking at this potentially being an open files limit reached issue for later.

If we take a look at the output of the top command again, focussing specifically on the stat column we see that it has a value of Z for all of the drill commands. This stands for zombie and in essence means that the process is dead, it has done its job and what’s left is the structure describing it on the Linux kernel. It is in zombie mode because it awaits its parent process to pick up its exit code. Based on that we can conclude that the exit code of the drill command is not picked up. Or it is, but the process is for whatever reason not killed. Duly note that a zombie process have released its recourses. However, it still remains in the system process table and therefore counts against the max user processes limit.

So what is invoking all those drill commands? Let’s use one of my other favorite old timer Linux commands. Namely ps.

Execute: ps -efl -o pid,ppid,user,comm | grep drill

Here is the output (a sample):

.....
 5453 24102 root    drill
 7012 24102 root    drill
16851 24102 root    drill
24217 24102 root    drill
.....

With the -o parameter we defined the columns that the ps command should output. The column to take note of is the second column. Namely the ppid column. ppid stands for parent id. So there you have it, the pid of the process that invoked the drill command.

Next step. Gathering info on the parent process. We simply do that by executing: cat /proc/{pid}/cmdline, in this case replace {pid} with 24102. The result: dnscrypt-proxy-config/config/dnscrypt-proxy.toml. For a reason unknown to me, all whitespace have been stripped from the output. I guess its because the cat command cannot read the special separator character of the content in the process cmdline file. Because, if we execute: vi /proc/{pid}/cmdline we see that the content of the cmdline file is dnscrypt-proxy^@-config^@/config/dnscrypt-proxy.toml^@. So whitespace is marked with a ^@ character sequence. With this knowledge obtained we can conclude that the actual cmdline of the parent process is: dnscrypt-proxy -config /config/dnscrypt-proxy.toml. So …. there you have it. It’s the dnscrypt-proxy that invokes all these drill commands.

Solving the cause of the resource exhaustion

dnscrypt-proxy runs in a container on my QNAP. In order to determine what is calling the drill command I had a look at its dockerfile. At the time of writing the dockerfile looked like this:

FROM alpine:edge

ENV DNSCRYPT_LISTEN_PORT=5053 \
    DNSCRYPT_SERVER_NAMES="" \
    PUID=1000 \
    PGID=100

COPY entrypoint.sh /

RUN apk add --no-cache dnscrypt-proxy drill \
    && chmod +x /entrypoint.sh

HEALTHCHECK --interval=5s --timeout=3s --start-period=10s \
    CMD drill -p $DNSCRYPT_LISTEN_PORT cloudflare.com @127.0.0.1 || exit 1

ENTRYPOINT ["/entrypoint.sh"]

Here’s a link to the dnscrypt-proxy dockerfile. As it looked at the time of writing.

By eye-gazing the file we can determine that the drill command is used as a part of the Docker container HEALTHCHECK. Oh Shoot! It is actually the HEALTHCHECK making the host “sick”. This is sooooo

But why is this happening. Again looking at the dockerfile we can see that the drill cmd queries a DNS server on port 5053. Okay fine. But wait, I have this volume mapping in my dnscrypt-proxy docker-compose YML file.

    volumes:
      - "./conf/dnscrypt-proxy.toml:/config/dnscrypt-proxy.toml"

I had a look in the dnscrypt-proxy.toml file and found that I had configured the listening port of dnscrypt-proxy to a different port than the one that the HEALTHCHECK queries. dnscrypt-proxy still works even though there is a mismatch between the HEALTHCHECK and the configured listening port in the dnscrypt-proxy.toml file. But the drill command will of course fail and fail and fail and fail ….. But why isn’t the HEALTHCHECK just returning the || exit 1 part? And the drill command killed!

Let’s go into that.

For King Canute let’s peep at that dockerfile once more. Notice that --timeout=3s? Yes you do, you see it . After spotting that, I suspected this being an issue of the Docker HEALTHCHECK not leaving enough time for the drill cmd to timeout in the case that the DNS server is not answering. Reading the documentation on the Docker HEALTHCHECK reveals that if the cmd used in the HEALTHCHECK takes more than, in this case, three seconds, the HEALTHCHECK will be considered as unhealthy. More here. After executing the cmd used in the HEALTHCHECK inside the dnscrypt-proxy container, I could conclude that the three second timeout would certainly not be enough to allow the drill command to naturally time out.

I used these commands:

time drill -p 5354 cloudflare.com @127.0.0.1 < query on a port that the dnscrypt-proxy is not listening on
- Result: 15.01s
time drill -p 5053 cloudflare.co @127.0.0.1 < query on the port that the dnscrypt-proxy is listening on
- Result: 0.00s. The speed of the answer is because it was served from the DNS cache

Conclusion

When the Docker HEALTHCHECK times out after three seconds the exit code of the drill cmd is not caught and it just hangs as a zombie process and counts against the max user processes. Then when the max is reached. The disaster occurs as the host experiences resource exhaustion.

The fix

Initially I fixed this by simply making sure that the listening port configured in the dnscrypt-proxy.toml file matched the one the Docker HEALTHCHECK uses. However, after running the container for a couple of days I still saw zombie drill cmd’s. Though a very limited number, compared to before. But of course the wise one thinks. E.g. if for some reason the drill command cannot reach the local DNS server. If the domain does not exist or an network issue occurs, drill does not get the necessary time it needs to time out naturally. Leaving us with hanging zombie drill processes.

The real fix

As part of my troubleshooting I created this GitHub issue. On the repository for the dnscrypt-proxy container image I use. For better or worse I was referred to the klutchell dnscrypt-proxy container image by the image creator. As the creator thinks that image is better maintained. I therefore had a look at that projects dockerfile and what do I see. No HEALTHCHECK. Hmmmm, interesting. Further investigation. This time into the commit history of the klutchell image. In there I found a commit with this revealing title: remove drill and internal healthcheck. But the commit had no reasoning on why. So I created an GitHub issue on the klutchell image repository. Where I simply asked for the reason behind removing the HEALTHCHECK. The creator told me that he did so in order to simplify the image and give users control on whether or not they want a HEALTHCHECK. That knowledge made me decide to continue with the image I was using. However, with the change that I would control its HEALTHCHECK in the docker-compose file I’m using for the dnscrypt-proxy container.

Like so:

version: '3.4'
....
....
....
    healthcheck:
      test: ["CMD", "drill", "-p", "5053", "cloudflare.com", "@127.0.0.1", "||", "exit", "1"]
      interval: 20s
      timeout: 20s
      retries: 3
      start_period: 5s
....

N.B. the above is a snippet of the docker-compose file. Ensure to specify the docker-compose file version to be v3.4 or above, if you want use the start_period parameter.

Overwriting the HEALTHCHECK gives the following benefits:

Controlling the value of the timeout HEALTHCHECK parameter (the actual fix)
And, if I want to, the freedom to set the dnscrypt-proxy listening port as I see fit

Completing that, I then manually monitored the dnscrypt-proxy container for the next day or so. It ran without any issues and even better, there was no longer any hanging zombie processes. There you have it, job done .

Thank you for reading along. I hope you enjoyed it and that you found useful knowledge.

Over and out and stay

When a health check makes its host unhealthy

Initial brute-force debugging

Spearhead debugging

Determining resource limit values

Determining the cause of the resource exhaustion

Solving the cause of the resource exhaustion

Conclusion

The fix

The real fix

Bengtsson Driven Development

Error

Initial brute-force debugging

Spearhead debugging

Determining resource limit values

Determining the cause of the resource exhaustion

Solving the cause of the resource exhaustion

Conclusion

The fix

The real fix

Templates:

Error