When a health check makes its host unhealthy
in Devops / Containerization / Docker / Debugging on Docker, Healthcheck, Containerization, Virtualization, Debugging, Resource, Limit, Ulimit
Leading up to this post, I’ve been fiddling with my HushDNS project for quite some time. In the words of the project. It’s all about:
....The DNS queries your system/browser and whatever software you use execute, leave you exposed to tracking by
prying outsiders. Leaving you with little privacy in regards to e.g. which websites you visit. This is so.
Because the data (requests and responses) involved when using DNS, is in the clear.
Meaning --> everybody can determine the websites and other Internet related services you use.
Furthermore, a DNS query you execute can be mingled with. So you could end up on a website
controlled be ill intended people. Rather on the one YOU intended.
Luckily, it's possible to seriously combat the issues that DNS have......
Early April 2020 I finally got everything to work. From a functionality point of view. However, after a couple of days, DNS queries came to a halt. Either taking a very long time to complete, or not being answered at all. Worse was it, that the host on which dnscrypt-proxy
and Pi-hole
is running, came to a halt. So I put on my superman debug cape and pursued the issue.
Come with me. Follow my debug tracks to see what I found.
Initial brute-force debugging
So yes, I must admit, the first action I took was restarting the box hosting the HushDNS containers. It did help. However, only temporarily. Not really a surprise though. Excuse me my laziness. Just how it is at times .
I quickly came to the conclusion that restarting the host was not going to be a permanent solution. I had to go deeper, way deeper.
Spearhead debugging
The next time it happened, I was lucky enough to already be SSH connected to the container host. So I did my cmd fu. Calling upon top
, ps
and the like. To my frustration the commands right in my face. With a very worrying fault: fork: Resource temporarily unavailable
. This usually means that a resource limit on Linux
has been reached.
As I couldn’t really interact with the host because of its rotten state. I had to restart it again. Hopefully for the last time (related to this issue ) .
With the host up again. I SSH’ed on to it and decided to do some monitoring to determine what could be hogging resources in such an aggressive way. Furthermore, I went over what recent changes I had made to the host. They where:
- Updating the OS of my
QNAP
- Setting up HushDNS related containers
- Pi-hole had been running for a long time already
- the
dnscrypt-server
container is running on a host outside my LAN -
dnscrypt-proxy
was rather new and had its configuration changed several times
It could be some bug in the newest QNAP
OS release. I categorized that as the less likely of the two. I therefore went with looking at the dnscrypt-proxy
container. But how is that container resource exhausting the host?
Determining resource limit values
On Linux
we do that with the ulimit
cmd. E.g. execute:
-
ulimit -a
to eye-gaze values for all the different types of resource limits that can be configured -
ulimit -u
to determine the value of themax user processes
setting
As the fork: Resource temporarily unavailable
error is often an issue of too many open files by an user or a max number of running processes met for a user, I had a look at the values of the max user processes
- and the open files
setting.
max user processes
is at 31335 and open files
at 1024.
The brute-force way of fixing this would be to set these two to unlimited. But, that would be a peasant solution and would open us up to a fork bomb attack.
Determining the cause of the resource exhaustion
Next up was to determine the current value of max user processes
and open files
. What to do? Hmmmm let’s pull out the top
cmd again and fire it of on the host.
Don’t tell me that you would honestly have any better idea at this point . top
is as warm and fuzzy as they come. So let’s hold on to it and squeeze it tightly .
Sooooo what did top
tell us?
.....
PID PPID User Stat Command
67 1 root Z 0 0% 0 0% [drill]
60 1 root Z 0 0% 2 0% [drill]
74 1 root Z 0 0% 3 0% [drill]
89 1 root Z 0 0% 3 0% [drill]
97 1 root Z 0 0% 1 0% [drill]
104 1 root Z 0 0% 3 0% [drill]
110 1 root Z 0 0% 3 0% [drill]
117 1 root Z 0 0% 3 0% [drill]
47 1 root Z 0 0% 3 0% [drill]
131 1 root Z 0 0% 2 0% [drill]
138 1 root Z 0 0% 2 0% [drill]
145 1 root Z 0 0% 0 0% [drill]
152 1 root Z 0 0% 1 0% [drill]
159 1 root Z 0 0% 0 0% [drill]
166 1 root Z 0 0% 0 0% [drill]
173 1 root Z 0 0% 2 0% [drill]
180 1 root Z 0 0% 2 0% [drill]
189 1 root Z 0 0% 3 0% [drill]
32 1 root Z 0 0% 1 0% [drill]
203 1 root Z 0 0% 2 0% [drill]
209 1 root Z 0 0% 1 0% [drill]
216 1 root Z 0 0% 0 0% [drill]
223 1 root Z 0 0% 1 0% [drill]
230 1 root Z 0 0% 0 0% [drill]
237 1 root Z 0 0% 2 0% [drill]
244 1 root Z 0 0% 0 0% [drill]
251 1 root Z 0 0% 3 0% [drill]
258 1 root Z 0 0% 0 0% [drill]
265 1 root Z 0 0% 1 0% [drill]
272 1 root Z 0 0% 1 0% [drill]
279 1 root Z 0 0% 2 0% [drill]
286 1 root Z 0 0% 0 0% [drill]
292 1 root Z 0 0% 0 0% [drill]
299 1 root Z 0 0% 2 0% [drill]
40 1 root Z 0 0% 3 0% [drill]
313 1 root Z 0 0% 1 0% [drill]
321 1 root Z 0 0% 3 0% [drill]
328 1 root Z 0 0% 1 0% [drill]
335 1 root Z 0 0% 1 0% [drill]
342 1 root Z 0 0% 3 0% [drill]
349 1 root Z 0 0% 0 0% [drill]
.....
N.B. The above output is a snippet of the output from
top
. And for good measure I changed the username in the output to the generic value ofroot
Oh wow! Holy hamster . That looks bad bad not good! Something is firing of the drill
command again and again for some reason. We can live with that, if drill
is actually killed. But its not and therefore just keeps hanging. This is likely to be the culprit of the resource exhaustion state that the hosts ends up in.
Let’s go after the many hanging drill
commands and save looking at this potentially being an open files
limit reached issue for later.
If we take a look at the output of the top
command again, focussing specifically on the stat
column we see that it has a value of Z for all of the drill
commands. This stands for zombie and in essence means that the process is dead, it has done its job and what’s left is the structure describing it on the Linux
kernel. It is in zombie mode because it awaits its parent process to pick up its exit code.
Based on that we can conclude that the exit code of the drill
command is not picked up. Or it is, but the process is for whatever reason not killed.
Duly note that a zombie process have released its recourses. However, it still remains in the system process table and therefore counts against the max user processes
limit.
So what is invoking all those drill
commands? Let’s use one of my other favorite old timer Linux
commands. Namely ps
.
Execute: ps -efl -o pid,ppid,user,comm | grep drill
Here is the output (a sample):
.....
5453 24102 root drill
7012 24102 root drill
16851 24102 root drill
24217 24102 root drill
.....
With the -o
parameter we defined the columns that the ps
command should output. The column to take note of is the second column. Namely the ppid
column. ppid
stands for parent id. So there you have it, the pid
of the process that invoked the drill
command.
Next step. Gathering info on the parent process. We simply do that by executing: cat /proc/{pid}/cmdline
, in this case replace {pid}
with 24102. The result: dnscrypt-proxy-config/config/dnscrypt-proxy.toml
. For a reason unknown to me, all whitespace have been stripped from the output. I guess its because the cat
command cannot read the special separator character of the content in the process cmdline
file. Because, if we execute: vi /proc/{pid}/cmdline
we see that the content of the cmdline
file is dnscrypt-proxy^@-config^@/config/dnscrypt-proxy.toml^@
. So whitespace is marked with a ^@
character sequence. With this knowledge obtained we can conclude that the actual cmdline
of the parent process is: dnscrypt-proxy -config /config/dnscrypt-proxy.toml
. So …. there you have it. It’s the dnscrypt-proxy
that invokes all these drill
commands.
Solving the cause of the resource exhaustion
dnscrypt-proxy
runs in a container on my QNAP
. In order to determine what is calling the drill
command I had a look at its dockerfile
. At the time of writing the dockerfile
looked like this:
FROM alpine:edge
ENV DNSCRYPT_LISTEN_PORT=5053 \
DNSCRYPT_SERVER_NAMES="" \
PUID=1000 \
PGID=100
COPY entrypoint.sh /
RUN apk add --no-cache dnscrypt-proxy drill \
&& chmod +x /entrypoint.sh
HEALTHCHECK --interval=5s --timeout=3s --start-period=10s \
CMD drill -p $DNSCRYPT_LISTEN_PORT cloudflare.com @127.0.0.1 || exit 1
ENTRYPOINT ["/entrypoint.sh"]
Here’s a link to the dnscrypt-proxy
dockerfile
. As it looked at the time of writing.
By eye-gazing the file we can determine that the drill
command is used as a part of the Docker
container HEALTHCHECK
. Oh Shoot! It is actually the HEALTHCHECK
making the host “sick”. This is sooooo
But why is this happening. Again looking at the dockerfile
we can see that the drill
cmd queries a DNS server on port 5053. Okay fine. But wait, I have this volume mapping in my dnscrypt-proxy
docker-compose
YML file.
volumes:
- "./conf/dnscrypt-proxy.toml:/config/dnscrypt-proxy.toml"
I had a look in the dnscrypt-proxy.toml
file and found that I had configured the listening port of dnscrypt-proxy
to a different port than the one that the HEALTHCHECK
queries.
dnscrypt-proxy
still works even though there is a mismatch between the HEALTHCHECK
and the configured listening port in the dnscrypt-proxy.toml
file. But the drill
command will of course fail and fail and fail and fail ….. But why isn’t the HEALTHCHECK
just returning the || exit 1
part? And the drill
command killed!
Let’s go into that.
For King Canute let’s peep at that dockerfile
once more. Notice that --timeout=3s
? Yes you do, you see it . After spotting that, I suspected this being an issue of the Docker
HEALTHCHECK
not leaving enough time for the drill
cmd to timeout in the case that the DNS server is not answering. Reading the documentation on the Docker
HEALTHCHECK
reveals that if the cmd used in the HEALTHCHECK
takes more than, in this case, three seconds, the HEALTHCHECK
will be considered as unhealthy. More here. After executing the cmd used in the HEALTHCHECK
inside the dnscrypt-proxy
container, I could conclude that the three second timeout would certainly not be enough to allow the drill
command to naturally time out.
I used these commands:
-
time drill -p 5354 cloudflare.com @127.0.0.1
< query on a port that thednscrypt-proxy
is not listening on- Result:
15.01s
- Result:
-
time drill -p 5053 cloudflare.co @127.0.0.1
< query on the port that thednscrypt-proxy
is listening on- Result:
0.00s
. The speed of the answer is because it was served from the DNS cache
- Result:
Conclusion
When the Docker
HEALTHCHECK
times out after three seconds the exit code of the drill
cmd is not caught and it just hangs as a zombie process and counts against the max user processes
. Then when the max is reached. The disaster occurs as the host experiences resource exhaustion.
The fix
Initially I fixed this by simply making sure that the listening port configured in the dnscrypt-proxy.toml
file matched the one the Docker
HEALTHCHECK
uses.
However, after running the container for a couple of days I still saw zombie drill
cmd’s. Though a very limited number, compared to before.
But of course the wise one thinks. E.g. if for some reason the drill
command cannot reach the local DNS server. If the domain does not exist or an network issue occurs, drill
does not get the necessary time it needs to time out naturally. Leaving us with hanging zombie drill
processes.
The real fix
As part of my troubleshooting I created this GitHub issue. On the repository for the dnscrypt-proxy
container image I use. For better or worse I was referred to the klutchell dnscrypt-proxy
container image by the image creator. As the creator thinks that image is better maintained. I therefore had a look at that projects dockerfile
and what do I see. No HEALTHCHECK
. Hmmmm, interesting.
Further investigation. This time into the commit history of the klutchell image. In there I found a commit with this revealing title: remove drill and internal healthcheck
. But the commit had no reasoning on why. So I created an GitHub issue on the klutchell image repository. Where I simply asked for the reason behind removing the HEALTHCHECK
.
The creator told me that he did so in order to simplify the image and give users control on whether or not they want a HEALTHCHECK
. That knowledge made me decide to continue with the image I was using. However, with the change that I would control its HEALTHCHECK
in the docker-compose
file I’m using for the dnscrypt-proxy
container.
Like so:
version: '3.4'
....
....
....
healthcheck:
test: ["CMD", "drill", "-p", "5053", "cloudflare.com", "@127.0.0.1", "||", "exit", "1"]
interval: 20s
timeout: 20s
retries: 3
start_period: 5s
....
N.B. the above is a snippet of the
docker-compose
file. Ensure to specify the docker-compose file version to be v3.4 or above, if you want use thestart_period
parameter.
Overwriting the HEALTHCHECK
gives the following benefits:
- Controlling the value of the
timeout
HEALTHCHECK
parameter (the actual fix) - And, if I want to, the freedom to set the
dnscrypt-proxy
listening port as I see fit
Completing that, I then manually monitored the dnscrypt-proxy
container for the next day or so. It ran without any issues and even better, there was no longer any hanging zombie processes. There you have it, job done .
Thank you for reading along. I hope you enjoyed it and that you found useful knowledge.
Over and out and stay