A few days ago one of our (Gartner Innovation Center) productions servers died as a result of a log file that wasn’t properly rotated. This might sound like an easy problem to figure out and fix, but the situation was a bit more complex!
This blog post walks through all the steps we took to investigate and fix the issue. I find it extremely interesting & hope you would too!
Investigating the issue
We connected to the server and ran a few “checklist” commands -
df -alh |
As you can see, /var/lib/docker
, got filled up. But which part?
du -ahd 1 /var/lib/docker |
Snap! We can’t get statistics on /var/lib/docker
because we don’t have the right privileges. We don’t have root
either. Ideas?
IT messed up and we can run docker
in privileged mode:
“When the operator executes docker run –privileged, Docker will enable access to all devices on the host as well as set some configuration in AppArmor or SELinux to allow the container nearly all the same access to the host as processes running outside containers on the host“ - Docker docs
So we fired up docker, mounting /var
to /host_var
:
docker run -t -i --privileged -v /var:/host_var alpine sh |
Interesting. du
says /var/lib/docker/containers
is full.
Let’s try and find out which container is the problematic one:
docker run -t -i --privileged -v /var:/host_var alpine sh |
Back to the host, we ran:
docker ps -a | grep 11bcb8ab547d177 |
and found out who was the trouble maker. But that doesn’t solve anything!
Back to the docker container:
du -hs /host_var/lib/docker/containers/11bcb8ab547d177/* |
We suspected that the problem is logging, because our docker log driver rotates logs after they reach a certain size. We should’ve see the file at the following location, but didn’t:/var/lib/docker/containers/11bcb8ab547d177/11bcb8ab547d177.json
What if a log wasn’t rotated properly? what if the container holds a file descriptor to a log file that already got deleted?
It’s really easy to find out! lets fire up lsof
and search for deleted files.
lsof |
Snap! we’re not running as root, which means we can only see output for our own processes. Some information about a process, such as its current directory, its root directory, the location of its executable and its file descriptors can only be viewed by the user running the process (or root).
You know the drill, right? we can mount /proc
and run lsof
. well, not exactly.lsof
will list open files inside a container, and not the files in the host.
We can, on the other hand, search for deleted files manually:
docker run -t -i --privileged -v /proc:/host_proc alpine sh |
Awesome. We found that pid 991 holds a reference to a ~102GB log file. That means that the file wasn’t rotated properly and filled up the disk. But why is the file descriptor pointing to /var/log
? we’ll discuss that later.
Fixing the issue
That’s easy - Save the log & release the file descriptor by truncating it!
docker run -t -i --privileged -v /proc:/host_proc \ |
And voila!
df -alh /var/lib/docker |
Problem fixed.
Permanent fix
Why did the issue occur in the first place? We use docker’s stock JSON File logging driver, which also rotates the files. Is that a bug?
It looks like our IT department messed up again, and set up logrotate in parallel to json logging:
$ cat /etc/logrotate.d/docker |
There was a race between docker’s log rotation and logrotate. But still, that means files could’ve been rotated twice, but not deleted.
The problem? logrotate moved the old log file, instead of copying it and truncating the original. That caused the docker daemon to keep writing to a file descriptor that points to a file that doesn’t exist anymore!
The fix? disable either logrotate or the stock logging driver.
We chose to disable the stock driver, but the problem wasn’t fixed just yet - the daemon can still leak file descriptors. That’s where copytruncate come in:
Truncate the original log file to zero size in place after creating a copy, instead of moving the old log file and optionally creating a new one. It can be used when some program cannot be told to close its logfile and thus might continue writing (appending) to the previous log file forever. Note that there is a very small time slice between copying the file and truncating it, so some logging data might be lost. When this option is used, the create option will have no effect, as the old log file stays in place.
All we had to do is add a copytruncate
line to our configuration and the issue was resolved. Tada!