A Mystery Explained
Peter McCurdy, 29 June 2008Back in the day, I worked on a Linux system running a version 2.4 kernel, and commonly run on VMware. We had to pay fairly close attention to the system time, and were continually confounded by the fact that time would occasionally go backwards, a situation that occurred much more frequently on VMware. These weren’t small jumps either; they were sometimes 200ms or more. At one point we even had to turn off log messages that commented on this problem, as they simply happened too frequently on VMware to be useful.
I recently ran across an article from VMware titled Timekeeping in VMware Virtual Machines, which explains the cause of the problem. Apparently it came down to the 2.4 Linux kernel using a poor algorithm for interpolating the system time in between timer interrupts, which would occasionally overestimate the elapsed time, and then reset itself backwards when a timer interrupt arrived. Life is somewhat better on Linux 2.6, though the paper only covers up to version 2.6.8.1. It would be nice to have an updated version of the paper, as Linux timer handling has changed fairly drastically in later releases of 2.6.
It’s nice to know that we weren’t going crazy when we were seeing these things, and that there wasn’t much of a sensible option to fix the root cause. The solution we came up with actually worked quite well; we simply stopped calling gettimeofday(2) directly in our application, and instead called a library function that returned a cached value. We then updated this cached value fairly frequently, but checked to make sure that it didn’t randomly jump backward. This ended up being a superior solution anyway, as it gave us better control over clock values in the application, and avoided making tons of redundant system calls.