Timers: getrusage weird result in 3.1 (regression)
akovalenko
Status: Curious
Joined: 11 Dec 2010
Posts: 5
Reply Quote
Hi, Liquorix kernel maintainers!

A problem with timers appeared in upstream kernel 3.1, with a pending patch for 3.2 here: [link]. It's not easy to see the issue from userland, unless you use 64-bit SBCL: it crashes when getrusage returns a bogus tv_sec in ru_stime, and that's what happens in 3.1 when a 64-bit representation of internal counter is "corrected" into a negative value, resulting in an overflow.

Another manifestation of the same problem can be seen with a test program (see below): if you start it under strace (or attach to it when it's running), it starts to detect bogus ru_stime.tv_sec on 3.1 kernels.

:: Code ::

#include <stdio.h>
#include <string.h>
#include <sys/resource.h>
int main(int argc, char *argv[])
{
  struct rusage usage;
  int ret;
  while(1) {
    memset(&usage,0xFF,sizeof(usage));
    ret = getrusage(RUSAGE_SELF,&usage);
    if ((usage.ru_utime.tv_sec < 0) ||
   (usage.ru_stime.tv_sec < 0) ||
   (usage.ru_utime.tv_sec >= (1UL<<31)) ||
   (usage.ru_stime.tv_sec >= (1UL<<31)) ||
   (usage.ru_utime.tv_usec < 0) ||
   (usage.ru_stime.tv_usec < 0) ||
   (usage.ru_utime.tv_usec > 1000000)||
   (usage.ru_stime.tv_usec > 1000000)||
   (ret<0)) {
      printf("Unexpected rusage: %zu.%zu and %zu.%zu => %d \n",
        usage.ru_utime.tv_sec,
        usage.ru_utime.tv_usec,
        usage.ru_stime.tv_sec,
        usage.ru_stime.tv_usec,
        ret);
    }
  }
  return 0;
}


I've applied the patch to recent liquorix kernel sources, rebuilt and verified that the problem indeed disappears.

As I understand it, the patch will eventually get into upstream kernel and, consequently, into liquorix kernel without extra attention from you. It's not an urgent issue for me anymore, as I already have built a patched kernel :-) However, if there will be some liquorix updates before the patch gets upstream, I'd find it beneficial to see it applied "ahead of time".
Back to top
damentz
Status: Assistant
Joined: 09 Sep 2008
Posts: 1122
Reply Quote
You mentioned this is hard to detect from user land. If I were to detect an anomaly with the timers, what kinds of applications would it affect most?
Back to top
akovalenko
Status: Curious
Joined: 11 Dec 2010
Posts: 5
Reply Quote
It's basically everything that uses SBCL (an implementation of Common Lisp) on a multicore 64-bit system (btw, one possible workaround is pinning a process to a particular CPU by setting a CPU affinity).

The following facts make it hard to reproduce: (1) Debian is not that great in providing prepackaged Common Lisp software, (2) CL programmers are not that great in supporting system-wide installations and communicating with distro maintainers, (3) the problem itself is not something that happens reliably, and it requires some GC-intensive and CPU-intensive activity.

Due to (1) and (2), it would be hard for anyone outside the CL world to be sure that he observes the kernel problem and not some random result of broken packaging & broken code. It was very noticeable on my desktop only because I'm constantly working with SBCL-based stuff, so I'm probably the only user of Liquorix kernel that had run into it.

The interaction of strace (ptrace call, actually) with the test program I posted is what happens reliably, and it highlights a difference between 3.0 and 3.1 kernels. However, this example doesn't show how it can be harmful (and it's not "unavoidably harmful" in theory: if SBCL didn't check type/range of *rusage field values, it wouldn't cause crashes, just some negative times reported during profiling).

A side note: I've posted here a year ago about a problem with FPU contexts (and you applied the patch which was available by that time, too); the environment where I discovered that bug was even more unusual, perhaps by the order of magnitude: experimental version of Wine with my experimental fork of SBCL codebase running on it, with a very experimental patch of mine directly related to FPU exception handling. Thus I had a very hard time to believe that I was running into a real problem in kernel: I didn't dare to report it until I reproduced it in a screenful of C code running directly on Linux. We have roughly the same situation here, but fortunately, no Wine (and nothing experimental) is involved now. OTOH, the C program I came up with for this time doesn't make a problem obvious, it only demonstrates the difference.[/b]
Back to top
damentz
Status: Assistant
Joined: 09 Sep 2008
Posts: 1122
Reply Quote
Ok, I committed it to zen sources - you'll see this patch integrated in the next package.

You can also look here if you're curious: git.zen-kernel.org/zen-stable/commit/?id=7feb298e78335929498a94bccaff39ba69b1542f
Back to top
Display posts from previous:   

All times are GMT - 8 Hours