Why DTrace Makes Leopard a Must-Have Upgrade

Posted by Nick Sieger Tue, 05 Feb 2008 21:12:00 GMT

I feel like I’m actually a relative late-comer to Leopard, at least in my social circle. A lot of the folks in the Ruby community already had it installed the week after it was out, and were showing it off at RubyConf back in November. I just didn’t have a compelling reason to upgrade and disrupt my workflow at the time. Plus, mixed reports were coming out about data loss, UI nits, and other instabilities.

By the time I went out to purchase, 10.5.1 was already the version boxed in the stores, and in retrospect, it seemed worth the wait. I haven’t had a single complaint or major issue with the upgrade so far, and have been enjoying the noticeable zippiness of a freshly-installed system.

Time Machine has been a widely-publicized feature, and has been touted as one of the top reasons to upgrade. So I bought a small portable drive with some leftover holiday gift cards and set out to try it. Initially it seemed promising, except after a day or two of backups the process would stall out during the “preparing” stage. Eventually I noticed that the TM background process, backupd, was eating up 0.5GB of memory and up to 100% of one of the CPUs.

If I wasn’t a nerd making my living having my way with computers, I probably would have given up on Time Machine at this point, after a couple of hours scouring Google and the Apple discussion boards searching for similar problems. But I knew that backupd had to be doing something pathological, and I was compelled to find out what.

On Solaris systems, truss is usually the order of the day for problems like this. It literally vomits an endless listing of system calls invoked by a process into your terminal window. Except there’s no truss on OS X. Is there a replacement? Google mentioned ktrace, present on Tiger systems and earlier, but it’s gone in Leopard. Replaced by? DTrace.

Ahhh, DTrace! Another geeky Leopard-only feature. Certainly DTrace will be able to trace system calls in the same manner as truss. But being a complete DTrace newb, I had no idea where to start. So, like any lazy programmer does, I started shopping around for examples to get me started. Looking around, this article on MacTech looked promising, but didn’t have what I needed. Eventually, I ended up finding the DTrace Toolkit on the OpenSolaris site.

The DTrace Toolkit appears to be your one-stop shop for all things DTrace. If you need a kick-start reason to take a look at DTrace and get you going, this is it. In my case, lo and behold, one of the scripts included in the toolkit is called dtruss!

Many of the scripts in the toolkit are tailored towards a Solaris system, and dtruss is a prime example. It won’t quite work out of the box on Leopard, because a few of the system calls mentioned in the script are non-existent there. Changing the shebang line at the top of the script to #!/bin/bash, and repeatedly running it a few times with sudo ./dtruss -p <pid> will give you an idea of which ones; I simply commented these out until I was successfully able to trace a process.

Now, finally we can pop the stack back to my original problem with Time Machine and backupd. I launched a backup run and waited for the process to start consuming large amounts of CPU and memory. I located the PID of the process in Activity Monitor, and started tracing it with my modified dtruss script. And, sure enough, I saw the following output scrolling by endlessly in my terminal:

mmap(0x0, 0x5000, 0x3)     = 958464 0
getdirentriesattr(..., ..., ...)    = ....
munmap(0xEA000, 0x5000)    = 0 0
getxattr(..., ..., 0x0)     = -1 Err#93
getxattr(..., ..., 0x0)     = -1 Err#93
getxattr(..., ..., 0x0)     = -1 Err#93
getxattr(..., ..., 0x0)     = -1 Err#93
getxattr(..., ..., 0x0)     = -1 Err#93
getxattr(..., ..., 0x0)     = -1 Err#93
getxattr(..., ..., 0x0)     = -1 Err#93
getxattr(..., ..., 0x0)     = -1 Err#93
getxattr(..., ..., 0x0)     = -1 Err#93
getxattr(..., ..., 0x0)     = -1 Err#93
getxattr(..., ..., 0x0)     = -1 Err#93
getxattr(..., ..., 0x0)     = -1 Err#93
getxattr(..., ..., 0x0)     = -1 Err#93

(The ellipsis were actual memory addresses, I didn’t save the output.) What was interesting is that the same chunk of memory (the first argument to getxattr) was floating by repeatedly. Looking at the man page for getxattr, the signature is:

ssize_t
getxattr(const char *path, const char *name, void *value,
         size_t size, u_int32_t position, int options);

So, the first argument contains the path. Now, how can I get the contents of that memory address? The answer is inside dtruss. Closer to the top of the script is this DTrace code:

/* print 3 args, arg0 as a string */
syscall::stat*:return, 
syscall::lstat*:return, 
syscall::open*:return
/* not on leopard -- syscall::resolvepath:return */
/self->start/
{
    /* calculate elapsed time */
    this->elapsed = timestamp - self->start;
    self->start = 0;
    this->cpu = vtimestamp - self->vstart;
    self->vstart = 0;
    self->code = errno == 0 ? "" : "Err#";

    /* print optional fields */
    OPT_printid  ? printf("%6d/%d:  ", pid, tid) : 1;
    OPT_relative ? printf("%8d ", vtimestamp/1000) : 1;
    OPT_elapsed  ? printf("%7d ", this->elapsed/1000) : 1;
    OPT_cpu      ? printf("%6d ", this->cpu/1000) : 1;

    /* print main data */
    printf("%s(\"%S\", 0x%X, 0x%X)\t\t = %d %s%d\n", probefunc,
        copyinstr(self->arg0), self->arg1, self->arg2, (int)arg0,
        self->code, (int)errno);
    OPT_stack ? ustack()    : 1;
    OPT_stack ? trace("\n") : 1;
    self->arg0 = 0;
    self->arg1 = 0;
    self->arg2 = 0;
}

I only had to add syscall::getxattr:return to the list of matched probes, and now I could finally inspect the path argument to getxattr:

munmap(0xEA000, 0x5000)      = 0 0
getxattr("/Users/nicksieger/Library/Mail/IMAP-nicksieger@gmail.com
@imap.gmail.com/[Gmail]/All Mail.imapmbox/Messages/1002680.emlx\0",
0x967288D4, 0x0)         = -1 Err#93
getxattr("/Users/nicksieger/Library/Mail/IMAP-nicksieger@gmail.com
@imap.gmail.com/[Gmail]/All Mail.imapmbox/Messages/1002681.emlx\0",
0x967288D4, 0x0)         = -1 Err#93
getxattr("/Users/nicksieger/Library/Mail/IMAP-nicksieger@gmail.com
@imap.gmail.com/[Gmail]/All Mail.imapmbox/Messages/1002682.emlx\0",
0x967288D4, 0x0)         = -1 Err#93
getxattr("/Users/nicksieger/Library/Mail/IMAP-nicksieger@gmail.com
@imap.gmail.com/[Gmail]/All Mail.imapmbox/Messages/1002683.emlx\0",
0x967288D4, 0x0)         = -1 Err#93
getxattr("/Users/nicksieger/Library/Mail/IMAP-nicksieger@gmail.com
@imap.gmail.com/[Gmail]/All Mail.imapmbox/Messages/1002684.emlx\0",
0x967288D4, 0x0)         = -1 Err#93
getxattr("/Users/nicksieger/Library/Mail/IMAP-nicksieger@gmail.com
@imap.gmail.com/[Gmail]/All Mail.imapmbox/Messages/1002685.emlx\0",
0x967288D4, 0x0)         = -1 Err#93
getxattr("/Users/nicksieger/Library/Mail/IMAP-nicksieger@gmail.com
@imap.gmail.com/[Gmail]/All Mail.imapmbox/Messages/1002686.emlx\0",
0x967288D4, 0x0)         = -1 Err#93
getxattr("/Users/nicksieger/Library/Mail/IMAP-nicksieger@gmail.com
@imap.gmail.com/[Gmail]/All Mail.imapmbox/Messages/1002687.emlx\0",
0x967288D4, 0x0)         = -1 Err#93

D’oh! GMail! That directory has so many files, it took over two minutes just for ls to list all 487356 of them. Hundreds of thousands of email messages, all being re-inspected by Time Machine every time some new messages are added to the directory. I’ll leave it to someone else to point fingers at what the actual problem is here (a side-effect of TM’s usage of hard links? Mail.app inefficiently storing too many messages in a single directory?), but after all this I just decided that I didn’t want a backup of my GMail messages since they’re stored on the server. So I added the directory to the list of excluded directories in TM, wiped my backup, and started over. (TM had similar problems trying to complete an incremental backup with the existing backed-up copy of my mail on the backup disk, so I decided to wipe it and start fresh.) I won’t declare the problem completely solved yet, but if it happens again, I’ll just repeat this process to find the new culprit. Hopefully I don’t end up excluding my entire home directory!

This whole process was a revelation to me -- the fact that I could pinpoint the exact problem in a piece of system software despite having few notions of the internals of that software. The next time I have a nagging issue in Leopard, DTrace will be my tool of choice in tracking it down. Let’s just hope it’s not a problem with iTunes!

Tags , ,  | 9 comments

Comments

  1. Avatar Luke Francl said about 1 hour later:

    Wow, that’s cool.

    I’m Time Machine has been great for me at work (I don’t store my mail on my computer ;) and I’m looking forward to upgrading at home.

  2. Avatar pete4120@muchomail.com said about 5 hours later:

    Did you really say that DTrace is a Leopard-only feature? That’s equivalent to saying that snow skiing can only be done in Dubai. Hint: other places have had ski slopes a long time before the desert.

  3. Avatar Lyle said about 17 hours later:

    Thanks for the intro to DTrace---it’s one of the things I’m looking forward to trying out.

    I think I’m finally going to be able to upgrade to Leopard this weekend, now that SuperDuper! (my backup program of choice) is finally ready for Leopard. The more I hear about how Time Machine works (or doesn’t work), the less enthused I am about trying it. Best case, it would be a complementary solution to cloning the drive with SuperDuper!

  4. Avatar vasi said 2 days later:

    Just in case you haven’t noticed, Leopard comes with a /usr/bin/dtruss. It also has a number of other dtrace tools included, you can get an idea of what’s available by looking in /usr/share/examples/DTTk.

  5. Avatar Mike said 2 days later:

    There is also an application that comes with Leopard called Instruments which is a GUI wrapper around DTrace.

  6. Avatar mikey said 3 days later:

    The way Time Machine works is to ask for filesystem changes from the FSEvents framework, which only records the parent directory when something in it changes, and relies on apps to figure out what in that directory changed. So yeah, every time an email is added to that dir as a new file, an FSEvents-using program like Time Machine needs to rescan the whole directory.

    It’s actually a reasonable idea for FSEvents, for efficiency’s sake, to drop all but the parent dirs from the list of changed files...but it can really hurt when there’s a pathological case like this; and Apple encourages applications to try to store files in a tree (so it would be /whatever/gmail-account/2008/January/03/ ... in email especially, after about a month, a directory probably never changes again). Maybe this is already sorted out for new Mail.app accounts, and they kept the old format for pre-10.5 installs?

  7. Avatar Reinder Verlinde said 3 days later:

    “I’ll leave it to someone else to point fingers at what the actual problem is here.”

    The problem probably is that Time Machine uses FSEvents http://developer.apple.com/documentation/MacOSX/Conceptual/OSX_Technology_Overview/SystemTechnology/chapter_3_section_5.html#//apple_ref/doc/uid/TP40001067-CH207-SW4 to determine what to backup. That mechanism only gives it information about directories that changed. Time Machine then must scan those directories to check for changed/added/deleted files. With thousands of entries in a directory, that takes time.

  8. Avatar AC said 3 days later:

    Duh, this could have been done with strace without any hassle

  9. Avatar Fred said 3 days later:

    I don’t recommend upgrading to Leopard.

    I tried Leopard for 1 month, and I had to roll back to Tiger...

    Lots of problems with me. (macbook core2duo)

    There are a lot of bugs, many of them, including Mail.app and Finder.

    I dont’ use Time Machine. I just rsync, a lot faster.

    Wait for 10.5.2 to come out this month.