|
Free Software programmer
Subscribe This blog existed before my current employment, and obviously reflects my own opinions and not theirs. ![]() This work is licensed under a Creative Commons Attribution 2.1 Australia License.
Categories of this blog:
Older issues:
My wife: |
Fri, 27 Jun 2008Linux Foundation's Device Driver StatementSomeone noted that I didn't sign the LF "proprietary modules are bad" statement. This is entirely due to my slackness and not any lack of support. As kernel module maintainer I feel obliged to maintain the status quo with proprietary modules, but I have noticed many colleagues becoming more annoyed about them. [/tech] permanent link Mon, 16 Jun 2008Selling the Farm .
With Alli's high-risk pregnancy, we're selling the farm and moving back to an apartment in Adelaide (where both our families are). She's moved across already, I'm staying to take care of the farm until it sells. Unfortunately farms do not sell quickly, but it gives me time to have everyone visit (again). And now we're selling, there's less chance you'll be asked to do random tree planting or similar chores. When Richard Guy Briggs visited last year he took some great photos, and now seems a good chance to link to them. [/self] permanent link Thu, 12 Jun 2008stop_machine latencyKathy Staples and I wrote a little program to measure the latency on every CPU on a machine. It sets CPU affinity and high priority (SCHED_FIFO, prio 50) for each thread, then spins doing gettimeofday() for a given duration. The maximum gap in gettimeofday() is reported for each CPU. I tested this on an old 18-way Power4 box sitting around the lab: CPU 0 is used for the parent process, and the latency is measured on the other CPUS. This was run 100 times. Then a variant which did an insmod system call on CPU 0 was used (this calls stop_machine, which is what we were trying to measure). The results are interesting and a little surprising. Normal max latency is around 35 usec, the stop_machine increasing it to the 100 range. There's obviously something running periodically on CPU 2: for both runs I had to remove one horrific 150ms latency result (1000 times average!) but there's still a noticeable spike there. I suspect CPU1 is low because CPU0 is mainly idle (same core). But more concerning is that latency seems to go up with higher CPU numbers, whereas I expected it to be worst on lower CPUs. We launch stop_machine threads in cpu order, so I expected the lower CPUs to wait the longest. We're running modprobe on cpu 0, which means the stop_machine control thread runs there, too. It loops through creating 17 other threads: as CPU 0 is busy, it gets scheduled on a different idle CPU. The first thing the thread does is try to move itself to its proper CPU. I suspect what is happening is that we're creating the 17 threads fast enough that they all end up queued on the migration queue for CPU 0 at once: this queueing uses "list_add" not "list_add_tail", so they are in fact deployed by the migration thread in reverse-CPU order. My simplified version of stop_machine is more intelligent: it moves all the threads to their correct CPUs before waking them all up. This should solve this problem as well as reducing overall latency. [/tech] permanent link Fri, 16 May 2008Tuning VirtIO and virtio_net: part IOne premise of virtio is that we should be as fast as reasonably possible. While there's nothing which should make us slow, that's not the same as actually being fast. So this week, I've been doing some simple benchmarks on my patch queue, which includes major changes to accelerate the tap device and allow async packet sends. I've been using lguest rather than kvm because it's far more hackable, and my test has been a 1GB (1024x1024x1024 byte) TCP send using netcat. And host->guest results were awful: instead of the current 12 seconds it was taking 70 seconds to receive 1GB. So I started breaking that down. The first things that I found was that simply allocating large receive buffers (of which only 1500 bytes is used) is expensive. Just this change alone takes the time from 12 seconds to 29, and there are two reasons for this so far. The first is because each 1500 byte packet takes two descriptors (we have a header containing metadata), whereas a fully populated paged skb takes 2 + 65536/PAGE_SIZE + 2 == 20 descriptors. That means we only fit 6 large packets in lguest's 128-descriptor ring, vs 64 for the small packet case. Increasing lguest's rings to 1024 drops the time from 29 to 25: not as much as you'd expect. Increasing it further has marginal effect (logically, we should see equivalence at 1280 descriptors, but it has to be a power of 2). The second reason is that alloc_page is quite slow. A simple cache of allocated pages drops the time from 25 to 19 seconds. But we're still 50% slower than allocating 1500-byte receive buffers, and today's task is to figure out why. It seems unlikely that the increased overhead of skb_to_sgvec, get_buf and add_buf would account for it. Cache effects also seem unlikely: 1024 descriptors are still only 8k. It's unfortunate that oprofile doesn't work inside lguest guests, so this will be old school. If the overhead really is inherent in large descriptors, we have several options. The obvious one is to add a separate "large buffer" queue, or allow mixing buffer sizes and expect the other end to try to forage for the minimal sized one. Both require a change to the server side. We can add a feature bit for backwards-compat, but that's always a last resort. Another option is to try for multi-page allocations for our skbs: as they're physically contiguous they'll use fewer descriptors. [/tech] permanent link Tue, 22 Apr 2008Austin, TXArrived for the virtualization mini-summit (alongside the Linux Foundation Collaboration Summit) the week before last, and stayed around because much of IBM's kvm work is done here. Much hacking, but I should have blogged about my travel plans sooner. I leave on Friday for San Jose (on the "Nerd bird" I'm told) for the weekend before I fly back home, but if anyone wants to catch up, send mail... [/self] permanent link Mon, 07 Apr 2008C inline functions not in headersI just appreciated an interesting side-effect of slapping "inline" on static functions within .c files. You don't get a warning when they become unused. This breaks my normal method for code cleanup (in this case, the tun driver). So unless you have evidence otherwise, plase trust the compiler to inline static functions appropriately and don't label them inline. (And remember: inline is the register keyword for the 21st century.) [/tech] permanent link Sat, 05 Apr 2008Hard To Misuse Commentry
Since my blogfu doesn't extend to comments, I recommend the thoughtful
comments found on my recent 'Hard to Misuse' posts at LWN: firstly
'How Do I Make This Hard to Misuse?'
commentry and then 'What If I Don't Actually Like My Users?' commentry.
[/tech] permanent link |