diary

rss feed

I've moved!

06/11/2012, [ misc / ] [ link ]

All of the stuff from this site is now at jk.ozlabs.org.

If you're looking for a new RSS feed, you can find that at jk.ozlabs.org/feeds/rss/.

See you on the other side!

global namespace

26/05/2009, [ tech / linux / ] [ link ]
[jk@pingu ~]$ mkdir ~/sshfs
[jk@pingu ~]$ afuse -o mount_template="sshfs %r:/ %m" -o unmount_template="fusermount -u -z %m" ~/sshfs/
[jk@pingu ~]$ cat ~/sshfs/ozlabs.org/etc/hostname
ozlabs
[jk@pingu ~]$ 

hiprofile - HTML interactive profiler, version 1.0

20/04/2009, [ tech / linux / ] [ link ]

I've recently been doing some performance analysis work on Linux, which requires profiling various applications to see how they perform under load. One of the handy tools for profiling this behaviour is oprofile, but the output can be a little difficult to interpret.

So, I've developed hiprofile, the HTML interactive profiler.

Hiprofile is a small python script to generate HTML reports for oprofile data. Output is broken down into per-program statistics:

Report summary

The page for each program lists the top 20 (by default) symbols within that function:

Breakdown of functions in the postgres binary

Each of the most-hit functions are show with per-instruction profiling data, and each instruction is coloured according to how 'hot' it is. If the source code for the function is available, the output is annotated with the corresponding code:

Source & assembly listing

More info and downloads are on the hiprofile project page.

linux.conf.au hackfest: the solution, part four

08/01/2009, [ tech / cell / ] [ link ]

In the previous article in this series, we finished with a parellel SPE-based fractal renderer. This article will cover some of the major optimisations we can do to make use of the Cell Broadband Engine.

Our last benchmark gave us the following results:

Performance of multi-threaded, interleaved SPE fractal renderer
SPE threads Running time (sec)
140.72
220.75
410.78
67.44
85.81

SPE vector instructions

The registers on the SPEs are actually 128-bits wide. Rather than using the entire register for one value, most of the SPE instruction set consists of vector instructions.

For example, say we wanted to add 7 to the value in register one, and place the result in register two. To do this, we'd use the ai ("add immediate") instruction:

        ai r2,r1,7

This instruction operates on the four 32-bit 'slots' of the 128-bit register (4 x 32 = 128) in parallel:

Diagram of vector addition instruction

By using these vector instructions, we can do a lot of the fractal rendering work in parallel.

Vector instructions aren't specific to the Cell Broadband Engine - Altivec (on PowerPC processors) and SSE (on Pentium processors) offer similar capabilities for parallel processing of data. In fact, the PPE on the Cell Broadband Engine supports Altivec extensions.

vectorising the fractal renderer

The hackfest task description contains a little information on how to vectorise code:

The SPEs of the Cell/B.E. have an instruction set almost entirely composed of SIMD instructions - so you can do most computations on four operands at a time.

To take advantage of these SIMD capabilities, you can use "vector" types in your code.

For example, consider a simple function to multiply a number by three:

float multiply_by_three(float f)
{
        return f * 3;
}

By vectorising this function, we can do four operations in the same amount of time:

vector float multiply_by_three(vector float f);
{
        return f * (vector float){3, 3, 3, 3};
}

The vector float type is similar to an array of 4 floats, except that we can operate on the vector as a whole.

If we look at the instructions generated for both the scalar and vector versions of the multiply_by_three function, they're exactly the same. The following code shows the compiled assembly of multiply_by_three(), both vector and scalar versions. The function argument is in register 3, return value is placed in register 3.

multiply_by_three:
        ilhu   $2,16448    ; load {3, 3, 3, 3} into register 2
        fm     $3,$3,$2    ; register 3 = (register 3) * (register 2)
        bi     $lr         ; return to caller

So, if you're using scalar operations, we get one multiply in three instructions. Using the vector equivalent, we get four multiplies in three instructions.

So, we should look at the hot paths of the current fractal renderer, and try to vectorise where possible.

Currently, this loop of our render_fractal function is where the SPE is doing most of the work, iterating over each pixel to find the point where it diverges to infinity:

        for (x = 0; x < params->cols; x++) {
                cr = x_min + x * params->delta;

                zr = 0;
                zi = 0;

                for (i = 0; i < params->i_max; i++)  {
                        /* z = z^2 + c */
                        tmp = zr*zr - zi*zi + cr;
                        zi =  2.0 * zr * zi + ci;
                        zr = tmp;

                        /* if abs(z) > 2.0 */
                        if (unlikely(zr*zr + zi*zi > 4.0))
                                break;
                }

                colour_map(&params->imgbuf[r * params->cols + x],
                                i, params->i_max);
        }

The renderer will execute the inner loop up to rows * columns * i times, where i is between 1 and i_max. Seeing as this is the most executed path in our renderer code, it is a good place to start our vectorisation efforts.

By vectorising, we can perform the contents of the loop on four pixels at a time (rather than just one), enabling us to reduce the number of iterations to around one quarter.

The new vectorised fractal generator is available in fractal.5.tar.gz. All we have changed here is the SPE-side code (spe-fractal.c). The most significant change is to the render_fractal() function, where we have replaced the above loop with:

        for (x = 0; x < params->cols; x += 4) {
                escaped_i = (vector unsigned int){0, 0, 0, 0};
                cr = x_min + spu_splats(params->delta * x);

                zr = (vector float){0, 0, 0, 0};
                zi = (vector float){0, 0, 0, 0};

                for (i = 0; i < params->i_max; i++)  {

                        /* z = z^2 + c */
                        tmp = zr*zr - zi*zi + cr;
                        zi =  (vector float){2.0f, 2.0f, 2.0f, 2.0f}
                                * zr * zi + ci;
                        zr = tmp;

                        /* escaped = abs(z) > 2 */
                        escaped = spu_cmpgt(zr * zr + zi * zi, limit);

                        /* escaped_i = escaped ? escaped_i : i */
                        escaped_i = spu_sel(spu_splats(i), escaped_i,
                                        escaped);

                        /* if all four words in escaped are non-zero,
                         * we're done */
                        if (!spu_extract(spu_orx(~escaped), 0))
                                break;
                }

                colour_map(&params->imgbuf[r * params->cols + x],
                                escaped_i, params->i_max);
        }

This new code processes four columns at each iteration, and increments x by four each time. We set up a vector of pixels (one per column) at each iteration, and do the divergence tests on a vector at a time.

Most of the actual aritmetic is easy to convert; the standard multiply, add and subtract operators will work just fine with vector types.

However, the if-test to see if the number has diverged is more complex - since we're dealing with a vector of four pixels, we can't just break out of the inner loop when one pixel has diverged. Instead, we need to keep processing until all vector elements to reach the divergence point, and remember the value of i where this divergence occurs.

Basically, the above code keeps two extra variables: escaped and escaped_i. escaped is a vector of four flags - if the nth word of escaped is non-zero, we have detected that the nth pixel in our vector has diverged to infinity. The escaped_i is a similar vector of the value of i where we first detected this divergence, which is used as the 'value' for each point in the fractal.

We do this with a few SPE intrinsics, which allow us to use some of the lower-level SPE vector functionality:

SPE intrinsics used in vectorised fractal code
spu_cmpgt(a,b) compare greater than
returns one bits in each word slot in a that is greater than the corresponding word slot in b.
spu_sel(a,b,c) select bits
For bit each bit in c: if the bit is zero, return the corresponding bit from a, otherwise return the corresponding bit from b.
spu_orx(a) or across
Logical-or each word in a, and return the result in the preferred slot (ie, the first word). The remaining three slots are set to zero.
spu_extract(a,n) extract vector element
Return a scalar value from the nth element of vector a
spu_splats(a) splat scalar
Return a vector consisting of each word equal to a

The first three intrinsics will result in just one instruction in the compiled code. Since slot 0 is the preferred slot (and therefore where the compiler would normally store a scalar value in a register), the call to spu_extract(x, 0) will not result in any instructions at all. The spu_splats instrinsic may require a variable number of instructions, depending on the value to splat.

More detail on the SPE instrinsics is available in the PPU & SPU C/C++ Language Extension Specification and SPU Instruction Set Architecture documents on the IBM developerWorks Cell Broadband Engine Resource Center.

We also need to change the colour_map function to take a vector of i values (rather than just a single value), and calulate the colours of four pixels at a time. For now, we can just do this one element at a time, using spu_extract, but this could be improved later.

benchmarking

After vectorising the SPE fractal code, we can benchmark to see how the vectorisation affects run time. The following table shows run times for the new renederer, along with the proportion of time against the previous scalar implementation.

Performance of multi-threaded, interleaved, vectorised SPE fractal renderer
SPE threads Running time (sec) Vector / Scalar
110.700.262
25.720.372
43.290.305
62.460.330
82.060.354

Not too bad! We're doing much better than our original PPE-based renderer, which took 55.7 seconds to render the same fractal.

Ideally, we'd see a constant vector / scalar figure of 0.25, meaning that we're doing four times the amount of work in the same run time. However, there are still a lot of operations that we haven't optimised yet.

More optimisations in the next article!

SPE gang scheduling policies

18/11/2008, [ tech / cell / spufs / ] [ link ]

In my previous post here, I mentioned that:

We also need to allow contexts to be loaded outside of spu_run to implement gang scheduling correctly and efficiently.

I think that may require a little explanation, so here we go:

gang scheduling policies

The idea behind SPE gang scheduling is to allow a set of related SPE contexts to be scheduled together, to allow interactions between contexts to be performed in a timely manner. For example, consider two contexts (A and B) that send mailbox messages to each other. If context A is running while context B is scheduled out, then A will spend its time-slice waiting for a message from B. If they are scheduled as a single entity, neither context will have to spend its timeslice blocked, waiting for the other context to be run.

So, we have to come up with a policy to define the behaviour of the gang scheduler. When does the gang become schedulable? Under which conditions should the gang be descheduled?

I can see four possible approaches:

policy 1: the gang is only schedulable when all contexts are runnable

In this case, the gang is only ever scheduled when all of the gang's contexts are runnable (ie, they are being run by the spu_run system call).

Although the simplest, this approach will never complete&emdash;consider the following:

  1. Context A becomes runnable
  2. Context B becomes runnable
  3. The gang is now schedulable, so both contexts are scheduled
  4. Because it has slightly less work to do, context A finishes before context B
  5. Because only one of the two contexts is runnable, the gang is no longer schedulable. Context B is never re-scheduled, so cannot complete the rest of its task

So, this policy isn't much use; perhaps we can solve this with a new approach:

policy 2: the gang is scheduled when all contexts are runnable, and descheduled when no contexts are runnable

This will solve the previous non-termination problem, in that context B will be able to terminate - the context isn't immediately descheduled when A finishes.

However, now we have a new, slightly more complex non-termination case:

  1. Context A becomes runnable
  2. Context B becomes runnable
  3. The gang is now schedulable, so both contexts are scheduled
  4. Because it has slightly less work to do, context A finishes before context B
  5. At the same time, context B does a PPE-assisted callback, which requires a stop-and-signal (and so leaves spu_run for just a moment)
  6. Because neither context is currently runnable, the gang is descheduled
  7. Context B finishes its callback, so re-enters spu_run to be re-scheduled. However, the policy does not allow context B to be re-scheduled, as only one of the two contexts is runnable.

Although this may sound like a rare occurrence, it's not a restriction we can pass on to the programmer. Imagine the following SPE code:

int main(void)
{
        do_work();
        printf("work done!\n");
        return 0;
}

Here we're doing a PPE-assisted callback (the call to printf is implemented as a callback) before finishing. If this callback were to occur when the other context has already completed, we would hit the non-termination condition above.

This means that the last-running context of a gang can never do a PPE-assisted callback. In fact, to be completely safe against this non-termination, a programmer would have to avoid callbacks after any context has finished, for risk of callbacks on the rest of the gang being synchronised.

So, it looks like we need to be a little more permissive when deciding if the gang is schedulable.

policy 3: the gang is scheduled when any context is runnable, and descheduled when no contexts are runnable

This is another fairly simple approach&emdash;the gang is scheduled whenever there is any work to do. We no longer have any non-termination conditions, as 'having work to do' will result in 'doing work'.

The tricky part is that it will require us to change one of the fundamental assumptions about spufs: currently, we don't schedule any context unless it is runnable. Because we schedule the entire gang when one if its context becomes runnable, we have to now schedule a number of non-runnable contexts.

The good news is that I've already done a little experimental work to overcome this general restriction in spufs.

The last approach is a little more complex, but works around this restriction:

policy 4: schedule the runnable contexts of a gang, and reserve SPEs for the non-runnable contexts

This is just like policy 3, but instead of actually scheduling the non-runnable contexts, we reserve a SPE for them.

This way, a non-runnable context does not need to be loaded, but can be quickly scheduled when it becomes runnable. The downside is that we're only half-implementing gang scheduling; there still may be interactions to a non-runnable SPE (eg, accesses to the problem state mapping from a running context in the same gang) that will cause running contexts to become blocked.

So, which policy is best for spufs?

Policies 1 and 2 have significant flaws in their approach. It's quite possible that either will lead to non-termination conditions in fairly simple user programs. I don't think we can 'work around' this with a restriction on the programmer.

Policy 4 will require a mechanism for reserving SPEs for a particular context; I'm not convinced the extra complexity is worth the effort, especially as this doesn't allow us to implement gang scheduling properly.

Currently, Luke Browning and André Detsch have a work-in progress patch series for gang scheduling, based on policy 3.

external context scheduling in spufs

14/11/2008, [ tech / cell / spufs / ] [ link ]

At present, the spufs code has the invariant that a context is only ever loaded to an SPE when it is being run; ie, a thread is calling the spu_run syscall on the context.

However, there are situations where we may want to load the context without it being run. For example, to use the SPU's DMA engine from the PPE, requires the PPE thread to write to registers in the SPU's problem-state mapping (psmap). Faults on the psmap area can only be serviced while the context is loaded, so will block until someone runs the context. Ideally, we could allow such accesses to the psmap without the spu_run call. We also need to allow contexts to be loaded outside of spu_run to implement gang scheduling correctly and efficiently.

So, I've been working on some experimental changes to allow "external scheduling" for SPE contexts. The "external" refers to a thread external to the SPE's usual method of scheduling (ie, it's owning thread calling spu_run). In the example above, the external schedule would be caused by the fault handler for the problem-state mapping.

Although a context may be scheduled to an SPE, we still can't always guarantee forward progress. For example, in the "use the psmap to access the DMA engine" scenario, a DMA may cause a major page fault, which needs a controlling thread to service. In this case, the only way to ensure forward progress is through calling spu_run. However, I have some ideas on how we can remove this restriction later.

the interface

First up, we need to tell the spufs scheduler that we want a context to be loaded:

/*
 * Request an 'external' schedule for this context.
 *
 * The context will be either loaded to an SPU, or added to the run queue,
 * depending on SPU availability.
 *
 * Should be called with the context's state mutex locked, and the context
 * in SPU_STATE_SAVED state.
 */
int spu_request_external_schedule(struct spu_context *ctx);

After loading the context with spu_request_external_schedule, we need a way to tell the scheduler that the context can be de-scheduled:

/*
 * The context should be unscheduled at the end of its timeslice
 */
void spu_cancel_external_schedule(struct spu_context *ctx);
These functions are implemented by incrementing or decrementing a count of "external schedulers" on the context. If multiple threads are requesting an external schedule, then the first will activate the context. When the last thread calls the cancel method, the context can be descheduled.

usage

We can use these two functions to allow the problem-state mapping fault handler to proceed outside of spu_run:

--- a/arch/powerpc/platforms/cell/spufs/file.c
+++ b/arch/powerpc/platforms/cell/spufs/file.c
@@ -413,9 +413,11 @@ static int spufs_ps_fault(struct vm_area_struct *vma,

        if (ctx->state == SPU_STATE_SAVED) {
                up_read(&current->mm->mmap_sem);
+               spu_request_external_schedule(ctx);
                spu_context_nospu_trace(spufs_ps_fault__sleep, ctx);
                ret = spufs_wait(ctx->run_wq, ctx->state == SPU_STATE_LOADED);
                spu_context_trace(spufs_ps_fault__wake, ctx, ctx->spu);
+               spu_cancel_external_schedule(ctx);
                down_read(&current->mm->mmap_sem);
        } else {
                area = ctx->spu->problem_phys + ps_offs;

Note that the spu_cancel_external_schedule function doesn't unload the context right away; if it did, the refault would fail too, and we'd end up in an infinite loop of faults. Instead, it keeps the context scheduled for the rest of its timeslice. This gives the faulting thread time to access the mapping after the fault handler has been invoked.

We also need to do a bit of trickery with the priorities of contexts during external schedule operations. If a high-priority thread access the problem-state mapping of a low-priority context, we want the context to temporarily inherit the higher priority. To do this, we raise the priority when spu_request_external_schedule is called, and drop it back after the context has finished its timeslice on the SPU.

the code

I've created a development branch in the spufs repository for these changes, which is available:

  • via git: git://git.kernel.org/pub/scm/linux/kernel/git/jk/spufs.git, in the ext-sched branch; or
  • on the browsable gitweb interface.

Note that this is an experimental codebase, expect breakages!

new patchwork beta

28/10/2008, [ tech / software / ] [ link ]
patchwork screengrab

We've had a new version of patchwork - the web-based patch-tracking system - online for a few weeks now at patchwork.ozlabs.org.

After Paul's presentation on patchwork at the 2008 Kernel Summit, there has been wider interest in patchwork setups for other projects. Patchwork originally hosted the Linux on Power and Linux on Cell lists, we've since added netdev, linux-mtd and linux-ext4. I've also added the main Linux Kernel mailing list (lkml), just to see how the new patchwork handles the load; all has been okay so far.

If you're interested in installing the new patchwork at your own site, you can grab the source from the patchwork project page. Installation can be a little tricky, so feel free to mail me if you need a hand.

2.6.26 on a Lenovo x61 thinkpad

02/09/2008, [ tech / linux / ] [ link ]

It looks like the iwl driver is slightly broken in the 2.6.26 release - connections will drop-out after 10 seconds or so.

The workaround for this is to enable the config option CONFIG_IWL4965_HT.

asynchronous spu contexts, initial designs

16/07/2008, [ tech / cell / spufs / ] [ link ]

I've recently been working on some changes to the spufs code, and thought I'd write-up some of the details.

At present, the spu_run syscall (used to run a SPU context) blocks until the SPU program has exited (or some other event has happened, such as a non-serviceable fault). This means that to take advantage of the SPUs, you really need to start a new thread for each SPU context that you create, otherwise your application will be sitting around waiting for each SPU context to complete.

In fact, we have an invariant in the spufs code at the moment that only contexts that are currently being spu_run will ever be runnable (and, at the moment, schedulable).

Ben H and I have been chatting about some ideas about asynchronous spu contexts. This means that the userspace app can start the context, then later retrieve the status of the SPU context (to see if it has stopped, faulted, or whatever). We can then use standard POSIX semantics like poll() to see if a context is still running or has generated any "events", then handle these events when they become available.

In effect, this is similar to spu_run: currently, the spu_run syscall runs the SPU, then blocks until an event happens, which is then returned to userpsace as the return value of spu_run. The main difference is that we don't block in the kernel while the SPU is running.

So, I've been coding up an experimental change to spufs. Firstly, we have to explicitly tell the kernel that we want a context to operate in asynchronous mode, so I've added a new flag to the spu_create syscall: SPU_CREATE_ASYNC.

I've opted for a file-based interface to these asynchronous contexts - SPU events are retrieved by reading from a file. Contexts that are created with the SPU_CREATE_ASYNC flag have an extra file present (called something like "event") in their context directory in the spufs mount. Reading from this file allows applications to retreive events that the SPU program has raised.

We need to define a format for the data read from this events file, so here's something to get started with:

struct spu_event {
	uint32_t event;
	uint32_t status;
	uint32_t npc;
};

- where the event member specifies which event happened - a stop-and-signal for example.

The status and npc members return the status of the SPU and the next program counter register, respectively. While not strictly necessary (this information is available from other files in spufs), it's very likely that the application will need these values in order to handle the event.

So, users of this interface may look something like this:

uint32_t npc = 0;
struct context {
        int fd;
        int event_fd;
} context;

/* create the context */
context.fd = spu_create("/spu/ctx", NULL, SPU_CREATE_ASYNC);

/* open the events file */
context.event_fd = openat(context.fd, "event", O_RDWR);

/* start the context running. unlike the spu_run syscall,
 * this function does not block for the duration of the
 * spu program */
run_context(&context, npc);

for (;;) {
        struct spu_event event;

        /* get the next event caused by the SPU */
        read(context.event_fd, &event, sizeof(event));

        if (event.event == SPU_EVENT_STOP)
                break;

        /* handle other event ... */
}

Note that the userspace examples here are not what we'd present to Cell application developers. They're more low-level examples of how the new asynchronous kernel interface works. In fact, the changes could be completely transparent to applications which use the libSPE interface.

This isn't far from the API provided by the current spu_run syscall, except that we're not waiting in the kernel while the SPU is running.

Also, we're going to need to control the SPU somehow - for example, we need to implement the run_context function in the pseudocode above. Rather than overloading the spu_run syscall, I've opted to use the same event file - writes to this file will allow userspace to control the SPU. I'm still working out the exact format of these writes, but the way I've implemented it at the moment is that the application can write structures of this layout to the file:

struct spu_control {
	uint32_t op;
	char data[];
};

The contents of the data member depends on the operation requested (specified by the op member). For example, a 'start spu' operation would have four extra bytes - a uint32_t containing the NPC to start the SPU execution from. A 'stop spu' operation doesn't require any extra parameters, so the data member would be 0 bytes long.

This would allow us to implement the run_context function as follows:

void run_context(struct context *context, uint32_t npc)
{
        uint32_t buf[2];

        buf[0] = SPU_CONTROL_START_SPU;
        buf[1] = npc;

        write(context.event_fd, buf, sizeof(buf));
}

There are plenty of other issues to deal with (like signals, and debugging), but I have a basic prototype working at the moment. More to come!

debian on a qs22 cell blade

30/06/2008, [ tech / cell / ] [ link ]

Seeing as the QS22 blades are out, here's a short guide to getting debian installed.

[jk@qs22 ~]$ grep -m1 ^cpu /proc/cpuinfo
cpu             : Cell Broadband Engine, altivec supported
[jk@qs22 ~]$ lsb_release -d
Description:    Debian GNU/Linux unstable (sid)

kernel

You'll need a kernel that has support for the IBM Cell blades. If you configure your kernel with the 'cell_defconfig' target, you should have all the necessary options:

[jk@pingu linux-2.6.25]$ make cell_defconfig

Specifically, you need:

  • CONFIG_PPC_IBM_CELL_BLADE;
  • CONFIG_SERIAL_OF_PLATFORM;
  • CONFIG_FUSION_SAS;
  • CONFIG_ROOT_NFS;
  • CONFIG_IP_PNP_DHCP and
  • CONFIG_SPU_FS.

root filesystem

The QS22s have no internal disk (they're compute nodes, right?), so you'll have to either:

  • use a remote root filesystem, like NFS; or
  • add a LSI SAS adaptor to the blade, and use an external SAS disk for the root filesystem.

The installation process will be different depending on which you choose, so just skip to the appropriate section here.

NFS root

For the first option, there's a number of NFS-root howtos around. First up, we need to build the actual debian filesystem, using debootstrap. For example:

[jk@pingu ~]$ sudo debootstrap --arch=powerpc --foreign sid /srv/nfs/qs22/

This will create an entire debian filesystem in /srv/nfs/qs22. We need to make a few modifications though:

  1. add the following line to /etc/inittab:
    T0:23:respawn:/sbin/getty -L ttyS0 19200 vt100
  2. and couple of extra device nodes:
    [jk@pingu ~]$ cd /srv/nfs/qs22/dev
    [jk@pingu dev]$ sudo mknod console c 5 1
    [jk@pingu dev]$ sudo mknod ttyS0 c 4 64

Once this is done, we need to complete the bootstrap on the QS22. Set up your NFS server, and export the appropriate directory. Boot the QS22 with the nfs root kernel options, plus "rw init=/bin/sh" (eg root=/dev/nfs nfsroot=server_ip:/srv/nfs/qs22 ip=dhcp rw init=/bin/sh). Then, once the machine has booted:

sh-3.2# PATH=/:/bin:/usr/bin:/sbin:/usr/sbin /debootstrap/debootstrap --second-stage

This should finish the bootstrap. After it has completed (it should finish with "I: Base system installed successfully"), reboot the machine with the same kernel command line, minus the rw init=/bin/sh arguments. Once it boots, you should have the debian login prompt. Login as root (there will be no password, but don't forget to set one) and away you go.

SAS disk

If you're using SAS, the install is much more straightforward, as you can just use the standard debian installer. However, you may need to use a custom kernel which supports the QS22s. This is a matter of building your own kernel, using the powerpc64 debian installer image as the initramfs:

[jk@pingu linux-2.6.25]$ wget http://ftp.us.debian.org/debian/dists/testing/main/installer-powerpc/current/images/powerpc64/netboot/initrd.gz
[jk@pingu linux-2.6.25]$ gunzip -c < initrd.gz > initrd
[jk@pingu linux-2.6.25]$ make cell_defconfig
[jk@pingu linux-2.6.25]$ sed -ie 's,^CONFIG_INITRAMFS_SOURCE=".*",CONFIG_INITRAMFS_SOURCE="'$PWD'/initrd",' .config
[jk@pingu linux-2.6.25]$ make

Then, just boot the kernel in arch/powerpc/boot/zImage.pseries. The debian installer should start, and guide you through the rest of the installation. Since you're netbooting, you can ignore any messages about not having a bootstrap partition, or not being able to install a kernel or yaboot

software

Entirely optional, but you'll probably get the most out of your QS22 with a few extra packages:

[jk@qs22 ~]$ sudo apt-get install openssh-server libspe2-dev spu-gcc build-essential

linux.conf.au hackfest: the solution, part three

23/06/2008, [ tech / cell / ] [ link ]

In part two of this series, we had just ported a fractal renderer to the SPEs on a Cell Broadband Engine machine. Now we're going to start the optimisation...

Our baseline performance is 40.7 seconds to generate the sample fractal (using the sample fractal parameters).

The initial SPE-based fractal renderer used only one SPE. However, we have more available:

Number of SPEs in currently-available in CBEA machines
Machine SPEs available
Sony Playstation 3 6
IBM QS21 / QS22 blades. 16 (8 per CPU)

So, we should be able to get better performance by distributing the render work between the SPEs. We can do this by dividing the fractal into a set of n strips, where n is the number of SPEs available. Then, each SPE renders its own strip of the fractal.

The following image shows the standard fractal, as would be rendered by 8 SPEs, with shading to show the division of the work amongst the SPEs.

SPE fractal divisions

In order to split up the work, we first need to tell each SPE which part of the fractal it is to render. We'll add two variables to the spe_args structure (which is passed to the SPE during program setup), to provide the total number of threads (n_threads) and the index of this SPE thread (thread_idx).

struct spe_args {
        struct fractal_params fractal;
        int n_threads;
        int thread_idx;
};

spe changes

On the SPE side, we use these parameters to alter the invocations of render_fractal. We set up another couple of convenience variables:

        rows_per_spe = args.fractal.rows / args.n_threads;
	start_row = rows_per_spe * args.thread_idx;

And just alter our for-loop to only render rows_per_spe rows, rather than the entire fractal:

        for (row = 0; row < rows_per_spe; row += rows_per_dma) {

                render_fractal(&args.fractal, start_row + row,
                                rows_per_dma);

                mfc_put(buf, ppe_buf + (start_row + row) * bytes_per_row,
                                bytes_per_row * rows_per_dma,
                                0, 0, 0);

                /* Wait for the DMA to complete */
                mfc_write_tag_mask(1 << 0);
                mfc_read_tag_status_all();
        }

ppe changes

The changes to the PPE code are fairly simple - instead of just creating one thread, create n threads.

First though, let's add a '-n' argument to the program to specify the number of threads to start:

        while ((opt = getopt(argc, argv, "p:o:n:")) != -1) {
                switch (opt) {

                /* other options omitted */

                case 'n':
                        n_threads = atoi(optarg);
                        break;

Rather than just creating one SPE thread, we create n_threads. Also, we have to set the thread-specific arguments for each thread:

        for (i = 0; i < n_threads; i++) {
                /* copy the fractal data into this thread's args */
                memcpy(&threads[i].args.fractal, fractal, sizeof(*fractal));

                /* set thread-specific arguments */
                threads[i].args.n_threads = n_threads;
                threads[i].args.thread_idx = i;

                threads[i].ctx = spe_context_create(0, NULL);

                spe_program_load(threads[i].ctx, &spe_fractal);

                pthread_create(&threads[i].pthread, NULL,
                                spethread_fn, &threads[i]);
        }

Now, the SPEs should be running, and generating their own slice of the fractal. We just have to wait for them all to finish:

        /* wait for the threads to finish */
        for (i = 0; i < n_threads; i++)
                pthread_join(threads[i].pthread, NULL);

If you're after the source code for the multi-threaded SPE fractal renderer, it's available in fractal.3.tar.gz.

That's it! Now we have a multi-threaded SPE application to do the fractal rendering. In theory, an n threaded program will take 1/n of the time of a single-threaded implementation, let's see how that fares with reality...

performance

Let's compare invocations of our multi-threaded fractal renderer, with different values for the n_threads parameter.

Performance of multi-threaded SPE fractal renderer
SPE threads Running time (sec)
140.72
230.14
418.84
612.72
810.89

Not too bad, but we're definitely not seeing linear scalability here; we could expect the 8 SPE case to take around 5 seconds, rather than 11.

what's slowing us down?

A little investigation into the fractal generator will show that some SPE threads are finishing long before others. This is due to the variability in calculation time between pixels. In order to see if a point (ie, pixel) in the fractal does not converge towards infinity (and gets coloured blue), we need to do the full i_max tests in render_fractal (i_max is 10,000 in our sample fractal). Other pixels may converge much more quickly (possibly in under 10 iterations), and so are orders of mangitude faster to calculate.

We end up with slices that are very quick to calculate, and others that take longer. Of course, we have to wait for the longest-running SPE thread to complete, so our overall runtime will be that of the slowest thread.

So, let's take another aproach to distributing the workload. Rather than dividing the fractal into contiguous slices, we can interleave the SPE work. For example, if we were to use 2 SPE threads, then thread 0 would render all the even chunks, and thread 1 would render all the odd chunks (where a "chunk" is a set of rows that fit into a single DMA). This should even-out the work between SPE threads.

Interleaved SPE fractal divisions

This is just a matter of changing the SPE for-loop a little. Rather than the current code, which divides the work into n_threads contiguous chunks:

        for (row = 0; row < rows_per_spe; row += rows_per_dma) {

                render_fractal(&args.fractal, start_row + row,
                                rows_per_dma);

                mfc_put(buf, ppe_buf + (start_row + row) * bytes_per_row,
                                bytes_per_row * rows_per_dma,
                                0, 0, 0);

                /* Wait for the DMA to complete */
                mfc_write_tag_mask(1 << 0);
                mfc_read_tag_status_all();
        }

We change this to render every n_threadth chunk, starting from thread_idx, which gives us the the interleaved pattern:

        for (row = rows_per_dma * args.thread_idx;
                        row < args.fractal.rows;
                        row += rows_per_dma * args.n_threads) {

                render_fractal(&args.fractal, row,
                                rows_per_dma);

                mfc_put(buf, ppe_buf + row * bytes_per_row,
                                bytes_per_row * rows_per_dma,
                                0, 0, 0);

                /* Wait for the DMA to complete */
                mfc_write_tag_mask(1 << 0);
                mfc_read_tag_status_all();
        }

An updated renderer is available in fractal.4.tar.gz.

Making this small change gives some better performance figures:

Performance of multi-threaded, interleaved SPE fractal renderer
SPE threads Running time (sec)
140.72
220.75
410.78
67.44
85.81

We're doing much better now, but we're still nowhere near the theoretical maximum performance of the SPEs. More optimisations in the next article...