Wednesday, June 22, 2011

Critical Sections and Efficiency

Oops, I made an error when I wrote this post. The mathematical formula I'm using to make all them pretty graphs doesn't actually apply to critical sections.

Here's a followup posting with the correct graphs for critical sections!

It only applies to code that must run in parallel then must run sequentially. One step must finish completely before the other can begin. In our moving example below it would be like step 1 would be moving the furniture into the van (highly parallelizable), the second part would be driving the single van to the new location and step 3 would be moving the furniture into the new house. Steps 1 and 3 are parallel and step 2 is sequential (only one van and one driver. Everyone has nothing is just riding along in the van or their own cars or something. Maybe they teleported.) So in conclusion I'm trying to understand the math for critical sections and I'll write something on that later.

Modern computers tend to have more than one processor core. Each one of those processor cores is like an individual computer brain. This has led to something of a crisis in computing circles because figuring out how to make use of those two cores is hard. A big chunk of the problem is splitting up a computing task into pieces that can be done independently. For example, moving house is easy to parallelize because every person who shows up to help can grab a different box. Walking through doors is harder to paralyze because if all your moving buddies walked through the same door at the same time they're going to bump into each other. That would be funny be not very productive.

In the computer world, whenever you parallelize a task you end up with a bit of that task that must be done sequentially (one at a time). If people are helping you move they can each move a box but they can't go through the door at the same time. In the computing world things like doors are called a critical sections. A critical section is a piece of a task that cannot be done in parallel.

Ever since working on a project to improving the performance of InteleViewer's image decompression on multiple processor machines I've been wondering how critical sections affect performance. InteleViewer will often open a stack of something like 500 images and each image will need to be decompressed before it can be displayed. This is an easy task to split up since you can just give each core a different image to compress. The thing is, there's a very tiny amount of task co-ordination that needs to be done in a critical section. When we ran the code on a two core computer we notices that the cores were only being used at 98% efficiency.

First of all, that's an incredible loss in efficiency given that the critical section was doing practically nothing and secondly, how much will the CPUs be idle on computers with something like 12 or 16 cores? In our moving day analogy it's roughly analogous to wondering how many much time will be wasted while people wait for each other to walk through the door if a large number of people are helping you move.

Luckily I recently stumbled across Amdahl's law. Amdahls's law isn't a new courtroom drama series premiering this fall on Friday nights at 8pm (9 central). It's a mathematical model that deals with the efficiency of parallel task running the presence of critical sections. I took the formula and produced a graph to see what was going on.

This chart has the number of cores along the X axis and the factor speedup along the Y axis. The lines represent how much of the task can run in parallel. The blue line represents the ideal of a task where 100% of it can run in parallel. The orange line represents a task where 99% of it can run in parallel and the light blue line at the bottom represents a task where only 50% of it can run in parallel.

(To go back to our moving analogy, the orange line is a house with a door and the light blue line is a house with a door and a very long, narrow hallway that you have to pass through first. The dark blue line is a house with no door.. which would be kinda weird in real life but perfectly safe in a world populated only by math.)

(Click to enlarge, higher is faster)

Notice how the light blue line at the bottom never quite reaches the 2X speed up even when you put up to 16 cores on the problem.

I found it surprising how inefficient things get even when 90% of the task is able to run in parallel. At around six cores there's almost no point in adding more cores. They don't really speed up things that much.

I also have a graph of the % efficiency. This grap htells you how busy each processor core would be if you gave it a some task and some number of cores.

(Click to enlarge)

Here's what the graph looks like up to 100 processors:

(Click to enlarge)

Notice how even a task that runs in parallel 99% of the time only makes use of half the available computing power at 100 CPUs. We don't have computers with 100 cores yet but with the number of cores on a processor doubling every 18 months or so we could have that in less than ten years. The main problem is that the InteleViewer's image decompression example represents close to a best case scenario and even there we have code that's not going to scale up to that many cores. Making programs run efficiently on multiple processor cores is hard. It's all just a tad depressing.

The people who make processors have noticed this problem too. In the latest batch of chips they are opting to not add more cores but instead to put a video processor (GPU) and other task specific hardware on the processor die... but that's another blog post.

The InteleViewer example actually does have a happy ending. We decided to switch from using critical sections to using a magical technique called compare-and-swap. It's not always possible or wise to use this technique but for InteleViewer it worked great. We ended up making our 1% overhead disappear. Well, as far as we can measure on a 12 core (6 + 2X hyper threading) anyway.

There's no equivalent analogy I think of to compare-and-swap in the moving example. The closest I can get is if all your moving buddies decided to run through the door regardless of who else was trying to get through. So long as collisions don't have a large time penalty and so long as they are infrequent enough it might be faster. I don't recommend trying it for doorways but for InteleViewer it worked great.

So, in conclusion, buy InteleViewer.