Sunday, January 8, 2012

Amdahl's Law and Critical Sections

Many moons ago I wrote a blog post on critical sections and efficiency. The blog post was a discussion of Amdahl's law. Aater Suleman over at future chips pointed out that Amdahl's law doesn't model critical sections. I used the wrong mathematical model! Oops.

Well, the good news is that my main point about critical sections hurting performance when you scale up to multiple processors is still valid... but all my graphs are wrong. *heavy sigh*. Since posting those graphs, I've been trying to find a moment to write a follow-up posting. What really drove home the importance of this was I noticed that just about every concurrency related session I attended at Java One this year mentioned Amdahl's law in relation to critical sections. Doh! It looks like I am in good company about miss-applying the law.

Amdahl's law applies to systems that can be split up into multiple tasks. Each one of these tasks is dependent on the completion of the previous task. Some of these tasks are 100% parallelizable others are not.
If we took this system and ran it on a machine with 3 processors, the amount of time it would take would be the amount of time for part 1 plus the amount of time for part 2 divided by the three processors:

In this example, on a single processor system, part 2 is three times longer to compute than part 1. So if we had three processors part 2 will run 3 times faster. Part 1 would stay the same speed because it can't be parallelized. The net result is we spend half of our time in part 1 and half of our time in part 2. It also means that overall, with 3 CPUs, this task will take half the time it would if we ran it on one CPU.

A real world example of this sort of situation would be painting a house. The first step would be going to get the paint from the store. It doesn't matter how many people you have it always takes the same amount of time to get the paint. The second step would be painting the house. Painting the house is something that can be split up amongst many people and so is parallelizable. The key thing here is you still need to wait until you have the paint before painting can begin and driving out to get the paint doesn't benefit from multiple people and so is not parallelizable.

Critical sections, on the other hand, are parts of a program that can only have one thread executing them at a time. A real world example would be having your friends help you move. Having multiple people can speed up moving house by every person carrying a different box out to the moving van. The thing is, two or more people can't fit through the front door at the same time. The door acts like a critical section; only allowing one thread (or person) to use it at a time.

This task contains a tiny piece that can't be done by multiple processes at the same time. In this example that same task will be executed 3 times on three different processors:

In the diagram the tasks on CPU2 and 3 started a bit late for some unimportant reason. However, when the process running on CPU 2 gets to the critical section it doesn't need to wait because CPU 1 is already finished with the critical section. Unfortunately, the processor running on CPU 3 does have to wait for CPU 2 to finish using the critical section. That's bad luck for process running on CPU3.

Critical sections are different from the situation described by Amdalh's law because the system might run completely in parallel; two processes that wait for each other is an unlucky occurrence. This means that, for a single run, we have no idea how much time the program will take because it depends on how often two or more threads both hit the same critical section.

While searching for Amdahl's law and critical sections I found a paper that provided a mathematical model for critical sections. Stijn Eyerman, one of the co-authors of that paper, created an web page that generates the kind of graphs (factor speedup vs number of cores for critical sections) I give below. The only downside is that the page is completely unfathomable unless you've read the paper. The paper is no picnic either. So what do the graphs look like when I use the right mathematical model?

Below is a graph of the factor speedup vs number of cores. This is is for critical sections:

Notice how the impact of the critical section is small up until the point when it suddenly starts to drastically limited the amount of speedup. Conceptually speaking this is the point where there's almost always a thread in the critical section... Or as the paper puts it the execution time "is determined by the serialized execution of all the contending critical sections".

For comparison, this is the graph for Amdalh's law:

In contrast to critical sections, the system starts to show inefficiencies much sooner but degrades much more gradually.

Here's the same data as above but in terms of % efficiency as the number of cores increase. First critical sections:

Now for Amdahl's law:

Here's what the graphs looks like for up to 100 cores. First critical sections:

Now for Amdahl's law:

Here's a direct comparison between Amdalh's law and critical sections at 95% parallelizable up to 100%. The red line represents critical sections.

(If you're using the Modeling critical sections in Amdahl's law web page the parameters I'm using are sequential code: 0 and Contention Probability: 1 )

Well, I hope this has been educational.


Anonymous said...

Nice post Andrew. Assuming that Amdahl's Law applies to critical sections is basically like saying that no other threads can run when one thread is in the critical section, which is obviously false. As your data suggests, the net performance is much better than that.

I've written a similar post called Locks Aren't Slow; Lock Contention Is. I didn't use any mathematical model -- instead I took performance measurements from a real, running application. It's limited to 4 cores, since that was all I had available. You can download the source, though it's written in C++.

Taking a quick look at Eyerman's graph generator, his model seems to differ from my results for two reasons. One is that he asks you to specify the contention probability, whereas in my test suite, the threads are modeled using a Poisson process, making the contention probability implicit. The other is that I'm note sure his model takes the cost of the lock itself into account. (Haven't read the paper.)

I quite like your analogies about painting the house and moving day. Very apt!

Andrew said...

Seeing data from an real application is fascinating. I'll have to generate the same graph as you did to see if they match up.
The microbench worries me, though. Especially since I work mostly in Java where big blocks of code and get removed by the JIT.

The "contention probability" factor in Eyerman's graph generator is to model multiple different locks. It's a lock granularity factor. In a system with a single lock it's always 1. The probability of contending for a lock is modelled in the formula implicitly.

Praveen said...

Agree with Preshing, Amdahl's model is more pessimistic in approach