From the Spring 1998 Issue of “IT Age” - DECUS
A version of this article first appeared in "Visual C++ Professional"
If a program compiles and links without error, all we can say is that the easy problems have been fixed. What remains of course, are those subtle bugs that rise up to bite us somewhere down the track. In fact, a program could run forever without error on some system yet not perform correctly on some other system; that’s simply a factor we need to take into account when we are porting.
Consider the following scenario: A process has multiple threads executing in parallel with each thread having write access to some global integer variable. Each thread simply increments that variable by 1 as follows:
++total;
Well that looks harmless enough; after all, this looks like an atomic operation. And on an Intel x86 or Pentium processor, it is. But what happens if we compile this code for Windows NT running on an Alpha? In fact, what happens if we run this code on Windows NT running on a multiprocessor Intel machine? The simple answer is that the behavior is now undefined; it might work in certain cases or it might not. To see why, let’s look at a fragment of some test code:
unsigned long int total = 0; int interlocked; DWORD Startup(DWORD *pnumber_loops) { unsigned long int i; for (i = 0; i < *pnumber_loops ++i){ if (interlocked){ InterlockedIncrement(&total); } else{ ++total; } } return 0; }
The program was run using two threads and with loop counts of 1 thousand through 1 billion, going up in powers of 10. For example, if two threads each loop and update variable “total” 1,000 times, at the end, “total” should have the value 2,000, assuming it started at zero.
The following table shows the results when this program was run on a single CPU 9O MHz Pentium. Whether or not thread synchronization was selected, the actual value of “total” exactly matched the theoretical value as indicated by the values and the percentage matching column. That is, the synchronization really was unnecessary. In fact, with synchronization enabled, the wall-clock elapsed time (shown in seconds) was nearly a factor of 3 longer!
Pentium No Synchronization With Synchronization
Theoretical Actual % Time Actual % Time
2000 2000 100.0 0.00 2000 100.0 0.00
20000 20000 100.0 0.00 20000 100.0 0.00
200000 200000 100.0 0.00 200000 100.0 0.00
2000000 2000000 100.0 0.00 2000000 100.0 1.00
20000000 20000000 100.0 1.00 20000000 100.0 4.00
200000000 200000000 100.0 16.00 200000000 100.0 44.00
2000000000 2000000000 100.0 162.00 2000000000 100.0 439.00
So if synchronization isn’t necessary and it slows the program down, why bother doing it? That’s a very good question and to answer it let’s look at the results from conducting the same experiment on an Alpha:
Alpha No Synchronization With Synchronization
Theoretical Actual % Time Actual % Time
2000 2000 100.0 0.00 2000 100.0 0.00
20000 20000 100.0 0.00 20000 100.0 0.00
200000 200000 100.0 0.00 200000 100.0 0.00
2000000 1807928 90.4 0.00 2000000 100.0 2.00
20000000 16149658 80.7 2.00 20000000 100.0 23.00
200000000 163335404 81.7 18.00 200000000 100.0 230.00
2000000000 1623777865 81.2 174.00 2000000000 100.0 2283.00
Now when synchronization isn’t used, things don’t look so good. For example, when 2 million updates were requested and actually done, the value incremented in “total” came up 9.6% short! And when 2 billion updates were requested and actually done, the value incremented in “total” came up 18.8% short! So the question now becomes “When might ++total not actually cause “total” to be incremented?” One answer is “On a RISC machine.” And surprise, Alpha is a RISC machine.
Let’s take a look at the code generated by the statement “++total”:
Pentium: inc DWORD PTR _total Alpha: ldah t0, h^.bss(zero) ldl tl, 1^.bss(t0) addl tl, 1, t1 stl tl, 1^.bss(t0)
The important difference is that on the Pentium, the ++ takes only one instruction and that instruction is atomic. That is, only one thread on that processor is permitted to increment that variable at a time; the hardware takes care of that. Being a RISC machine, except for load and store, the Alpha cannot perform operations directly on memory; it must first load the operand into a register. As a result, the Alpha requires four instructions to implement ++. Consider the case in which one thread loads the value of “total” (0) into the register. Then before it can update it and write back the value 1, another thread loads that same value (0). The first thread then runs storing 1 back in “total” and the second thread DOES EXACTLY THE SAME. That is, two ++ operators were executed but “total” shows an increment of only 1. Clearly, if n threads were concurrently attempting ++total, “total” might be incremented by anywhere from 1 to n. So that’s where our counts ran off the rails. Interestingly enough, when two threads each looped 1,000, 10,000, and 100,000 times on the Alpha, the value of “total” was not corrupted. The problem only manifested itself when the same number of threads ran longer. Of course, the problem may also manifest itself with less iterations if more threads are run.
Based on the Alpha output, we see that synchronization makes a big difference. This synchronization is achieved by calling the 32-bit API function InterlockedIncrement. This function increments by one the value of its signed 32-bit integer argument which is passed by address. (There is a corresponding decrement by one function called InterlockedDecrement.) This function prevents more than one thread, on the same or different processors, from using the InterlockedDecrement or InterlockedIncrement function to access the same variable simultaneously.
Why did the Pentium take nearly 3 times as long to perform a synchronized increment? On Visual C++/Intel, a call to InterlockedIncrement results in a call to the library function _imp_InterlockedIncrement@4 and that’s where the program spends its time.
Okay, but what about the factor of 12 on the Alpha? Isn’t that outrageously expensive? Just what is InterlockedIncrement doing all that time? To answer that question, let’s look at pseudo-code for the code really generated by Visual C++/Alpha:
mb Retry: ldl_l R1, total ; Step 1 addl R1, 1, R1 ; Step 2 stl_c R1, total ; Step 3 beq R1, Retry ; Step 4
Instead of the compiler generating code to call a system function like on the Pentium, the compiler implemented the API function inline using the pseudo-code shown above.
As an optimization, compilers targeting an Alpha processor can rearrange certain load and store instructions to improve throughput. However, if load/store synchronization is needed, it can be forced by the Memory Barrier (mb) instruction. But according to the Alpha hardware description, this instruction is only required on a multiprocessor system. Unfortunately, the compiler doesn’t seem to know or care if the target has more than one processor since it generates the protection anyway. So that contributes a little (but actually an insignificant amount of) overhead. The big delay results from the ldl_l (load locked) and stl_c (store conditional) instructions.
The Alpha has no way of stopping multiple threads (running on the same or different processors) from accessing the same memory location for write. Instead, it provides the following machinery whose steps correspond to the pseudo-code above:
1. Declare that you need exclusive access to some memory location when you load its contents.
2. Perform your operation(s) on that value.
3. Attempt to store the result back in memory and check if the store succeeded.
4. If another thread wrote to that memory location since step 1, go back to step 1 and try again.
Consider the case of the test program creating 2 threads each of which loops n times. In theory, each time thread 1 gets to step 3, it finds that thread 2 has modified “total” so thread 1 has to retry the sequence of steps. And of course, the threads could get in each other’s way. And based on the timings in the Alpha table above, it appears they did, and quite often. What isn’t shown in that table and what would be interesting is the elapsed wall-clock time of each thread since that could vary considerably.
The following program suffers from the same problem except that in this case, the problem is not so obvious since it is hidden inside a class. Basically, the class’s constructor increments a static count member each time an object of that type is constructed, and the destructor decrements that count. When the program terminates normally, the count should be zero, showing that everything we set up, we cleaned up. (In reality, the program has the destructor incrementing a separate static count member so the two counts should be the same.):
class X { public: static unsigned long int construct_count; static unsigned long int destruct_count; X() { ++construct count; } ~X() { ++destruct count; } }; unsigned long int X: :construct_count = 0; unsigned long int X: :destruct_count = 0; DWORD StartUp(DWORD *pnumber_loops) { unsigned long int i; for (i=0; i < *pnumber_loops; ++i) { X a; } return 0; }
When the program was run with 10 threads each looping 1 million times on an Alpha, the lowest count was 1,097,368 and the highest was 3,686,304 when both should have been 10 million. Clearly, the affects of most of the ++ and —operations were lost. When the program was run on a single CPU Intel system, there were no problems, as we should expect based on the discussion earlier.
There is an added complication not shown by this example (in fact, it was avoided specifically). When this kind of thing is done in a class, the constructor and destructor really use the same counter. However, without synchronization, the number of decrements actually reflected can exceed the number of increments reflected, resulting in a negative count value. However, if the count has an unsigned type, that negative value will appear to be positive. When the program was run 20 times on an Alpha, about 50% of the time the destructor count was higher than that for the constructor.
Code containing multiple threads or asynchronous signal handlers, that works properly on a system with a single CISC (Intel) CPU system need not necessarily work correctly on one containing a single RISC CPU.
Code that uses shared memory and that works properly on a single CPU system need not necessarily work correctly on one containing multiple CPUs.
Even if you recognize the need to perform synchronization using the interlocked increment and decrement functions, you have to make ABSOLUTELY SURE that your code never modifies the shared object in ANY OTHER WAY. No assigning directly to it. No incrementing, decrementing, or assigning indirectly via a pointer to it. (To perform a synchronized assignment, use InterlockedExchange) And if you need more synchronization than these simple interlocked functions provide, you’ll have to use more elaborate machinery such as critical sections or events. One such situation is the need to update a shared object of type double or even to increment or decrement an integer by more than 1.
My guess is that the vast majority of the programs being run under NT have not been tested extensively (if at all) on either a RISC or multi-CPU system. Yes the operating system can easily handle the addition of extra processors and it runs fine with a variety of processors. However, can your code or that of the applications you use do the same?
Rex ]aeschke is a consultant, seminar leader and author specializing in C, C++, Java, and Win32. He is also chair of the ANSI C and ANSI Java standards committees. Rex can be reached at: [email protected]