Hi,
Here we’ll discuss - 11 Likely Problems In Your Multithreaded Code.
The 11 likely problems are as –
· Data Races
· Forgotten Synchronization
· Incorrect Granularity
· Read and Write Tearing
· Lock-Free Reordering
· Reentrancy
· Deadlocks
· Lock Convoys
· Stampeding
· Two-Step Dance
· Priority Inversion
· Patterns for Achieving Safety
· Immutability
· Purity
· Isolation
Data Races
A data race—or race condition—occurs when data is accessed concurrently from multiple threads. Specifically, it happens when one or more threads are writing a piece of data while one or more threads are also reading that piece of data. Consider this canonical example:
static class Counter
{
internal static int s_curr = 0;
internal static int GetNext()
{
return s_curr++;
}
}
Here, If two threads access simultaneously, two threads might be given the same number. The reason is that s_curr++ compiles into three separate steps:
- Read the current value from the shared s_curr variable into a processor register.
- Increment that register.
- Write the register value back to the shared s_curr variable.
Two threads executing this same sequence can both read the same value from s_curr locally (say, 42), increment it (to, say, 43), and publish the same resulting value.
Note - Although the simple statement s_curr++ appears to be atomic, this couldn't be farther from the truth.
Forgotten Synchronization
This is the simplest kind of data race: Synchronization is forgotten altogether.
Problems like this aren't always so obvious. For example, an object may be part of some large complicated object graph that happens to be reachable from a static variable or became shared by passing an object as part of a closure when creating a new thread or queuing work to a thread pool.
It is imperative to pay attention to when an object (graph) Transitions From Private To Shared. This is called Publication. The reverse is called Privatization—that is, when an object (graph) goes from being shared to private again.
The solution is to add proper synchronization to the above example.
| |
static class Counter { internal static volatile int s_curr = 0; internal static int GetNext() { return Interlocked.Increment(ref s_curr); } } This works because the update is confined to a single memory location. Otherwise you can use the right side one. Also, Volatility is relevant only to primitive integral (and unsafe pointer) types – other types are not cached in CPU registers and cannot be declared with the volatile keyword. | static class Counter { internal static int s_curr = 0; private static object s_currLock = new object(); internal static int GetNext() { lock (s_currLock) { return s_curr++; } } } |
Using Interlocked is generally more efficient that obtaining a lock, because it can never block and suffer the overhead of its thread being temporarily descheduled. Interlocked is also valid across multiple processes – in contrast to the lock statement, which is effective only across threads in the current process. An example of where this might be useful is in reading and writing into shared memory. |
Incorrect Granularity
Even if access to shared state occurs with proper synchronization, the resulting behavior may still be incorrect. The granularity must be sufficiently large that the operations that must be seen as atomic are encapsulated within the region. For example (Left Pane), take a look at the bank account abstraction shown below. All is well, and the object's two methods, Deposit and Withdraw, appear to be concurrency-safe. Also see the right pane to encounter the problem associated with it.
class BankAccount { private decimal m_balance = 0.0M; private object m_balanceLock = new object();
internal void Deposit(decimal delta) { lock (m_balanceLock) { m_balance += delta; } }
internal void Withdraw(decimal delta) { lock (m_balanceLock) { if (m_balance < delta) throw new Exception("Insufficient funds"); m_balance -= delta; } } } | What if, however, you'd like to add a Transfer method? A naive (and incorrect) attempt would assume that, because Deposit and Withdraw are safe in isolation, they can be combined easily: class BankAccount { internal static void Transfer(BankAccount a, BankAccount b, decimal delta) { Withdraw(a, delta); Deposit(b, delta); } } Some banking application can use them without worry that balances will become corrupt due to concurrent access. This is incorrect. There is actually a period of time between the Withdraw and Deposit calls where the money is missing completely. |
A correct implementation would need to acquire the locks on both a and b up front and then make the method calls: class BankAccount { internal static void Transfer(BankAccount a, BankAccount b, decimal delta) { lock (a.m_balanceLock) { lock (b.m_balanceLock) { Withdraw(a, delta); Deposit(b, delta); } } } } It turns out that this approach, while solving the granularity problem, is prone to deadlock. You'll see how to fix it later on. |
Read and Write Tearing
Races in naturally sized words, allow you to access variables without synchronization. i.e Reads and writes of aligned, naturally sized words—for example, pointer-sized things that are 32 bits (4 bytes) on 32-bit processors and 64 bits (8 bytes) on 64-bit processors—are atomic. If a thread is just reading a single variable that some other thread will write, and there aren't any complex invariants involved, you can sometimes Skip The Synchronization altogether thanks to this guarantee.
Contradiction - Read the ‘Note’ under ‘Data Races’ bullet.
But beware. If you try this on a misaligned memory location e.g 64 bit value on a 32 bit compiler, or a location that isn't naturally sized, you can encounter a read or write tearing. Tearing occurs because reading or writing such locations actually involves multiple physical memory operations. Concurrent updates can happen in between these, potentially causing the resultant value to be some blend of the before and after values.
As an example, imagine ThreadA sits in a loop and writes only 0x0L and 0xaaaabbbbccccddddL to a 64-bit variable s_x. ThreadB sits in a loop reading it
internal static volatile long s_x; void ThreadA() { int i = 0; while (true) { s_x = (i & 1) == 0 ? 0x0L : 0xaaaabbbbccccddddL; i++; } } void ThreadB() { while (true) { long x = s_x; Debug.Assert(x == 0x0L || x == 0xaaaabbbbccccddddL); } } You may be surprised to discover that ThreadB's assert might fire. The reason is that ThreadA's write will consist of two parts, the high 32-bits and the low 32-bits, in some order depending on the compiler. The same goes for ThreadB's read. Thus ThreadB could witness values 0xaaaabbbb00000000L or 0x00000000aaaabbbbL. | |
Lock-Free Reordering
Sometimes it's tempting to write lock-free code as a way of achieving better scalability and reliability. Doing so requires a deep understanding of the memory model of your target platform.
Failure to understand and heed these rules can lead to memory reordering bugs. These happen because compilers and processors are free to reorder memory operations during the process or making optimizations.
Reference: See Vance Morrison's article "Memory Models: Understand the Impact of Low-Lock Techniques in Multithreaded Apps" at msdn.microsoft.com/magazine/cc163715 for details.
As an example, assume s_x and s_y are both initialized to the value 0, as you see here:
internal static volatile int s_x = 0; internal static volatile int s_xa = 0; internal static volatile int s_y = 0; internal static volatile int s_ya = 0;
void ThreadA() { s_x = 1; s_ya = s_y; }
void ThreadB() { s_y = 1; s_xa = s_x; }
|
void ThreadA() { s_x = 1; Thread.MemoryBarrier(); s_ya = s_y; }
|
Is it possible that, after ThreadA and ThreadB both run to completion, both s_ya and s_xa will contain the value 0? This seems ridiculous on its face. Either s_x = 1 or s_y = 1 will happen first, in which case the other thread will witness the update when it gets around to its update.
Unfortunately, processors are Free To Reorder This Code so that the loads effectively happen before the writes. Hence it can happen, both s_ya and s_xa will contain the value 0.
To get You can get around this with an explicit memory barrier:
The .NET Framework offers this particular API, and C++ offers _MemoryBarrier and similar macros. But the moral of the story is not that you should insert memory barriers all over the place.
The Moral is that you should steer clear of lock-free code until you've mastered memory models, and even then tread with care.
Lock Convoys & Stampede
When the arrival rate at a lock is consistently high compared to its lock acquisition rate, a lock convoy may result. In the extreme, there are more threads waiting at a lock than can be serviced, leading to a catastrophe. This is more common on server-side programs where certain locks protecting data structures needed by most clients can get unusually hot.
For example, imagine this scenario: On average, eight requests arrive per 100 milliseconds. We use eight threads to service requests (because we're on an 8-CPU machine). Each of those eight threads must acquire a lock and hold it for 20 milliseconds before it can do meaningful work. Unfortunately, access to this lock must be serialized, so it takes 160 milliseconds for all eight threads to enter and leave the lock. After the first exists, there will be 140 milliseconds of time before a ninth thread can access the lock. This scheme inherently will not scale, and there will be a continuously growing backup of requests. Over time, if the arrival rate does not decrease, client requests are apt to begin timing out, and a disaster will ensue.
Fairness in locks is known to contribute to Convoys. The reason is that periods of time where the lock could have been made available are artificially blocked out, so that threads arriving must wait until the chosen lock owner thread can wake up, context switch, and acquire and release the lock.
The only real solution to the fundamental convoy problem is to reduce lock hold times and to factor your system so that there are very few (if any) hot locks. This is easier said than done, but is crucial for scalability.
A Stampede is a situation where many threads are awakened, such that they all fight for attention simultaneously from the Windows thread scheduler. If you have 100 threads blocked on a single manual-reset event, for example, and you set that event … well, you're apt to have a real mess on your hands, particularly if a large portion of those threads will have to wait again.
One way to implement a blocking queue is to use a manual-reset event that gets unsignaled when the queue is empty, and becomes signaled when it is non-empty. Unfortunately, when there are a large number of waiting threads during the transition from zero elements to one element, a stampede can occur. That's because only one thread will take the single element, which transitions the queue back to empty, and necessarily involves resetting the event. If you had 100 threads waiting, then 99 of them will wake up, context switch (and incur all those cache misses), just to realize they have to wait again.
Two-Step Dance
Sometimes you need to signal an event while holding a lock. This can be unfortunate if the waking thread needs to acquire the lock held, because it will be awakened only to find out that it must wait again. This is wasteful and increases the number of overall context switches. This situation is called the Two-Step Dance, and can extend far beyond just two steps if many locks and events are involved.
Both Win32 and the CLR's condition variable support inherently suffers from the two-step dance problem. It is often unavoidable, or prohibitively difficult to work around.
The two-step dance issue is even worse on a single-processor machine. When events are involved, the kernel will apply a priority boost to the awakening thread. This is nearly guaranteed to preempt the thread setting the event before it has a chance to release the lock.
This is two-step dance in the extreme, where the setting ThreadA is context switched out so that the awakening ThreadB can attempt to acquire the lock; it of course cannot, and so it will context switch out so that ThreadA can run again; eventually, ThreadA will release the lock, which again will priority boost ThreadB, which preempts ThreadA so that it can run. As you can see, there's a lot of wasteful context switching going on.
Priority Inversion
Modifying thread priorities is usually asking for trouble. The moral of this story is to simply avoid changing thread priorities wherever possible.
Priority inversion - Can arise when many threads at different priorities share access to the same locks and resources, whereby a lower-priority thread actually indefinitely holds up the progress of a higher-priority thread.
Here is an extreme example of priority inversion. Imagine ThreadA at Low priority acquires some lock L. Then comes along ThreadB, which is at High priority. It tries to acquire L, but cannot because ThreadA holds it. This is the "Inversion" part: it's as if ThreadA is artificially and temporarily given a higher priority than ThreadB, simply because it holds a lock that ThreadB needs.
The situation will eventually resolve itself when ThreadA releases the lock.
Unfortunately, imagine what happens if ThreadC at Medium priority comes into the picture. Although ThreadC doesn't need lock L, its mere presence may prevent ThreadA from running at all, which indirectly prevents the High-priority ThreadB from running.Eventually the Windows Balance Set Manager thread will notice this situation. Even if ThreadC remains runnable forever, ThreadA will eventually (after four seconds) receive a temporary priority boost by the OS. Hopefully this is enough to allow it to run to completion and release the lock. But the delay here (four seconds) is huge, and if there is any user interface involved, the application's user is certainly going to notice the problem.
Immutability
An immutable data structure is one that doesn't change after construction. This is a wonderful property for concurrent programs because if data isn't being mutated, then there is no risk of conflict if many threads access it at once. That means Synchronization Isn't A Concern.
Immutability has some support in C++ by way of const, and in C# by way of the read-only modifier. For example, a .NET type that has only read-only fields is Shallow Immutable. Going a step further, if each of those fields itself refers to another type whose fields are all read-only (and only refers to deeply immutable types), then the type is Deeply Immutable. This results in an entire object graph that is guaranteed to not change out from underneath you, which is very useful indeed.
All of this describes immutability as a Static Property.
It's also possible for objects to be immutable by convention, meaning that there is some guarantee that state won't change during certain periods of time. This is a Dynamic Property.
The Windows Presentation Foundation (WPF) freezable feature implements precisely this, and it also allows concurrent access without synchronization.
Immutability does have some downsides. Namely, whenever something has to change, you will need to make a copy of the original object and apply the changes along the way. Additionally, cycles in an object graph are generally not possible (except with dynamic immutability).
e.g. Mutable Object – Created using ‘Read Only’
public class ImmutableStack<T> { private readonly T m_value; private readonly ImmutableStack<T> m_next; private readonly bool m_empty; public ImmutableStack() { m_empty = true; } internal ImmutableStack(T value, Node next) { m_value = value; m_next = next; m_empty = false; } public ImmutableStack<T> Push(T value) { return new ImmutableStack(value, this); } public ImmutableStack<T> Pop(out T value) { if (m_empty) throw new Exception("Empty."); return m_next; } }
|
|
Purity
Even with immutable data types, most operations performed by a program are method calls. And method calls can have side-effects, which are problematic in concurrent code because a side-effect implies mutation of some form. Most often this will be a simple write to shared memory, but it can also be a physically mutating operation like a database transaction, Web service invocation, or file system operation. In many circumstances, I'd like to be able to invoke some method without fear that it will lead to concurrency hazards. Good examples of this are simple methods like GetHashCode and ToString on System.Object. Most people wouldn't expect them to entail side-effects.
A pure method can always be run in a concurrent setting without added synchronization. Although there is no common language support for purity, you can define a pure method very simply:
- It only reads from shared memory, and it only reads immutable or constant state.
- It can, of course, write to local variables.
- It may only call other pure methods.
The set of things possible within a pure method, thus, is extremely limited. But when combined with immutable types, purity can be made possible and convenient. Some functional languages assume purity by default, most notably Haskell where everything is pure. Anything that must perform a side-effect has to be wrapped in a special thing called a monad. Most of us don't use Haskell, however, so we'll have to get by with purity-by-convention.
Isolation
Publication and privatization were mentioned in passing earlier, but they strike at the heart of a very important issue. Synchronization is necessary—and immutability and purity are interesting—because state is ordinarily shared among multiple threads. But if state is confined within a single thread, then there is no need for synchronization. This leads to more naturally scalable software.
Indeed, if state is isolated, mutation can happen freely. This is convenient because mutation is a fundamental built-in aspect of most C-style languages. Programmers are accustomed to it. It takes discipline to program in a mostly functional style, and is quite difficult for most developers. So give it a try, but don't deceive yourself into thinking the world will become functional overnight.
Ownership is a difficult thing to track down. When does an object become shared? When it is initialized, that is being done by a single thread and the object itself is not reachable from other threads just yet. Once a reference to that object is stored in a static variable, someplace that has been shared at thread creation or queuing time, or a field of an object transitively reachable from one of these places, the object becomes shared. It is imperative that developers watch specifically for these transitions between private and shared, and treat all shared state with care.
Patterns for Achieving Safety
Now that I've chased down issue after issue, the good news is that there are design patterns you can follow to make the occurrence of many of the above issues (particularly the correctness hazards) far less frequent. You can take a cue from functional programming languages like Haskell, LISP, Scheme, ML, and even F# (a new .NET-compliant language), by adopting immutability, purity, and isolation as first class design concepts.
Reference: Oct 2008 MSDN Magzine
Thanks & Regards,
Arun Manglick || Senior Tech Lead
No comments:
Post a Comment