Friday, February 13, 2015

Few notes on Locking and Shared Memory access

I was reading Disruptor paper and came across some mind boggling performance numbers. Disruptor is a concurrency design pattern.
These are some general notes which I collected from this paper and Martin Thompson blogs. It will be useful to give a thought to them while implementing concurrent program.

Cost of Locks:
Locks provide mutual exclusion and ensure that the visibility of change occurs in an orderly manner. Locks are expensive because they require arbitration when contended. The arbitration is achieved by a context switch to the operating system kernel which will suspend threads waiting on a lock until it is released.

Cost of CAS:
CAS is a special machine code instruction that allows a word in a memory to be conditionally set as an atomic operation. The old and new value is provided as parameters to this instruction. Instruction is only successful if old and new value is same. CAS approach is significantly more efficient than locks because it does not require a context switch to the kernel for arbitration. However CAS is not free of cost. The processor must lock its instruction pipeline to ensure atomicity and employ a memory barrier to make the changes visible to other threads.

Cache Lines:
Modern CPUs are much faster than memory systems. Fetching the data from main memory is comparatively very slow. To fix it, multiple layer of hardware caches are used to cache the recently accessed main memory data. When data is accessed by CPU, the data is accessed in blocks and it is stored in the cache for future retrievals. 
Memory is pre-fetched into the cache if the memory access is predictable. When iterating over arrays the stride is predictable so memory will be pre-fetched in cache lines. For linked list and tree the nodes are distributed so there is no predictable stride of access. The lack of a consistent pattern in memory constraints the ability of the system to pre-fetch cache lines, resulting in main memory accesses which can be more than 2 orders of magnitude less efficient.

False Sharing: 
Even if two unrelated variables are accessed by different threads, then sharing may happen if they are located next to each other in the main memory. This is due to the fact that, even if one of the variable is accessed, the other variable will also be cached as both may be stored in the  hardware cache store. 

Busy Spin Wait Strategy:
Busy spin may give better performance than using Locks in many cases. The busy spin can be implemented with a small sleep interval. As discussed using Locks has lot of overhead of context switching. A context switch may have an overhead equivalent to spinning a few hundred or thousand times, so if a lock can be acquired by burning a few cycles spinning, this may overall very well be more efficient. Also, for realtime applications it may not be acceptable to block and wait for the scheduler to come back to them at some far away time in the future. This thread spinlock vs semaphore http://stackoverflow.com/questions/195853/spinlock-versus-semaphore is worth reading for the details. 

Avoid Contention:
The most costly operation in any concurrent environment is a contended write access. It is best to write in a single thread and read through multiple threads.