[asterisk-dev] Locking, coding guidelines addition

Fri Jun 27 13:17:53 CDT 2008

After doing some code review and responding to changes to this list,  
and then the following discussion on the #asterisk-dev channel, it has  
occurred to me that the fundamental locking rules for the boundary  
between channel drivers and the Asterisk core have not sufficiently  
been documented.  Thus, I propose the following addition to the coding  
guidelines document.

I welcome any feedback.  If anything is not clear, I will do my best  
to clarify it.  If anything seems wrong, then we have an even bigger  
problem, because this describes how the code is written today.  This  
is essentially an attempt to document some of what has come out of the  
evolution of the Asterisk code base, not what I would consider ideal.

---------------------------------------------------------
--- Locking
---------------------------------------------------------

A) Locking Fundamentals

Asterisk is a heavily multithreaded application.  It makes extensive  
use of locking to ensure safe access to shared resources between  
different threads.

When more that one lock is involved in a given code path, there is the  
potential for deadlocks.  A deadlock occurs when a thread is stuck  
waiting for a resource that it will never acquire.  Here is a classic  
example of a deadlock:

    Thread 1                           Thread 2
    ------------                           ------------
    Holds Lock A                   Holds Lock B
    Waiting for Lock B          Waiting for Lock A

In this case, there is a deadlock between threads 1 and 2.  However,  
this deadlock would have been avoided if both threads 1 and 2 had  
agreed that one must acquire Lock A before Lock B, or the other way  
around.

We have now arrived at the fundamental rule for dealing with multiple  
locks.

     Thou must not violate the rules of locking order.

The agreed upon locking order depends on the code, but when using  
multiple locks, an order _must_ be established, and once established,  
must be respected.

B) Minding the boundary between channel drivers and the Asterisk core

The #1 cause of deadlocks in Asterisk is by not properly following the  
locking rules that exist at the boundary between channel drivers and  
the Asterisk core.  The Asterisk core allocates an ast_channel, and  
channel drivers allocate technology specific private data that is  
associated with an ast_channel.  Typically, both the ast_channel and  
technology pvt (private) data have their own lock.  There are _many_  
code paths that require both objects to be locked.

The locking order in this situation is the following:

     1) ast_channel
     2) technology pvt data

Channel drivers implement the ast_channel_tech interface to provide a  
channel implementation for Asterisk.  Most of the channel_tech  
interface callbacks are called with the associated ast_channel  
locked.  When accessing technology pvt data, the pvt can be locked  
directly.

The complication arises in the opposite code path.  Consider the  
following situation:

     1) A message comes in over the "network"
     2) The CD (channel driver) monitor thread receives the message
     3) The CD associates the message with a pvt struct and locks the  
pvt
     4) While processing the message, the CD must do something that  
requires
          locking the ast_channel

This is the point that must be handled carefully.  Given the specified  
locking order, it seems as if the following psuedo-code would be  
sufficient:

       unlock(pvt);
       lock(ast_channel);
       lock(pvt);

Unfortunately, that's not good enough in Asterisk.  All locks (in the  
case of mutexes, not rwlocks) are configured as recursive locks.  That  
means that the same thread can lock the same lock multiple times,  
which essentially increments a counter.  So, an unlock() may not  
actually make the thread release a lock.

So, instead, we rely on a different construct to ensure that both  
locks are acquired safely.

       while (trylock(ast_channel) == FAILURE) {
           unlock(pvt);
           usleep(1); /* yield to other threads */
           lock(pvt);
       }

Technically, we are not following the specified locking order.  We  
will acquire the pvt lock before the ast_channel lock.  However, by  
looping in this fashion, we are safe from deadlocks.  We never block  
waiting on the ast_channel lock.  If we fail to acquire the  
ast_channel, we give other threads a chance to grab the pvt lock.   
This is to allow alternate code paths in other threads that have the  
ast_channel lock and call into the channel driver to safely acquire  
the pvt.

C) Locking multiple channels.

The next situation to consider is what to do when you need a lock on  
multiple ast_channels.  If we did not use recursive mutexes, the  
following construct would be sufficient:

          lock(MIN(chan1, chan2));
          lock(MAX(chan1, chan2));

That type of code would follow an established locking order of always  
locking the channel that has a lower address first.  Again, since we  
use recursive mutexes, this isn't safe.  We use the same construct as  
described at the end of section B, instead.

         lock(chan1);
         while (trylock(chan2) == FAILURE) {
              unlock(chan1);
              usleep(1);
              lock(chan1);
          }

D) Documenting locking rules.

Finally, it is required that locking order rules are documented for  
every lock introduced into Asterisk.  This is done almost nowhere in  
the existing code.  However, it will be expected to be there for newly  
introduced code.  Over time, this information should be added for all  
of the existing lock usage.

---------------------------------------------------------
---------------------------------------------------------

--
Russell Bryant
Senior Software Engineer
Open Source Team Lead
Digium, Inc.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.digium.com/pipermail/asterisk-dev/attachments/20080627/bb5ce5ee/attachment.htm