[asterisk-dev] [Code Review] Avoid deadlocks in chan_sip - 1.6.0

Russell Bryant russell at digium.com
Fri Nov 20 10:34:46 CST 2009



> On 2009-11-19 15:22:38, Russell Bryant wrote:
> > I would be very interested in seeing the "core show locks" output for the deadlocks that you encountered.  Then, I would be more than happy to help with determining the safest changes to make.
> > 
> > There is a "Locking in Asterisk" section of the CODING-GUIDELINES document that may help, as well.
> 
> Olle E Johansson wrote:
>     I'll see what we can do about the "core show locks".  There are some clear paths that will deadlock, but there might be other locks that we can change to let these code pieces do their work. I just could not find them. 
>     
>     Thanks for the quick feedback, Russell!

You're quite welcome.  The exact "core show locks" output isn't required, but the same information is probably needed.  Let me elaborate using examples from the changes we are discussing here.  

In the last 2 of the 3 parts of this patch, the changes are in sip_read() or sip_write().  Instead of directly locking the sip_pvt, the changes it to a trylock loop with releasing the channel lock temporarily, very similar to many other "deadlock avoidance" constructs in the code base.  Now, let's discuss the problem that led to this change.

Presumably there was one thread executing this function that got stuck waiting on the sip_pvt lock.  The critical piece of other information is what other thread(s) were involved in the deadlock.  In the simplest (and most likely) case, there is another thread stuck attempting to acquire the locks in the opposite order, meaning that thread 2 holds the sip_pvt lock and is stuck doing an ast_channel_lock().

At this point you have two routes to follow for making the deadlock go away.  One of the code paths has to give.  To make the decision of which code path to modify, we must consult our locking rules.  A deadlock results from a violation of locking order.  The locking order for this situation is 1) ast_channel, and 2) sip_pvt.

Based on those locking rules, the code in sip_read and sip_write is actually correct.  Those functions are called with the ast_channel locked, so locking the pvt directly is a safe operation (or is supposed to be).  So, it is the other code path that must be changed.

The problem with releasing the channel lock here is that there are a lot of assumptions in the code that this will not happen.  By releasing the channel lock, it is possible that a masquerade operation on this channel be executed in another thread.  If that happens at this point in the code, it will most likely cause a bizarre and tough to debug crash.

<sarcasm>Oh, the joys of heavily multithreaded programming with fine grained locking ...</sarcasm>


- Russell


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/433/#review1262
-----------------------------------------------------------


On 2009-11-19 11:02:38, Olle E Johansson wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviewboard.asterisk.org/r/433/
> -----------------------------------------------------------
> 
> (Updated 2009-11-19 11:02:38)
> 
> 
> Review request for Asterisk Developers.
> 
> 
> Summary
> -------
> 
> While running 1.6.0 in production, we have stumbled over a few deadlocks that we have proven to avoid with these changes.
> 
> - CLI "sip show channels" can cause serious damage before this change
> - Retrans_pkt had some issues with locking the Ast_channel
> - Both sip_write and sip_read was hanging waiting for a better time.
> 
> We debugged with GDB to see where each thread was hanging and found all these issues. Tested with and without patches and the issues disappeared on the patched systems.
> 
> Locking is a part of Asterisk coding that I'm far away from being a master in - so please give feedback. I'm sure there's a lot of room for improvement.
> 
> (And yes, I noticed the red box in the patch. Will remove.)
> 
> 
> Diffs
> -----
> 
>   /team/oej/moretrylock-1.6.0/channels/chan_sip.c 230467 
> 
> Diff: https://reviewboard.asterisk.org/r/433/diff
> 
> 
> Testing
> -------
> 
> The patches has been running in production on a couple of PBXs for a couple of weeks without causing any harm that we're aware of.
> 
> 
> Thanks,
> 
> Olle E
> 
>




More information about the asterisk-dev mailing list