[asterisk-dev] [Code Review] Avoid deadlocks in chan_sip - 1.6.0
Olle E. Johansson
oej at edvina.net
Sat Nov 21 04:02:16 CST 2009
20 nov 2009 kl. 17.34 skrev Russell Bryant:
>
>
>> On 2009-11-19 15:22:38, Russell Bryant wrote:
>>> I would be very interested in seeing the "core show locks" output for the deadlocks that you encountered. Then, I would be more than happy to help with determining the safest changes to make.
>>>
>>> There is a "Locking in Asterisk" section of the CODING-GUIDELINES document that may help, as well.
>>
>> Olle E Johansson wrote:
>> I'll see what we can do about the "core show locks". There are some clear paths that will deadlock, but there might be other locks that we can change to let these code pieces do their work. I just could not find them.
>>
>> Thanks for the quick feedback, Russell!
>
> You're quite welcome. The exact "core show locks" output isn't required, but the same information is probably needed. Let me elaborate using examples from the changes we are discussing here.
>
> In the last 2 of the 3 parts of this patch, the changes are in sip_read() or sip_write(). Instead of directly locking the sip_pvt, the changes it to a trylock loop with releasing the channel lock temporarily, very similar to many other "deadlock avoidance" constructs in the code base. Now, let's discuss the problem that led to this change.
>
> Presumably there was one thread executing this function that got stuck waiting on the sip_pvt lock. The critical piece of other information is what other thread(s) were involved in the deadlock. In the simplest (and most likely) case, there is another thread stuck attempting to acquire the locks in the opposite order, meaning that thread 2 holds the sip_pvt lock and is stuck doing an ast_channel_lock().
>
> At this point you have two routes to follow for making the deadlock go away. One of the code paths has to give. To make the decision of which code path to modify, we must consult our locking rules. A deadlock results from a violation of locking order. The locking order for this situation is 1) ast_channel, and 2) sip_pvt.
>
> Based on those locking rules, the code in sip_read and sip_write is actually correct. Those functions are called with the ast_channel locked, so locking the pvt directly is a safe operation (or is supposed to be). So, it is the other code path that must be changed.
>
> The problem with releasing the channel lock here is that there are a lot of assumptions in the code that this will not happen. By releasing the channel lock, it is possible that a masquerade operation on this channel be executed in another thread. If that happens at this point in the code, it will most likely cause a bizarre and tough to debug crash.
>
> <sarcasm>Oh, the joys of heavily multithreaded programming with fine grained locking ...</sarcasm>
Thanks for the explanation. I don't remember what the respective threads where doing, so I need to get back there and run tests on the platform. The systems had quite a lot of traffic including a log of subscriptions.
We noticed the other day a lot of new issues with the autodestruct code, that is executing without locks, it actually tries to send a final notify in the case of subscriptions. There's obviously NO ast_channel involved here at all - so there are cases where we have to lock sip_pvt's without the existance of channels. That propably what breaks our locking architecture totally. We have a few SIP operations on the PVT that does NOT involve an AST_CHANNEL and in those cases we propably need the trylocks.
The autodestruct code caused a few crashes so we added locks around it as well, to protect the channel while we where trying to transmit the final notify. I think that part of the code needs a change, we can't destruct the pvt since we have an outbound SIP message that needs retransmit support for up to 32 secs.
Trying to figure out a conclusion in all this mess, I think the major issue here is as I wrote before that the SIP transactions that involve no ast_channel are causing these deadlocks, since the locking architecture in many cases depend on the ast_channel to be locked.
We might have to figure out a way for these codepaths to behave differently in the case of these types of transactions compared with others. Of course, releasing a non-existing ast_channel won't help either... We also need to make sure that anything that executes in the monitor thread does not hang waiting for locks. All scheduled items (retransmissions, subs/regs, destructions, timeouts), reading the udp socket and handling those incoming requests is part of that.
I am amazed over the number of issues like this that we have found in the 1.6.0 code on this installation. I haven't seen anything similar in 1.4 installs, so I need to go back and try to figure out what causes this. There must be some pretty large architecture change that makes it happen. Even though my patches are wrong (which I kind of suspected) they do make the systems much more stable, so I'm convinced we found the right sickness even though my medicine might cause seriously bad effects in the future...
Regards,
/O
More information about the asterisk-dev
mailing list