[asterisk-dev] [Code Review] 3069: Fix deadlock experienced during multi-party PJSIP transfer through masquerade rework

Tue Dec 17 16:54:21 CST 2013

> On Dec. 16, 2013, 8:26 p.m., rmudgett wrote:
> > /branches/12/main/channel.c, lines 2687-2704
> > <https://reviewboard.asterisk.org/r/3069/diff/1/?file=49555#file49555line2687>
> >
> >     It is probably simpler now to just make ast_hangup() wait for any masquerade in progress to complete.
> >     
> >     while (ast_channel_masq() || ast_channel_masqr()) {
> >       unlock(chan);
> >       usleep(1);
> >       lock(chan);
> >     }
> 
> Mark Michelson wrote:
>     I'm actually beginning to rethink my original change a bit.
>     
>     In channel_do_masquerade() (yes, I renamed it to that), the masq field on the original is cleared quite earlier than I expected. The masq is cleared from the original, then the original is unlocked, and then further operations are performed on the original afterward, to include indicating a visible indication and indicating SRCCHANGE frames. When the original becomes unlocked at this point in the masquerade, the ast_hangup() operation will be able to grab the channel lock, will see that ast_channel_masq() evaluates NULL, and will continue with the hangup, thus hanging up the channel at the same time that frames are being indicated on it. I think the masq field on the channel needs to be cleared at the very end of the masquerade in order to prevent this from happening. Do you agree?
>     
>     As far as the masqr field is concerned, if I make the suggested change here, then I should also remove the clone_was_zombie logic from channel_do_masquerade() since that situation would no longer be possible. If the channel is marked as a masqr and then ast_hangup() is called, then the hangup will wait for the masquerade to complete before marking the channel as a zombie. If ast_hangup() is called first, then the channel cannot be marked as a masqr since it will already be marked as a zombie. Thus, when channel_do_masquerade() is called, it will be impossible for the clonechan to be marked as a zombie. Does this logic seem sound?

Yes.  Clearing the masq and masqr pointers at the end should work fine.  There cannot be another masquerade started anyway because of the masquerade lock around ast_channel_move().

Yes.  Remove the clone_was_zombie is true code.  It is dead code with the proposed ast_hangup() change.  If ast_hangup() is called first then ast_channel_move() will fail before channel_do_masquerade() is called because ast_hangup() has marked it as a zombie.

> On Dec. 16, 2013, 8:26 p.m., rmudgett wrote:
> > /branches/12/main/channel.c, lines 10399-10400
> > <https://reviewboard.asterisk.org/r/3069/diff/1/?file=49555#file49555line10399>
> >
> >     One thing that ast_channel_masquerade() did that is not being done anymore is queuing null frames to wake up threads before the masquerade takes place.
> >     
> >     Maybe do_channel_masquerade() needs to poke both channels to wake up their respective caretaker threads instead.  Those threads could be in the ast_read/ast_write functions.
> 
> Mark Michelson wrote:
>     The old ast_channel_masquerade() would queue null frames, but I interpreted that as being done so that when the thread servicing the channel woke up, the subsequent ast_read() call would end up performing the masquerade. Since we set up and perform the masquerade all at once now, queuing a null frame at the time of marking the channels as being involved in a masquerade seems like the wrong move.
>     
>     I *think* that signaling blocking threads is actually taken care of already.
>     
>     The clonechan has a null frame queued onto it when it gets marked as a zombie.
>     The original channel has a SIGURG sent to its blocking thread shortly after.
>     
>     Am I correct that this should result in any blocking threads being awoken correctly?

I was hoping you would know with more certainty than I since the channel fds are swapped around. :)

- rmudgett

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3069/#review10423
-----------------------------------------------------------

On Dec. 12, 2013, 9:59 p.m., Mark Michelson wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviewboard.asterisk.org/r/3069/
> -----------------------------------------------------------
> 
> (Updated Dec. 12, 2013, 9:59 p.m.)
> 
> 
> Review request for Asterisk Developers, Joshua Colp, Jonathan Rose, Matt Jordan, and rmudgett.
> 
> 
> Bugs: ASTERISK-22936
>     https://issues.asterisk.org/jira/browse/ASTERISK-22936
> 
> 
> Repository: Asterisk
> 
> 
> Description
> -------
> 
> This is more-or-less a take 2 of https://reviewboard.asterisk.org/r/3042
> 
> Initially, it was discovered that performing an attended transfer of a multiparty bridge with a PJSIP channel would cause a deadlock. A PBX thread started a masquerade and reached the point where it was calling the fixup() callback on the "original" channel. For chan_pjsip, this involves pushing a synchronous task to the session's serializer. The problem was that a task ahead of the fixup task was also attempting to perform a channel masquerade. However, since masquerades are designed in a way to only allow for one to occur at a time, the task ahead of the fixup could not continue until the masquerade already in progress had completed. And of course, the masquerade in progress could not complete until the task ahead of the fixup task had completed. Deadlock.
> 
> The initial fix was to change the fixup task to be asynchronous. While this prevented the deadlock from occurring, it had the frightful side effect of potentially allowing for tasks in the session's serializer to operate on a zombie channel.
> 
> Taking a step back from this particular deadlock, it became clear that the problem was not really this one particular issue but that masquerades themselves needed to be addressed. A PJSIP attended transfer operation calls ast_channel_move(), which attempts to both set up and execute a masquerade. The problem was that after it had set up the masquerade, the PBX thread had swooped in and tried to actually perform the masquerade. Looking at changes that had been made to Asterisk 12, it became clear that there never is any time now that anyone ever wants to set up a masquerade and allow for the channel thread to actually perform the masquerade. Everyone always is calling ast_channel_move(), performs the masquerade itself before returning.
> 
> In this patch, I have removed all blocks of code from channel.c that will attempt to perform a masquerade if ast_channel_masq() returns true. Now, there is no distinction between setting up a masquerade and performing the masquerade. It is one operation. The only remaining checks for ast_channel_masq() and ast_channel_masqr() are in ast_hangup() since we do not want to interrupt a masquerade by hanging up the channel. Instead, now ast_hangup() will wait for a masquerade to complete before moving forward with its operation.
> 
> The ast_channel_move() function has been modified to basically in-line the logic that used to be in ast_channel_masquerade(). ast_channel_masquerade() has been killed off for real. ast_channel_move() now has a lock associated with it that is used to prevent any simultaneous moves from occurring at once. This means there is no need to make sure that ast_channel_masq() or ast_channel_masqr() are already set on a channel when ast_channel_move() is called. It also means the channel container lock is not pulling double duty by both keeping the container locked and preventing multiple masquerades from occurring simultaneously.
> 
> The ast_do_masquerade() function has been renamed to do_channel_masquerade() and is now internal to channel.c. The function now takes explicit arguments of which channels are involved in the masquerade instead of a single channel. While it probably is possible to do some further refactoring of this method, I feel that I would be treading dangerously. Instead, all I did was change some comments that no longer are true after this changeset.
> 
> The other more minor change introduced in this patch is to res_pjsip.c to make ast_sip_push_task_synchronous() run the task in-place if we are already a SIP servant thread. This is related to this patch because even when we isolate the channel masquerade to only running in the SIP servant thread, we would still deadlock when the fixup() callback is reached since we would essentially be waiting forever for ourselves to finish before actually running the fixup. This makes it so the fixup is run without having to push a task into a serializer at all.
> 
> 
> Diffs
> -----
> 
>   /branches/12/res/res_pjsip.c 403349 
>   /branches/12/main/channel.c 403343 
>   /branches/12/include/asterisk/channel.h 403343 
> 
> Diff: https://reviewboard.asterisk.org/r/3069/diff/
> 
> 
> Testing
> -------
> 
> Jonathan ran the same test he had originally run in order to make the deadlock occur, and with this patch it no longer occurs. I also ran a bevy of testsuite tests to see if there were any that did not pass with this patch but did pass without the patch. I found none on my machine, though I'll admit I did not run every test.
> 
> 
> Thanks,
> 
> Mark Michelson
> 
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20131217/bf6d7375/attachment-0001.html>