[asterisk-dev] [Code Review] 3930: PJSIP: Resolve race condition regarding media handling when receiving 200 OK

Mark Michelson reviewboard at asterisk.org
Mon Aug 25 08:53:09 CDT 2014



> On Aug. 25, 2014, 1:12 p.m., Joshua Colp wrote:
> > Review your doxygen because I think you are repeating yourself, and also run through all the PJSIP tests. Do that and this is good to go.

I don't see the doxygen problem. One states that on transaction state changes, it is called before SDP negotiation and the other states that on transaction state changes it is called after SDP negotiation.


- Mark


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3930/#review13164
-----------------------------------------------------------


On Aug. 24, 2014, 11:37 p.m., Mark Michelson wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviewboard.asterisk.org/r/3930/
> -----------------------------------------------------------
> 
> (Updated Aug. 24, 2014, 11:37 p.m.)
> 
> 
> Review request for Asterisk Developers.
> 
> 
> Bugs: ASTERISK-24212
>     https://issues.asterisk.org/jira/browse/ASTERISK-24212
> 
> 
> Repository: Asterisk
> 
> 
> Description
> -------
> 
> In /r/3927, I introduced a changeset that fixes a scheduler race condition. I noted on that issue that even though I was no longer seeing the same crashes when running tests, I was still seeing occasional test failures due to some media-related race condition.
> 
> I have determined the cause of this race condition. PJSIP's inv_session API provides two callbacks, on_state_changed() and on_tsx_state_changed(). Both of these are called into when the state of a transaction changes, and the documentation is not particularly clear about what the difference is between these. Through some investigation, I've found that what happens is:
> 
> 1) Transaction layer detects a state change (such as a request/response being sent/received, or a timeout)
> 2) Transaction layer informs the inv_session layer.
> 3) inv_session layer sets the new state and calls the on_state_changed() callback.
> 4) inv_session layer performs its processing of the transaction state change (including potential calls to the on_media_update() callback).
> 5) inv_session layer calls the on_tsx_state_changed() callback.
> 
> res_pjsip_session is hooked into step (3) such that when a new SIP request or response is sent or received, we call into session supplements to let them react to the new message. One of these session supplements is in chan_pjsip and queues an AST_CONTROL_ANSWER frame when a 200 OK response is received.
> 
> In the test where I am seeing the race condition occur, a call is originated to a PJSIP channel (Alice). Once Alice answers, her channel is placed into the dialplan to execute the TalkDetect() application. The problem here is that at step (3), the AST_CONTROL_ANSWER frame is queued onto Alice's channel, and the PBX thread moves the channel into the TalkDetect() application before step (4) is executed. Step (4) is important for setting formats on Alice's channel and setting appropriate translation paths as well. Since step (4) has not been completed yet, the attempt to play a file as part of TalkDetect() fails, and therefore the test fails. In most runs, however, step (4) does get completed before the PBX thread moves Alice's channel into TalkDetect().
> 
> The fix presented here is pretty simple. For transaction state changes, we are now hooked into step (5) instead of step (3) to call session supplements. This means that we do not call into session supplements until media has been negotiated on the channel. This eliminates the race condition entirely.
> 
> 
> Diffs
> -----
> 
>   /branches/12/res/res_pjsip_session.c 421883 
> 
> Diff: https://reviewboard.asterisk.org/r/3930/diff/
> 
> 
> Testing
> -------
> 
> I ran the tests/channels/pjsip/basic_calls/two_parties/nominal/alice_initiated/alice_hangs_up test in a loop on my console both before and after applying this patch. Before applying the patch, I would usually encounter a test failure within an hour or two. After applying this patch, I left the test running in a loop for over 24 hours and never had a test failure.
> 
> 
> Thanks,
> 
> Mark Michelson
> 
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20140825/7eb17d7d/attachment.html>


More information about the asterisk-dev mailing list