[asterisk-dev] [Code Review] 3930: PJSIP: Resolve race condition regarding media handling when receiving 200 OK

Mark Michelson reviewboard at asterisk.org
Tue Sep 2 13:17:09 CDT 2014


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3930/
-----------------------------------------------------------

(Updated Sept. 2, 2014, 1:17 p.m.)


Status
------

This change has been marked as submitted.


Review request for Asterisk Developers.


Changes
-------

Committed in revision 422536


Bugs: ASTERISK-24212
    https://issues.asterisk.org/jira/browse/ASTERISK-24212


Repository: Asterisk


Description
-------

In /r/3927, I introduced a changeset that fixes a scheduler race condition. I noted on that issue that even though I was no longer seeing the same crashes when running tests, I was still seeing occasional test failures due to some media-related race condition.

I have determined the cause of this race condition. PJSIP's inv_session API provides two callbacks, on_state_changed() and on_tsx_state_changed(). Both of these are called into when the state of a transaction changes, and the documentation is not particularly clear about what the difference is between these. Through some investigation, I've found that what happens is:

1) Transaction layer detects a state change (such as a request/response being sent/received, or a timeout)
2) Transaction layer informs the inv_session layer.
3) inv_session layer sets the new state and calls the on_state_changed() callback.
4) inv_session layer performs its processing of the transaction state change (including potential calls to the on_media_update() callback).
5) inv_session layer calls the on_tsx_state_changed() callback.

res_pjsip_session is hooked into step (3) such that when a new SIP request or response is sent or received, we call into session supplements to let them react to the new message. One of these session supplements is in chan_pjsip and queues an AST_CONTROL_ANSWER frame when a 200 OK response is received.

In the test where I am seeing the race condition occur, a call is originated to a PJSIP channel (Alice). Once Alice answers, her channel is placed into the dialplan to execute the TalkDetect() application. The problem here is that at step (3), the AST_CONTROL_ANSWER frame is queued onto Alice's channel, and the PBX thread moves the channel into the TalkDetect() application before step (4) is executed. Step (4) is important for setting formats on Alice's channel and setting appropriate translation paths as well. Since step (4) has not been completed yet, the attempt to play a file as part of TalkDetect() fails, and therefore the test fails. In most runs, however, step (4) does get completed before the PBX thread moves Alice's channel into TalkDetect().

The fix presented here is pretty simple. For transaction state changes, we are now hooked into step (5) instead of step (3) to call session supplements. This means that we do not call into session supplements until media has been negotiated on the channel. This eliminates the race condition entirely.


***EDIT 26 Aug, 2014***
After running all PJSIP tests in the testsuite, I found that things weren't quite so cut and dry as they were initially presented here. Here are some additional findings:
* For incoming 3XX responses, it's actually at step (2) that we get told about the redirection. There are some session supplements that need to be called back at this time.
* Not all session supplements can safely wait until after media handling to be called into. For instance, the NAT handling code that rewrites contacts needs to be called into before media handling on an inbound 200 OK.
* In something pretty much completely unrelated, I found that there was an undiscovered bug in res_pjsip_session.c that prevented incoming reinvites from reaching session supplements.

With regards to the first two bullet points, I have added a "response_priority" field to session supplements. This allows for a session supplement to tell res_pjsip_session at  what point during response handling it should be called into. All but two session supplements are called into at the same point that they were prior to writing this patch. However, two supplements now specify when they should be called into:
1) res_pjsip_diversion's supplement specifies to be called into before processing redirections.
2) One of chan_pjsip's session supplements specifies to be called into after media negotiation.

I discovered the bug in the third bullet by accidentally fixing it. Because of this change, there are some session supplements in chan_pjsip that now will check INVITE state and stop processing if dealing with a reinvite.


Diffs
-----

  /branches/12/res/res_pjsip_session.c 421883 
  /branches/12/res/res_pjsip_diversion.c 421883 
  /branches/12/include/asterisk/res_pjsip_session.h 421883 
  /branches/12/channels/chan_pjsip.c 421883 

Diff: https://reviewboard.asterisk.org/r/3930/diff/


Testing
-------

I ran the tests/channels/pjsip/basic_calls/two_parties/nominal/alice_initiated/alice_hangs_up test in a loop on my console both before and after applying this patch. Before applying the patch, I would usually encounter a test failure within an hour or two. After applying this patch, I left the test running in a loop for over 24 hours and never had a test failure.


Thanks,

Mark Michelson

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20140902/13217669/attachment.html>


More information about the asterisk-dev mailing list