<html>
<body>
<div style="font-family: Verdana, Arial, Helvetica, Sans-Serif;">
<table bgcolor="#f9f3c9" width="100%" cellpadding="8" style="border: 1px #c9c399 solid;">
<tr>
<td>
This is an automatically generated e-mail. To reply, visit:
<a href="https://reviewboard.asterisk.org/r/3930/">https://reviewboard.asterisk.org/r/3930/</a>
</td>
</tr>
</table>
<br />
<table bgcolor="#fefadf" width="100%" cellspacing="0" cellpadding="8" style="background-image: url('https://reviewboard.asterisk.org/static/rb/images/review_request_box_top_bg.ab6f3b1072c9.png'); background-position: left top; background-repeat: repeat-x; border: 1px black solid;">
<tr>
<td>
<div>Review request for Asterisk Developers.</div>
<div>By Mark Michelson.</div>
<p style="color: grey;"><i>Updated Aug. 24, 2014, 11:37 p.m.</i></p>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Changes</h1>
<table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" style="border: 1px solid #b8b5a0">
<tr>
<td>
<pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Added doxygen for the two callbacks in res_pjsip_session.c</pre>
</td>
</tr>
</table>
<div style="margin-top: 1.5em;">
<b style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Bugs: </b>
<a href="https://issues.asterisk.org/jira/browse/ASTERISK-24212">ASTERISK-24212</a>
</div>
<div style="margin-top: 1.5em;">
<b style="color: #575012; font-size: 10pt;">Repository: </b>
Asterisk
</div>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Description </h1>
<table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" style="border: 1px solid #b8b5a0">
<tr>
<td>
<pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">In /r/3927, I introduced a changeset that fixes a scheduler race condition. I noted on that issue that even though I was no longer seeing the same crashes when running tests, I was still seeing occasional test failures due to some media-related race condition.
I have determined the cause of this race condition. PJSIP's inv_session API provides two callbacks, on_state_changed() and on_tsx_state_changed(). Both of these are called into when the state of a transaction changes, and the documentation is not particularly clear about what the difference is between these. Through some investigation, I've found that what happens is:
1) Transaction layer detects a state change (such as a request/response being sent/received, or a timeout)
2) Transaction layer informs the inv_session layer.
3) inv_session layer sets the new state and calls the on_state_changed() callback.
4) inv_session layer performs its processing of the transaction state change (including potential calls to the on_media_update() callback).
5) inv_session layer calls the on_tsx_state_changed() callback.
res_pjsip_session is hooked into step (3) such that when a new SIP request or response is sent or received, we call into session supplements to let them react to the new message. One of these session supplements is in chan_pjsip and queues an AST_CONTROL_ANSWER frame when a 200 OK response is received.
In the test where I am seeing the race condition occur, a call is originated to a PJSIP channel (Alice). Once Alice answers, her channel is placed into the dialplan to execute the TalkDetect() application. The problem here is that at step (3), the AST_CONTROL_ANSWER frame is queued onto Alice's channel, and the PBX thread moves the channel into the TalkDetect() application before step (4) is executed. Step (4) is important for setting formats on Alice's channel and setting appropriate translation paths as well. Since step (4) has not been completed yet, the attempt to play a file as part of TalkDetect() fails, and therefore the test fails. In most runs, however, step (4) does get completed before the PBX thread moves Alice's channel into TalkDetect().
The fix presented here is pretty simple. For transaction state changes, we are now hooked into step (5) instead of step (3) to call session supplements. This means that we do not call into session supplements until media has been negotiated on the channel. This eliminates the race condition entirely.</pre>
</td>
</tr>
</table>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Testing </h1>
<table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" style="border: 1px solid #b8b5a0">
<tr>
<td>
<pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">I ran the tests/channels/pjsip/basic_calls/two_parties/nominal/alice_initiated/alice_hangs_up test in a loop on my console both before and after applying this patch. Before applying the patch, I would usually encounter a test failure within an hour or two. After applying this patch, I left the test running in a loop for over 24 hours and never had a test failure.</pre>
</td>
</tr>
</table>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Diffs</b> (updated)</h1>
<ul style="margin-left: 3em; padding-left: 0;">
<li>/branches/12/res/res_pjsip_session.c <span style="color: grey">(421883)</span></li>
</ul>
<p><a href="https://reviewboard.asterisk.org/r/3930/diff/" style="margin-left: 3em;">View Diff</a></p>
</td>
</tr>
</table>
</div>
</body>
</html>