[asterisk-dev] [Code Review] Fix broken reinvite glare scenario
elguero
reviewboard at asterisk.org
Fri May 11 15:51:51 CDT 2012
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/1911/#review6202
-----------------------------------------------------------
Ship it!
Not sure if I am an expert yet on chan_sip but your very nice explanation and the code changes seem sane.
- elguero
On May 11, 2012, 12:40 p.m., Mark Michelson wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviewboard.asterisk.org/r/1911/
> -----------------------------------------------------------
>
> (Updated May 11, 2012, 12:40 p.m.)
>
>
> Review request for Asterisk Developers and opticron.
>
>
> Summary
> -------
>
> Strap in. This one's rough.
>
> I was going to test reinvite glare when setting up direct media in Asterisk 1.8 to see what sort of delay this causes when setting up media. Unfortunately, I found that reinvite glare is actually broken in current 1.8. Specifically, if two Asterisk boxes are in a reinvite glare situation with one another, then the two will continuously send 491 responses and ACKs to one-another until both transactions time out (by default 32 seconds). When this happens, the call is dropped. Also worth noting is that because direct media is only partially set up, there is one-way audio on the call until it is dropped.
>
> After looking at SIP debugs, the issue is obvious. The ACK sent by each Asterisk for the 491 has the To and From headers reversed. This results in tag mismatches when attempting to find the corresponding dialog.
>
> I traced the issue to a commit made in March of this year (revision 358115, specifically). The offending patch added the following to several error cases in handle_request_invite:
>
> p->pendinginvite = seqno;
> check_via(p, req);
> copy_request(&p->initreq, req);
>
> One such case is when Asterisk responds to an INVITE with a 491. The issue is caused specifically by that last line. Assume there are two Asterisk boxes (Asterisk 1 and Asterisk 2) that are in a reinvite glare scenario. Here is a step-by-step illustration of what occurs from the perspective of Asterisk 1.
>
> 1. Asterisk 1 sends a reinvite to Asterisk 2. Doing so causes the sip_pvt's initreq to be set to this outbound reinvite request.
> 2. Asterisk 1 receives a reinvite from Asterisk 2. The sip_pvt's initreq is set to this inbound reinvite request.
> 3. Asterisk 1 responds to the reinvite from Asterisk 2 with a 491 since there is a pending INVITE transaction on this dialog.
> 4. Asterisk 1 receives a 491 response from Asterisk 2 since Asterisk 2 also has a pending INVITE transaction on this dialog.
>
> Asterisk 1 wants to send an ACK to Asterisk 2. When constructing the ACK, Asterisk 1 refers to the sip_pvt's initreq and the SIP_OUTBOUND flag to determine how to construct the From and To headers. In this case, since the SIP_OUTBOUND flag is set, this means that the From and To headers are copied as is to the outbound request from the initreq. The issue here is that the initreq is not the outbound reinvite; it's the inbound reinvite from Asterisk 2. This means that the From and To headers are reversed from what they should be.
>
> (I'll note here that determining what the From and To headers should be based on the initial request is not the ideal way to do this, but that's a change for another day)
>
> So, why was this change added in the first place? The issue being fixed was ASTERISK-19303. In that case, an INVITE with Replaces was being sent to Asterisk, and its Replaces header referred to a non-existent dialog. Asterisk would respond with a 481, but then would not process the corresponding inbound ACK. The reason for this was based on some code in handle_incoming() that tried to be sure not to handle out-of-dialog requests that have a to-tag. Since initreq was not being initialized in the INVITE with Replaces error scenario, ACKs, which contain a to-tag, would be dropped since they were seen as an out-of-dialog request with a to-tag.
>
> First off, the code in handle_incoming() is superfluous. In find_call(), we reject the incoming request if it has a to-tag and we have not yet allocated a local tag for the dialog. The code in handle_incoming() can be removed with no negative affects.
>
> With knowledge that this code can be removed, I considered simply undoing the change in revision 358115 entirely. Looking into how a sip_pvt's initreq is used, I found the following
>
> 1. It is used in handle_incoming in several places (including the code that can be removed) in order to determine if the incoming request is out-of-dialog.
> 2. It is used as the basis for certain headers when sending requests. This is only useful if the initreq refers to an outbound initial request.
> 3. It is used as the basis for certain headers when sending responses to asynchronous requests. By "asynchronous requests" I mean requests that do not receive an immediate response from Asterisk. A good example of this would be the 200 OK for an initial SIP INVITE. The code has to enter the dialplan and possibly make outbound calls before the 200 OK can be sent.
>
> So what can we conclude from this with regards to inbound INVITEs? If an out-of-dialog INVITE arrives, we absolutely MUST set the initreq to the inbound INVITE. By not doing this, the code in handle_incoming() will assume subsequent in-dialog messages are out-of-dialog. The end result is the possible premature ending of the dialog if an erroneously constructed in-dialog request arrives later. An in-dialog inbound INVITE also MUST have the initreq set to the inbound INVITE if the INVITE is to be responded to asynchronously.
>
> What this means, then, is that in the case where Asterisk is going to send a 491, it is not imperative to set the initreq to the inbound reinvite since
> 1. handle_incoming will recognize subsequent requests as being in-dialog since the initreq was set by the initial out-of-dialog INVITE
> 2. the 491 response is sent immediately.
>
> There are three sets of changes in this diff:
>
> 1. A large block in handle_incoming() has been removed because it is superfluous. find_call() already performs the necessary check for us.
> 2. When sending a 491 response to an INVITE, we do not copy the request into the sip_pvt's initreq.
> 3. I have removed a bunch of superfluous settings of p->pendinginvite from handle_request_invite(). This gets set by calls to __sip_reliable_xmit() (via transmit_response_reliable()).
>
> Note that this change will get the job done, but there are better alternatives for doing this. Unfortunately, the better alternatives are much more invasive.
>
> The real correct change is to give chan_sip a proper transaction layer.
>
> The more realistic change is to store the To and From headers for a dialog on the sip_pvt as "local" and "remote" headers. Then, on requests, always place the local header in the From and remote in the To. On responses, place the local in the To and the remote in the From. This way, we do not need to rely on the so-called "initial request" of a transaction in order to determine proper header placement.
>
>
> Diffs
> -----
>
> /branches/1.8/channels/chan_sip.c 366100
>
> Diff: https://reviewboard.asterisk.org/r/1911/diff
>
>
> Testing
> -------
>
> I have tested reinvite glares with this patch in place and now have two-way audio and no dropped call.
>
>
> Thanks,
>
> Mark
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20120511/36617321/attachment.htm>
More information about the asterisk-dev
mailing list