[asterisk-dev] [Code Review]: DIALOG_INFO_XML timeout notification stops BLF's from working.

Alec Davis reviewboard at asterisk.org
Thu Mar 29 03:17:16 CDT 2012



> On March 22, 2012, 6:14 p.m., Mark Michelson wrote:
> > No.
> > 
> > From RFC 4235 Section 4.1: "Versions start at 0, and increment by one for each new document sent to a subscriber. Versions are scoped within a subscription."
> > 
> > When the phone reSUBSCRIBEs with a new call-id, that is a new subscription, and so the version starts back at 0 again. The fact that the phone you are using is trying to compare the version of a new subscription against the version of a previous subscription is just completely wrong.
> > 
> > Now, in Asterisk, we try to work around some vendor's improper behavior, and it can probably be done here as well, just not in the way you have done it. What you have done is to make dialog-info subscriptions never expire. This means that you are purposely adding a sip_pvt reference leak into the code in the case that a subscription times out. If I understand things correctly, if you perform multiple reboots on your phones, then the number of active subscriptions in Asterisk will stack more and more. I'm willing to bet that if you run 'sip show subscriptions' in your setup after a few reboots of those phones, you'd see a LOT more subscriptions than there should be. In addition, if you look at the SIP traffic after a few reboots, there are probably way more NOTIFYs than necessary going to the phones on every state change.
> > 
> > I honestly have no idea how the code change you've presented causes things to work. When your phone reboots, I can see that a new subscription is created based on the fact that the XML has a new call-id. With your change, the old subscription that normally would have timed out does not now. What this means is that Asterisk is maintaining two subscriptions for the phone. Asterisk, on every dialog state change, will be sending two NOTIFYs to the phone. One will be for the old subscription, with a version the phone likes. The other NOTIFY will be for the new subscription, with a version starting at 0 and incrementing on each new NOTIFY. Now, here's where things get weird. The phone is presumably honoring the NOTIFYs for the old subscription, but not the NOTIFYs for the new subscription. Why? Does the phone still have knowledge of the old subscription despite the reboot? Does the phone just accept any unsolicited dialog-info NOTIFY? If the phone knows about the old subscription, and it also knows that it has created a new subscription, then why doesn't the phone unsubscribe from the old one? Why is it trying to apply a per-dialog version to multiple dialogs? Seriously, WTF?
> > 
> > I have a possible solution for this that is safer, but I don't know the specific mechanics behind the brokenness of these phones. Currently, the version reported in dialog-info is obtained from the sip_pvt, meaning each subscription maintains its own independent version counter. We could create an option so that for phones like yours, we store the version in the sip_peer instead. This way, if the peer resubscribes, then the version reported in the new subscription will continue from the same point it was in the old subscription. This would allow for the old subscription to still time out and have its resources released properly. If implementing this also causes BLF to be reported properly to your phones, then I'd be willing to accept such a change into Asterisk. Just expect some strongly worded vitriol in the sample config file about how such an option should never have to exist in the first place :)
> 
> Alec Davis wrote:
>     Working with the senario that a phone has restarted due to loss of power for some reason.
>     Trusting a phone to unsubscribe to a previous subscription is just wrong, it won't know what it's previous call-id was.
>     
>     The cleaner removed the power. Then plugged it back in... so many senarios :)
>     So the phone has just finished booting. 
>     
>        -- Registered SIP 'GXP0001' at 192.168.5.141:5066
>     [Mar 23 23:54:15] NOTICE[23637]: chan_sip.c:13138 state_notify_build_xml: ALEC <dialog id="8601GXP0001" call-id="e607e6a8e79400dd at 192.168.5.141"> version=0 terminated
>     astrid-test*CLI> sip show subscriptions
>     Peer             User             Call ID          Extension        Last state     Type            Mailbox    Expiry
>     192.168.5.141    GXP0001          e607e6a8e79400d  8601GXP0001 at tru  Idle           dialog-info+xml <none>     000300
>     192.168.5.141    GXP0001          5a1f3df6ce2e4fc  8601GXP0001 at tru  Idle           dialog-info+xml <none>     000300
>     192.168.5.141    GXP0001          052f0371d160f58  --               <none>         mwi             8612       000300
>     3 active SIP subscriptions
>     
>     [Mar 23 23:56:45] NOTICE[23637]: chan_sip.c:13138 state_notify_build_xml: ALEC <dialog id="8601GXP0001" call-id="e607e6a8e79400dd at 192.168.5.141"> version=1 terminated
>         -- Registered SIP 'GXP0001' at 192.168.5.141:5064
>     [Mar 23 23:58:25] NOTICE[23637]: chan_sip.c:13133 state_notify_build_xml: ALEC <dialog id="8601GXP0001" call-id="5a1f3df6ce2e4fcf at 192.168.5.141" direction="recipient"> version=2 terminated
>     [Mar 23 23:58:25] NOTICE[23637]: chan_sip.c:3868 __sip_autodestruct: ALEC TIMEOUT of SIP subscription 5a1f3df6ce2e4fcf at 192.168.5.141
>     
>     astrid-test*CLI> sip show subscriptions
>     Peer             User             Call ID          Extension        Last state     Type            Mailbox    Expiry
>     192.168.5.141    GXP0001          e607e6a8e79400d  8601GXP0001 at tru  Idle           dialog-info+xml <none>     000300
>     192.168.5.141    GXP0001          052f0371d160f58  --               <none>         mwi             8612       000300
>     2 active SIP subscriptions
>     
>     From this, I take it as call-id of '5a1f3df6ce2e4fc' has been killed off, and with my change the phone wasn't sent the timeout to the device.
>     
>     I'm wondering why we didn't finish up with 2 MWI subscriptions, there may be a clue there, did the code kill off the previous mwi subscription when another for the same device arrived, or did the phone unsubscribe.
>
> 
> Mark Michelson wrote:
>         "Working with the senario that a phone has restarted due to loss of power for some reason.
>          Trusting a phone to unsubscribe to a previous subscription is just wrong, it won't know what it's previous call-id was.
>     
>          The cleaner removed the power. Then plugged it back in... so many senarios :)"
>     
>     Yes, this is why there's a timeout in place :)
>     
>     Was this output taken from your patched system or from an unpatched Asterisk?
>     
>     Looking at your change again, I missed something. It looks like old subscriptions will time out in Asterisk, it's just that Asterisk won't bother sending a NOTIFY to let subscribers know that they have timed out. What I still don't understand though is how this change actually fixes the problem you originally had. I think I'd need to see a SIP debug of the post-phone-reboot packets both before and after your patch to fully understand what's up here.
>     
>     Despite not having all the information in front of me, I have a bit of an idea of what might be happening.
>     
>     1. The phone is subscribed to Asterisk.
>     2. Asterisk sends some NOTIFYs to the phone. Let's say it sends 5 of them. This means that the last version number Asterisk has sent to the phone is 4 (since version starts at 0)
>     3. Phone reboots. It loses all memory of the old subscription and the old version number.
>     4. Phone starts a new subscription with Asterisk. Asterisk sends an initial NOTIFY for this new subscription with version 0.
>     5. Original subscription times out. Asterisk sends a NOTIFY to end the old subscription. The version number in this NOTIFY is 5.
>     6. The phone, for some reason, even though it should have not knowledge of this old subscription, stores 5 as the current version number (How does the phone respond to this NOTIFY? 200? 481?)
>     7. Asterisk sends NOTIFYs for the new subscription, incrementing version every time. Since the phone has 5 as its current version, it will ignore all NOTIFYs until version 6 arrives.
>     
>     If this is what's happening, then the problem isn't that the phone is remembering version numbers across reboots. The problem is actually that the phone will honor version numbers for NOTIFYs for subscriptions it does not, or at least should not, know about (which seems pretty exploitable if you ask me). The solution I proposed before may work, although it might not. At step 4 above, Asterisk will send an initial NOTIFY with version 0. The phone may be programmed not to accept NOTIFYs on a subscription that do not start with version 0. If that's the case, my solution will not work. And if my solution does not work, then I don't know of a good one off the top of my head.
>     
>     I still do not agree with this patch though. Most devices, including the newer generation phones you are using, properly handle a NOTIFY that ends a timed-out subscription. If anything, the solution for this will have to be implemented as a new option because there is no way we will unconditionally never send a subscription-ending NOTIFY for dialog-info events.
> 
> Alec Davis wrote:
>     
>     Yes this is unpatched, expected for the 3 DEBUG prints.
>     
>     [code}
>     Index: channels/chan_sip.c
>     ===================================================================
>     --- channels/chan_sip.c (revision 359029)
>     +++ channels/chan_sip.c (working copy)
>     @@ -3865,6 +3865,7 @@
>                     p->subscribed = NONE;
>                     append_history(p, "Subscribestatus", "timeout");
>                     ast_debug(3, "Re-scheduled destruction of SIP subscription %s\n", p->callid ? p->callid : "<unknown>");
>     +               ast_log(LOG_NOTICE, "ALEC TIMEOUT of SIP subscription %s\n", p->callid ? p->callid : "<unknown>");
>                     return 10000;   /* Reschedule this destruction so that we know that it's gone */
>             }
>     
>     @@ -13127,10 +13130,12 @@
>                                                     local_display, local_target, local_target, mto, mto);
>                             } else {
>                                     ast_str_append(tmp, 0, "<dialog id=\"%s\" direction=\"recipient\">\n", exten);
>     +                               ast_log(LOG_NOTICE, "ALEC <dialog id=\"%s\" call-id=\"%s\" direction=\"recipient\"> version=%u %s\n", exten, p->callid, p->dialogver, statestring);
>                             }
>     
>                     } else {
>                             ast_str_append(tmp, 0, "<dialog id=\"%s\">\n", exten);
>     +                       ast_log(LOG_NOTICE, "ALEC <dialog id=\"%s\" call-id=\"%s\"> version=%u %s\n", exten, p->callid, p->dialogver, statestring);
>                     }
>                     ast_str_append(tmp, 0, "<state>%s</state>\n", statestring);
>                     if (state == AST_EXTENSION_ONHOLD) {
>     {code}
>     
>     Regarding steps 1-7, you have it exactly.
>     answer for step 6: the phone obviously isn't matching call-id.
>     
>     I agree, it's only a GXP20XX issue, Grandstream have fixed it, with the GXP21XX series.
>     I have access to a SNOM, and it works fine.
>     
>     I have seen on one of the other open PBX forums, the same reported issue, there fix was to remove any previous subscriptions that match peer and entity. I will try to refind it. 
>

Seems 1.2 had code to remove old duplicate subscriptions:

{code}
                        /* remove any old subscription from this peer for the same exten/context,
                           as the peer has obviously forgotten about it and it's wasteful to wait
                           for it to expire and send NOTIFY messages to the peer only to have them
                           ignored (or generate errors)
                        */
                        ast_mutex_lock(&iflock);
                        for (p_old = iflist; p_old; p_old = p_old->next) {
                                if (p_old == p)
                                        continue;
                                if (p_old->initreq.method != SIP_SUBSCRIBE)
                                        continue;
                                if (p_old->subscribed == NONE)
                                        continue;
                                ast_mutex_lock(&p_old->lock);
                                if (!strcmp(p_old->username, p->username)) {
                                        if (!strcmp(p_old->exten, p->exten) &&
                                            !strcmp(p_old->context, p->context)) {
                                                ast_set_flag(p_old, SIP_NEEDDESTROY);
                                                ast_mutex_unlock(&p_old->lock);
                                                break;
                                        }
                                }
                                ast_mutex_unlock(&p_old->lock);
                        }
                        ast_mutex_unlock(&iflock);

{code}


- Alec


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/1813/#review5867
-----------------------------------------------------------


On March 13, 2012, 5:35 p.m., Alec Davis wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviewboard.asterisk.org/r/1813/
> -----------------------------------------------------------
> 
> (Updated March 13, 2012, 5:35 p.m.)
> 
> 
> Review request for Asterisk Developers.
> 
> 
> Summary
> -------
> 
> Some phones (Grandstream GXP20XX series 1.2.5.3 firmware) if rebooted, loose the BLF lights due to the new subscription's "version" being less than the previous subscription's "version".
> This happens the instant that asterisk timeouts the previous subscription - which has the much larger "version" sequence number.
> 
> The only way for the lights to start working again, is for the current subscription's "version" number to increment past the previous old "version" number.
> That period of time could be huge, if the phone isn't rebooted for days.
> 
> The workaround has been to reboot the phone twice, within a few minutes of each reboot.
>   1st time wait for the BLF's to fail (then reboot again) - which is the timeout period - a few minutes.
>   2nd time the lights will work, then fail, then after a few minutes start working again.
> 
> This fix prevents the old subscription timeout from updating the phones "version" number. 
> 
> This issue doesn't affter the GXP21XX series.
> 
> Attempt to show issue byway of example below with some debug output showing call-id and version number.
> 
> astrid-test*CLI> core show hints
> 
>     -= Registered Asterisk Dial Plan Hints =-
>             8601GXP0001 at trusted             : Custom:q8601_a8612    State:InUse           Watchers  1
> Debug output:
> 
> ===Set the BLF state to InUse
> - Executing [s at macro-custom-agent-inout:13] Set("SIP/GXP0001-00000005", "DEVICE_STATE(Custom:q8601_a8612)=INUSE") in new stack
> ...
> [Mar 14 11:01:39] NOTICE[10819]: chan_sip.c:13136 state_notify_build_xml: ALEC <dialog id="8601GXP0001" call-id="3c65aed02ae1d5df at 192.168.5.141"> version=5
> 
> =============
> Reboot the phone, it gets a new BLF subscription.
> =============
> [Mar 14 11:02:30] NOTICE[10826]: chan_sip.c:13136 state_notify_build_xml: ALEC <dialog id="8601GXP0001" call-id="79956c0168125400 at 192.168.5.141"> version=0
> ===The BLF has correct state of InUse.
> 
> ...
> [Mar 14 11:05:00] NOTICE[10826]: chan_sip.c:13136 state_notify_build_xml: ALEC <dialog id="8601GXP0001" call-id="79956c0168125400 at 192.168.5.141"> version=1
> ===The BLF still has correct state of InUse.
> 
> ...
> The timeout 'last notification'
> [Mar 14 11:05:09] NOTICE[10826]: chan_sip.c:13131 state_notify_build_xml: ALEC <dialog id="8601GXP0001" call-id="3c65aed02ae1d5df at 192.168.5.141" direction="recipient"> version=6
> [Mar 14 11:05:09] NOTICE[10826]: chan_sip.c:3868 __sip_autodestruct: ALEC TIMEOUT of SIP subscription 3c65aed02ae1d5df at 192.168.5.141
> ===The BLF has incorrectly gone Idle.
> 
> ...
> [Mar 14 11:15:01] NOTICE[10826]: chan_sip.c:13136 state_notify_build_xml: ALEC <dialog id="8601GXP0001" call-id="79956c0168125400 at 192.168.5.141"> version=5
> ===The BLF is still incorrect at Idle.
> 
>     -- Registered SIP 'GXP0001' at 192.168.5.141:5066
> [Mar 14 11:17:31] NOTICE[10826]: chan_sip.c:13136 state_notify_build_xml: ALEC <dialog id="8601GXP0001" call-id="79956c0168125400 at 192.168.5.141"> version=6
> ===The BLF now has correct state of InUse.
> 
> 
> Diffs
> -----
> 
>   trunk/channels/chan_sip.c 358608 
> 
> Diff: https://reviewboard.asterisk.org/r/1813/diff
> 
> 
> Testing
> -------
> 
> In place on a 1.8 deployment for months, our production box with ~80 GXP phones, half of which are the older GXP20XX series.
> 
> 
> Thanks,
> 
> Alec
> 
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20120329/e38308ce/attachment-0001.htm>


More information about the asterisk-dev mailing list