<html>
<body>
<div style="font-family: Verdana, Arial, Helvetica, Sans-Serif;">
<table bgcolor="#f9f3c9" width="100%" cellpadding="8" style="border: 1px #c9c399 solid;">
<tr>
<td>
This is an automatically generated e-mail. To reply, visit:
<a href="https://reviewboard.asterisk.org/r/1813/">https://reviewboard.asterisk.org/r/1813/</a>
</td>
</tr>
</table>
<br />
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: 10px;">
<p style="margin-top: 0;">On March 22nd, 2012, 6:14 p.m., <b>Mark Michelson</b> wrote:</p>
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: 10px;">
<pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">No.
>From RFC 4235 Section 4.1: "Versions start at 0, and increment by one for each new document sent to a subscriber. Versions are scoped within a subscription."
When the phone reSUBSCRIBEs with a new call-id, that is a new subscription, and so the version starts back at 0 again. The fact that the phone you are using is trying to compare the version of a new subscription against the version of a previous subscription is just completely wrong.
Now, in Asterisk, we try to work around some vendor's improper behavior, and it can probably be done here as well, just not in the way you have done it. What you have done is to make dialog-info subscriptions never expire. This means that you are purposely adding a sip_pvt reference leak into the code in the case that a subscription times out. If I understand things correctly, if you perform multiple reboots on your phones, then the number of active subscriptions in Asterisk will stack more and more. I'm willing to bet that if you run 'sip show subscriptions' in your setup after a few reboots of those phones, you'd see a LOT more subscriptions than there should be. In addition, if you look at the SIP traffic after a few reboots, there are probably way more NOTIFYs than necessary going to the phones on every state change.
I honestly have no idea how the code change you've presented causes things to work. When your phone reboots, I can see that a new subscription is created based on the fact that the XML has a new call-id. With your change, the old subscription that normally would have timed out does not now. What this means is that Asterisk is maintaining two subscriptions for the phone. Asterisk, on every dialog state change, will be sending two NOTIFYs to the phone. One will be for the old subscription, with a version the phone likes. The other NOTIFY will be for the new subscription, with a version starting at 0 and incrementing on each new NOTIFY. Now, here's where things get weird. The phone is presumably honoring the NOTIFYs for the old subscription, but not the NOTIFYs for the new subscription. Why? Does the phone still have knowledge of the old subscription despite the reboot? Does the phone just accept any unsolicited dialog-info NOTIFY? If the phone knows about the old subscription, and it also knows that it has created a new subscription, then why doesn't the phone unsubscribe from the old one? Why is it trying to apply a per-dialog version to multiple dialogs? Seriously, WTF?
I have a possible solution for this that is safer, but I don't know the specific mechanics behind the brokenness of these phones. Currently, the version reported in dialog-info is obtained from the sip_pvt, meaning each subscription maintains its own independent version counter. We could create an option so that for phones like yours, we store the version in the sip_peer instead. This way, if the peer resubscribes, then the version reported in the new subscription will continue from the same point it was in the old subscription. This would allow for the old subscription to still time out and have its resources released properly. If implementing this also causes BLF to be reported properly to your phones, then I'd be willing to accept such a change into Asterisk. Just expect some strongly worded vitriol in the sample config file about how such an option should never have to exist in the first place :)</pre>
</blockquote>
<p>On March 23rd, 2012, 6:23 a.m., <b>Alec Davis</b> wrote:</p>
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: 10px;">
<pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Working with the senario that a phone has restarted due to loss of power for some reason.
Trusting a phone to unsubscribe to a previous subscription is just wrong, it won't know what it's previous call-id was.
The cleaner removed the power. Then plugged it back in... so many senarios :)
So the phone has just finished booting.
-- Registered SIP 'GXP0001' at 192.168.5.141:5066
[Mar 23 23:54:15] NOTICE[23637]: chan_sip.c:13138 state_notify_build_xml: ALEC <dialog id="8601GXP0001" call-id="e607e6a8e79400dd@192.168.5.141"> version=0 terminated
astrid-test*CLI> sip show subscriptions
Peer User Call ID Extension Last state Type Mailbox Expiry
192.168.5.141 GXP0001 e607e6a8e79400d 8601GXP0001@tru Idle dialog-info+xml <none> 000300
192.168.5.141 GXP0001 5a1f3df6ce2e4fc 8601GXP0001@tru Idle dialog-info+xml <none> 000300
192.168.5.141 GXP0001 052f0371d160f58 -- <none> mwi 8612 000300
3 active SIP subscriptions
[Mar 23 23:56:45] NOTICE[23637]: chan_sip.c:13138 state_notify_build_xml: ALEC <dialog id="8601GXP0001" call-id="e607e6a8e79400dd@192.168.5.141"> version=1 terminated
-- Registered SIP 'GXP0001' at 192.168.5.141:5064
[Mar 23 23:58:25] NOTICE[23637]: chan_sip.c:13133 state_notify_build_xml: ALEC <dialog id="8601GXP0001" call-id="5a1f3df6ce2e4fcf@192.168.5.141" direction="recipient"> version=2 terminated
[Mar 23 23:58:25] NOTICE[23637]: chan_sip.c:3868 __sip_autodestruct: ALEC TIMEOUT of SIP subscription 5a1f3df6ce2e4fcf@192.168.5.141
astrid-test*CLI> sip show subscriptions
Peer User Call ID Extension Last state Type Mailbox Expiry
192.168.5.141 GXP0001 e607e6a8e79400d 8601GXP0001@tru Idle dialog-info+xml <none> 000300
192.168.5.141 GXP0001 052f0371d160f58 -- <none> mwi 8612 000300
2 active SIP subscriptions
>From this, I take it as call-id of '5a1f3df6ce2e4fc' has been killed off, and with my change the phone wasn't sent the timeout to the device.
I'm wondering why we didn't finish up with 2 MWI subscriptions, there may be a clue there, did the code kill off the previous mwi subscription when another for the same device arrived, or did the phone unsubscribe.
</pre>
</blockquote>
<p>On March 23rd, 2012, 9:44 a.m., <b>Mark Michelson</b> wrote:</p>
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: 10px;">
<pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;"> "Working with the senario that a phone has restarted due to loss of power for some reason.
Trusting a phone to unsubscribe to a previous subscription is just wrong, it won't know what it's previous call-id was.
The cleaner removed the power. Then plugged it back in... so many senarios :)"
Yes, this is why there's a timeout in place :)
Was this output taken from your patched system or from an unpatched Asterisk?
Looking at your change again, I missed something. It looks like old subscriptions will time out in Asterisk, it's just that Asterisk won't bother sending a NOTIFY to let subscribers know that they have timed out. What I still don't understand though is how this change actually fixes the problem you originally had. I think I'd need to see a SIP debug of the post-phone-reboot packets both before and after your patch to fully understand what's up here.
Despite not having all the information in front of me, I have a bit of an idea of what might be happening.
1. The phone is subscribed to Asterisk.
2. Asterisk sends some NOTIFYs to the phone. Let's say it sends 5 of them. This means that the last version number Asterisk has sent to the phone is 4 (since version starts at 0)
3. Phone reboots. It loses all memory of the old subscription and the old version number.
4. Phone starts a new subscription with Asterisk. Asterisk sends an initial NOTIFY for this new subscription with version 0.
5. Original subscription times out. Asterisk sends a NOTIFY to end the old subscription. The version number in this NOTIFY is 5.
6. The phone, for some reason, even though it should have not knowledge of this old subscription, stores 5 as the current version number (How does the phone respond to this NOTIFY? 200? 481?)
7. Asterisk sends NOTIFYs for the new subscription, incrementing version every time. Since the phone has 5 as its current version, it will ignore all NOTIFYs until version 6 arrives.
If this is what's happening, then the problem isn't that the phone is remembering version numbers across reboots. The problem is actually that the phone will honor version numbers for NOTIFYs for subscriptions it does not, or at least should not, know about (which seems pretty exploitable if you ask me). The solution I proposed before may work, although it might not. At step 4 above, Asterisk will send an initial NOTIFY with version 0. The phone may be programmed not to accept NOTIFYs on a subscription that do not start with version 0. If that's the case, my solution will not work. And if my solution does not work, then I don't know of a good one off the top of my head.
I still do not agree with this patch though. Most devices, including the newer generation phones you are using, properly handle a NOTIFY that ends a timed-out subscription. If anything, the solution for this will have to be implemented as a new option because there is no way we will unconditionally never send a subscription-ending NOTIFY for dialog-info events.</pre>
</blockquote>
<p>On March 23rd, 2012, 3:54 p.m., <b>Alec Davis</b> wrote:</p>
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: 10px;">
<pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">
Yes this is unpatched, expected for the 3 DEBUG prints.
[code}
Index: channels/chan_sip.c
===================================================================
--- channels/chan_sip.c (revision 359029)
+++ channels/chan_sip.c (working copy)
@@ -3865,6 +3865,7 @@
p->subscribed = NONE;
append_history(p, "Subscribestatus", "timeout");
ast_debug(3, "Re-scheduled destruction of SIP subscription %s\n", p->callid ? p->callid : "<unknown>");
+ ast_log(LOG_NOTICE, "ALEC TIMEOUT of SIP subscription %s\n", p->callid ? p->callid : "<unknown>");
return 10000; /* Reschedule this destruction so that we know that it's gone */
}
@@ -13127,10 +13130,12 @@
local_display, local_target, local_target, mto, mto);
} else {
ast_str_append(tmp, 0, "<dialog id=\"%s\" direction=\"recipient\">\n", exten);
+ ast_log(LOG_NOTICE, "ALEC <dialog id=\"%s\" call-id=\"%s\" direction=\"recipient\"> version=%u %s\n", exten, p->callid, p->dialogver, statestring);
}
} else {
ast_str_append(tmp, 0, "<dialog id=\"%s\">\n", exten);
+ ast_log(LOG_NOTICE, "ALEC <dialog id=\"%s\" call-id=\"%s\"> version=%u %s\n", exten, p->callid, p->dialogver, statestring);
}
ast_str_append(tmp, 0, "<state>%s</state>\n", statestring);
if (state == AST_EXTENSION_ONHOLD) {
{code}
Regarding steps 1-7, you have it exactly.
answer for step 6: the phone obviously isn't matching call-id.
I agree, it's only a GXP20XX issue, Grandstream have fixed it, with the GXP21XX series.
I have access to a SNOM, and it works fine.
I have seen on one of the other open PBX forums, the same reported issue, there fix was to remove any previous subscriptions that match peer and entity. I will try to refind it.
</pre>
</blockquote>
</blockquote>
<pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Seems 1.2 had code to remove old duplicate subscriptions:
{code}
/* remove any old subscription from this peer for the same exten/context,
as the peer has obviously forgotten about it and it's wasteful to wait
for it to expire and send NOTIFY messages to the peer only to have them
ignored (or generate errors)
*/
ast_mutex_lock(&iflock);
for (p_old = iflist; p_old; p_old = p_old->next) {
if (p_old == p)
continue;
if (p_old->initreq.method != SIP_SUBSCRIBE)
continue;
if (p_old->subscribed == NONE)
continue;
ast_mutex_lock(&p_old->lock);
if (!strcmp(p_old->username, p->username)) {
if (!strcmp(p_old->exten, p->exten) &&
!strcmp(p_old->context, p->context)) {
ast_set_flag(p_old, SIP_NEEDDESTROY);
ast_mutex_unlock(&p_old->lock);
break;
}
}
ast_mutex_unlock(&p_old->lock);
}
ast_mutex_unlock(&iflock);
{code}</pre>
<br />
<p>- Alec</p>
<br />
<p>On March 13th, 2012, 5:35 p.m., Alec Davis wrote:</p>
<table bgcolor="#fefadf" width="100%" cellspacing="0" cellpadding="8" style="background-image: url('https://reviewboard.asterisk.org/media/rb/images/review_request_box_top_bg.png'); background-position: left top; background-repeat: repeat-x; border: 1px black solid;">
<tr>
<td>
<div>Review request for Asterisk Developers.</div>
<div>By Alec Davis.</div>
<p style="color: grey;"><i>Updated March 13, 2012, 5:35 p.m.</i></p>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Description </h1>
<table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" style="border: 1px solid #b8b5a0">
<tr>
<td>
<pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Some phones (Grandstream GXP20XX series 1.2.5.3 firmware) if rebooted, loose the BLF lights due to the new subscription's "version" being less than the previous subscription's "version".
This happens the instant that asterisk timeouts the previous subscription - which has the much larger "version" sequence number.
The only way for the lights to start working again, is for the current subscription's "version" number to increment past the previous old "version" number.
That period of time could be huge, if the phone isn't rebooted for days.
The workaround has been to reboot the phone twice, within a few minutes of each reboot.
1st time wait for the BLF's to fail (then reboot again) - which is the timeout period - a few minutes.
2nd time the lights will work, then fail, then after a few minutes start working again.
This fix prevents the old subscription timeout from updating the phones "version" number.
This issue doesn't affter the GXP21XX series.
Attempt to show issue byway of example below with some debug output showing call-id and version number.
astrid-test*CLI> core show hints
-= Registered Asterisk Dial Plan Hints =-
8601GXP0001@trusted : Custom:q8601_a8612 State:InUse Watchers 1
Debug output:
===Set the BLF state to InUse
- Executing [s@macro-custom-agent-inout:13] Set("SIP/GXP0001-00000005", "DEVICE_STATE(Custom:q8601_a8612)=INUSE") in new stack
...
[Mar 14 11:01:39] NOTICE[10819]: chan_sip.c:13136 state_notify_build_xml: ALEC <dialog id="8601GXP0001" call-id="3c65aed02ae1d5df@192.168.5.141"> version=5
=============
Reboot the phone, it gets a new BLF subscription.
=============
[Mar 14 11:02:30] NOTICE[10826]: chan_sip.c:13136 state_notify_build_xml: ALEC <dialog id="8601GXP0001" call-id="79956c0168125400@192.168.5.141"> version=0
===The BLF has correct state of InUse.
...
[Mar 14 11:05:00] NOTICE[10826]: chan_sip.c:13136 state_notify_build_xml: ALEC <dialog id="8601GXP0001" call-id="79956c0168125400@192.168.5.141"> version=1
===The BLF still has correct state of InUse.
...
The timeout 'last notification'
[Mar 14 11:05:09] NOTICE[10826]: chan_sip.c:13131 state_notify_build_xml: ALEC <dialog id="8601GXP0001" call-id="3c65aed02ae1d5df@192.168.5.141" direction="recipient"> version=6
[Mar 14 11:05:09] NOTICE[10826]: chan_sip.c:3868 __sip_autodestruct: ALEC TIMEOUT of SIP subscription 3c65aed02ae1d5df@192.168.5.141
===The BLF has incorrectly gone Idle.
...
[Mar 14 11:15:01] NOTICE[10826]: chan_sip.c:13136 state_notify_build_xml: ALEC <dialog id="8601GXP0001" call-id="79956c0168125400@192.168.5.141"> version=5
===The BLF is still incorrect at Idle.
-- Registered SIP 'GXP0001' at 192.168.5.141:5066
[Mar 14 11:17:31] NOTICE[10826]: chan_sip.c:13136 state_notify_build_xml: ALEC <dialog id="8601GXP0001" call-id="79956c0168125400@192.168.5.141"> version=6
===The BLF now has correct state of InUse.
</pre>
</td>
</tr>
</table>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Testing </h1>
<table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" style="border: 1px solid #b8b5a0">
<tr>
<td>
<pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">In place on a 1.8 deployment for months, our production box with ~80 GXP phones, half of which are the older GXP20XX series.
</pre>
</td>
</tr>
</table>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Diffs</b> </h1>
<ul style="margin-left: 3em; padding-left: 0;">
<li>trunk/channels/chan_sip.c <span style="color: grey">(358608)</span></li>
</ul>
<p><a href="https://reviewboard.asterisk.org/r/1813/diff/" style="margin-left: 3em;">View Diff</a></p>
</td>
</tr>
</table>
</div>
</body>
</html>