[asterisk-bugs] [JIRA] (ASTERISK-24374) "sip qualify peer" CLI command stops periodic pokes for the peer forever, if the peer is unreachable

Tue Oct 7 09:25:29 CDT 2014

     [ https://issues.asterisk.org/jira/browse/ASTERISK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Jordan updated ASTERISK-24374:
-----------------------------------

    Description: 
Periodic pokes for a realtime peer are started by the first call to {{sip_poke_peer}}. {{sip_poke_peer}} cancels the current {{pokeexpire}} timer, and transmits an {{OPTION}} request to the peer. If the transmission fails, {{sip_poke_noanswer}} is called immediately; otherwise, the {{pokeexpire}} timer is set to call {{sip_poke_noanswer}} after {{maxms * 2}} milliseconds (the maximum allowed roundtrip time)

If the peer answers to the {{OPTION}} request, {{handle_response_peerpoke}} is called, which updates peer reachability and re-schedules the pokeexpire timer to call {{sip_poke_peer_s}} in {{qualifyfreq}} milliseconds (if the peer is reachable) or 10 seconds

If the peer fails to answer, {{sip_poke_noanswer}} is called by the scheduler. {{sip_poke_noanswer}} sets the peer as unreachable, and schedules the {{pokeexpire}} timer to call {{sip_poke_peer_s}} in 10 seconds

{{sip_poke_peer_s}} is little more than a wrapper to {{sip_poke_peer}}

To sum up, the periodic poke loop goes like: {{sip_poke_peer}} → {{sip_poke_noanswer}}/{{handle_response_peerpoke}} → {{sip_poke_peer_s}} → repeat

However, when {{sip_poke_peer}} is called by CLI/manager command {{sip qualify peer}}, it won't schedule a call to {{sip_poke_noanswer}}:

{code}
	} else if (!force) {
		AST_SCHED_REPLACE_UNREF(peer->pokeexpire, sched, peer->maxms * 2, sip_poke_noanswer, peer,
				unref_peer(_data, "removing poke peer ref"),
				unref_peer(peer, "removing poke peer ref"),
				ref_peer(peer, "adding poke peer ref"));
	}
{code}

If the peer is unreachable, {{handle_response_peerpoke}} will never be called. This deadlock is supposed to be broken by {{sip_poke_noanswer}}, but it's never scheduled (and not all network errors can be detected synchronously by {{transmit_invite}}, so {{sip_poke_noanswer}} may never be called directly, either). Nobody is left to schedule a call to {{sip_poke_peer_s}}: the periodic poke state machine is dead and there is no way to restart it, except by pruning the peer from the realtime table

The easiest way to fix the issue is probably to change the above code into:

\[Edit\]: *Inline patch removed by mjordan*

  was:
Periodic pokes for a realtime peer are started by the first call to {{sip_poke_peer}}. {{sip_poke_peer}} cancels the current {{pokeexpire}} timer, and transmits an {{OPTION}} request to the peer. If the transmission fails, {{sip_poke_noanswer}} is called immediately; otherwise, the {{pokeexpire}} timer is set to call {{sip_poke_noanswer}} after {{maxms * 2}} milliseconds (the maximum allowed roundtrip time)

If the peer answers to the {{OPTION}} request, {{handle_response_peerpoke}} is called, which updates peer reachability and re-schedules the pokeexpire timer to call {{sip_poke_peer_s}} in {{qualifyfreq}} milliseconds (if the peer is reachable) or 10 seconds

If the peer fails to answer, {{sip_poke_noanswer}} is called by the scheduler. {{sip_poke_noanswer}} sets the peer as unreachable, and schedules the {{pokeexpire}} timer to call {{sip_poke_peer_s}} in 10 seconds

{{sip_poke_peer_s}} is little more than a wrapper to {{sip_poke_peer}}

To sum up, the periodic poke loop goes like: {{sip_poke_peer}} → {{sip_poke_noanswer}}/{{handle_response_peerpoke}} → {{sip_poke_peer_s}} → repeat

However, when {{sip_poke_peer}} is called by CLI/manager command {{sip qualify peer}}, it won't schedule a call to {{sip_poke_noanswer}}:

{code}
	} else if (!force) {
		AST_SCHED_REPLACE_UNREF(peer->pokeexpire, sched, peer->maxms * 2, sip_poke_noanswer, peer,
				unref_peer(_data, "removing poke peer ref"),
				unref_peer(peer, "removing poke peer ref"),
				ref_peer(peer, "adding poke peer ref"));
	}
{code}

If the peer is unreachable, {{handle_response_peerpoke}} will never be called. This deadlock is supposed to be broken by {{sip_poke_noanswer}}, but it's never scheduled (and not all network errors can be detected synchronously by {{transmit_invite}}, so {{sip_poke_noanswer}} may never be called directly, either). Nobody is left to schedule a call to {{sip_poke_peer_s}}: the periodic poke state machine is dead and there is no way to restart it, except by pruning the peer from the realtime table

The easiest way to fix the issue is probably to change the above code into:

{code}
	} else {
		AST_SCHED_REPLACE_UNREF(peer->pokeexpire, sched, peer->maxms * 2, sip_poke_noanswer, peer,
				unref_peer(_data, "removing poke peer ref"),
				unref_peer(peer, "removing poke peer ref"),
				ref_peer(peer, "adding poke peer ref"));
	}
{code}

> "sip qualify peer" CLI command stops periodic pokes for the peer forever, if the peer is unreachable
> ----------------------------------------------------------------------------------------------------
>
>                 Key: ASTERISK-24374
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-24374
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Channels/chan_sip/General
>    Affects Versions: 1.8.31.0
>            Reporter: Michele Cicciotti (PrivateWave SpA)
>            Assignee: Michele Cicciotti (PrivateWave SpA)
>            Severity: Minor
>
> Periodic pokes for a realtime peer are started by the first call to {{sip_poke_peer}}. {{sip_poke_peer}} cancels the current {{pokeexpire}} timer, and transmits an {{OPTION}} request to the peer. If the transmission fails, {{sip_poke_noanswer}} is called immediately; otherwise, the {{pokeexpire}} timer is set to call {{sip_poke_noanswer}} after {{maxms * 2}} milliseconds (the maximum allowed roundtrip time)
> If the peer answers to the {{OPTION}} request, {{handle_response_peerpoke}} is called, which updates peer reachability and re-schedules the pokeexpire timer to call {{sip_poke_peer_s}} in {{qualifyfreq}} milliseconds (if the peer is reachable) or 10 seconds
> If the peer fails to answer, {{sip_poke_noanswer}} is called by the scheduler. {{sip_poke_noanswer}} sets the peer as unreachable, and schedules the {{pokeexpire}} timer to call {{sip_poke_peer_s}} in 10 seconds
> {{sip_poke_peer_s}} is little more than a wrapper to {{sip_poke_peer}}
> To sum up, the periodic poke loop goes like: {{sip_poke_peer}} → {{sip_poke_noanswer}}/{{handle_response_peerpoke}} → {{sip_poke_peer_s}} → repeat
> However, when {{sip_poke_peer}} is called by CLI/manager command {{sip qualify peer}}, it won't schedule a call to {{sip_poke_noanswer}}:
> {code}
> 	} else if (!force) {
> 		AST_SCHED_REPLACE_UNREF(peer->pokeexpire, sched, peer->maxms * 2, sip_poke_noanswer, peer,
> 				unref_peer(_data, "removing poke peer ref"),
> 				unref_peer(peer, "removing poke peer ref"),
> 				ref_peer(peer, "adding poke peer ref"));
> 	}
> {code}
> If the peer is unreachable, {{handle_response_peerpoke}} will never be called. This deadlock is supposed to be broken by {{sip_poke_noanswer}}, but it's never scheduled (and not all network errors can be detected synchronously by {{transmit_invite}}, so {{sip_poke_noanswer}} may never be called directly, either). Nobody is left to schedule a call to {{sip_poke_peer_s}}: the periodic poke state machine is dead and there is no way to restart it, except by pruning the peer from the realtime table
> The easiest way to fix the issue is probably to change the above code into:
> \[Edit\]: *Inline patch removed by mjordan*

--
This message was sent by Atlassian JIRA
(v6.2#6252)