[asterisk-bugs] [JIRA] (ASTERISK-28282) AST_SCHED_REPLACE_UNREF causes wait-on-self deadlocks (in chan_sip)
Walter Doekes (JIRA)
noreply at issues.asterisk.org
Mon Feb 11 09:19:47 CST 2019
Walter Doekes created ASTERISK-28282:
----------------------------------------
Summary: AST_SCHED_REPLACE_UNREF causes wait-on-self deadlocks (in chan_sip)
Key: ASTERISK-28282
URL: https://issues.asterisk.org/jira/browse/ASTERISK-28282
Project: Asterisk
Issue Type: Bug
Security Level: None
Components: Channels/chan_sip/General
Affects Versions: 13.24.1
Reporter: Walter Doekes
Hi,
since ASTERISK-24212, a chan_sip bug has moved its appearance from apparent memory corruption to deadlocks.
In the ticket above, the scheduler is fixed so it waits for a scheduled callback to complete before allowing it to be cancelled: _that way the callback is removed before running OR it runs completely before the caller of ast_sched_del can remove objects that the callback may depend on_.
Example 1:
- thread 1: add sip_poke_peer_now schedule
- thread 2: del sip_poke_peer_now schedule
Example 2:
- thread 1: add sip_poke_peer_now schedule
- thread 2: run sip_poke_peer_now
- thread 3: attempt to delete sip_poke_peer_now schedule, wait
- thread 2: callback completes
- thread 3: the delete finishes
However, under certain conditions of heavy load, the following situation can occur:
- thread 1: add sip_poke_peer_now schedule
- thread 2: run sip_poke_peer_now
- also thread 2: attempt to delete sip_poke_peer_now schedule, but wait for callback to complete (which will never succeed)
This can happen because of this macro:
{code}
#define AST_SCHED_REPLACE_VARIABLE_UNREF(id, sched, when, callback, data, variable, unrefcall, addfailcall, refcall) \
...
id = ast_sched_add_variable(sched, when, callback, data, variable); \
...
{code}
Which is used like this:
{code}
AST_SCHED_REPLACE_UNREF(peer->pokeexpire, sched,
0, /* Poke the peer ASAP */
sip_poke_peer_now, peer,
sip_unref_peer(_data, "removing poke peer ref"),
sip_unref_peer(peer, "removing poke peer ref"),
sip_ref_peer(peer, "adding poke peer ref"));
{code}
The scheduler gets tasked with this:
{code}
static int sip_poke_peer_now(const void *data)
{
struct sip_peer *peer = (struct sip_peer *) data;
peer->pokeexpire = -1;
sip_poke_peer(peer, 0);
{code}
Which ends up here:
{code}
static int sip_poke_peer(struct sip_peer *peer, int force)
...
AST_SCHED_DEL_UNREF(sched, peer->pokeexpire,
sip_unref_peer(peer, "removing poke peer ref"));
{code}
On a loaded system, the {{ast_sched_add_variable(sched, when, callback, data, variable);}} can complete and then the {{sip_poke_peer_now}} function can get called _before_ {{peer->pokeexpire}} is set.
I.e.:
- thread 1: {{ast_sched_add_variable(sched, when, callback, data, variable)}} returns ID123
- thread 2: sip_poke_peer_now: {{peer->pokeexpire = -1;}}
- thread 1: {{peer->pokeexpire = ID123}} (from macro)
- thread 2: sip_poke_peer: {{ast_sched_del(...ID123)}}
This results in a deadlock here:
{code}
/* Wait for executing task to complete so that caller of ast_sched_del() does not
* free memory out from under the task.
*/
while (con->currently_executing && (id == con->currently_executing->sched_id->id)) {
ast_cond_wait(&s->cond, &con->lock);
}
{code}
We're executing callback ID123, and we wait for that to complete before we can ast_sched_del it: infinite wait.
*Fixes:*
- Best fix is to not do any schedule-changes from the schedule-callbacks. But refactoring chan_sip is beyond the scope here.
- An alternate fix that makes the target ID (in this case {{peer->pokeexpire}}) get set by {{ast_sched_add_variable}} _before_ the scheduled item is added to the heap would be rather invasive.
But there should be other fixes we can do..
_Observe that this problem may exist in other places than just chan_sip and {{sip_poke_peer_now}}._
--
This message was sent by Atlassian JIRA
(v6.2#6252)
More information about the asterisk-bugs
mailing list