[asterisk-bugs] [JIRA] (ASTERISK-28282) AST_SCHED_REPLACE_UNREF causes wait-on-self deadlocks (in chan_sip)
Friendly Automation (JIRA)
noreply at issues.asterisk.org
Fri Jul 19 08:44:48 CDT 2019
[ https://issues.asterisk.org/jira/browse/ASTERISK-28282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Friendly Automation closed ASTERISK-28282.
------------------------------------------
Resolution: Fixed
> AST_SCHED_REPLACE_UNREF causes wait-on-self deadlocks (in chan_sip)
> -------------------------------------------------------------------
>
> Key: ASTERISK-28282
> URL: https://issues.asterisk.org/jira/browse/ASTERISK-28282
> Project: Asterisk
> Issue Type: Bug
> Security Level: None
> Components: Channels/chan_sip/General
> Affects Versions: 13.24.1
> Reporter: Walter Doekes
> Assignee: Walter Doekes
> Labels: patch
> Attachments: ASTERISK-28282_improving_sip_show_sched.patch, ASTERISK-28282_no_sip_poke_peer_now_if_no_qualify.patch, ASTERISK-28282_only_unref_if_ast_sched_del_succeeded.patch, ASTERISK-28282_report_deadlock_but_continue.patch, ASTERISK-28282_reproducing_sched_unref_self_deadlock.patch, ASTERISK-28282_suggest_locking_less_in_mixmonitor_b.patch
>
>
> Hi,
> since ASTERISK-24212, a chan_sip bug has moved its appearance from apparent memory corruption to deadlocks.
> In the ticket above, the scheduler is fixed so it waits for a scheduled callback to complete before allowing it to be cancelled: _that way the callback is removed before running OR it runs completely before the caller of ast_sched_del can remove objects that the callback may depend on_.
> Example 1:
> - thread 1: add sip_poke_peer_now schedule
> - thread 2: del sip_poke_peer_now schedule
> Example 2:
> - thread 1: add sip_poke_peer_now schedule
> - thread 2: run sip_poke_peer_now
> - thread 3: attempt to delete sip_poke_peer_now schedule, wait
> - thread 2: callback completes
> - thread 3: the delete finishes
> However, under certain conditions of heavy load, the following situation can occur:
> - thread 1: add sip_poke_peer_now schedule
> - thread 2: run sip_poke_peer_now
> - also thread 2: attempt to delete sip_poke_peer_now schedule, but wait for callback to complete (which will never succeed)
> This can happen because of this macro:
> {code}
> #define AST_SCHED_REPLACE_VARIABLE_UNREF(id, sched, when, callback, data, variable, unrefcall, addfailcall, refcall) \
> ...
> id = ast_sched_add_variable(sched, when, callback, data, variable); \
> ...
> {code}
> Which is used like this:
> {code}
> AST_SCHED_REPLACE_UNREF(peer->pokeexpire, sched,
> 0, /* Poke the peer ASAP */
> sip_poke_peer_now, peer,
> sip_unref_peer(_data, "removing poke peer ref"),
> sip_unref_peer(peer, "removing poke peer ref"),
> sip_ref_peer(peer, "adding poke peer ref"));
> {code}
> The scheduler gets tasked with this:
> {code}
> static int sip_poke_peer_now(const void *data)
> {
> struct sip_peer *peer = (struct sip_peer *) data;
> peer->pokeexpire = -1;
> sip_poke_peer(peer, 0);
> {code}
> Which ends up here:
> {code}
> static int sip_poke_peer(struct sip_peer *peer, int force)
> ...
> AST_SCHED_DEL_UNREF(sched, peer->pokeexpire,
> sip_unref_peer(peer, "removing poke peer ref"));
> {code}
> On a loaded system, the {{ast_sched_add_variable(sched, when, callback, data, variable);}} can complete and then the {{sip_poke_peer_now}} function can get called _before_ {{peer->pokeexpire}} is set.
> I.e.:
> - thread 1: {{ast_sched_add_variable(sched, when, callback, data, variable)}} returns ID123
> - thread 2: sip_poke_peer_now: {{peer->pokeexpire = -1;}}
> - thread 1: {{peer->pokeexpire = ID123}} (from macro)
> - thread 2: sip_poke_peer: {{ast_sched_del(...ID123)}}
> This results in a deadlock here:
> {code}
> /* Wait for executing task to complete so that caller of ast_sched_del() does not
> * free memory out from under the task.
> */
> while (con->currently_executing && (id == con->currently_executing->sched_id->id)) {
> ast_cond_wait(&s->cond, &con->lock);
> }
> {code}
> We're executing callback ID123, and we wait for that to complete before we can ast_sched_del it: infinite wait.
> *Fixes:*
> - Best fix is to not do any schedule-changes from the schedule-callbacks. But refactoring chan_sip is beyond the scope here.
> - An alternate fix that makes the target ID (in this case {{peer->pokeexpire}}) get set by {{ast_sched_add_variable}} _before_ the scheduled item is added to the heap would be rather invasive.
> But there should be other fixes we can do..
> _Observe that this problem may exist in other places than just chan_sip and {{sip_poke_peer_now}}._
--
This message was sent by Atlassian JIRA
(v6.2#6252)
More information about the asterisk-bugs
mailing list