[asterisk-bugs] [JIRA] (ASTERISK-28282) AST_SCHED_REPLACE_UNREF causes wait-on-self deadlocks (in chan_sip)

Asterisk Team (JIRA) noreply at issues.asterisk.org
Mon Feb 11 09:19:47 CST 2019


    [ https://issues.asterisk.org/jira/browse/ASTERISK-28282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=246193#comment-246193 ] 

Asterisk Team commented on ASTERISK-28282:
------------------------------------------

Thanks for creating a report! The issue has entered the triage process. That means the issue will wait in this status until a Bug Marshal has an opportunity to review the issue. Once the issue has been reviewed you will receive comments regarding the next steps towards resolution.

A good first step is for you to review the [Asterisk Issue Guidelines|https://wiki.asterisk.org/wiki/display/AST/Asterisk+Issue+Guidelines] if you haven't already. The guidelines detail what is expected from an Asterisk issue report.

Then, if you are submitting a patch, please review the [Patch Contribution Process|https://wiki.asterisk.org/wiki/display/AST/Patch+Contribution+Process].

Please note that once your issue enters an open state it has been accepted. As Asterisk is an open source project there is no guarantee or timeframe on when your issue will be looked into. If you need expedient resolution you will need to find and pay a suitable developer. Asking for an update on your issue will not yield any progress on it and will not result in a response. All updates are posted to the issue when they occur.

> AST_SCHED_REPLACE_UNREF causes wait-on-self deadlocks (in chan_sip)
> -------------------------------------------------------------------
>
>                 Key: ASTERISK-28282
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-28282
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Channels/chan_sip/General
>    Affects Versions: 13.24.1
>            Reporter: Walter Doekes
>
> Hi,
> since ASTERISK-24212, a chan_sip bug has moved its appearance from apparent memory corruption to deadlocks.
> In the ticket above, the scheduler is fixed so it waits for a scheduled callback to complete before allowing it to be cancelled: _that way the callback is removed before running OR it runs completely before the caller of ast_sched_del can remove objects that the callback may depend on_.
> Example 1:
> - thread 1: add sip_poke_peer_now schedule
> - thread 2: del sip_poke_peer_now schedule
> Example 2:
> - thread 1: add sip_poke_peer_now schedule
> - thread 2: run sip_poke_peer_now
> - thread 3: attempt to delete sip_poke_peer_now schedule, wait
> - thread 2: callback completes
> - thread 3: the delete finishes
> However, under certain conditions of heavy load, the following situation can occur:
> - thread 1: add sip_poke_peer_now schedule
> - thread 2: run sip_poke_peer_now
> - also thread 2: attempt to delete sip_poke_peer_now schedule, but wait for callback to complete (which will never succeed)
> This can happen because of this macro:
> {code}
> #define AST_SCHED_REPLACE_VARIABLE_UNREF(id, sched, when, callback, data, variable, unrefcall, addfailcall, refcall) \
> ...
>                 id = ast_sched_add_variable(sched, when, callback, data, variable); \
> ...
> {code}
> Which is used like this:
> {code}
>                         AST_SCHED_REPLACE_UNREF(peer->pokeexpire, sched,
>                                 0, /* Poke the peer ASAP */
>                                 sip_poke_peer_now, peer,
>                                 sip_unref_peer(_data, "removing poke peer ref"),
>                                 sip_unref_peer(peer, "removing poke peer ref"),
>                                 sip_ref_peer(peer, "adding poke peer ref"));
> {code}
> The scheduler gets tasked with this:
> {code}
> static int sip_poke_peer_now(const void *data)
> {
>         struct sip_peer *peer = (struct sip_peer *) data;
>         peer->pokeexpire = -1;
>         sip_poke_peer(peer, 0);
> {code}
> Which ends up here:
> {code}
> static int sip_poke_peer(struct sip_peer *peer, int force)
> ...
>                 AST_SCHED_DEL_UNREF(sched, peer->pokeexpire,
>                                 sip_unref_peer(peer, "removing poke peer ref"));
> {code}
> On a loaded system, the {{ast_sched_add_variable(sched, when, callback, data, variable);}} can complete and then the {{sip_poke_peer_now}} function can get called _before_ {{peer->pokeexpire}} is set.
> I.e.:
> - thread 1: {{ast_sched_add_variable(sched, when, callback, data, variable)}} returns ID123
> - thread 2: sip_poke_peer_now: {{peer->pokeexpire = -1;}}
> - thread 1: {{peer->pokeexpire = ID123}} (from macro)
> - thread 2: sip_poke_peer: {{ast_sched_del(...ID123)}}
> This results in a deadlock here:
> {code}
>                 /* Wait for executing task to complete so that caller of ast_sched_del() does not
>                  * free memory out from under the task.
>                  */
>                 while (con->currently_executing && (id == con->currently_executing->sched_id->id)) {
>                         ast_cond_wait(&s->cond, &con->lock);
>                 }
> {code}
> We're executing callback ID123, and we wait for that to complete before we can ast_sched_del it: infinite wait.
> *Fixes:*
> - Best fix is to not do any schedule-changes from the schedule-callbacks. But refactoring chan_sip is beyond the scope here.
> - An alternate fix that makes the target ID (in this case {{peer->pokeexpire}}) get set by {{ast_sched_add_variable}} _before_ the scheduled item is added to the heap would be rather invasive.
> But there should be other fixes we can do..
> _Observe that this problem may exist in other places than just chan_sip and {{sip_poke_peer_now}}._



--
This message was sent by Atlassian JIRA
(v6.2#6252)



More information about the asterisk-bugs mailing list