[asterisk-bugs] [JIRA] (ASTERISK-28282) AST_SCHED_REPLACE_UNREF causes wait-on-self deadlocks (in chan_sip)

Walter Doekes (JIRA) noreply at issues.asterisk.org
Mon Feb 11 09:19:47 CST 2019


Walter Doekes created ASTERISK-28282:
----------------------------------------

             Summary: AST_SCHED_REPLACE_UNREF causes wait-on-self deadlocks (in chan_sip)
                 Key: ASTERISK-28282
                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-28282
             Project: Asterisk
          Issue Type: Bug
      Security Level: None
          Components: Channels/chan_sip/General
    Affects Versions: 13.24.1
            Reporter: Walter Doekes


Hi,

since ASTERISK-24212, a chan_sip bug has moved its appearance from apparent memory corruption to deadlocks.

In the ticket above, the scheduler is fixed so it waits for a scheduled callback to complete before allowing it to be cancelled: _that way the callback is removed before running OR it runs completely before the caller of ast_sched_del can remove objects that the callback may depend on_.

Example 1:

- thread 1: add sip_poke_peer_now schedule
- thread 2: del sip_poke_peer_now schedule

Example 2:

- thread 1: add sip_poke_peer_now schedule
- thread 2: run sip_poke_peer_now
- thread 3: attempt to delete sip_poke_peer_now schedule, wait
- thread 2: callback completes
- thread 3: the delete finishes

However, under certain conditions of heavy load, the following situation can occur:

- thread 1: add sip_poke_peer_now schedule
- thread 2: run sip_poke_peer_now
- also thread 2: attempt to delete sip_poke_peer_now schedule, but wait for callback to complete (which will never succeed)

This can happen because of this macro:

{code}
#define AST_SCHED_REPLACE_VARIABLE_UNREF(id, sched, when, callback, data, variable, unrefcall, addfailcall, refcall) \
...
                id = ast_sched_add_variable(sched, when, callback, data, variable); \
...
{code}
Which is used like this:
{code}
                        AST_SCHED_REPLACE_UNREF(peer->pokeexpire, sched,
                                0, /* Poke the peer ASAP */
                                sip_poke_peer_now, peer,
                                sip_unref_peer(_data, "removing poke peer ref"),
                                sip_unref_peer(peer, "removing poke peer ref"),
                                sip_ref_peer(peer, "adding poke peer ref"));
{code}
The scheduler gets tasked with this:
{code}
static int sip_poke_peer_now(const void *data)
{
        struct sip_peer *peer = (struct sip_peer *) data;

        peer->pokeexpire = -1;
        sip_poke_peer(peer, 0);
{code}
Which ends up here:
{code}
static int sip_poke_peer(struct sip_peer *peer, int force)
...
                AST_SCHED_DEL_UNREF(sched, peer->pokeexpire,
                                sip_unref_peer(peer, "removing poke peer ref"));
{code}
On a loaded system, the {{ast_sched_add_variable(sched, when, callback, data, variable);}} can complete and then the {{sip_poke_peer_now}} function can get called _before_ {{peer->pokeexpire}} is set.

I.e.:

- thread 1: {{ast_sched_add_variable(sched, when, callback, data, variable)}} returns ID123
- thread 2: sip_poke_peer_now: {{peer->pokeexpire = -1;}}
- thread 1: {{peer->pokeexpire = ID123}} (from macro)
- thread 2: sip_poke_peer: {{ast_sched_del(...ID123)}}

This results in a deadlock here:

{code}
                /* Wait for executing task to complete so that caller of ast_sched_del() does not
                 * free memory out from under the task.
                 */
                while (con->currently_executing && (id == con->currently_executing->sched_id->id)) {
                        ast_cond_wait(&s->cond, &con->lock);
                }
{code}

We're executing callback ID123, and we wait for that to complete before we can ast_sched_del it: infinite wait.

*Fixes:*

- Best fix is to not do any schedule-changes from the schedule-callbacks. But refactoring chan_sip is beyond the scope here.
- An alternate fix that makes the target ID (in this case {{peer->pokeexpire}}) get set by {{ast_sched_add_variable}} _before_ the scheduled item is added to the heap would be rather invasive.

But there should be other fixes we can do..

_Observe that this problem may exist in other places than just chan_sip and {{sip_poke_peer_now}}._



--
This message was sent by Atlassian JIRA
(v6.2#6252)



More information about the asterisk-bugs mailing list