[asterisk-bugs] [JIRA] (ASTERISK-28282) AST_SCHED_REPLACE_UNREF causes wait-on-self deadlocks (in chan_sip)

Walter Doekes (JIRA) noreply at issues.asterisk.org
Mon Feb 11 09:31:47 CST 2019


     [ https://issues.asterisk.org/jira/browse/ASTERISK-28282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Walter Doekes updated ASTERISK-28282:
-------------------------------------

    Attachment: ASTERISK-28282_report_deadlock_but_continue.patch

> AST_SCHED_REPLACE_UNREF causes wait-on-self deadlocks (in chan_sip)
> -------------------------------------------------------------------
>
>                 Key: ASTERISK-28282
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-28282
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Channels/chan_sip/General
>    Affects Versions: 13.24.1
>            Reporter: Walter Doekes
>              Labels: patch
>         Attachments: ASTERISK-28282_improving_sip_show_sched.patch, ASTERISK-28282_report_deadlock_but_continue.patch, ASTERISK-28282_reproducing_sched_unref_self_deadlock.patch
>
>
> Hi,
> since ASTERISK-24212, a chan_sip bug has moved its appearance from apparent memory corruption to deadlocks.
> In the ticket above, the scheduler is fixed so it waits for a scheduled callback to complete before allowing it to be cancelled: _that way the callback is removed before running OR it runs completely before the caller of ast_sched_del can remove objects that the callback may depend on_.
> Example 1:
> - thread 1: add sip_poke_peer_now schedule
> - thread 2: del sip_poke_peer_now schedule
> Example 2:
> - thread 1: add sip_poke_peer_now schedule
> - thread 2: run sip_poke_peer_now
> - thread 3: attempt to delete sip_poke_peer_now schedule, wait
> - thread 2: callback completes
> - thread 3: the delete finishes
> However, under certain conditions of heavy load, the following situation can occur:
> - thread 1: add sip_poke_peer_now schedule
> - thread 2: run sip_poke_peer_now
> - also thread 2: attempt to delete sip_poke_peer_now schedule, but wait for callback to complete (which will never succeed)
> This can happen because of this macro:
> {code}
> #define AST_SCHED_REPLACE_VARIABLE_UNREF(id, sched, when, callback, data, variable, unrefcall, addfailcall, refcall) \
> ...
>                 id = ast_sched_add_variable(sched, when, callback, data, variable); \
> ...
> {code}
> Which is used like this:
> {code}
>                         AST_SCHED_REPLACE_UNREF(peer->pokeexpire, sched,
>                                 0, /* Poke the peer ASAP */
>                                 sip_poke_peer_now, peer,
>                                 sip_unref_peer(_data, "removing poke peer ref"),
>                                 sip_unref_peer(peer, "removing poke peer ref"),
>                                 sip_ref_peer(peer, "adding poke peer ref"));
> {code}
> The scheduler gets tasked with this:
> {code}
> static int sip_poke_peer_now(const void *data)
> {
>         struct sip_peer *peer = (struct sip_peer *) data;
>         peer->pokeexpire = -1;
>         sip_poke_peer(peer, 0);
> {code}
> Which ends up here:
> {code}
> static int sip_poke_peer(struct sip_peer *peer, int force)
> ...
>                 AST_SCHED_DEL_UNREF(sched, peer->pokeexpire,
>                                 sip_unref_peer(peer, "removing poke peer ref"));
> {code}
> On a loaded system, the {{ast_sched_add_variable(sched, when, callback, data, variable);}} can complete and then the {{sip_poke_peer_now}} function can get called _before_ {{peer->pokeexpire}} is set.
> I.e.:
> - thread 1: {{ast_sched_add_variable(sched, when, callback, data, variable)}} returns ID123
> - thread 2: sip_poke_peer_now: {{peer->pokeexpire = -1;}}
> - thread 1: {{peer->pokeexpire = ID123}} (from macro)
> - thread 2: sip_poke_peer: {{ast_sched_del(...ID123)}}
> This results in a deadlock here:
> {code}
>                 /* Wait for executing task to complete so that caller of ast_sched_del() does not
>                  * free memory out from under the task.
>                  */
>                 while (con->currently_executing && (id == con->currently_executing->sched_id->id)) {
>                         ast_cond_wait(&s->cond, &con->lock);
>                 }
> {code}
> We're executing callback ID123, and we wait for that to complete before we can ast_sched_del it: infinite wait.
> *Fixes:*
> - Best fix is to not do any schedule-changes from the schedule-callbacks. But refactoring chan_sip is beyond the scope here.
> - An alternate fix that makes the target ID (in this case {{peer->pokeexpire}}) get set by {{ast_sched_add_variable}} _before_ the scheduled item is added to the heap would be rather invasive.
> But there should be other fixes we can do..
> _Observe that this problem may exist in other places than just chan_sip and {{sip_poke_peer_now}}._



--
This message was sent by Atlassian JIRA
(v6.2#6252)



More information about the asterisk-bugs mailing list