[asterisk-bugs] [JIRA] (ASTERISK-25023) Deadlock in chan_sip in update_provisional_keepalive

Tue Nov 24 10:57:34 CST 2015

    [ https://issues.asterisk.org/jira/browse/ASTERISK-25023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=228451#comment-228451 ] 

Matt Jordan commented on ASTERISK-25023:
----------------------------------------

The inversion is as [~hexanol] pointed out: when a scheduled task is executing, that task isn't holding any locks. However, anyone attempting to delete the running task may block that task if they themselves are currently holding any locks.

This appears to have been caused by a combination of two commits:

{noformat}
commit 7c4ed8cc897ace58dd6b2534589fb5902eebdb14
Author: Mark Michelson <mmichelson at digium.com>
Date:   Tue Aug 26 22:13:57 2014 +0000

    Fix race condition in the scheduler when deleting a running entry.

    When scheduled tasks run, they are removed from the heap (or hashtab).
    When a scheduled task is deleted, if the task can't be found in the
    heap (or hashtab), an assertion is triggered. If DO_CRASH is enabled,
    this assertion causes a crash.

    The problem is, sometimes it just so happens that someone attempts
    to delete a scheduled task at the time that it is running, leading
    to a crash. This change corrects the issue by tracking which task
    is currently running. If that task is attempted to be deleted,
    then we mark the task, and then wait for the task to complete.
    This way, we can be sure to coordinate task deletion and memory
    freeing.

    ASTERISK-24212
    Reported by Matt Jordan

    Review: https://reviewboard.asterisk.org/r/3927
    ........

    Merged revisions 422070 from http://svn.asterisk.org/svn/asterisk/branches/12

    git-svn-id: https://origsvn.digium.com/svn/asterisk/branches/13@422071 65c4cc65-6c06-0410-ace0-fbb531ad65f3
{noformat}

And:

{noformat}
commit 85e1cb51b21c2d194647e16b81b5a1344d2ff911
Author: Joshua Colp <jcolp at digium.com>
Date:   Fri Aug 28 21:57:14 2015 -0300

    sched: ast_sched_del may return prematurely due to spurious wakeup

    When deleting a scheduled item if the item in question is currently
    executing the ast_sched_del function waits until it has completed.
    This is accomplished using ast_cond_wait. Unfortunately the
    ast_cond_wait function can suffer from spurious wakeups so the
    predicate needs to be checked after it returns to make sure it has
    really woken up as a result of being signaled.

    This change adds a loop around the ast_cond_wait to make sure that
    it only exits when the executing task has really completed.

    ASTERISK-25355 #close

    Change-Id: I51198270eb0b637c956c61aa409f46283432be61
{noformat}

In the first, we introduced the notion of waiting until a running task completes before attempting to delete/remove them from the scheduler. That didn't actually cause the deadlock, however, as {{ast_cond_wait}} wasn't being guarded with a predicate. As such, weird slow downs may have occurred, but eventually {{ast_cond_wait}} would *probably* have returned, resulting in things flowing along (although the delete would have probably been in a precarious position, as the running task would still be ... running.)

The second commit causes the deadlock by adding the predicate. Since the predicate will never be true - as the deleter of the scheduled task is holding a lock that task is waiting on - we're deadlocked.

There's at least two ways to solve this:
* Use a time-bounded wait condition instead of {{ast_cond_wait}}. I don't really love this option, however, as it will result in deleting a task being in an unknown state if the time limit passes - plus, we're liable to block the entire SIP stack while we wait for that timer to expire.
* Fix the priority inversions. Generally, that means callers of {{ast_sched_del}} *CANNOT* *HOLD* *ANY* *LOCKS*. That patch is mostly tedious to write, and should probably result in a macro that deals with the fun of ref bump/unlock/re-lock that will be necessary. Callers will also have to be aware that they may hold the only reference to the pvt/peer when it completes, which can result in some fun crashiness if we aren't careful.

> Deadlock in chan_sip in update_provisional_keepalive
> ----------------------------------------------------
>
>                 Key: ASTERISK-25023
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-25023
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Channels/chan_sip/General
>    Affects Versions: 13.3.2
>         Environment: centos 6 / 64Bit
>            Reporter: Arnd Schmitter
>         Attachments: backtrace.txt, core-show-locks.txt, trace.txt
>
>
> There is a race condition / deadlock when update_provisional_keepalive is called.
> If there is already a scheduler run plant for calling send_provisional_keepalive_full.
> the func update_provisional_keepalive gets called with a locked sip_pvt struct and the first thing it does is delete the plant scheduler.
> If the plant scheduler is started after the sip_pvt lock and before the AST_SCHED_DEL_UNREF call in update_provisional_keepalive is executed, the scheduler job is blocked in send_provisional_keepalive_full, waiting to get a lock on the sip_pvt struct.
> The call to AST_SCHED_DEL_UNREF is waiting for a condistion signal, that the running scheduler finish.

--
This message was sent by Atlassian JIRA
(v6.2#6252)