[asterisk-bugs] [JIRA] (ASTERISK-21207) [patch] - Deadlock on fax extension calling ast_async_goto() with locked channel

Masahide Yamamoto (JIRA) noreply at issues.asterisk.org
Sun Jul 6 07:31:57 CDT 2014


    [ https://issues.asterisk.org/jira/browse/ASTERISK-21207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=220282#comment-220282 ] 

Masahide Yamamoto edited comment on ASTERISK-21207 at 7/6/14 7:30 AM:
----------------------------------------------------------------------

We also have been encountering this deadlock issue so far in 1.8 branch.
Unfortunately this issue does not seem to have been fixed in the latest 1.8 branch either.

bq. The ast_channel_unlock in process_sdp was just cut&pasted into place, and is pointless. It only unlocks the recursive lock from a couple lines up, not the big lock held by the caller of process_sdp – ultimately handle_request_do.

According to the above comment from Ashley, the lock in process_sdp seems to need to be disabled so the following ast_exists_extension and ast_async_goto will work without deadlocking like:

--------------------------------------------------------------------
Modified on 06/Jul/14:

{code}

/* In global scope of main/astobj2.c, */

int ast_get_mutex_locked_count(void *user_data)
{
        int n_locked_count = 0;
        struct astobj2 *p = INTERNAL_OBJ(user_data);
        if (p == NULL)
                return -1;
        n_locked_count = p->priv_data.lock.mutex.__data.__count;
        return n_locked_count;
}


/* In function process_sdp of channels/chan_sip.c, */

if (ast_test_flag(&p->flags[1], SIP_PAGE2_FAX_DETECT_T38)) {

        /* Commented out */
        /* ast_channel_lock(p->owner); */

        /* Need to check on how many recursive locks are currently being taken by this thread because we are not sure if the channel is always locked whenever it comes here.. */
        /* If the buggy case which we're concerning here, it will be more than zero and in that case we need to unlock the mutex once. */
        int n_locked_count = ast_get_mutex_locked_count(p->owner);

        if (strcmp(p->owner->exten, "fax")) {
                const char *target_context = S_OR(p->owner->macrocontext, p->owner->context);

                /* Commented out */
                /* ast_channel_unlock(p->owner); */

                /* Unlock if the channel was locked by this thread for the following ast_exists_extension and ast_async_goto to work without causing deadlock at a later time.
                   We will try only once even if n_locked_count is more than one though.. */
                if(n_locked_count > 0) ast_channel_unlock(p->owner);

                if (ast_exists_extension(p->owner, target_context, "fax", 1,
                        S_COR(p->owner->caller.id.number.valid, p->owner->caller.id.number.str, NULL))) {
                        ast_verbose(VERBOSE_PREFIX_2 "Redirecting '%s' to fax extension due to peer T.38 re-INVITE\n", p->owner->name);

                        /* Assume the channel is not locked here as long as it hasn't been locked in ast_exists_extension in normal case */
                        ast_channel_lock(p->owner); /* Added */
                        pbx_builtin_setvar_helper(p->owner, "FAXEXTEN", p->owner->exten);
                        ast_channel_unlock(p->owner); /* Added */

                        if (ast_async_goto(p->owner, target_context, "fax", 1)) {
                                ast_log(LOG_NOTICE, "Failed to async goto '%s' into fax of '%s'\n", p->owner->name, target_context);
                        }
                } else {
                        ast_log(LOG_NOTICE, "T.38 re-INVITE detected but no fax extension\n");
                }

                /* Re-lock if the channel was locked */
                if(n_locked_count > 0) ast_channel_lock(p->owner);
        }

        /* Commented out */
        /*  } else {
                ast_channel_unlock(p->owner);
        } */

}
{code}

-FYI: Like the above code snippet, we can use ast_channel_trylock / ast_mutex_trylock for non-blocking lock attempt and checking.-

--------------------------------------------------------------------
Added on 06/Jul/14:

I read the following piece of code about PTHREAD_MUTEX_RECURSIVE_NP, which is the mutex's attribute that Asterisk uses.

In Linux and glibc,

{code}
 int __pthread_mutex_trylock (pthread_mutex_t *mutex)
 {
     int oldval;
     pid_t id = THREAD_GETMEM (THREAD_SELF, tid);
   
     switch (__builtin_expect (PTHREAD_MUTEX_TYPE_ELISION (mutex),
                   PTHREAD_MUTEX_TIMED_NP))
     {
       /* Recursive mutex.  */
       case PTHREAD_MUTEX_RECURSIVE_NP|PTHREAD_MUTEX_ELISION_NP:
       case PTHREAD_MUTEX_RECURSIVE_NP:
       /* Check whether we already hold the mutex.  */
       if (mutex->__data.__owner == id)
       {
         /* Just bump the counter.  */
         if (__builtin_expect (mutex->__data.__count + 1 == 0, 0))
           /* Overflow of the counter.  */
           return EAGAIN;
    
         ++mutex->__data.__count;
         return 0;
       }
    
       if (lll_trylock (mutex->__data.__lock) == 0)
       {
         /* Record the ownership.  */
         mutex->__data.__owner = id;
         mutex->__data.__count = 1;
         ++mutex->__data.__nusers;
         return 0;
       }
       break;

       /* omitted */

     }


     /* omitted */

     return EBUSY;
 }
{code}

So I realized I made a big misunderstanding about the semantics of Asterisk's trylock function which uses PTHREAD_MUTEX_RECURSIVE_NP, so I rectified my previously mentioned code accordingly.
* I thought ast_channel_trylock always returns EBUSY when a channel given is locked by any threads including the calling thread itself.
* Note that my approach is not portable since I used architecture-specific way (I think it will work at least in Linux/glibc).

Also I put the following code in __ao2_lock, __ao2_unlock and __ao2_trylock as follows to see if everything works as expected:

{code}
int __ao2_lock(void *user_data, const char *file, const char *func, int line, const char *var)
{
/* omitted */

        ast_log(LOG_DEBUG, "QLOOG: __ao2_lock: lock info: %s | %d | %s | %s | %d\n", file, line, func, var, ast_get_mutex_locked_count(user_data));

/* omitted */
}

int __ao2_unlock(void *user_data, const char *file, const char *func, int line, const char *var)
{
/* omitted */

        ast_log(LOG_DEBUG, "QLOOG: __ao2_unlock: lock info: %s | %d | %s | %s | %d\n", file, line, func, var, ast_get_mutex_locked_count(user_data));

/* omitted */
}

int __ao2_trylock(void *user_data, const char *file, const char *func, int line, const char *var)
{
/* omitted */

        ast_log(LOG_DEBUG, "QLOOG: __ao2_trylock: lock info: %s | %d | %s | %s | %d\n", file, line, func, var, ast_get_mutex_locked_count(user_data));

/* omitted */
}
{code}

I'm still testing them, so far so good.


was (Author: m.yamamoto):
We also have been encountering this deadlock issue so far in 1.8 branch.
Unfortunately this issue does not seem to have been fixed in the latest 1.8 branch either.

bq. The ast_channel_unlock in process_sdp was just cut&pasted into place, and is pointless. It only unlocks the recursive lock from a couple lines up, not the big lock held by the caller of process_sdp – ultimately handle_request_do.

According to the above comment from Ashley, the lock in process_sdp seems to need to be disabled so the following ast_exists_extension and ast_async_goto will work without deadlocking like:

--------------------------------------------------------------------
Modified on 06/Jul/14:

{code}

/* In global scope of main/astobj2.c, */

AST_MUTEX_DEFINE_STATIC(ast_get_mutex_locked_count_m);
int ast_get_mutex_locked_count(void *user_data)
{
        int n_locked_count = 0;
        struct astobj2 *p = INTERNAL_OBJ(user_data);
        if (p == NULL)
                return -1;
        ast_mutex_lock(&ast_get_mutex_locked_count_m);
        n_locked_count = p->priv_data.lock.mutex.__data.__count;
        ast_mutex_unlock(&ast_get_mutex_locked_count_m);
        return n_locked_count;
}


/* In function process_sdp of channels/chan_sip.c, */

if (ast_test_flag(&p->flags[1], SIP_PAGE2_FAX_DETECT_T38)) {

        /* Commented out */
        /* ast_channel_lock(p->owner); */

        /* Need to check on how many recursive locks are currently being taken by this thread because we are not sure if the channel is always locked whenever it comes here.. */
        /* If the buggy case which we're concerning here, it will be more than zero and in that case we need to unlock the mutex once. */
        int n_locked_count = ast_get_mutex_locked_count(p->owner);

        if (strcmp(p->owner->exten, "fax")) {
                const char *target_context = S_OR(p->owner->macrocontext, p->owner->context);

                /* Commented out */
                /* ast_channel_unlock(p->owner); */

                /* Unlock if the channel was locked by this thread for the following ast_exists_extension and ast_async_goto to work without causing deadlock at a later time.
                   We will try only once even if n_locked_count is more than one though.. */
                if(n_locked_count > 0) ast_channel_unlock(p->owner);

                if (ast_exists_extension(p->owner, target_context, "fax", 1,
                        S_COR(p->owner->caller.id.number.valid, p->owner->caller.id.number.str, NULL))) {
                        ast_verbose(VERBOSE_PREFIX_2 "Redirecting '%s' to fax extension due to peer T.38 re-INVITE\n", p->owner->name);

                        /* Assume the channel is not locked here as long as it hasn't been locked in ast_exists_extension in normal case */
                        ast_channel_lock(p->owner); /* Added */
                        pbx_builtin_setvar_helper(p->owner, "FAXEXTEN", p->owner->exten);
                        ast_channel_unlock(p->owner); /* Added */

                        if (ast_async_goto(p->owner, target_context, "fax", 1)) {
                                ast_log(LOG_NOTICE, "Failed to async goto '%s' into fax of '%s'\n", p->owner->name, target_context);
                        }
                } else {
                        ast_log(LOG_NOTICE, "T.38 re-INVITE detected but no fax extension\n");
                }

                /* Re-lock if the channel was locked */
                if(n_locked_count > 0) ast_channel_lock(p->owner);
        }

        /* Commented out */
        /*  } else {
                ast_channel_unlock(p->owner);
        } */

}
{code}

-FYI: Like the above code snippet, we can use ast_channel_trylock / ast_mutex_trylock for non-blocking lock attempt and checking.-

--------------------------------------------------------------------
Added on 06/Jul/14:

I read the following piece of code about PTHREAD_MUTEX_RECURSIVE_NP, which is the mutex's attribute that Asterisk uses.

In Linux and glibc,

{code}
 int __pthread_mutex_trylock (pthread_mutex_t *mutex)
 {
     int oldval;
     pid_t id = THREAD_GETMEM (THREAD_SELF, tid);
   
     switch (__builtin_expect (PTHREAD_MUTEX_TYPE_ELISION (mutex),
                   PTHREAD_MUTEX_TIMED_NP))
     {
       /* Recursive mutex.  */
       case PTHREAD_MUTEX_RECURSIVE_NP|PTHREAD_MUTEX_ELISION_NP:
       case PTHREAD_MUTEX_RECURSIVE_NP:
       /* Check whether we already hold the mutex.  */
       if (mutex->__data.__owner == id)
       {
         /* Just bump the counter.  */
         if (__builtin_expect (mutex->__data.__count + 1 == 0, 0))
           /* Overflow of the counter.  */
           return EAGAIN;
    
         ++mutex->__data.__count;
         return 0;
       }
    
       if (lll_trylock (mutex->__data.__lock) == 0)
       {
         /* Record the ownership.  */
         mutex->__data.__owner = id;
         mutex->__data.__count = 1;
         ++mutex->__data.__nusers;
         return 0;
       }
       break;

       /* omitted */

     }


     /* omitted */

     return EBUSY;
 }
{code}

So I realized I made a big misunderstanding about the semantics of Asterisk's trylock function which uses PTHREAD_MUTEX_RECURSIVE_NP, so I rectified my previously mentioned code accordingly.
* I thought ast_channel_trylock always returns EBUSY when a channel given is locked by any threads including the calling thread itself.
* Note that my approach is not portable since I used architecture-specific way (I think it will work at least in Linux/glibc).

Also I put the following code in __ao2_lock, __ao2_unlock and __ao2_trylock as follows to see if everything works as expected:

{code}
int __ao2_lock(void *user_data, const char *file, const char *func, int line, const char *var)
{
/* omitted */

        ast_log(LOG_DEBUG, "QLOOG: __ao2_lock: lock info: %s | %d | %s | %s | %d\n", file, line, func, var, ast_get_mutex_locked_count(user_data));

/* omitted */
}

int __ao2_unlock(void *user_data, const char *file, const char *func, int line, const char *var)
{
/* omitted */

        ast_log(LOG_DEBUG, "QLOOG: __ao2_unlock: lock info: %s | %d | %s | %s | %d\n", file, line, func, var, ast_get_mutex_locked_count(user_data));

/* omitted */
}

int __ao2_trylock(void *user_data, const char *file, const char *func, int line, const char *var)
{
/* omitted */

        ast_log(LOG_DEBUG, "QLOOG: __ao2_trylock: lock info: %s | %d | %s | %s | %d\n", file, line, func, var, ast_get_mutex_locked_count(user_data));

/* omitted */
}
{code}

I'm still testing them, so far so good.

> [patch] - Deadlock on fax extension calling ast_async_goto() with locked channel
> --------------------------------------------------------------------------------
>
>                 Key: ASTERISK-21207
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-21207
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Channels/chan_dahdi, Channels/chan_sip/General
>    Affects Versions: 10.7.1, 11.2.1
>         Environment: CentOS 6.3 x86_64
>            Reporter: Ashley Winters
>            Severity: Critical
>         Attachments: backtrace-threads.txt, core-show-locks.txt, fax-deadlock.patch, fax-deadlock-v2.patch, fax-deadlock-v2.patch-11.3.0, gdb-fax-deadlock.txt, issue_log
>
>
> On an asterisk system with heavy use of AGI and inbound CNG-detected faxing, occasionally all channel activity will freeze. Running 'core show channels' returns nothing, but the logs continue running with anything except channel activity. Running with 'sip set debug on' shows that chan_sip.c doesn't even claim to be reading packets anymore.
> This deadlock was triggered several times daily across our array of asterisk servers, which process hundreds of faxes and tens of thousands of calls daily.



--
This message was sent by Atlassian JIRA
(v6.2#6252)



More information about the asterisk-bugs mailing list