[asterisk-bugs] [JIRA] (ASTERISK-19463) Asterisk deadlocks during startup with mutex errors

Matt Jordan (JIRA) noreply at issues.asterisk.org
Thu Nov 15 19:19:45 CST 2012


     [ https://issues.asterisk.org/jira/browse/ASTERISK-19463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Jordan updated ASTERISK-19463:
-----------------------------------

    Description: 
[Edit] 11/15/12

This particular issue has morphed slightly as time has gone on.  Once we solved the original logger thread problem, we quickly ran into another apparent deadlock.  This one has proven much more difficult to resolve.

On initial inspection (see ast_19239_bt), it appears as if something failed to unlock the recursive mutex that protects the lock tracking information in front of the thread registration list's read/write lock.  A deep inspection of the DEBUG_THREADS code surrounding the various read/write lock code proved this wasn't the case.

Although the deadlock/lock failures happen only sporadically on the 64-bit test system, they happen much more frequently on a 32-bit system, which made it easier to reproduce.  Although the failures are sporadic, one thing that is consistent is that it only happens on startup - once a test got past the thread registration step, it would never occur on any other rwlock.

This implied that something was wrong with the thread registration.  This also proved to not be the case - its incredibly simple code with no possibility of a recursive rwlock/wrlock call.

On to plan C.  We put the following debug code into the ast_reentrancy_lock/unlock:

{noformat}
static inline int ast_reentrancy_lock(struct ast_lock_track *lt)
{
	int res;
	res = pthread_mutex_lock(&lt->reentr_mutex);
	if (res) {
		fprintf(stderr, "Failed to lock reentrancy mutex: %s[%d]\n", strerror(res), res);
		abort();
	}
	return res;
}

static inline int ast_reentrancy_unlock(struct ast_lock_track *lt)
{
	int res;
	res = pthread_mutex_unlock(&lt->reentr_mutex);
	if (res) {
		fprintf(stderr, "Failed to unlock reentrancy mutex: %s[%d]\n", strerror(res), res);
		abort();
	}
	return res;
}
{noformat}

And then we saw the following:
{noformat}
Running ['tests/apps/voicemail/check_voicemail_new_user/run-test'] ...
[Nov 09 19:00:13] WARNING[30837]: asterisk.asterisk:127 errReceived: Asterisk 127.0.0.1 received error: Failed to unlock reentrancy mutex: Operation not permitted[1]
{noformat}

[Rut roh|http://bamboo.asterisk.org/browse/ASTTEAM-MJORDAN-MJORDAN18-80]

In a recursive pthread mutex, a failure to unlock a reentrant mutex implies that the thread unlocking the mutex doesn't own it.  Except in this case, the call to unlock the mutex happens 5 lines of code after we lock it.

The only explanation that I have for this is that the thread ID that the mutex recorded when it locked the mutex is no longer the thread ID that is returned when it unlocks the mutex.

For reference: [pthread_mutex_unlock|http://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_unlock.c;h=f9fe10b0f2b6f880de046df476082ecb9ffe3adf;hb=HEAD]
{noformat}
  55   else if (__builtin_expect (type == PTHREAD_MUTEX_RECURSIVE_NP, 1))
  56     {
  57       /* Recursive mutex.  */
  58       if (mutex->__data.__owner != THREAD_GETMEM (THREAD_SELF, tid))
  59         return EPERM;
  60 
  61       if (--mutex->__data.__count != 0)
  62         /* We still hold the mutex.  */
  63         return 0;
  64       go
{noformat}

SO... why would this happen?

Well, on startup, the Asterisk Test Suite executes Asterisk by sending it into the background.  Asterisk, when it starts with the appropriate option, calls [daemon|http://www.kernel.org/doc/man-pages/online/pages/man3/daemon.3.html] which forks the currently running process.  Forking a process does not carry over threads and may result in undefined behavior for things that interact with them.

To test this, we moved the call to daemon higher in the startup sequence and deliberately removed attempts to call into logger before the daemon happened.  This immediately resolved the locking issues - whereas [before|http://bamboo.asterisk.org/browse/ASTTEAM-MJORDAN-MJORDAN18-80] we'd have one to (many) failures, we now had [none|http://bamboo.asterisk.org/browse/ASTTEAM-MJORDAN-MJORDAN18-94] (at least not related to this problem).

A number of test runs were conducted without any recurrances of the problem.

Moral of the story: combining fork with threads == bad.

[Edit] 03/05/12

Uploaded two new backtraces, one from an instance of Asterisk that hung (ast_19239_bt), and one from the instance of Asterisk that was executing a remote command (ast_19240_bt).  Interestingly enough, in the backtrace ast_19239_bt, Asterisk was not in the logger locks, but in a rwlock that protects registering threads.

[Edit] 03/02/12

Originally, this issue appeared to occur during startup, when a remote Asterisk instance executed a "core wait fullybooted" against an Asterisk instance that was starting up.

Now, however, the same behaviour can occur when a remote Asterisk instance executes "core stop gracefully" - so when this occurs when Asterisk is starting up does not seem to be related to the problem, so much as a remote Asterisk instance sending a command.

Note that in both instances, the logger thread is waiting on the condition variable to be signalled, and a remote console command is being handled.

Additional note:
1) No log statements were ever logged for the instance of Asterisk corresponding to bt_full_shutdown
2) In the backtrace, both the logger_thread and the listener thread have the same thread data object passed to their thread startup routines:

Thread 44 (Thread 0xb74ffb70 (LWP 26037)):
...
#10 0x0807b698 in listener (unused=0x0) at asterisk.c:1387
#11 0x081b0344 in dummy_start (data=0x9258038) at utils.c:1010
#12 0x0012ba09 in start_thread () from /lib/libpthread.so.0
#13 0x0059e15e in clone () from /lib/libc.so.6

Thread 43 (Thread 0xb74c3b70 (LWP 26038)):
...
#7  0x0812dba9 in logger_thread (data=0x0) at logger.c:1053
#8  0x081b0344 in dummy_start (data=0x9258038) at utils.c:1010
#9  0x0012ba09 in start_thread () from /lib/libpthread.so.0
#10 0x0059e15e in clone () from /lib/libc.so.6

Since this should have been allocated in utils.c just prior to creating the thread, this is ... odd.  Note that this was the case in the other backtrace as well.

*Note:* This is a red herring.  We free the thread argument as the first action in dummy_start, so the memory is available for usage again.

[Edit] - original issue description below. This is somewhat misleading, as the issue appeared to only occur on startup.

It is possible for Asterisk to enter into a deadlock if there are no log messages available to be logged.  This typically happens on startup, before logger.conf has been processed.  In the logger thread, if the list of log messages is empty, the logger thread waits on the condition logcond after it has acquired the list mutex:

{code}
AST_LIST_LOCK(&logmsgs);
if (AST_LIST_EMPTY(&logmsgs)) {
	if (close_logger_thread) {
		break;
	} else {
		ast_cond_wait(&logcond, &logmsgs.lock);
	}
}
next = AST_LIST_FIRST(&logmsgs);
AST_LIST_HEAD_INIT_NOLOCK(&logmsgs);
AST_LIST_UNLOCK(&logmsgs);
{code}

Unfortunately, if close_logger_thread is true, we fail to unlock the logmsgs list mutex.  This causes a deadlock on any thread that attempts to log a message.

Backtrace attached.  Note that a 'core show locks' could not be obtained as the deadlock occurs in the logger and fairly early in Asterisk startup; however, the backtrace readily demonstrates the problem.

Thread 3: the logger thread holding the lock
Thread 4: blocked waiting on Thread 3
Thread 1: blocked waiting on Thread 3

  was:
[Edit] 03/05/12

Uploaded two new backtraces, one from an instance of Asterisk that hung (ast_19239_bt), and one from the instance of Asterisk that was executing a remote command (ast_19240_bt).  Interestingly enough, in the backtrace ast_19239_bt, Asterisk was not in the logger locks, but in a rwlock that protects registering threads.

The common thread between all of these hangs seems to be the remote instance of Asterisk attempting to send a CLI command.

[Edit] 03/02/12

Originally, this issue appeared to occur during startup, when a remote Asterisk instance executed a "core wait fullybooted" against an Asterisk instance that was starting up.

Now, however, the same behaviour can occur when a remote Asterisk instance executes "core stop gracefully" - so when this occurs when Asterisk is starting up does not seem to be related to the problem, so much as a remote Asterisk instance sending a command.

Note that in both instances, the logger thread is waiting on the condition variable to be signalled, and a remote console command is being handled.

Additional note:
1) No log statements were ever logged for the instance of Asterisk corresponding to bt_full_shutdown
2) In the backtrace, both the logger_thread and the listener thread have the same thread data object passed to their thread startup routines:

Thread 44 (Thread 0xb74ffb70 (LWP 26037)):
...
#10 0x0807b698 in listener (unused=0x0) at asterisk.c:1387
#11 0x081b0344 in dummy_start (data=0x9258038) at utils.c:1010
#12 0x0012ba09 in start_thread () from /lib/libpthread.so.0
#13 0x0059e15e in clone () from /lib/libc.so.6

Thread 43 (Thread 0xb74c3b70 (LWP 26038)):
...
#7  0x0812dba9 in logger_thread (data=0x0) at logger.c:1053
#8  0x081b0344 in dummy_start (data=0x9258038) at utils.c:1010
#9  0x0012ba09 in start_thread () from /lib/libpthread.so.0
#10 0x0059e15e in clone () from /lib/libc.so.6

Since this should have been allocated in utils.c just prior to creating the thread, this is ... odd.  Note that this was the case in the other backtrace as well.

*Note:* This is a red herring.  We free the thread argument as the first action in dummy_start, so the memory is available for usage again.

[Edit] - original issue description below. This is somewhat misleading, as the issue appeared to only occur on startup.

It is possible for Asterisk to enter into a deadlock if there are no log messages available to be logged.  This typically happens on startup, before logger.conf has been processed.  In the logger thread, if the list of log messages is empty, the logger thread waits on the condition logcond after it has acquired the list mutex:

{code}
AST_LIST_LOCK(&logmsgs);
if (AST_LIST_EMPTY(&logmsgs)) {
	if (close_logger_thread) {
		break;
	} else {
		ast_cond_wait(&logcond, &logmsgs.lock);
	}
}
next = AST_LIST_FIRST(&logmsgs);
AST_LIST_HEAD_INIT_NOLOCK(&logmsgs);
AST_LIST_UNLOCK(&logmsgs);
{code}

This should release logmsgs and block on the condition variable.  In ast_log, we should be able to grab the mutex and signal the condition variable, as shown below in ast_log:

{code}
AST_LIST_LOCK(&logmsgs);
AST_LIST_INSERT_TAIL(&logmsgs, logmsg, list);
ast_cond_signal(&logcond);
AST_LIST_UNLOCK(&logmsgs);
{code}

Unfortunately, on startup, this does not always appear to happen.  The code looks to be correct, based on the defined behaviour of pthread_cond_wait - we should be releasing the mutex when we call ast_cond_wait (which under the hood calls the pthread implementation).  Based on the backtrace, it appears as if that didn't happen.

Backtrace attached.  Note that a 'core show locks' could not be obtained as the deadlock occurs in the logger and fairly early in Asterisk startup; however, the backtrace readily demonstrates the problem.

Thread 3: the logger thread holding the lock
Thread 4: blocked waiting on Thread 3
Thread 1: blocked waiting on Thread 3

        Summary: Asterisk deadlocks during startup with mutex errors  (was: Asterisk deadlocks when remote Asterisk connects and sends a command)
    
> Asterisk deadlocks during startup with mutex errors
> ---------------------------------------------------
>
>                 Key: ASTERISK-19463
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-19463
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Tests/testsuite
>    Affects Versions: 11, 1.8.9.3, 10.1.3
>         Environment: CentOS 6 32-bit (2.6.32-220.el6.i686), 64-bit (2.6.32-220.el6.x86_64)
> Asterisk (1.8, 10, and trunk)
>            Reporter: Matt Jordan
>            Assignee: Matt Jordan
>         Attachments: ast_19239_bt.txt, ast_19240_bt.txt, backtrace.txt, bt_full_shutdown.txt, logger_deadlock.txt
>
>
> [Edit] 11/15/12
> This particular issue has morphed slightly as time has gone on.  Once we solved the original logger thread problem, we quickly ran into another apparent deadlock.  This one has proven much more difficult to resolve.
> On initial inspection (see ast_19239_bt), it appears as if something failed to unlock the recursive mutex that protects the lock tracking information in front of the thread registration list's read/write lock.  A deep inspection of the DEBUG_THREADS code surrounding the various read/write lock code proved this wasn't the case.
> Although the deadlock/lock failures happen only sporadically on the 64-bit test system, they happen much more frequently on a 32-bit system, which made it easier to reproduce.  Although the failures are sporadic, one thing that is consistent is that it only happens on startup - once a test got past the thread registration step, it would never occur on any other rwlock.
> This implied that something was wrong with the thread registration.  This also proved to not be the case - its incredibly simple code with no possibility of a recursive rwlock/wrlock call.
> On to plan C.  We put the following debug code into the ast_reentrancy_lock/unlock:
> {noformat}
> static inline int ast_reentrancy_lock(struct ast_lock_track *lt)
> {
> 	int res;
> 	res = pthread_mutex_lock(&lt->reentr_mutex);
> 	if (res) {
> 		fprintf(stderr, "Failed to lock reentrancy mutex: %s[%d]\n", strerror(res), res);
> 		abort();
> 	}
> 	return res;
> }
> static inline int ast_reentrancy_unlock(struct ast_lock_track *lt)
> {
> 	int res;
> 	res = pthread_mutex_unlock(&lt->reentr_mutex);
> 	if (res) {
> 		fprintf(stderr, "Failed to unlock reentrancy mutex: %s[%d]\n", strerror(res), res);
> 		abort();
> 	}
> 	return res;
> }
> {noformat}
> And then we saw the following:
> {noformat}
> Running ['tests/apps/voicemail/check_voicemail_new_user/run-test'] ...
> [Nov 09 19:00:13] WARNING[30837]: asterisk.asterisk:127 errReceived: Asterisk 127.0.0.1 received error: Failed to unlock reentrancy mutex: Operation not permitted[1]
> {noformat}
> [Rut roh|http://bamboo.asterisk.org/browse/ASTTEAM-MJORDAN-MJORDAN18-80]
> In a recursive pthread mutex, a failure to unlock a reentrant mutex implies that the thread unlocking the mutex doesn't own it.  Except in this case, the call to unlock the mutex happens 5 lines of code after we lock it.
> The only explanation that I have for this is that the thread ID that the mutex recorded when it locked the mutex is no longer the thread ID that is returned when it unlocks the mutex.
> For reference: [pthread_mutex_unlock|http://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_unlock.c;h=f9fe10b0f2b6f880de046df476082ecb9ffe3adf;hb=HEAD]
> {noformat}
>   55   else if (__builtin_expect (type == PTHREAD_MUTEX_RECURSIVE_NP, 1))
>   56     {
>   57       /* Recursive mutex.  */
>   58       if (mutex->__data.__owner != THREAD_GETMEM (THREAD_SELF, tid))
>   59         return EPERM;
>   60 
>   61       if (--mutex->__data.__count != 0)
>   62         /* We still hold the mutex.  */
>   63         return 0;
>   64       go
> {noformat}
> SO... why would this happen?
> Well, on startup, the Asterisk Test Suite executes Asterisk by sending it into the background.  Asterisk, when it starts with the appropriate option, calls [daemon|http://www.kernel.org/doc/man-pages/online/pages/man3/daemon.3.html] which forks the currently running process.  Forking a process does not carry over threads and may result in undefined behavior for things that interact with them.
> To test this, we moved the call to daemon higher in the startup sequence and deliberately removed attempts to call into logger before the daemon happened.  This immediately resolved the locking issues - whereas [before|http://bamboo.asterisk.org/browse/ASTTEAM-MJORDAN-MJORDAN18-80] we'd have one to (many) failures, we now had [none|http://bamboo.asterisk.org/browse/ASTTEAM-MJORDAN-MJORDAN18-94] (at least not related to this problem).
> A number of test runs were conducted without any recurrances of the problem.
> Moral of the story: combining fork with threads == bad.
> [Edit] 03/05/12
> Uploaded two new backtraces, one from an instance of Asterisk that hung (ast_19239_bt), and one from the instance of Asterisk that was executing a remote command (ast_19240_bt).  Interestingly enough, in the backtrace ast_19239_bt, Asterisk was not in the logger locks, but in a rwlock that protects registering threads.
> [Edit] 03/02/12
> Originally, this issue appeared to occur during startup, when a remote Asterisk instance executed a "core wait fullybooted" against an Asterisk instance that was starting up.
> Now, however, the same behaviour can occur when a remote Asterisk instance executes "core stop gracefully" - so when this occurs when Asterisk is starting up does not seem to be related to the problem, so much as a remote Asterisk instance sending a command.
> Note that in both instances, the logger thread is waiting on the condition variable to be signalled, and a remote console command is being handled.
> Additional note:
> 1) No log statements were ever logged for the instance of Asterisk corresponding to bt_full_shutdown
> 2) In the backtrace, both the logger_thread and the listener thread have the same thread data object passed to their thread startup routines:
> Thread 44 (Thread 0xb74ffb70 (LWP 26037)):
> ...
> #10 0x0807b698 in listener (unused=0x0) at asterisk.c:1387
> #11 0x081b0344 in dummy_start (data=0x9258038) at utils.c:1010
> #12 0x0012ba09 in start_thread () from /lib/libpthread.so.0
> #13 0x0059e15e in clone () from /lib/libc.so.6
> Thread 43 (Thread 0xb74c3b70 (LWP 26038)):
> ...
> #7  0x0812dba9 in logger_thread (data=0x0) at logger.c:1053
> #8  0x081b0344 in dummy_start (data=0x9258038) at utils.c:1010
> #9  0x0012ba09 in start_thread () from /lib/libpthread.so.0
> #10 0x0059e15e in clone () from /lib/libc.so.6
> Since this should have been allocated in utils.c just prior to creating the thread, this is ... odd.  Note that this was the case in the other backtrace as well.
> *Note:* This is a red herring.  We free the thread argument as the first action in dummy_start, so the memory is available for usage again.
> [Edit] - original issue description below. This is somewhat misleading, as the issue appeared to only occur on startup.
> It is possible for Asterisk to enter into a deadlock if there are no log messages available to be logged.  This typically happens on startup, before logger.conf has been processed.  In the logger thread, if the list of log messages is empty, the logger thread waits on the condition logcond after it has acquired the list mutex:
> {code}
> AST_LIST_LOCK(&logmsgs);
> if (AST_LIST_EMPTY(&logmsgs)) {
> 	if (close_logger_thread) {
> 		break;
> 	} else {
> 		ast_cond_wait(&logcond, &logmsgs.lock);
> 	}
> }
> next = AST_LIST_FIRST(&logmsgs);
> AST_LIST_HEAD_INIT_NOLOCK(&logmsgs);
> AST_LIST_UNLOCK(&logmsgs);
> {code}
> Unfortunately, if close_logger_thread is true, we fail to unlock the logmsgs list mutex.  This causes a deadlock on any thread that attempts to log a message.
> Backtrace attached.  Note that a 'core show locks' could not be obtained as the deadlock occurs in the logger and fairly early in Asterisk startup; however, the backtrace readily demonstrates the problem.
> Thread 3: the logger thread holding the lock
> Thread 4: blocked waiting on Thread 3
> Thread 1: blocked waiting on Thread 3

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira



More information about the asterisk-bugs mailing list