[asterisk-bugs] [Asterisk 0014112]: [patch] Thread deadlock causes Asterisk to stop routing calls to agents, agents unable to change status

Thu Jun 18 08:17:13 CDT 2009

A NOTE has been added to this issue. 
====================================================================== 
https://issues.asterisk.org/view.php?id=14112 
====================================================================== 
Reported By:                sroberts
Assigned To:                
====================================================================== 
Project:                    Asterisk
Issue ID:                   14112
Category:                   Channels/chan_agent
Reproducibility:            random
Severity:                   major
Priority:                   normal
Status:                     ready for review
Target Version:             1.4.27
Asterisk Version:           1.4.19 
Regression:                 No 
SVN Branch (only for SVN checkouts, not tarball releases): N/A 
SVN Revision (number only!):  
Request Review:              
====================================================================== 
Date Submitted:             2008-12-19 07:16 CST
Last Modified:              2009-06-18 08:17 CDT
====================================================================== 
Summary:                    [patch] Thread deadlock causes Asterisk to stop
routing calls to agents, agents unable to change status
Description: 
We run several call centres on Asterisk. Our queue servers are using
Asterisk 1.4.19 on CentOS 4.6. The busier sites can take anywhere between
5000 and 10000 calls per day.

We are experiencing a problem whereby occasionally Asterisk will stop
routing calls to agents. If one opens the console and issues a "show
channels" command the channel list does not finish displaying, it only
shows a portion of the channels. The console then becomes unresponsive.
Calls then continue to build up in the queue, and only restarting Asterisk
fixes it.

I have attached a dump of "core show locks". I could not get any further
info this time, as Asterisk must generally be restarted immediately when
this happens to keep the call centre operating. If you look for Thread id
3006892960 you will see that this thread has locked the list of agents
while a whole host of other threads are waiting for this mutex to be
released. I ran "core show locks" 4 times about 2 seconds apart before
restarting and in each case that lock was still being held with many other
threads waiting for it.

The lock in question is in agent_logoff in chan_agent.c. It seems that the
list of agent channels is locked, it then matches the agent channel it is
looking for and then falls into one of the following two loops:

while (p->owner && ast_channel_trylock(p->owner)) {
    ast_mutex_unlock(&p->lock);
    usleep(1);
    ast_mutex_lock(&p->lock);
}

or

while (p->chan && ast_channel_trylock(p->chan)) {
    ast_mutex_unlock(&p->lock);
    usleep(1);
    ast_mutex_lock(&p->lock);
}

It seems as though it is unable to obtain the lock for p->owner or
p->chan, so it keeps looping trying to get the lock. The problem is that
the number of threads waiting for the agent list to be unlocked starts
growing and growing as more and more calls pour into the queue. Obviously
because the list of agents is locked, calls can't be routed. 

This problem seems to happen every 2 to 3 weeks and generally occurs only
when the box is receiving a lot of calls.

Any suggestions/advice would be greatly appreciated.

======================================================================
Relationships       ID      Summary
----------------------------------------------------------------------
related to          0013676 channel get stuck on ast_queue_frame wh...
====================================================================== 

---------------------------------------------------------------------- 
 (0106619) sroberts (reporter) - 2009-06-18 08:17
 https://issues.asterisk.org/view.php?id=14112#c106619 
---------------------------------------------------------------------- 
I've recently encountered a new deadlock where a call file being hung up
causes a deadlock in local_hangup (line 515 of chan_local). 

Now we change agent's state by using call files (creating local channels
which execute dialplan logic). What seems to be happening is that a call
file finishes execution local_hangup is trying to grab the local on the pvt
structure of the channel, however it appears that this channel has already
been destroyed so it will sit waiting for a lock which does not exist. This
causes a deadlock and sometimes even a seg-fault. The biggest issue with
this problem is that when the deadlock occurs, if I try and run "core show
locks" it causes Asterisk's CLI to freeze up completely, stopping all
routing of calls.

I have managed to create a setup in our lab environment in which I can
recreate this bug. What I'm doing is creating 1000 queue members and
assigning them to a queue. I then start pumping calls into the queue at the
rate of about 1/sec. I've then created a script which logs in all 1000
agents, waits 60s then unpauses them all, waits 60s then pauses them all,
waits another 60s and then unpauses them all and finally after another 60s
logs them all out and starts over again. The machine with the queue on it
then routes the calls sitting in the queue to another box, which answers
and plays tt-monkeys a couple of times before hanging up.

Under a sustained load of about 200-300 concurrent calls this causes
Asterisk to deadlock in under 5 minutes. Now you might think that this is a
bit extreme, but our call centres are pushing 10k calls per day, and we are
getting 1 or two of these lockups occurring per week, this just seems to
accelerate the occurrence of the deadlock.

Upon further investigation, I realised where the problem was; looking at
the GDB stack trace I noticed the following:

https://issues.asterisk.org/view.php?id=9  0x008e400e in
__ast_pthread_mutex_lock (filename=0xb1f2bf80
"chan_local.c", lineno=515, func=0xb1f2bf30 "local_hangup",
mutex_name=0xb1f2bee0 "&p->lock", t=0x92db838) at
/usr/src/asterisk-1.4.22.2-VITEL2-090611/include/asterisk/lock.h:406
	res = -1309491552
	canlog = 0
	__PRETTY_FUNCTION__ = "__ast_pthread_mutex_lock"
https://issues.asterisk.org/view.php?id=10 0x008e5e67 in local_hangup
(ast=0x9535868) at chan_local.c:549
	__func = "local_hangup", '\0' <repeats 48 times>,
"5?S\000\001\000\000\000\024`B\000?????@\021\b"
	__filename =
"chan_local.c\000?S\000\001\000\000\000?-\t????@\021\b0?t\t?U\021\b???Q[\021\b??M\t?\216S\000???\001\000\000\000?-\t?\216S\000????hXS\t"
	__mutex_name = "&p->lock", '\0' <repeats 71 times>
	__lineno = 515
	__res = 0
	p = (struct local_pvt *) 0x92db838
	isoutbound = 0
	f = {frametype = AST_FRAME_CONTROL, subclass = 1, datalen = 0, samples =
0, mallocd = 0, mallocd_hdr_len = 0, offset = 0, src = 0x0, data = 0x0,
delivery = {tv_sec = 0, tv_usec = 0}, frame_list = {next = 0x0}, flags = 0,
ts = 0, len = 0, seqno = 0}
	ochan = (struct ast_channel *) 0x0
	glaredetect = 0
	res = 0
	__PRETTY_FUNCTION__ = "local_hangup"

What I noticed here was that it was attempting to get p->lock at line 515
AND 549 of local_hangup. Upon examining the method I noted something.

At the beginning of local_hangup, it acquires a lock on p->lock at line
515:

 while (ast_mutex_trylock(&p->lock)) {
                ast_channel_unlock(ast);
                usleep(1);
                ast_channel_lock(ast);
        }
It then enters the if(is_outbound) {} else {} section. The thing to note
here is that in the if(is_outbound) section it will unlock p->lock at line
527, and then re-lock it again at line 535. This is fine, however in the
else block is where I see the problem. The following spin lock attempts to
re-lock p->lock:

while (p->chan && ast_channel_trylock(p->chan)) {
    DEADLOCK_AVOIDANCE(&p->lock);
}

However p->lock is already held at this point, so this is the second time
it is being locked!

Looking further down, you can see in the if (!p->owner && !p->chan) block,
p->lock is released and then the method returns. If it does not enter this
block, we have the following:

if (!res)
		ast_mutex_unlock(&p->lock);

What I'm getting at here, is that under certain circumstances p->lock is
being locked twice and only unlocked once! My guess is that it gets locked
twice, and then when another thread needs to obtain that lock it seems to
block waiting for the lock, while the channel gets destroyed leaving it to
sit there forever waiting for a non-existent lock.

I have fixed this by commenting out the lines 
while (p->chan && ast_channel_trylock(p->chan)) {
	DEADLOCK_AVOIDANCE(&p->lock);
}

I don't see the need for this, as the lock is already guaranteed to be
acquired since the code executed almost immediately prior to it was also a
spin lock which grabs the mutex.

With this code removed, I have now been running the same simulation for
over 3 hours with no lockups whatsoever at a sustained load of 300
concurrent calls. Bear in mind that with that code in, Asterisk deadlocks
in less than 5 minutes.

Now I've created a patch for this which removes that code (I'm running
1.4.22.2, but I see it is exactly the same for 1.4.25 even). I don't see
any harm in removing the code and my testing shows that it should work
fine, however I would like to know whether I will be breaking anything by
doing this? As I said, I don't see how, but then again that code must be
there for a reason surely? If it needs to be there for whatever reason, is
there no better way to do this?

I'm going to apply this patch to one of our quieter servers tonight, while
leaving the heavy load test on overnight to see how things work out. I'd
appreciate some feedback on this, as I would like to roll this out to our
busier servers as soon as possible if I can as these deadlocks are a major
headache.

Thanks! 

Issue History 
Date Modified    Username       Field                    Change               
====================================================================== 
2009-06-18 08:17 sroberts       Note Added: 0106619                          
======================================================================