[asterisk-bugs] [Asterisk 0014112]: Thread deadlock causes Asterisk to stop routing calls to agents, agents unable to change status

Asterisk Bug Tracker noreply at bugs.digium.com
Mon Mar 2 17:07:47 CST 2009


A NOTE has been added to this issue. 
====================================================================== 
http://bugs.digium.com/view.php?id=14112 
====================================================================== 
Reported By:                sroberts
Assigned To:                
====================================================================== 
Project:                    Asterisk
Issue ID:                   14112
Category:                   Channels/chan_agent
Reproducibility:            random
Severity:                   major
Priority:                   normal
Status:                     feedback
Asterisk Version:           1.4.19 
Regression:                 No 
SVN Branch (only for SVN checkouts, not tarball releases): N/A 
SVN Revision (number only!):  
Request Review:              
====================================================================== 
Date Submitted:             2008-12-19 07:16 CST
Last Modified:              2009-03-02 17:07 CST
====================================================================== 
Summary:                    Thread deadlock causes Asterisk to stop routing
calls to agents, agents unable to change status
Description: 
We run several call centres on Asterisk. Our queue servers are using
Asterisk 1.4.19 on CentOS 4.6. The busier sites can take anywhere between
5000 and 10000 calls per day.

We are experiencing a problem whereby occasionally Asterisk will stop
routing calls to agents. If one opens the console and issues a "show
channels" command the channel list does not finish displaying, it only
shows a portion of the channels. The console then becomes unresponsive.
Calls then continue to build up in the queue, and only restarting Asterisk
fixes it.

I have attached a dump of "core show locks". I could not get any further
info this time, as Asterisk must generally be restarted immediately when
this happens to keep the call centre operating. If you look for Thread id
3006892960 you will see that this thread has locked the list of agents
while a whole host of other threads are waiting for this mutex to be
released. I ran "core show locks" 4 times about 2 seconds apart before
restarting and in each case that lock was still being held with many other
threads waiting for it.

The lock in question is in agent_logoff in chan_agent.c. It seems that the
list of agent channels is locked, it then matches the agent channel it is
looking for and then falls into one of the following two loops:

while (p->owner && ast_channel_trylock(p->owner)) {
    ast_mutex_unlock(&p->lock);
    usleep(1);
    ast_mutex_lock(&p->lock);
}

or

while (p->chan && ast_channel_trylock(p->chan)) {
    ast_mutex_unlock(&p->lock);
    usleep(1);
    ast_mutex_lock(&p->lock);
}


It seems as though it is unable to obtain the lock for p->owner or
p->chan, so it keeps looping trying to get the lock. The problem is that
the number of threads waiting for the agent list to be unlocked starts
growing and growing as more and more calls pour into the queue. Obviously
because the list of agents is locked, calls can't be routed. 

This problem seems to happen every 2 to 3 weeks and generally occurs only
when the box is receiving a lot of calls.

Any suggestions/advice would be greatly appreciated.




====================================================================== 

---------------------------------------------------------------------- 
 (0101040) sroberts (reporter) - 2009-03-02 17:07
 http://bugs.digium.com/view.php?id=14112#c101040 
---------------------------------------------------------------------- 
At the end of the day, what is really killing me here is that fact that
agent_devicestate_cb is holding the lock on the list of agents while it
waits. This prevents calls being routed to any other agents, which leaves
around 300 call centre agents sitting around twiddling their thumbs. I was
therefore wondering about whether I could modify agent_devicestate_cb so
that at least it won't hold the agent list lock indefinitely. In other
words, if it cannot get the lock on an agent channel, it should just carry
on iterating through the list. Once it is done traversing the list of
agents, if the agent in question has been updated, we can finish,
otherwise, unlock the list of agents, sleep for a while and then try again.
This will at least unlock the agent list for other threads to use. Meaning
that should 1 agent become locked or waiting, at least the other agents can
have calls routed to them. My idea for changing agent_devicestate_cb would
be something as follows:

static int agent_devicestate_cb(const char *dev, int state, void *data)
{
        int res, i;
	struct agent_pvt *p;
	char basename[AST_CHANNEL_NAME], *tmp;

	/* Skip Agent status */
	if (!strncasecmp(dev, "Agent/", 6)) {
		return 0;
	}
        int stateUpdated = 0;
        while (!stateUpdated) {
            /* Try to be safe, but don't deadlock */
            for (i = 0; i < 10; i++) {
                    if ((res = AST_LIST_TRYLOCK(&agents)) == 0) {
                            break;
                    }
            }
            if (res) {
                    return -1;
            }
    
            AST_LIST_TRAVERSE(&agents, p, list) {
                    res = ast_mutex_trylock(&p->lock);
                    if (!res) {
                        if (p->chan) {
                                ast_copy_string(basename, p->chan->name,
sizeof(basename));
                                if ((tmp = strrchr(basename, '-'))) {
                                        *tmp = '\0';
                                }
                                if (strcasecmp(p->chan->name, dev) == 0 ||
strcasecmp(basename, dev) == 0) {
                                        p->inherited_devicestate = state;
                                       
ast_device_state_changed("Agent/%s", p->agent);
                                        stateUpdated = 1;
                                }
                        }
                        ast_mutex_unlock(&p->lock);
                    }
            }
            AST_LIST_UNLOCK(&agents);
            usleep(200);
        }
	return 0;
}

Any feedback on this would be greatly appreciated, as I'm really quite
desperate... 

Issue History 
Date Modified    Username       Field                    Change               
====================================================================== 
2009-03-02 17:07 sroberts       Note Added: 0101040                          
======================================================================




More information about the asterisk-bugs mailing list