[asterisk-bugs] [Asterisk 0009788]: Deadlock problem with agents, queues and libpri (stop accepting incoming calls in PRI lines)

Fri Sep 14 08:40:00 CDT 2007

A NOTE has been added to this issue. 
====================================================================== 
http://bugs.digium.com/view.php?id=9788 
====================================================================== 
Reported By:                Ted Brown
Assigned To:                
====================================================================== 
Project:                    Asterisk
Issue ID:                   9788
Category:                   Addons/General
Reproducibility:            sometimes
Severity:                   crash
Priority:                   normal
Status:                     new
Asterisk Version:           1.4.10.1  
SVN Branch (only for SVN checkouts, not tarball releases): N/A  
SVN Revision (number only!):  
Disclaimer on File?:        No 
Request Review:              
====================================================================== 
Date Submitted:             05-23-2007 18:18 CDT
Last Modified:              09-14-2007 08:39 CDT
====================================================================== 
Summary:                    Deadlock problem with agents, queues and libpri
(stop accepting incoming calls in PRI lines)
Description: 
I have a Asterisk-based call center deployment with around 40 SIP users,
attending incoming calls from two PRI lines (2xE1) using agents and
queues.

The problem is that Asterisk stops accepting new incoming calls to the PRI
lines without reason, although there should be free channels to make room
for new incoming calls, but Asterisk thinks these channels are being used.
SIP calls can be placed without problems between internal users.

PRI lines shouldn't be the origin of the problem, as an old legacy PBX
works perfectly with the same lines, so the problem seems to be related
with agents or queues.

After the crash, performing an "zap show channels" shows that all channels
are busy, and calls seems that have been queued for a long time in
different queues (and they are not really there - users usually don't wait
90 minutes to be attended while listening to the music on hold).

There is no other services running on the server, CDR is being stored to 
disk and we are not using any kind of AGI's or reporting tools. Currently
the only solution is to reboot the machine, as rebooting Asterisk is not
enough. Using any command on the CLI results in no output at all.

The crash is not easily reproduceable, as it doesn't follow a clear
pattern. Asterisk just seem to get blocked when it manages around 30-40
calls in the queues. During last week, we had 2-3 crashed each day.

Based on users lists mails, it seems that other users have had a similar
problem within the same scenario, at least with 1.2.x. More precisely, we
have observed the same problem in bug ID 0006147, but it has been closed
without a clear answer.

Hardware and software specs:

 Platform: Suse Linux Enterprise Server 10
 Machine: IBM xSeries 226, 1 GB RAM, Intel CPU
 PRI card: Digium TE212 with echo cancellation module
 Asterisk version: 1.2.18

Follows a list of the most relevant messages before and after the crash:

DEBUG[28519] chan_sip.c: Stopping retransmission on
'NzNmZWM0ZDc0OTYyNWI5YWM2ZTBhZjY3NDM4N2RjNmQ.' of Response 12: Match Found 
(lots of messages like that)

DEBUG[28511] chan_zap.c: Ring requested on channel 0/13 already in use or
previously requested on span 1.  Attempting to renegotiating channel.

DEBUG[28511] chan_zap.c: Found empty available channel 0/9

DEBUG[29939] app_dial.c: Exiting with DIALSTATUS=CONGESTION.

I would very appreciate any help on this. I can provide backtrace if
needed.

Best regards,
====================================================================== 

---------------------------------------------------------------------- 
 Ted Brown - 09-14-07 08:39  
---------------------------------------------------------------------- 
Hi everyone,

after a 3-days troubleshooting process, I think we've located the problem,
and we can force it to appear as many times as we want.

The scenario is like follows:

- A call ("Call_1") arrives to a queue named 'Queue_A'. An agent
'Agent_A', registered in that queue, takes that call 'Call_1'. 

- After that, the same agent 'Agent_A' initiates another call ("Call_2",
using a free line in his/her softphone, and placing "Call_1" on hold" ) to
another queue 'Queue_B'.

- An agent 'B' ('Agent_B') registered in that queue ('Queue_B') takes the
call, so 'Agent_A' transfers the first call 'Call_1' to agent 'B'. 

- Both agents use softphones (Eyebeam).

After this simple transfer, Asterisk will crash in one of these
situations: 

 a) If you make a 'show channels' or 'core show channels' command on the
CLI
 b) If another new call is issued, no matter where it comes from.

In the first case, Asterisk doesn't always crash for the first time, but
it will eventually crash if you make several 'show channels" continously
(15-20 in our tests). 

In the second case, Asterisk crashes as soon as it receives an INVITE
request (we've noticed in the coredump that this request is sent from agent
'B' to agent 'A' a time after the transfer, which we think makes no sense,
as the new call has nothing to do with any of them).

In the troubleshooting process, we first tried with incoming calls from a
PRI line. Asterisk crashed. So we tried it from a SIP line, as we thought 
it could be a PRI line misconfiguration, but it also crashed. So the issue
is not related at all with PRI configuration (I've tried to change the
summary of this issue with no luck, shoul I create a new one??).

Looking at the core generated by the crash, we've found this:

http://bugs.digium.com/view.php?id=0  0x000000000044a03b in ast_bridged_channel
(chan=0x2aaab5227b24) at
channel.c:3900
3900  if (bridged && bridged->tech && bridged->tech->bridged_channel)

In fact, the line should be "if (bridged &&
bridged->tech->bridged_channel)", but as bridged->tech was null, the
comprobation caused the crash. The change we made (to include "&&
bridged->tech") didn't solve the issue. 

Now, bridged->tech value is:

(gdb) print bridged->tech
$1 = (const struct ast_channel_tech *) 0x2aaadeadbeef

We can't go much further than this, as this variable is referenced from a
lot of functions. The big issue is to know why bridged->tech is corrupted.
Could this be a bad-freeing memory issue? Or maybe a semaphore/monitor
problem accessing to this variable?

We have also recompiled Asterisk with DONT-OPTIMIZE directive and other
debug flags, and we disabled every not necessary IRQ in the bios. CPU and
RAM aren't an issue, as we are monitoring them and in no point CPU raised
higher than 40% and neither RAM raised higher than 10% (both peak values).

As we can easily reproduce the crash, do not hesitate to contact us for
more tests. 

Thank you! 

Issue History 
Date Modified   Username       Field                    Change               
====================================================================== 
09-14-07 08:39  Ted Brown      Note Added: 0070526                          
======================================================================