[asterisk-dev] Advice debugging chan_agent device state caching under Asterisk 11.25

Alex Villací­s Lasso a_villacis at palosanto.com
Fri Oct 13 12:48:37 CDT 2017


First of all, I know that 11.25.x is in security fix mode, if not already EOL.

I am providing support for a client that runs an Asterisk 11.25.0 installation as a callcenter. Updating to any later branch is not possible right now because the client is very cautious about the stability of the software, and because the call center 
software (written by me) uses chan_agent extensively and has not (yet) been ported to the new method of proxy agents under Asterisk 13 and later.

The way the callcenter runs right now is by using AMI Originate between Local/XXXXX at context (where XXXXX is the outbound number to dial), and an extension that drops the call into a Queue where several agents (logged in using chan_agent as Agent/YYYYY) are 
members of the queue, and are available to connect calls. The callcenter supervisors frequently (several times per day) move queue members in and out of the queues as required by the ongoing work load. The callcenter agent workers become available in the 
queues by logging into the Agent channel - they cannot choose to log into the queue itself.

Most of the time, this setup works correctly. However, around once or twice per week, a situation arises where one or more agents are logged-in and idle (according to reality and to the output of "agent show"), but the queues where they are present as 
members show them as being Busy. An (anonymized) example is show below:

[root at CALLCENTERSERVER ~]# asterisk -rnx 'queue show 108' | grep Busy ; asterisk -rnx 'agent show'  | grep 1562
       Agent/4248 (ringinuse enabled) (Busy) has taken no calls yet
       Agent/4275 (ringinuse enabled) (Busy) has taken no calls yet
       Agent/1562 (ringinuse enabled) (Busy) has taken no calls yet
       Agent/4259 (ringinuse enabled) (Busy) has taken no calls yet
       Agent/4286 (ringinuse enabled) (Busy) has taken no calls yet
1562         (XXXXXX XXXXXXXX XXXXXXXX XXXXX) logged in on SIP/9028-0003f3e6 is idle (musiconhold is 'none')
[root at CALLCENTERSERVER ~]#

Here, agent channel Agent/1562 is logged-in and free according to "agent show". However, queue "108" shows this agent (as well as a few others) to be Busy. Also it was found that in this particular instance, at least Agent/4248 is not even logged in (also, 
according to "agent show"). From experimentation, no amount of moving agents between queues will clear this situation for the affected agents, and the only solution that works so far is a full Asterisk restart.

I have a copy of the git repository and am able to recompile at will. By studying the relevant code, I have found that the device state is cached in at least two levels:

 1. The queue member has a "status" field in the struct representing it in app_queue. This field is supposed to be updated by a subscription to the device change event supplied by the Asterisk core.
 2. The Asterisk core maintains a cache of the device state for all devices, including Agent/XXXXX from chan_agent.

Because of this, I had to check which level is the one that is out of sync with reality. In wiki.asterisk.org I learned of the DEVICE_STATE() function to query the current (cached) device state of its parameter. Therefore, on the above example, I ran the 
following using telnet:

[root at CALLCENTERSERVER ~]# telnet 127.0.0.1 5038
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
Asterisk Call Manager/1.3
Action: Login
Username: admin
Secret: *********

Response: Success
Message: Authentication accepted

Event: FullyBooted
Privilege: system,all
Status: Fully Booted

Action: Events
EventMask: off

Response: Success
Events: Off

Action: GetVar
Variable: DEVICE_STATE(Agent/1562)

Response: Success
Variable: DEVICE_STATE(Agent/1562)
Value: BUSY

Action: Logoff

Response: Goodbye
Message: Thanks for all the fish.

Connection closed by foreign host.
[root at CALLCENTERSERVER ~]#

I see that DEVICE_STATE() also believes that the agent channel is Busy when it is actually idle. So I conclude that the issue lies somewhere in chan_agent and how the Asterisk core gets its cached value from it.

Do you have any advice (other than just "update Asterisk") on where to go from here? I see that the chan_agent source code only sets device state to Unknown, and not to other states. Is there a way to force Asterisk to flush or refresh the device state 
cache, either globally or per device? If none exists, what do you think of implementing such a command as a workaround while searching for the true solution?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20171013/16d61c03/attachment.html>


More information about the asterisk-dev mailing list