[asterisk-bugs] [JIRA] (ASTERISK-29578) app_queue: Custom device states or hints become stale and fail to update
Joshua C. Colp (JIRA)
noreply at issues.asterisk.org
Sun Aug 15 12:19:33 CDT 2021
[ https://issues.asterisk.org/jira/browse/ASTERISK-29578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=255830#comment-255830 ]
Joshua C. Colp commented on ASTERISK-29578:
-------------------------------------------
We require additional debug to continue with triage of your issue. Please follow the instructions on the wiki [1] for how to collect debugging information from Asterisk. For expediency, where possible, attach the debug with a '.txt' file extension so that the debug will be usable for further analysis.
Thanks!
[1] https://wiki.asterisk.org/wiki/display/AST/Collecting+Debug+Information
Additionally this issue is extremely confusing because you're mixing output of different hints amongst others hints and extensions. Please simplify this down to a description with a single member and a single hint showing it, alongside the complete configuration not just bits and pieces.
> app_queue: Custom device states or hints become stale and fail to update
> ------------------------------------------------------------------------
>
> Key: ASTERISK-29578
> URL: https://issues.asterisk.org/jira/browse/ASTERISK-29578
> Project: Asterisk
> Issue Type: Bug
> Security Level: None
> Components: Applications/app_queue
> Affects Versions: 18.4.0
> Environment: Debian 10
> Tested on Asterisk 18.4 with chan_pjsip and Asterisk 18.3 with chan_sip, stock queues.conf configuration. Bug appears in both.
> Reporter: N A
>
> I have been encountering an issue for several weeks now where the device state that app_queue has is consistently outdated.
> I am using statically configured queue members in queues.conf - the queue members themselves are not realtime, dynamic, etc.
> Things will work fine initially, but eventually - and especially if a device was temporarily unavailable and is back online later - the device state that app_queue has will become outdated and inaccurate. Here is a recent call, for instance, where I manually use ${DEVICE_STATE()} to check the state of a member in the queue:
> [2021-08-12 23:03:00] -- Executing [s at queue:2] NoOp("IAX2/8142", "NOT_INUSE") in new stack
> [2021-08-12 23:03:00] -- Executing [s at queue:3] Set("IAX2/8142", "__queuechannel=IAX2/8142") in new stack
> [2021-08-12 23:03:00] -- Executing [s at queue:4] Set("IAX2/8142", "QUEUENAME=test") in new stack
> [2021-08-12 23:03:00] WARNING[6505][C-000002bc]: app_queue.c:8356 queue_exec: Unable to join queue 'test'
> If I run "queue show", all the queues with that member show "Unavailable".
> Running "queue reload all" alone will not fix the issue, because app_queue is smart enough to realize the config file has not changed.
> Running "touch queues.conf" and *then* "queue reload members" will fix the issue, forcing app_queue to purge its stale device status and then things are finally updated... for the moment. Before long, everything is stale again.
> The problem seems to be that app_queue is consistently wrong and stale to the point where the application is unusable, and unless I call System(touch queues.conf; asterisk -rx "queue reload members") before any caller enters a queue, there is currently no way around this.
> This all was using PJSIP + Asterisk 18.4. To try to replicate this, I also tested on an Asterisk 18.3 system using SIP.
> If I add devices directly as queue members, this issue does not seem to come up. However, if I use hints for device state, things start to become less reliable there:
> [Aug 12 20:09:40] -- Executing [223 at local:1] NoOp("IAX2/14159", "NOT_INUSE") in new stack
> PBX*CLI> core show hints
> -= Registered Asterisk Dial Plan Hints =-
> 231 at test-hints : SIP/ATAxOffice1 State:Unavailable Presence:not_set Watchers 1
> So, again DEVICE_STATE = NOT_INUSE, but hint state is Unavailable.
> This is the hint: exten => 2127,hint,PJSIP/ATAxLA2
> So perhaps this isn't an issue with queues, really, but an issue with the hints that app_queue is using not aligning with the actual device state. That's what it's starting to look like to me.
> 2127 at hints-cen: PJSIP/ATAxLA2 State:Idle Presence:not_set Watchers 6
> test has 0 calls (max unlimited) in 'ringall' strategy (0s holdtime, 0s talktime), W:0, C:0, A:0, SL:0.0%, SL2:0.0% within 12s
> Members:
> PJSIP/ATAxLA2 (ringinuse disabled) (Not in use) has taken no calls yet
> 231 (Local/115 at phreaknet-queue-ring from hint:2127 at hints-all) with penalty 5 (ringinuse disabled) (Unavailable) has taken no calls yet
> No Callers
> Okay, so now let's add a member to the queue and do "queue reload members":
> test has 0 calls (max unlimited) in 'ringall' strategy (0s holdtime, 0s talktime), W:0, C:0, A:0, SL:0.0%, SL2:0.0% within 12s
> Members:
> PJSIP/ATAxLA2 (ringinuse disabled) (Not in use) has taken no calls yet
> 231 (Local/116 at queue-ring from PJSIP/ATAxLA2) with penalty 5 (ringinuse disabled) (Not in use) has taken no calls yet
> 231 (Local/115 at queue-ring from hint:2127 at hints-all) with penalty 5 (ringinuse disabled) (Not in use) has taken no calls yet
> No Callers
> Aha! So you see, when app_queue reloads members, the stale hints are refreshed. In this case, I've added a local channel but with device state to a device rather than a hint, so we can now compare all 3 cases. Many hours later:
> 2127 at hints-cen: PJSIP/ATAxLA2 State:Idle Presence:not_set Watchers 6
> test has 0 calls (max unlimited) in 'ringall' strategy (0s holdtime, 0s talktime), W:0, C:0, A:0, SL:0.0%, SL2:0.0% within 12s
> Members:
> PJSIP/ATAxLA2 (ringinuse disabled) (Not in use) has taken no calls yet
> 231 (Local/116 at queue-ring from PJSIP/ATAxLA2) with penalty 5 (ringinuse disabled) (Unavailable) has taken no calls yet
> 231 (Local/115 at queue-ring from hint:2127 at hints-all) with penalty 5 (ringinuse disabled) (Unavailable) has taken no calls yet
> No Callers
> The device is currently available. I have added the same endpoint 3 different ways. app_queue things one of them is available (specified with PJSIP as the direct tech), and the other two, which are local channels using a custom hint or device state, are unavailable.
> So, it doesn't matter whether a device is specified directly or a hint extension is used. Both suffer from the same problem. All three queue members above correspond to literally the same endpoint. The end device state should be equivalent, but in fact, the device state is only accurate if a real channel type is used for the queue member. If a hint or device is specified, it becomes stale and inaccurate.
> So as can be seen, this doesn't affect all types of queue members and it isn't a bug with hints themselves. Rather, it seems to be an issue specifically with the way that app_queue figures out device state from hints. Initially, it's right, but over time, it becomes stale and basically screws up everything if that is how you are determining if queue members are available or not.
> I tried this on another server using Asterisk 18.3 and chan_sip with a stock queues.conf config (except for the members):
> [test-queue]
> member => SIP/ATAxOffice1
> member => Local/419 at public,0,Conference,SIP/ATAxOffice1
> member => Local/418 at public,0,Conference,hint:231 at test-hints
> I do notice some of the same issues. For instance, this after a peer becomes unavailable:
> PBX*CLI> queue show
> msr has 0 calls (max unlimited) in 'ringall' strategy (0s holdtime, 0s talktime), W:0, C:0, A:0, SL:0.0%, SL2:0.0% within 0s
> No Members
> No Callers
> test-queue has 0 calls (max unlimited) in 'ringall' strategy (0s holdtime, 0s talktime), W:0, C:0, A:0, SL:0.0%, SL2:0.0% within 0s
> Members:
> SIP/ATAxOffice1 (ringinuse enabled) (Not in use) has taken no calls yet
> Conference (Local/419 at public from SIP/ATAxOffice1) (ringinuse enabled) (Not in use) has taken no calls yet
> Conference (Local/418 at public from hint:231 at test-hints) (ringinuse enabled) (Not in use) has taken no calls yet
> No Callers
> [Aug 15 12:11:39] NOTICE[805]: chan_sip.c:30516 sip_poke_noanswer: Peer 'ATAxOffice1' is now UNREACHABLE! Last qualify: 12
> [Aug 15 12:11:39] NOTICE[805]: chan_sip.c:30516 sip_poke_noanswer: Peer 'ATAxOffice2' is now UNREACHABLE! Last qualify: 10
> PBX*CLI> queue show
> msr has 0 calls (max unlimited) in 'ringall' strategy (0s holdtime, 0s talktime), W:0, C:0, A:0, SL:0.0%, SL2:0.0% within 0s
> No Members
> No Callers
> test-queue has 0 calls (max unlimited) in 'ringall' strategy (0s holdtime, 0s talktime), W:0, C:0, A:0, SL:0.0%, SL2:0.0% within 0s
> Members:
> SIP/ATAxOffice1 (ringinuse enabled) (Unavailable) has taken no calls yet
> Conference (Local/419 at public from SIP/ATAxOffice1) (ringinuse enabled) (Not in use) has taken no calls yet
> Conference (Local/418 at public from hint:231 at test-hints) (ringinuse enabled) (Unavailable) has taken no calls yet
> No Callers
> PBX*CLI> queue show
> msr has 0 calls (max unlimited) in 'ringall' strategy (0s holdtime, 0s talktime), W:0, C:0, A:0, SL:0.0%, SL2:0.0% within 0s
> No Members
> No Callers
> test-queue has 0 calls (max unlimited) in 'ringall' strategy (0s holdtime, 0s talktime), W:0, C:0, A:0, SL:0.0%, SL2:0.0% within 0s
> Members:
> SIP/ATAxOffice1 (ringinuse enabled) (Unavailable) has taken no calls yet
> Conference (Local/419 at public from SIP/ATAxOffice1) (ringinuse enabled) (Not in use) has taken no calls yet
> Conference (Local/418 at public from hint:231 at test-hints) (ringinuse enabled) (Unavailable) has taken no calls yet
> No Callers
> All three of these agents have the same device state, but right after one becomes unavailable, 2 of them correctly show "Unavailable" while one of them is still "Not in use" (which is, actually, the opposite problem of the above). If I touch the file, then do "queue reload all", the middle one then correctly changes to "Unavailable". But watch what happens once the peers come back online:
> test-queue has 0 calls (max unlimited) in 'ringall' strategy (0s holdtime, 0s talktime), W:0, C:0, A:0, SL:0.0%, SL2:0.0% within 0s
> Members:
> SIP/ATAxOffice1 (ringinuse enabled) (Unavailable) has taken no calls yet
> Conference (Local/419 at public from SIP/ATAxOffice1) (ringinuse enabled) (Unavailable) has taken no calls yet
> Conference (Local/418 at public from hint:231 at test-hints) (ringinuse enabled) (Unavailable) has taken no calls yet
> No Callers
> [Aug 15 12:17:40] NOTICE[805]: chan_sip.c:24984 handle_response_peerpoke: Peer 'ATAxOffice1' is now Reachable. (12ms / 2000ms)
> [Aug 15 12:17:40] NOTICE[805]: chan_sip.c:24984 handle_response_peerpoke: Peer 'ATAxOffice2' is now Reachable. (11ms / 2000ms)
> PBX*CLI> queue show
> msr has 0 calls (max unlimited) in 'ringall' strategy (0s holdtime, 0s talktime), W:0, C:0, A:0, SL:0.0%, SL2:0.0% within 0s
> No Members
> No Callers
> test-queue has 0 calls (max unlimited) in 'ringall' strategy (0s holdtime, 0s talktime), W:0, C:0, A:0, SL:0.0%, SL2:0.0% within 0s
> Members:
> SIP/ATAxOffice1 (ringinuse enabled) (Not in use) has taken no calls yet
> Conference (Local/419 at public from SIP/ATAxOffice1) (ringinuse enabled) (Unavailable) has taken no calls yet
> Conference (Local/418 at public from hint:231 at test-hints) (ringinuse enabled) (Not in use) has taken no calls yet
> No Callers
> So on this particular Asterisk, it seems that the middle agent is consistently getting "stuck" in a stale state. Channel driver does not seem to matter - all that it takes is specifying custom device states. Hence, it seems there are general problems with app_queue itself with using custom device states/hints for availability.
> This kind of thing now happened dozens of times in the past month. Since I need to use local channels for my queue members, this effectively makes app_queue completely unusable for me.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
More information about the asterisk-bugs
mailing list