[asterisk-users] Exceptionally long queue length queuing

Wed Jul 22 19:05:40 CDT 2020

Hello list,
Hope you are all doing well!

Sorry for the long email but I tried to explain all I've seen regarding
this issue.... I am going to open a ticket for this issue but I found it
useful to already explain it here.

I've also recently faced this "Exceptionally long queue length queuing"
error in some servers running 16.8.0 and after A LOT of investigation, I've
discovered what is causing it in my case (not sure if is the same case as
initially reported by Dovid)
So, what I discovered is that in case the diaplan has a Wait() and during
this wait period many "deferrable frames" are received, the final piece of
the Wait function (ast_safe_sleep_conditional) can throw the "Exceptionally
long queue length queuing" error message.
It is not really common to get a lot of "deferrable frames" (see
ast_is_deferrable_frame) but we can have that by simply putting on hold and
off hold while the channel is on "wait".
My production case is a little more complex and happens while on Wait and
with the AST_CONTROL_SRCCHANGE event. This happens when an old Asterisk
13.13.0 receives a hold REINVITE and starts a new RTP stream (with
different SSRC) towards some other victim Asterisk with the Music on Hold
while still sending the original caller RTP (so at this point 2 RTP streams
with different SSRCs are sent to the destination that happens to be a
channel currently on Wait). I could not find any ticket for such bug, but
version 16.8.0 does not have this problem and instead seems to inject the
Music on Hold in the existing RTP stream so there aren't 2 streams at the
same time to the same destination. The problem with 2 streams is that each
RTP package of the new stream generates a source change frame
(AST_CONTROL_SRCCHANGE), and at 50pps this builds up a lot of events fast
and so if the Wait is high the issue pops up (I agree having a channel on
Wait for too long is not a good practice but noone is perfect...). Also if
multiple channels are facing this situation then the problem just escalates
very bad.
When the "Exceptionally long queue length queuing" is happening all AMI
commands seem to fail, BYE messages seem to be missed/delayed, internal
timeouts expire and all sort of weird things happen to all channels in the
server, apparently the whole Asterisk process gets locked in this loop
affecting everything else... and this is what makes this issue really bad
because Asterisk can become completely unresponsive while the error message
is happening... if it only could throw the error and keep working fine I
guess it would be reasonable leaving it up to the dialplan logic to avoid a
long Wait()....
Anyway, to replicate the issue is enough a simple dialplan like this:
exten => 8888,1,Answer()
same => n,Wait(30)
same => n,NoOp(After wait)
same => n,Playback(goodbye)
same => n,Hangup()
After calling in and while in the Wait(30), repeatedly press Hold/Unhold in
the telephone (with Linphone just press the hold once and then press and
hold the spacebar which will repeatedly do the hold and unhold several
times). After Wait is finished the "Exceptionally long queue length
queuing" will show up but gets resolved very fast just because I think
there weren't enough frames queued to really freeze Asterisk for too long.

Thank you,
Kind regards,
Patrick Wakano

On Fri, 3 Jul 2020 at 19:29, Joshua C. Colp <jcolp at sangoma.com> wrote:

> On Fri, Jul 3, 2020 at 3:32 AM Dovid Bender <dovid at telecurve.com> wrote:
>
>>
>>
>> On Mon, Jun 29, 2020 at 6:46 AM Joshua C. Colp <jcolp at sangoma.com> wrote:
>>
>>> On Sun, Jun 28, 2020 at 2:26 PM Dovid Bender <dovid at telecurve.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> We have a box up and we are starting to see a lot of "Exceptionally
>>>> long queue length queuing" in the logs. From all the research so far it
>>>> seems like this leads to their systems crashing and being unreachable. In
>>>> our case the box remains up and takes calls. We are running Asterisk
>>>> 16.6.1. We are using MusicOnHold to play online music streams via ffmpeg.
>>>> Any idea on how to troubleshoot this further to see why this is happening?
>>>>
>>>
>>> The message occurs when either a channel is deadlocked or the processing
>>> time to handle what a channel is doing is substantial, and things build up.
>>> Getting a backtrace[1] at the time it occurs will show what the system is
>>> actually doing and can provide insight into what is causing it.
>>>
>>> [1]
>>> https://wiki.asterisk.org/wiki/display/AST/Getting+a+Backtrace#GettingaBacktrace-Runningast_coredumperfordeadlocks,taskprocessorbackups,etc
>>> .
>>>
>>> --
>>>
>> I ended up re-writing my code to use less Local channels and most of the
>> errors went away. I also noticed that my load average and CPU usage is way
>> down. I will open a ticket since it seems  that it may be a bug that others
>> are experiencing.
>>
>
> A backtrace would still be needed for the issue, otherwise there is no way
> to know what exactly is going on our where things are getting held up.
>
> --
> Joshua C. Colp
> Asterisk Technical Lead
> Sangoma Technologies
> Check us out at www.sangoma.com and www.asterisk.org
> --
> _____________________________________________________________________
> -- Bandwidth and Colocation Provided by http://www.api-digital.com --
>
> Check out the new Asterisk community forum at:
> https://community.asterisk.org/
>
> New to Asterisk? Start here:
>       https://wiki.asterisk.org/wiki/display/AST/Getting+Started
>
> asterisk-users mailing list
> To UNSUBSCRIBE or update options visit:
>    http://lists.digium.com/mailman/listinfo/asterisk-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-users/attachments/20200723/df45cc05/attachment.html>