[asterisk-bugs] [JIRA] (ASTERISK-27987) AGI() call causes RTP to briefly freeze under certain circumstances

xrobau (JIRA) noreply at issues.asterisk.org
Thu Jul 26 23:19:54 CDT 2018


    [ https://issues.asterisk.org/jira/browse/ASTERISK-27987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=244294#comment-244294 ] 

xrobau edited comment on ASTERISK-27987 at 7/26/18 11:19 PM:
-------------------------------------------------------------

I may have found another common factor.   When things go bad, I find this in the logs:

{code}
var/log/asterisk/full-[2018-07-26 12:43:39] WARNING[10389] taskprocessor.c: The 'app_voicemail' task processor queue reached 500 scheduled tasks again.
/var/log/asterisk/full-[2018-07-26 12:43:39] ERROR[23418] astobj2.c: Excessive refcount 100000 reached on ao2 object 0x28946d0
/var/log/asterisk/full:[2018-07-26 12:43:39] ERROR[23418] astobj2.c: FRACK!, Failed assertion Excessive refcount 100000 reached on ao2 object 0x28946d0 (0)
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: Got 29 backtrace records
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #0: [0x5fc445] asterisk __ast_assert_failed() (0x5fc3c1+84)
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #1: [0x45e00f] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #2: [0x45e368] asterisk __ao2_ref() (0x45e33a+2E)
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #3: [0x5ceb86] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #4: [0x5ced09] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #5: [0x5c635c] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #6: [0x5c704b] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #7: [0x5c7333] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #8: [0x5c73a2] asterisk stasis_publish() (0x5c737a+28)
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #9: [0x5c7e26] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #10: [0x5c6607] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #11: [0x5c669a] asterisk stasis_subscribe_pool() (0x5c6667+33)
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #12: [0x7fd077c67f19] res_pjsip_mwi.so <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #13: [0x7fd077c69476] res_pjsip_mwi.so <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #14: [0x45f120] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #15: [0x45f43d] asterisk __ao2_callback() (0x45f3e1+5C)
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #16: [0x7fd077c69cf8] res_pjsip_mwi.so <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #17: [0x7fd077c69d52] res_pjsip_mwi.so <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #18: [0x536694] asterisk ast_module_reload() (0x53632c+368)
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #19: [0x4cfbc1] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #20: [0x4d7cbd] asterisk ast_cli_command_full() (0x4d7a4c+271)
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #21: [0x4d7e00] asterisk ast_cli_command_multiple_full() (0x4d7d64+9C)
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #22: [0x454b71] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #23: [0x5f93f8] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] ERROR[23418] astobj2.c: Excessive refcount 100000 reached on ao2 object 0x28946d0
/var/log/asterisk/full:[2018-07-26 12:43:39] ERROR[23418] astobj2.c: FRACK!, Failed assertion Excessive refcount 100000 reached on ao2 object 0x28946d0 (0)
 ... (another identical stack trace) ...
{code}

That was on the client's machine, which is just in normal production, without any load tests being run on it.

When I'm running the load test locally to try to give you a decent replication, I do - eventually - get FRACK errors about too many refcounts on ao2 objects which almost ALWAYS signify that the machine is now broken.

The FRACK error I can reliably reproduce requires CEL to be enabled, and the test trigger to be similar to this:

{code}
  same => n,Dial(SIP/8675309 at 127.0.0.1:5160,1,g)
{code}

To reproduce this yourself, you just need to deploy a standard FreePBX 14 machine, enable anonymous inbound SIP and then run the loop in the original description with that as the trigger. Wait for it to FRACK a few times, and then replace it with AGI(/bin/true) and you will see the problem.

If it will be more convenient for you, I can provide you access to the actual hardware that can reliably reproduce this, which is already configured and ready to go.


was (Author: xrobau):
I may have found another common factor.   When things go bad, I find this in the logs:

{code}
var/log/asterisk/full-[2018-07-26 12:43:39] WARNING[10389] taskprocessor.c: The 'app_voicemail' task processor queue reached 500 scheduled tasks again.
/var/log/asterisk/full-[2018-07-26 12:43:39] ERROR[23418] astobj2.c: Excessive refcount 100000 reached on ao2 object 0x28946d0
/var/log/asterisk/full:[2018-07-26 12:43:39] ERROR[23418] astobj2.c: FRACK!, Failed assertion Excessive refcount 100000 reached on ao2 object 0x28946d0 (0)
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: Got 29 backtrace records
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #0: [0x5fc445] asterisk __ast_assert_failed() (0x5fc3c1+84)
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #1: [0x45e00f] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #2: [0x45e368] asterisk __ao2_ref() (0x45e33a+2E)
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #3: [0x5ceb86] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #4: [0x5ced09] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #5: [0x5c635c] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #6: [0x5c704b] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #7: [0x5c7333] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #8: [0x5c73a2] asterisk stasis_publish() (0x5c737a+28)
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #9: [0x5c7e26] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #10: [0x5c6607] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #11: [0x5c669a] asterisk stasis_subscribe_pool() (0x5c6667+33)
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #12: [0x7fd077c67f19] res_pjsip_mwi.so <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #13: [0x7fd077c69476] res_pjsip_mwi.so <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #14: [0x45f120] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #15: [0x45f43d] asterisk __ao2_callback() (0x45f3e1+5C)
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #16: [0x7fd077c69cf8] res_pjsip_mwi.so <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #17: [0x7fd077c69d52] res_pjsip_mwi.so <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #18: [0x536694] asterisk ast_module_reload() (0x53632c+368)
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #19: [0x4cfbc1] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #20: [0x4d7cbd] asterisk ast_cli_command_full() (0x4d7a4c+271)
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #21: [0x4d7e00] asterisk ast_cli_command_multiple_full() (0x4d7d64+9C)
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #22: [0x454b71] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] VERBOSE[23418] logger.c: #23: [0x5f93f8] asterisk <unknown>()
/var/log/asterisk/full-[2018-07-26 12:43:39] ERROR[23418] astobj2.c: Excessive refcount 100000 reached on ao2 object 0x28946d0
/var/log/asterisk/full:[2018-07-26 12:43:39] ERROR[23418] astobj2.c: FRACK!, Failed assertion Excessive refcount 100000 reached on ao2 object 0x28946d0 (0)
 ... (another identical stack trace) ...
{code}

That was on the client's machine, which is just in normal production, without any load tests being run on it.

When I'm running the load test locally to try to give you a decent replication, I do - eventually - get FRACK errors about too many refcounts on ao2 objects which almost ALWAYS signify that the machine is now broken.

The FRACK error I can reliably reproduce requires CEL to be enabled, and the test trigger to be similar to this:

{code}
  same => n,Dial(SIP/8675309 at 127.0.0.1:5160,1,g)
{code}

To reproduce this yourself, you just need to deploy a standard FreePBX 14 machine, enable anonymous inbound SIP and then run the loop in the original description with that as the trigger. Wait for it to FRACK a few times, and then replace it with AGI(/bin/true) and you will see the problem.




> AGI() call causes RTP to briefly freeze under certain circumstances
> -------------------------------------------------------------------
>
>                 Key: ASTERISK-27987
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-27987
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Core/Bridging
>    Affects Versions: 15.5.0
>            Reporter: xrobau
>            Assignee: Unassigned
>              Labels: pjsip
>         Attachments: broken.pcap, debug-server.txt, fdlist.txt, pcap1.png, pcap2.png, timer_list.txt
>
>
> "Something" causes Asterisk to get into a state where launching an AGI causes the entire process to freeze briefly.
> The symptoms are identical to ASTERISK-26257, but that is inaccurate as to the number of services that are impacted.
> Currently I am unable to create an artificial reproduction of the CAUSE, but the symptoms are extremely obvious.
> When Asterisk gets into this state (however it may occur) anything that is reading audio from the filesystem (MoH, Playback()/Background(), or Voicemail) OR confbridge conferences (not meetme) will briefly pause while an AGI is launched.
> We have numerous wireshark captures that show this happening (which we can attach if required), and the timestamps as generated by Asterisk are correct for when the packet WAS sent, not for when it SHOULD have been sent.
> This is triggered by AGI calls in dialplan.  Our test dialplan to determine if Asterisk is in this state is as follows:
> {code}
> exten => 998,1,Answer
>  same => n,Dial(Local/999 at from-internal-custom/n,300,gm(default))
>  same => n,Playback(beep)
>  same => n,Goto(1)
> exten => 999,1,Set(COUNT=0)
>  same => n(loop),GotoIf($[ ${COUNT} > 1000 ]?toomany)
>  same => n,Gosub(testtrigger)
>  same => n,Set(COUNT=$[ ${COUNT} + 1 ])
>  same => n,Goto(loop)
>  same => n(toomany),Hangup
>  same => n(testtrigger),AGI(/bin/true)
>  same => n,Return
> {code}
> If Asterisk is in this broken state, the MoH will be extremely choppy.  After a restart of Asterisk, the MoH is not interrupted.
> We've also noticed that the loop runs MUCH slower when Asterisk is in this state. 
> My random, and probably incorrect, hypothesis is that there's a core lock somewhere that is being grabbed by res_agi which is appending to a gradually growing list, or SOMETHING, that takes a longer and longer time, until it's locking up for longer than 20msec and blocking things that are reading from files (and whatever confbridge is doing)
> I ran that dialplan on a test machine, and was unable to get Asterisk into that state after 250,000 AGI calls. However, our (Sangoma)'s main PBX does randomly and sporadically experience this problem, and we've just got into the habit of restarting asterisk whenever we notice it, which has never been below 30,000 calls (calls as reported by 'core show calls'), and is almost certain to be in that state over 50,000 calls.
> The only common factor that is visible, so far, is that this only happens on Hardware, not in VMs (which may be a red herring), and that all the machines have a lot of listeners to AMI Events.
> Unfortunately, without any way to reproduce this, I'm aware that it's hard to proceed further. However, if there's any more information I can provide when a machine IS in this state, as to what may be causing it, please feel free to ask. 
> Fixing it, as above, is as simple as 'core restart now'. The AGI loops runs fast, and the audio glitch vanishes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)



More information about the asterisk-bugs mailing list