[asterisk-bugs] [JIRA] (ASTERISK-27321) Asterisk Crashing with FRACK Errors and Serious Network Trouble

Fri Oct 6 11:33:38 CDT 2017

     [ https://issues.asterisk.org/jira/browse/ASTERISK-27321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Sedory updated ASTERISK-27321:
-------------------------------------

    Description: 
Running FreePBX 13.0.192.16 and Asterisk 13.17.0

I have previously posted about this issue in the freepbx and astersk forums. Here are those links:

https://community.asterisk.org/t/asterisk-freepbx-crashing-and-frack-errors/72159
https://community.freepbx.org/t/consistent-asterisk-freepbx-crash-issue/43682/1

Host: Dell R720 with 2x Xeon E5-2620 2.00GHz (6 Core) and 64GB RAM DDR3 ECC), local PERC storage.

Hypervisor: Proxmox 4.4-1.

Network: using onboard Quad NIC. Bridge “vmbr0” points to “bond0” as the bridge port, and bond0 has eth0 and eth1 in it in “active-backup” mode, each going to one of our two core switches. Using Cisco 3560G. Switch ports are in trunk mode, with native vlan set to our management vlan. VMs are tagged to our public facing vlan, for direct internet access.

VMs are running FreePBX/Asterisk versions mentioned above. Each have 4GB RAM fixed with ballooning disabled, 4 cores (2 sockets, 2 cores; have tried with NUMA enabled and disabled) with type “Default (kvm64)”, NIC using E1000 model, vdisk is 300G presented as ide0 as a raw image on a local LVM-Thin volume.

Endpoints: All endpoints are NAT’d. We use TCP for SIP with an obscure port (not 5060 or near that). RTP traffic on our VSP’s required port range is allowed as well. All other traffic is dropped per the FreePBX firewall.

In summary, what is happening is that we get a bunch of errors like this:

[2017-09-28 02:05:18] ERROR[7061] astobj2.c: FRACK!, Failed assertion bad magic number 0x0 for object 0x3e7c690 (0)
[2017-09-28 02:05:24] ERROR[6934] astobj2.c: FRACK!, Failed assertion bad magic number 0x0 for object 0x3e7c690 (0)
[2017-09-28 02:05:28] ERROR[7107] astobj2.c: FRACK!, Failed assertion bad magic number 0x0 for object 0x3e7c690 (0)

and right before and after, we have most of our peers go unreachable. Sometime Asterisk will crash afterwards, sometimes not.

The issue happens intermittently, but seems to happen more frequently on the VMs that have more peers/endpoints (100+). I don’t think we’ve had it happen on any VMs that had less than 100 peers/endpoints.

We recently chopped a server that had about 130 endpoints into two of 110 and 20. More accurately, we moved 110 off server A to server B, leaving 20 on server B. Before that move, we were experiencing FRACK! errors every day (anywhere from 20-300, usually all within a 20 minute window or so). Once the 110 were moved to server B, server A has never again had FRACK! errors or asterisk crashes. Server B however is having them now, just much less often then when all 130 endpoints were on server A. My assumption for that is due to the slightly lower endpoint total on the VM.

This morning was one of those instances. We had 193 errors, identical to the three I posted above (minus the ERROR[number] being different). AND, we had a crash afterwards. Here is the backtrace: http://pastebin.freepbx.org/view/8cccc15f2

So I come to you, the asterisk community, for help. I first posted on the FreePBX forum, and was directed here.

I understand this may point to a memory issue, but what is strange is that the Dell iDrac log doesn’t show any memory errors in it. Perhaps there are errors but iDrac just isn’t seeing them to report them. I’m hoping someone out there can parse through the backtrace and give me a clear answer to what the problem is. Thanks in advance.

  was:
Recently used the "warm spare" method to move to a new server (new VM on KVM/proxmox)

The server has about 120 remote extension, and had no real problems before.

I posted about this crash issue yesterday here, but my hypothesis was off: https://community.freepbx.org/t/media-index-c-failed-to-stat/43645

Today we had a ton of users call and say their phones weren't working. Funny thing is, they show OK with IP address in peers list when running "sip show peers" in cli.

So we did a fwconsole restart, and things started working again.

This crash has happened three times this week already. Sunday morning, yesterday morning, and today.

I starting digging through the logs, and these are the errors that may or may not be the cause. I'm hoping someone can give me some insight. Here are some error examples:

These ones show all over the logs, way before, way after, and right around the crash time:

[2017-08-22 09:30:59] ERROR[32499][C-00000028] pbx_functions.c: Function PJSIP_HEADER not registered

These ones yesterday were fairly close to before the crash, but there were none today before the crash:

[2017-08-22 09:38:08] ERROR[1488] netsock2.c: getaddrinfo("2605:e000:6045:3a00:20b:82ff:feac:c151:13312", "(null)", ...): Name or service not known
[2017-08-22 09:38:08] WARNING[1488] chan_sip.c: Could not resolve socket address for '2605:e000:6045:3a00:20b:82ff:feac:c151:13312'

These existing on all three instances:

Line 33144: [2017-08-20 10:34:38] ERROR[30555] chan_sip.c: Serious Network Trouble; __sip_xmit returns error for pkt data

And finally, these ones look the like the most likely culprit, but didn't show on Yesterday's crash (just today and Sunday's):
*note that the difference between Today's and Sunday's, vs Yesterday's, is that the former showed all endpoints "OK", though they truly weren't, the latter showed only about half of them

[2017-08-23 13:42:49] ERROR[26004] astobj2.c: FRACK!, Failed assertion bad magic number 0x0 for object 0x3de7430 (0)

So all that to say, I hope someone can help us find the root cause of all this.

Again, this server was a fresh v13 FreePBX server that we just "warm spare" copied to from an existing server. The existing was running on an ESXi host, fully updated to ....66-21. We fully updated the fresh VM to 66-21 as well before running the backup/restore. The new server is a VM on KVM/proxmox.

> Asterisk Crashing with FRACK Errors and Serious Network Trouble
> ---------------------------------------------------------------
>
>                 Key: ASTERISK-27321
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-27321
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Channels/chan_sip/General
>    Affects Versions: 13.17.0
>         Environment: FreePBX 13.0.192.16 and Asterisk 13.17.0, proxmox 4.4 on Dell R720, local RAID volume. Using TCP and obscure port for SIP. UDP 5060 still open/enabled, but firewalled to only allow Anveo Direct servers.
>            Reporter: Steven Sedory
>            Severity: Critical
>
> Running FreePBX 13.0.192.16 and Asterisk 13.17.0
> I have previously posted about this issue in the freepbx and astersk forums. Here are those links:
> https://community.asterisk.org/t/asterisk-freepbx-crashing-and-frack-errors/72159
> https://community.freepbx.org/t/consistent-asterisk-freepbx-crash-issue/43682/1
> Host: Dell R720 with 2x Xeon E5-2620 2.00GHz (6 Core) and 64GB RAM DDR3 ECC), local PERC storage.
> Hypervisor: Proxmox 4.4-1.
> Network: using onboard Quad NIC. Bridge “vmbr0” points to “bond0” as the bridge port, and bond0 has eth0 and eth1 in it in “active-backup” mode, each going to one of our two core switches. Using Cisco 3560G. Switch ports are in trunk mode, with native vlan set to our management vlan. VMs are tagged to our public facing vlan, for direct internet access.
> VMs are running FreePBX/Asterisk versions mentioned above. Each have 4GB RAM fixed with ballooning disabled, 4 cores (2 sockets, 2 cores; have tried with NUMA enabled and disabled) with type “Default (kvm64)”, NIC using E1000 model, vdisk is 300G presented as ide0 as a raw image on a local LVM-Thin volume.
> Endpoints: All endpoints are NAT’d. We use TCP for SIP with an obscure port (not 5060 or near that). RTP traffic on our VSP’s required port range is allowed as well. All other traffic is dropped per the FreePBX firewall.
> In summary, what is happening is that we get a bunch of errors like this:
> [2017-09-28 02:05:18] ERROR[7061] astobj2.c: FRACK!, Failed assertion bad magic number 0x0 for object 0x3e7c690 (0)
> [2017-09-28 02:05:24] ERROR[6934] astobj2.c: FRACK!, Failed assertion bad magic number 0x0 for object 0x3e7c690 (0)
> [2017-09-28 02:05:28] ERROR[7107] astobj2.c: FRACK!, Failed assertion bad magic number 0x0 for object 0x3e7c690 (0)
> and right before and after, we have most of our peers go unreachable. Sometime Asterisk will crash afterwards, sometimes not.
> The issue happens intermittently, but seems to happen more frequently on the VMs that have more peers/endpoints (100+). I don’t think we’ve had it happen on any VMs that had less than 100 peers/endpoints.
> We recently chopped a server that had about 130 endpoints into two of 110 and 20. More accurately, we moved 110 off server A to server B, leaving 20 on server B. Before that move, we were experiencing FRACK! errors every day (anywhere from 20-300, usually all within a 20 minute window or so). Once the 110 were moved to server B, server A has never again had FRACK! errors or asterisk crashes. Server B however is having them now, just much less often then when all 130 endpoints were on server A. My assumption for that is due to the slightly lower endpoint total on the VM.
> This morning was one of those instances. We had 193 errors, identical to the three I posted above (minus the ERROR[number] being different). AND, we had a crash afterwards. Here is the backtrace: http://pastebin.freepbx.org/view/8cccc15f2
> So I come to you, the asterisk community, for help. I first posted on the FreePBX forum, and was directed here.
> I understand this may point to a memory issue, but what is strange is that the Dell iDrac log doesn’t show any memory errors in it. Perhaps there are errors but iDrac just isn’t seeing them to report them. I’m hoping someone out there can parse through the backtrace and give me a clear answer to what the problem is. Thanks in advance.

--
This message was sent by Atlassian JIRA
(v6.2#6252)