[asterisk-bugs] [JIRA] (ASTERISK-28705) chan_sip: Phones loose abiltiy to work, core restarting asterisk fixes issue
Sean Bright (JIRA)
noreply at issues.asterisk.org
Tue Feb 11 09:10:25 CST 2020
[ https://issues.asterisk.org/jira/browse/ASTERISK-28705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=249665#comment-249665 ]
Sean Bright edited comment on ASTERISK-28705 at 2/11/20 9:09 AM:
-----------------------------------------------------------------
I've worked with Redhat on the issue, I'm pasting in notes from the ticket,
In mpstat we don't have any high CPU usage, these samples are pretty typical for the problem duration:
{noformat}
===== 2020-02-04 14:44:17.183137666-06:00 (CST) =====
Linux 3.10.0-1062.9.1.el7.x86_64 (vmctel22.multiservice.com) 02/04/2020 _x86_64_ (2 CPU)
02:44:17 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
02:44:17 PM all 3.78 0.04 2.21 0.03 0.00 0.22 0.00 0.00 0.00 93.73
02:44:17 PM 0 4.07 0.04 2.25 0.03 0.00 0.22 0.00 0.00 0.00 93.40
02:44:17 PM 1 3.49 0.04 2.16 0.03 0.00 0.22 0.00 0.00 0.00 94.07
--
===== 2020-02-04 14:44:27.689970043-06:00 (CST) =====
Linux 3.10.0-1062.9.1.el7.x86_64 (vmctel22.multiservice.com) 02/04/2020 _x86_64_ (2 CPU)
02:44:27 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
02:44:27 PM all 3.78 0.04 2.21 0.03 0.00 0.22 0.00 0.00 0.00 93.73
02:44:27 PM 0 4.07 0.04 2.25 0.03 0.00 0.22 0.00 0.00 0.00 93.40
02:44:27 PM 1 3.49 0.04 2.16 0.03 0.00 0.22 0.00 0.00 0.00 94.07
....
===== 2020-02-04 14:51:49.046385591-06:00 (CST) =====
Linux 3.10.0-1062.9.1.el7.x86_64 (vmctel22.multiservice.com) 02/04/2020 _x86_64_ (2 CPU)
02:51:49 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
02:51:49 PM all 3.78 0.04 2.21 0.03 0.00 0.22 0.00 0.00 0.00 93.73
02:51:49 PM 0 4.07 0.04 2.25 0.03 0.00 0.22 0.00 0.00 0.00 93.40
02:51:49 PM 1 3.49 0.04 2.16 0.03 0.00 0.22 0.00 0.00 0.00 94.07
--
===== 2020-02-04 14:51:59.576578097-06:00 (CST) =====
Linux 3.10.0-1062.9.1.el7.x86_64 (vmctel22.multiservice.com) 02/04/2020 _x86_64_ (2 CPU)
02:51:59 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
02:51:59 PM all 3.78 0.04 2.21 0.03 0.00 0.22 0.00 0.00 0.00 93.73
02:51:59 PM 0 4.07 0.04 2.25 0.03 0.00 0.22 0.00 0.00 0.00 93.40
02:51:59 PM 1 3.49 0.04 2.16 0.03 0.00 0.22 0.00 0.00 0.00 94.07
{noformat}
Looking in netstat, we see the UDP receive buffer errors start to grow at this time:
{noformat}
===== 2020-02-04 14:44:06.543609549-06:00 (CST) =====
498235 packet receive errors
498235 receive buffer errors
===== 2020-02-04 14:44:17.097980110-06:00 (CST) =====
498235 packet receive errors
498235 receive buffer errors
===== 2020-02-04 14:44:27.612813403-06:00 (CST) =====
498288 packet receive errors
498288 receive buffer errors
===== 2020-02-04 14:44:38.117369314-06:00 (CST) =====
498737 packet receive errors
498737 receive buffer errors
...
===== 2020-02-04 14:51:48.952477580-06:00 (CST) =====
539529 packet receive errors
539529 receive buffer errors
===== 2020-02-04 14:51:59.475876814-06:00 (CST) =====
540605 packet receive errors
540605 receive buffer errors
===== 2020-02-04 14:52:09.984118121-06:00 (CST) =====
541476 packet receive errors
541476 receive buffer errors
===== 2020-02-04 14:52:20.524792598-06:00 (CST) =====
541476 packet receive errors
541476 receive buffer errors
{noformat}
Looking in ss, we can correlate the "receive buffer errors" with the asterisk PID's receive buffers growing large on the same, and growing, set of file descriptors:
{noformat}
===== 2020-02-04 14:43:45.551605758-06:00 (CST) =====
udp UNCONN 169728 0 *:5060 *:* users:(("asterisk",pid=60277,fd=25))
udp UNCONN 8704 0 *:13900 *:* users:(("asterisk",pid=60277,fd=448))
udp UNCONN 2176 0 *:13901 *:* users:(("asterisk",pid=60277,fd=451))
===== 2020-02-04 14:43:56.047824912-06:00 (CST) =====
udp UNCONN 578816 0 *:5060 *:* users:(("asterisk",pid=60277,fd=25))
udp UNCONN 8704 0 *:13900 *:* users:(("asterisk",pid=60277,fd=448))
udp UNCONN 2176 0 *:13901 *:* users:(("asterisk",pid=60277,fd=451))
udp UNCONN 2176 0 *:18306 *:* users:(("asterisk",pid=60277,fd=322))
===== 2020-02-04 14:44:06.551862862-06:00 (CST) =====
udp UNCONN 404736 0 *:15844 *:* users:(("asterisk",pid=60277,fd=389))
udp UNCONN 2176 0 *:15845 *:* users:(("asterisk",pid=60277,fd=452))
udp UNCONN 1129344 0 *:5060 *:* users:(("asterisk",pid=60277,fd=25))
udp UNCONN 8704 0 *:13900 *:* users:(("asterisk",pid=60277,fd=448))
udp UNCONN 2176 0 *:13901 *:* users:(("asterisk",pid=60277,fd=451))
{noformat}
Mike's notes showed these growing to their maximum size, which is when traffic will drop and the "receive buffer error" stat will grow.
We also have a number of samples where asterisk's TCP sockets are sitting in CLOSE_WAIT, indicating the remote end has sent a FIN (which the kernel has ACKed) and we're waiting for the socket owner process (asterisk) to realise that's happened and to close its socket as well:
{noformat}
===== 2020-02-04 14:45:41.206873257-06:00 (CST) =====
tcp CLOSE-WAIT 1 0 172.25.220.22:43002 172.25.220.22:3306 users:(("asterisk",pid=60277,fd=10))
tcp CLOSE-WAIT 1 0 172.25.220.18:5060 10.119.220.220:63729 users:(("asterisk",pid=60277,fd=167))
tcp ESTAB 757 0 172.25.220.18:5060 10.119.102.102:54234 users:(("asterisk",pid=60277,fd=589))
tcp CLOSE-WAIT 1 0 172.25.220.18:5060 10.119.220.166:60249 users:(("asterisk",pid=60277,fd=189))
tcp CLOSE-WAIT 1 0 172.25.220.22:34332 172.25.220.22:3306 users:(("asterisk",pid=60277,fd=11))
tcp CLOSE-WAIT 1 0 172.25.220.18:5060 10.119.221.63:51439 users:(("asterisk",pid=60277,fd=64))
tcp CLOSE-WAIT 1 0 172.25.220.18:5060 10.119.220.237:48159 users:(("asterisk",pid=60277,fd=146))
{noformat}
All taken together, this is strong evidence that there is no system bottleneck, but the root cause is that the Asterisk process is doing something else instead of handling its network sockets.
The next step is to debug Asterisk to find out what it's doing at the problem times and why it's doing that, with the aim to get Asterisk back to doing the work it should be doing instead.
As Asterisk is not software which Red Hat supply or support, please refer this issue to your technical support for Asterisk.
was (Author: dwsiemens):
I've worked with Redhat on the issue, I'm pasting in notes from the ticket,
In mpstat we don't have any high CPU usage, these samples are pretty typical for the problem duration:
--------------------------------------------------------------------------------
===== 2020-02-04 14:44:17.183137666-06:00 (CST) =====
Linux 3.10.0-1062.9.1.el7.x86_64 (vmctel22.multiservice.com) 02/04/2020 _x86_64_ (2 CPU)
02:44:17 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
02:44:17 PM all 3.78 0.04 2.21 0.03 0.00 0.22 0.00 0.00 0.00 93.73
02:44:17 PM 0 4.07 0.04 2.25 0.03 0.00 0.22 0.00 0.00 0.00 93.40
02:44:17 PM 1 3.49 0.04 2.16 0.03 0.00 0.22 0.00 0.00 0.00 94.07
--
===== 2020-02-04 14:44:27.689970043-06:00 (CST) =====
Linux 3.10.0-1062.9.1.el7.x86_64 (vmctel22.multiservice.com) 02/04/2020 _x86_64_ (2 CPU)
02:44:27 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
02:44:27 PM all 3.78 0.04 2.21 0.03 0.00 0.22 0.00 0.00 0.00 93.73
02:44:27 PM 0 4.07 0.04 2.25 0.03 0.00 0.22 0.00 0.00 0.00 93.40
02:44:27 PM 1 3.49 0.04 2.16 0.03 0.00 0.22 0.00 0.00 0.00 94.07
....
===== 2020-02-04 14:51:49.046385591-06:00 (CST) =====
Linux 3.10.0-1062.9.1.el7.x86_64 (vmctel22.multiservice.com) 02/04/2020 _x86_64_ (2 CPU)
02:51:49 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
02:51:49 PM all 3.78 0.04 2.21 0.03 0.00 0.22 0.00 0.00 0.00 93.73
02:51:49 PM 0 4.07 0.04 2.25 0.03 0.00 0.22 0.00 0.00 0.00 93.40
02:51:49 PM 1 3.49 0.04 2.16 0.03 0.00 0.22 0.00 0.00 0.00 94.07
--
===== 2020-02-04 14:51:59.576578097-06:00 (CST) =====
Linux 3.10.0-1062.9.1.el7.x86_64 (vmctel22.multiservice.com) 02/04/2020 _x86_64_ (2 CPU)
02:51:59 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
02:51:59 PM all 3.78 0.04 2.21 0.03 0.00 0.22 0.00 0.00 0.00 93.73
02:51:59 PM 0 4.07 0.04 2.25 0.03 0.00 0.22 0.00 0.00 0.00 93.40
02:51:59 PM 1 3.49 0.04 2.16 0.03 0.00 0.22 0.00 0.00 0.00 94.07
--------------------------------------------------------------------------------
Looking in netstat, we see the UDP receive buffer errors start to grow at this time:
--------------------------------------------------------------------------------
===== 2020-02-04 14:44:06.543609549-06:00 (CST) =====
498235 packet receive errors
498235 receive buffer errors
===== 2020-02-04 14:44:17.097980110-06:00 (CST) =====
498235 packet receive errors
498235 receive buffer errors
===== 2020-02-04 14:44:27.612813403-06:00 (CST) =====
498288 packet receive errors
498288 receive buffer errors
===== 2020-02-04 14:44:38.117369314-06:00 (CST) =====
498737 packet receive errors
498737 receive buffer errors
...
===== 2020-02-04 14:51:48.952477580-06:00 (CST) =====
539529 packet receive errors
539529 receive buffer errors
===== 2020-02-04 14:51:59.475876814-06:00 (CST) =====
540605 packet receive errors
540605 receive buffer errors
===== 2020-02-04 14:52:09.984118121-06:00 (CST) =====
541476 packet receive errors
541476 receive buffer errors
===== 2020-02-04 14:52:20.524792598-06:00 (CST) =====
541476 packet receive errors
541476 receive buffer errors
--------------------------------------------------------------------------------
Looking in ss, we can correlate the "receive buffer errors" with the asterisk PID's receive buffers growing large on the same, and growing, set of file descriptors:
--------------------------------------------------------------------------------
===== 2020-02-04 14:43:45.551605758-06:00 (CST) =====
udp UNCONN 169728 0 *:5060 *:* users:(("asterisk",pid=60277,fd=25))
udp UNCONN 8704 0 *:13900 *:* users:(("asterisk",pid=60277,fd=448))
udp UNCONN 2176 0 *:13901 *:* users:(("asterisk",pid=60277,fd=451))
===== 2020-02-04 14:43:56.047824912-06:00 (CST) =====
udp UNCONN 578816 0 *:5060 *:* users:(("asterisk",pid=60277,fd=25))
udp UNCONN 8704 0 *:13900 *:* users:(("asterisk",pid=60277,fd=448))
udp UNCONN 2176 0 *:13901 *:* users:(("asterisk",pid=60277,fd=451))
udp UNCONN 2176 0 *:18306 *:* users:(("asterisk",pid=60277,fd=322))
===== 2020-02-04 14:44:06.551862862-06:00 (CST) =====
udp UNCONN 404736 0 *:15844 *:* users:(("asterisk",pid=60277,fd=389))
udp UNCONN 2176 0 *:15845 *:* users:(("asterisk",pid=60277,fd=452))
udp UNCONN 1129344 0 *:5060 *:* users:(("asterisk",pid=60277,fd=25))
udp UNCONN 8704 0 *:13900 *:* users:(("asterisk",pid=60277,fd=448))
udp UNCONN 2176 0 *:13901 *:* users:(("asterisk",pid=60277,fd=451))
--------------------------------------------------------------------------------
Mike's notes showed these growing to their maximum size, which is when traffic will drop and the "receive buffer error" stat will grow.
We also have a number of samples where asterisk's TCP sockets are sitting in CLOSE_WAIT, indicating the remote end has sent a FIN (which the kernel has ACKed) and we're waiting for the socket owner process (asterisk) to realise that's happened and to close its socket as well:
--------------------------------------------------------------------------------
===== 2020-02-04 14:45:41.206873257-06:00 (CST) =====
tcp CLOSE-WAIT 1 0 172.25.220.22:43002 172.25.220.22:3306 users:(("asterisk",pid=60277,fd=10))
tcp CLOSE-WAIT 1 0 172.25.220.18:5060 10.119.220.220:63729 users:(("asterisk",pid=60277,fd=167))
tcp ESTAB 757 0 172.25.220.18:5060 10.119.102.102:54234 users:(("asterisk",pid=60277,fd=589))
tcp CLOSE-WAIT 1 0 172.25.220.18:5060 10.119.220.166:60249 users:(("asterisk",pid=60277,fd=189))
tcp CLOSE-WAIT 1 0 172.25.220.22:34332 172.25.220.22:3306 users:(("asterisk",pid=60277,fd=11))
tcp CLOSE-WAIT 1 0 172.25.220.18:5060 10.119.221.63:51439 users:(("asterisk",pid=60277,fd=64))
tcp CLOSE-WAIT 1 0 172.25.220.18:5060 10.119.220.237:48159 users:(("asterisk",pid=60277,fd=146))
--------------------------------------------------------------------------------
All taken together, this is strong evidence that there is no system bottleneck, but the root cause is that the Asterisk process is doing something else instead of handling its network sockets.
The next step is to debug Asterisk to find out what it's doing at the problem times and why it's doing that, with the aim to get Asterisk back to doing the work it should be doing instead.
As Asterisk is not software which Red Hat supply or support, please refer this issue to your technical support for Asterisk.
> chan_sip: Phones loose abiltiy to work, core restarting asterisk fixes issue
> -----------------------------------------------------------------------------
>
> Key: ASTERISK-28705
> URL: https://issues.asterisk.org/jira/browse/ASTERISK-28705
> Project: Asterisk
> Issue Type: Bug
> Security Level: None
> Components: Channels/chan_sip/General
> Affects Versions: 16.7.0
> Environment: I have a Redhat HA cluster running with 2 Rhel 7 nodes with asterisk 16.7.0
> Reporter: Dan
> Assignee: Unassigned
>
> Incoming calls on Chan sip still come into the sytem. Calls in flight stay running.
> Endpoints get a "x" on them being ploycom phone model vvx310, 311, soundpoint 430, 450's and zoipe soft phones.
> THere are no messages about phones loosing registration.
> A core restart now on asterisk fixes the issue.
> I think the problem started around asterisk 16.4.0
--
This message was sent by Atlassian JIRA
(v6.2#6252)
More information about the asterisk-bugs
mailing list