[asterisk-bugs] [JIRA] (ASTERISK-28705) chan_sip: Phones loose abiltiy to work, core restarting asterisk fixes issue

Wed Feb 5 07:28:25 CST 2020

    [ https://issues.asterisk.org/jira/browse/ASTERISK-28705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=249665#comment-249665 ] 

Dan  commented on ASTERISK-28705:
---------------------------------

I've worked with Redhat on the issue,   I'm pasting in notes from the ticket,   

In mpstat we don't have any high CPU usage, these samples are pretty typical for the problem duration:

--------------------------------------------------------------------------------
===== 2020-02-04 14:44:17.183137666-06:00 (CST) =====
Linux 3.10.0-1062.9.1.el7.x86_64 (vmctel22.multiservice.com)    02/04/2020      _x86_64_        (2 CPU)

02:44:17 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
02:44:17 PM  all    3.78    0.04    2.21    0.03    0.00    0.22    0.00    0.00    0.00   93.73
02:44:17 PM    0    4.07    0.04    2.25    0.03    0.00    0.22    0.00    0.00    0.00   93.40
02:44:17 PM    1    3.49    0.04    2.16    0.03    0.00    0.22    0.00    0.00    0.00   94.07
--
===== 2020-02-04 14:44:27.689970043-06:00 (CST) =====
Linux 3.10.0-1062.9.1.el7.x86_64 (vmctel22.multiservice.com)    02/04/2020      _x86_64_        (2 CPU)

02:44:27 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
02:44:27 PM  all    3.78    0.04    2.21    0.03    0.00    0.22    0.00    0.00    0.00   93.73
02:44:27 PM    0    4.07    0.04    2.25    0.03    0.00    0.22    0.00    0.00    0.00   93.40
02:44:27 PM    1    3.49    0.04    2.16    0.03    0.00    0.22    0.00    0.00    0.00   94.07

....

===== 2020-02-04 14:51:49.046385591-06:00 (CST) =====
Linux 3.10.0-1062.9.1.el7.x86_64 (vmctel22.multiservice.com)    02/04/2020      _x86_64_        (2 CPU)

02:51:49 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
02:51:49 PM  all    3.78    0.04    2.21    0.03    0.00    0.22    0.00    0.00    0.00   93.73
02:51:49 PM    0    4.07    0.04    2.25    0.03    0.00    0.22    0.00    0.00    0.00   93.40
02:51:49 PM    1    3.49    0.04    2.16    0.03    0.00    0.22    0.00    0.00    0.00   94.07
--
===== 2020-02-04 14:51:59.576578097-06:00 (CST) =====
Linux 3.10.0-1062.9.1.el7.x86_64 (vmctel22.multiservice.com)    02/04/2020      _x86_64_        (2 CPU)

02:51:59 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
02:51:59 PM  all    3.78    0.04    2.21    0.03    0.00    0.22    0.00    0.00    0.00   93.73
02:51:59 PM    0    4.07    0.04    2.25    0.03    0.00    0.22    0.00    0.00    0.00   93.40
02:51:59 PM    1    3.49    0.04    2.16    0.03    0.00    0.22    0.00    0.00    0.00   94.07
--------------------------------------------------------------------------------
Looking in netstat, we see the UDP receive buffer errors start to grow at this time:

--------------------------------------------------------------------------------
===== 2020-02-04 14:44:06.543609549-06:00 (CST) =====
    498235 packet receive errors
    498235 receive buffer errors
===== 2020-02-04 14:44:17.097980110-06:00 (CST) =====
    498235 packet receive errors
    498235 receive buffer errors
===== 2020-02-04 14:44:27.612813403-06:00 (CST) =====
    498288 packet receive errors
    498288 receive buffer errors
===== 2020-02-04 14:44:38.117369314-06:00 (CST) =====
    498737 packet receive errors
    498737 receive buffer errors
...
===== 2020-02-04 14:51:48.952477580-06:00 (CST) =====
    539529 packet receive errors
    539529 receive buffer errors
===== 2020-02-04 14:51:59.475876814-06:00 (CST) =====
    540605 packet receive errors
    540605 receive buffer errors
===== 2020-02-04 14:52:09.984118121-06:00 (CST) =====
    541476 packet receive errors
    541476 receive buffer errors
===== 2020-02-04 14:52:20.524792598-06:00 (CST) =====
    541476 packet receive errors
    541476 receive buffer errors
--------------------------------------------------------------------------------

Looking in ss, we can correlate the "receive buffer errors" with the asterisk PID's receive buffers growing large on the same, and growing, set of file descriptors:

--------------------------------------------------------------------------------
===== 2020-02-04 14:43:45.551605758-06:00 (CST) =====
udp    UNCONN     169728 0         *:5060                  *:*                   users:(("asterisk",pid=60277,fd=25))
udp    UNCONN     8704   0         *:13900                 *:*                   users:(("asterisk",pid=60277,fd=448))
udp    UNCONN     2176   0         *:13901                 *:*                   users:(("asterisk",pid=60277,fd=451))
===== 2020-02-04 14:43:56.047824912-06:00 (CST) =====
udp    UNCONN     578816 0         *:5060                  *:*                   users:(("asterisk",pid=60277,fd=25))
udp    UNCONN     8704   0         *:13900                 *:*                   users:(("asterisk",pid=60277,fd=448))
udp    UNCONN     2176   0         *:13901                 *:*                   users:(("asterisk",pid=60277,fd=451))
udp    UNCONN     2176   0         *:18306                 *:*                   users:(("asterisk",pid=60277,fd=322))
===== 2020-02-04 14:44:06.551862862-06:00 (CST) =====
udp    UNCONN     404736 0         *:15844                 *:*                   users:(("asterisk",pid=60277,fd=389))
udp    UNCONN     2176   0         *:15845                 *:*                   users:(("asterisk",pid=60277,fd=452))
udp    UNCONN     1129344 0         *:5060                  *:*                   users:(("asterisk",pid=60277,fd=25))
udp    UNCONN     8704   0         *:13900                 *:*                   users:(("asterisk",pid=60277,fd=448))
udp    UNCONN     2176   0         *:13901                 *:*                   users:(("asterisk",pid=60277,fd=451))
--------------------------------------------------------------------------------

Mike's notes showed these growing to their maximum size, which is when traffic will drop and the "receive buffer error" stat will grow.

We also have a number of samples where asterisk's TCP sockets are sitting in CLOSE_WAIT, indicating the remote end has sent a FIN (which the kernel has ACKed) and we're waiting for the socket owner process (asterisk) to realise that's happened and to close its socket as well:

--------------------------------------------------------------------------------
===== 2020-02-04 14:45:41.206873257-06:00 (CST) =====
tcp    CLOSE-WAIT 1      0      172.25.220.22:43002              172.25.220.22:3306                users:(("asterisk",pid=60277,fd=10))
tcp    CLOSE-WAIT 1      0      172.25.220.18:5060               10.119.220.220:63729               users:(("asterisk",pid=60277,fd=167))
tcp    ESTAB      757    0      172.25.220.18:5060               10.119.102.102:54234               users:(("asterisk",pid=60277,fd=589))
tcp    CLOSE-WAIT 1      0      172.25.220.18:5060               10.119.220.166:60249               users:(("asterisk",pid=60277,fd=189))
tcp    CLOSE-WAIT 1      0      172.25.220.22:34332              172.25.220.22:3306                users:(("asterisk",pid=60277,fd=11))
tcp    CLOSE-WAIT 1      0      172.25.220.18:5060               10.119.221.63:51439               users:(("asterisk",pid=60277,fd=64))
tcp    CLOSE-WAIT 1      0      172.25.220.18:5060               10.119.220.237:48159               users:(("asterisk",pid=60277,fd=146))
--------------------------------------------------------------------------------

All taken together, this is strong evidence that there is no system bottleneck, but the root cause is that the Asterisk process is doing something else instead of handling its network sockets.

The next step is to debug Asterisk to find out what it's doing at the problem times and why it's doing that, with the aim to get Asterisk back to doing the work it should be doing instead.

As Asterisk is not software which Red Hat supply or support, please refer this issue to your technical support for Asterisk.

> chan_sip: Phones loose abiltiy to work,  core restarting asterisk fixes issue
> -----------------------------------------------------------------------------
>
>                 Key: ASTERISK-28705
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-28705
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Channels/chan_sip/General
>    Affects Versions: 16.7.0
>         Environment: I have a Redhat HA cluster running with 2 Rhel 7 nodes with asterisk 16.7.0
>            Reporter: Dan 
>            Assignee: Unassigned
>
> Incoming calls on Chan sip still come into the sytem.    Calls in flight stay running.  
> Endpoints get a "x" on them being ploycom phone model vvx310, 311, soundpoint 430, 450's and  zoipe soft phones. 
> THere are no messages about phones loosing registration. 
> A core restart now on asterisk fixes the issue. 
> I think the problem started around asterisk 16.4.0     

--
This message was sent by Atlassian JIRA
(v6.2#6252)