[asterisk-bugs] [JIRA] (ASTERISK-28674) Asterisk becomes unstable after SS7 signalling link restarts

Gregory Massel (JIRA) noreply at issues.asterisk.org
Thu Jan 2 15:25:25 CST 2020


    [ https://issues.asterisk.org/jira/browse/ASTERISK-28674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=249221#comment-249221 ] 

Gregory Massel commented on ASTERISK-28674:
-------------------------------------------

In addition to the core dump, I've realised that it is trivial to reproduce this.

Before reproducing:

samsbc2*CLI> ss7 show linkset 1
SS7 flags: 0x0
SS7 linkset 1 status: Up
SS7 calling nai: -1
SS7 called nai: -1
SS7 nationalprefix: 0
SS7 internationalprefix: +
SS7 unknownprefix:
SS7 networkroutedprefix:
SS7 subscriberprefix:
Switch type: ITU
Our point code: 776
SLS shift: 0
numlinks: 3
numsps: 1
  ---------------------------------
  Adjacent SP PC: 2200 STATE: UP
  TRA:  GOT SENT    T19: not running T21: not running
  Routes:
    DPC       State        T6       T10
3856          TFP         -   running
976          TFP         -   running
968          TFP         -   running
  Link ADJ_PC:SLC: 2200:0 NetMngSLS: 0
    State:      INSERVICE,  UP
    STD Test:  passed
    Got, sent :
    Inhibit:
    Changeover: NO
    Tx buffer:  0
    Tx queue:   0
    Retrans pos 0
    CO buffer:  0
    CB buffer:  0
    Last FSN:   108
    MTP3timers:
  Link ADJ_PC:SLC: 2200:1 NetMngSLS: 1
    State:      INSERVICE,  UP
    STD Test:  passed
    Got, sent :
    Inhibit:
    Changeover: NO
    Tx buffer:  0
    Tx queue:   0
    Retrans pos 0
    CO buffer:  0
    CB buffer:  0
    Last FSN:   79
    MTP3timers:
  Link ADJ_PC:SLC: 2200:2 NetMngSLS: 2
    State:      INSERVICE,  UP
    STD Test:  passed
    Got, sent :
    Inhibit:
    Changeover: NO
    Tx buffer:  0
    Tx queue:   0
    Retrans pos 0
    CO buffer:  0
    CB buffer:  0
    Last FSN:   92
    MTP3timers:

Then, I used dahd_maint to loop one of the E1's, causing the associated signalling link to fail.
The following are largely irrelevent but just demonstrate the loop:


[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 63: Loopback
[Jan  2 23:11:33] ERROR[24690]: chan_dahdi.c:3104 my_handle_link_exception: SS7 got event: Alarm(4) on span 1/2
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 64: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 65: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 66: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 67: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 68: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 69: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 70: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 71: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 72: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 73: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 74: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 75: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 76: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 77: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 79: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 80: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 81: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 82: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 83: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 84: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 85: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 86: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 87: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 88: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 89: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 90: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 91: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 92: Loopback
[Jan  2 23:11:33] WARNING[24691]: chan_dahdi.c:7376 handle_alarms: Detected alarm on channel 93: Loopback

And after that:

[Jan  2 23:11:33] ERROR[24690]: chan_dahdi.c:13851 dahdi_ss7_error: [1] Received message for slc 0x2, but we are 0x0.  Dropping
[Jan  2 23:11:33] ERROR[24690]: chan_dahdi.c:13851 dahdi_ss7_error: [1] Received message for slc 0x2, but we are 0x0.  Dropping
[1] MTP3 T2 timer expired on link SLC: 2 ADJPC: 2200 changeover completed
[Jan  2 23:11:35] ERROR[24690]: chan_dahdi.c:13851 dahdi_ss7_error: [1] T7 expired on link SLC: 0 ADJPC: 2200
[Jan  2 23:11:35] WARNING[24690]: sig_ss7.c:1496 ss7_linkset: MTP2 link down (SLC 153)
[Jan  2 23:11:35] ERROR[24690]: chan_dahdi.c:13851 dahdi_ss7_error: [1] Received message for slc 0x0, but we are 0x1.  Dropping
 MTP2 link up (SLC 158)
[1] MTP3 T2 timer expired on link SLC: 0 ADJPC: 2200 changeover completed

This shows the issue with the SLC number.
Immediately thereafter:

samsbc2*CLI> ss7 show linkset 1
SS7 flags: 0x0
SS7 linkset 1 status: Up
SS7 calling nai: -1
SS7 called nai: -1
SS7 nationalprefix: 0
SS7 internationalprefix: +
SS7 unknownprefix:
SS7 networkroutedprefix:
SS7 subscriberprefix:
Switch type: ITU
Our point code: 776
SLS shift: 0
numlinks: 3
numsps: 1
  ---------------------------------
  Adjacent SP PC: 2200 STATE: UP
  TRA:  GOT SENT    T19: not running T21: not running
  Routes:
    DPC       State        T6       T10
3856          TFP         -   running
976          TFP         -   running
968          TFP         -   running
  Link ADJ_PC:SLC: 2200:0 NetMngSLS: 0
    State:      INSERVICE,  UP
    STD Test:  passed
    Got, sent : sentCBD
    Inhibit:
    Changeover: CHANGEBACK INITIATED

Note how now only one signalling link shows; both the failed link (SLC 1) and the one after it (SLC 2) just vanished.

After that, I took the core dump on the running Asterisk process.

> Asterisk becomes unstable after SS7 signalling link restarts
> ------------------------------------------------------------
>
>                 Key: ASTERISK-28674
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-28674
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Channels/chan_dahdi/SS7
>    Affects Versions: 13.29.2, 16.7.0
>         Environment: Asterisk 16.7.0, LibSS7 2.0.0, DAHDI 3.1.0, Digium TE820 + TE420 (5th gen), Ubuntu 18.04.3 LTS, Kernel 5.0.0-37-generic.
> Same issue also ocurrs using Asterisk 13.29.2.
>            Reporter: Gregory Massel
>            Assignee: Unassigned
>            Severity: Minor
>         Attachments: chan_dahdi.conf.txt, core-asterisk-running-2020-01-02T23-14-27+0200-brief.txt, core-asterisk-running-2020-01-02T23-14-27+0200-full.txt, core-asterisk-running-2020-01-02T23-14-27+0200-locks.txt, core-asterisk-running-2020-01-02T23-14-27+0200-thread1.txt, ss7.timers.txt
>
>
> When one of the SS7 signalling links goes down briefly and then restores, the SS7 subsystem goes into a corrupt state. This manifests in a numbers of ways:
> 1. The system starts using consistently 100% CPU on exactly one core, resulting in a load average of just over 1.0. This will continue indefinitely.
> 2. All outbound DAHDI calls thereafter hang during initiation reflecting a "Ring" state; a "core show channels" after a few hours shows thousands of hung channels. The corresponding PJSIP channels close and log "res_pjsip_sdp_rtp.c: Disconnecting channel 'PJSIP/xxx' for lack of RTP activity in 60 seconds" but the call and DAHDI channel remain hung.
> 3. No inbound calls are accepted via the DAHDI spans.
> 4. A "ss7 show linkset 1" seems to be missing signalling links that were there before, as if they were never configured.
> The log entries before everything goes pear-shaped look like this:
> [Jan  1 03:14:08] WARNING[24375] sig_ss7.c: MTP2 link down (SLC 33)
> [Jan  1 03:14:08] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0.  Dropping
> [Jan  1 03:14:08] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0.  Dropping
> [Jan  1 03:14:08] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0.  Dropping
> [Jan  1 03:14:08] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0.  Dropping
> [Jan  1 03:14:09] VERBOSE[24375] sig_ss7.c: MTP2 link up (SLC 34)
> [Jan  1 03:14:09] VERBOSE[24375] chan_dahdi.c: [1] MTP3 T2 timer expired on link SLC: 3 ADJPC: 2200 changeover completed
> [Jan  1 03:14:09] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x1.  Dropping
> [Jan  1 03:14:09] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0.  Dropping
> [Jan  1 03:14:09] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x2.  Dropping
> [Jan  1 03:14:09] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x1.  Dropping
> [Jan  1 03:14:09] ERROR[24375] chan_dahdi.c: [1] T7 expired on link SLC: 0 ADJPC: 2200
> [Jan  1 03:14:09] WARNING[24375] sig_ss7.c: MTP2 link down (SLC 34)
> [Jan  1 03:14:09] ERROR[24375] chan_dahdi.c: [1] T7 expired on link SLC: 1 ADJPC: 2200
> [Jan  1 03:14:09] WARNING[24375] sig_ss7.c: MTP2 link down (SLC 34)
> This highlights a related issue: I have SLC numbers 0 to 3 only. The system is actually logging an incorrect SLC number (34) as having failed and/or restored. It appears that this goes beyond just the log entry. The entire system seems to get horribly confused because it thinks that an event has ocurred relating to an invalid SLC number.
> Note that the "Received message for slc 0x3, but we are 0x2.  Dropping" messages are NOT the source of the issue; I get these when Asterisk starts up cleanly as well, because chan_dahdi/libss7 wrongly filters out STD Test responses received via a different SLC from the one they're sent on. This is fine because the applicable timer expires and it stops waiting for the STD Test response, however, it highlights another unrelated bug (i.e. that Asterisk shouldn't be filtering STD_TEST replies received on a different signalling link within a single linkset and should treat the received message as proper acknowledgement).
> When Asterisk starts up for the first time, it always shows SLC 0 as up for all signalling links:
> [Jan  1 18:51:03] VERBOSE[24690] sig_ss7.c: MTP2 link up (SLC 0)
> [Jan  1 18:51:03] VERBOSE[24690] sig_ss7.c: MTP2 link up (SLC 0)
> [Jan  1 18:51:03] VERBOSE[24690] sig_ss7.c: MTP2 link up (SLC 0)
> This is despite these being SLC 0, 1 and 2 (all part of one linkset).
> The SLC numbers only reflect 33 or 34 (invalid) once one of the signalling links fails and immediately restores.
> The configuration appears to be 100% because the system can run for days - even weeks - perfectly, processing tens of thousands of calls. Things only go awry when the SLC fails and immediately restores and they go awry every single time this happens.
> My suspicion is that the issues here is that Asterisk is mishandling the situation where the signalling link fails by misindentifying which SLC number has failed. Thereafter, it seems to corrupt its own structures relating to the signalling links. Even when the signalling link restores, things are already corrupted beyond any hope of functioning.
> This has happened countless times over many months, however, is becoming more frequent and difficult to manage. What has aggravated it has been enabling 4 signalling links (previously I had 2) within the linkset, as there is a higher probability of failure. 
> All drops have been for a second or less; I'm not sure if this has any bearing on the situation (e.g. if two timers are running concurrently, one relating to link failure and the other to link restoration).
> Killing Asterisk completely and restarting immediately fixes the issue. It's not necessary to restart DAHDI.
> The 100% CPU on a single core does seem to indicate a deadlocked thread, however, I cannot readily identify which thread.
> The config is ITU SS7, 4x E1 circuits are connected to the TE820 card. The TE420 card in the system is unused at this stage (the intention was to provide for 12x E1 in the future). I've experienced similar issues on a system with 2x TE820 and, similarly, 4x E1 on one of the TE820's.



--
This message was sent by Atlassian JIRA
(v6.2#6252)



More information about the asterisk-bugs mailing list