[asterisk-bugs] [JIRA] (ASTERISK-28674) Asterisk becomes unstable after SS7 signalling link restarts

Tue Jan 7 10:01:26 CST 2020

    [ https://issues.asterisk.org/jira/browse/ASTERISK-28674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=249259#comment-249259 ] 

Gregory Massel commented on ASTERISK-28674:
-------------------------------------------

Please keep in mind that the problem ocurrs BEFORE doing a "ss7 show linkset 1". In fact, the CLI command is largely irrelevent (I would have had no need to even look at the linkset status if the system hadn't failed already) and I included it merely because the output is a very clear and visual demonstration that the data structures relating to the signalling links have been corrupted.

The issues ocurrs from the very moment that the signalling link goes down and tries to come up again. It will happen even if you don't issue any CLI commands at all (or never even log into the CLI).

Looking at the above code snippet, I see no safe-check to test whether cur->next is NULL or res is irrationally high. That, in itself, is not a problem if the underlying 'buf' is valid. Putting in place some sort of limit or check may *MITIGATE* against an infinite loop and more gracefully deal with broken code elsewhere, but it wouldn't solve the underlying problem (that is more likely caused by the data in *buf being corrupt).

I'm no expert at reading a backtrace, however, my inclination was to look at channels/sig_ss7.c, particularly ss7_linkset, row 1493 and 1496, because the logged data is WRONG.

                        case MTP2_LINK_UP:
                                ast_verb(1, "MTP2 link up (SLC %d)\n", e->gen.data);
                                break;
                        case MTP2_LINK_DOWN:
                                ast_log(LOG_WARNING, "MTP2 link down (SLC %d)\n", e->gen.data);
                                break;

Note that it logged:
Linkset 1: Processing event: MTP2_LINK_DOWN
[Jan  3 21:42:38] WARNING[17089]: sig_ss7.c:1496 ss7_linkset: MTP2 link down (SLC 36)

But we know that we only have SLC numbers 0 to 3 on this system, so numbers as high as 36 are wrong.

This means that e->gen.data is probably NOT the correct value of the SLC.

That being said, I'm trying to follow the code from there and, as far as I can see, that issue applies solely to the logging and e->gen.data isn't used thereafter. So, notwithstanding that the logging is buggy (e->gen.data is NOT the correct SLC number), it's not clear that this is the same bug.

I've also followed this code through to libss7-2.0.0/mtp3.c, row 1869, and the following code is also imperfect:

1860:        if (ss7->pc != rl.dpc) {
1861:                ss7_error(ss7, "Received message destined for point code 0x%x, but we are 0x%x.  Dropping\n", rl.dpc, ss7->pc);
1862:                return -1;
1863:        } else if (userpart == SIG_STD_TEST || userpart == SIG_SPEC_TEST || userpart == SIG_NET_MNG) {
1864:                if (link->dpc != rl.opc) {
1865:                        ss7_error(ss7, "Received message from point code 0x%x but we are connected to DPC 0x%x.  Dropping\n", rl.opc, link->dpc);
1866:                        return -1;
1867:                }
1868:
1869:                if (link->slc != rl.sls) {
1870:                        ss7_error(ss7, "Received message for slc 0x%x, but we are 0x%x.  Dropping\n", rl.sls, link->slc);
1871:                        return -1;
1872:                }
1873:        }

I believe that lines 1869 to 1872 are incorrect and should be ommitted. They are basically saying that, in respect of the applicable messages (SIG_STD_TEST, SIG_SPEC_TEST, SIG_NET_MNG), that these *MUST* be received on the same signalling link (link->slc) to which they relate (rl.sls). However, that is simply not correct. This code is having the effect of filtering valid messages. The message must be in respect of the same linkset and, in fact, this is tested in lines 1864-1867, but 1869-1872 go beyond what is reasonable, hence I'm logging lots of messages like:

[Jan  3 21:42:38] ERROR[17089]: chan_dahdi.c:13851 dahdi_ss7_error: [1] Received message for slc 0x0, but we are 0x1.  Dropping

Looking at the logs, this filtering caused it to think that signalling links 0 and 2 (slc 0 and slc 2) failed (when, in fact, slc1 failed):

[1] MTP3 T2 timer expired on link SLC: 2 ADJPC: 2200 changeover completed
[Jan  3 21:42:38] ERROR[17089]: chan_dahdi.c:13851 dahdi_ss7_error: [1] T7 expired on link SLC: 0 ADJPC: 2200
[1] Link state change: INSERVICE -> IDLE
[1] Link state change: IDLE -> NOTALIGNED
[1] MTP3 T2 timer started on link SLC: 0 ADJPC: 2200
Linkset 1: Processing event: MTP2_LINK_DOWN
[Jan  3 21:42:38] WARNING[17089]: sig_ss7.c:1496 ss7_linkset: MTP2 link down (SLC 36)

This has happened because it filtered out NET_MNG messages incorrectly (libss7-2.0.0/mtp3.c lines 1869-1872)

When timers expire, SLC 0 seems to come up, however, SLC 2 doesn't (maybe because NET_MNG for 2 was received on SLC 1).

So, at this point, SLC 0 should be regarded as up and, in theory, signalling should be fine (as at least one signalling link is up). However, it very clearly isn't. Something is locked up in a loop. Something related to SLC 1 and/or SLC 2 (or both) appears to be left in a corrupt state.

When issuing a "ss7 show linkset 1" on the CLI, it's not that that command is breaking something; it's more likely that it's trying to access a data structure relating to SLC 1 or SLC 2 that, somewhere in the above, got corrupted.

> Asterisk becomes unstable after SS7 signalling link restarts
> ------------------------------------------------------------
>
>                 Key: ASTERISK-28674
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-28674
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Channels/chan_dahdi/SS7
>    Affects Versions: 13.29.2, 16.7.0
>         Environment: Asterisk 16.7.0, LibSS7 2.0.0, DAHDI 3.1.0, Digium TE820 + TE420 (5th gen), Ubuntu 18.04.3 LTS, Kernel 5.0.0-37-generic.
> Same issue also ocurrs using Asterisk 13.29.2.
>            Reporter: Gregory Massel
>            Assignee: Unassigned
>            Severity: Minor
>         Attachments: chan_dahdi.conf.txt, core-asterisk-running-2020-01-02T23-14-27+0200-brief.txt, core-asterisk-running-2020-01-02T23-14-27+0200-full.txt, core-asterisk-running-2020-01-02T23-14-27+0200-locks.txt, core-asterisk-running-2020-01-02T23-14-27+0200-thread1.txt, core-asterisk-running-2020-01-03T21-43-37+0200-brief.txt, core-asterisk-running-2020-01-03T21-43-37+0200-full.txt, core-asterisk-running-2020-01-03T21-43-37+0200-locks.txt, core-asterisk-running-2020-01-03T21-43-37+0200-thread1.txt, core-asterisk-running-2020-01-03T21-44-45+0200-brief.txt, core-asterisk-running-2020-01-03T21-44-45+0200-full.txt, core-asterisk-running-2020-01-03T21-44-45+0200-locks.txt, core-asterisk-running-2020-01-03T21-44-45+0200-thread1.txt, ss7-debug.txt, ss7.timers.txt
>
>
> When one of the SS7 signalling links goes down briefly and then restores, the SS7 subsystem goes into a corrupt state. This manifests in a numbers of ways:
> 1. The system starts using consistently 100% CPU on exactly one core, resulting in a load average of just over 1.0. This will continue indefinitely.
> 2. All outbound DAHDI calls thereafter hang during initiation reflecting a "Ring" state; a "core show channels" after a few hours shows thousands of hung channels. The corresponding PJSIP channels close and log "res_pjsip_sdp_rtp.c: Disconnecting channel 'PJSIP/xxx' for lack of RTP activity in 60 seconds" but the call and DAHDI channel remain hung.
> 3. No inbound calls are accepted via the DAHDI spans.
> 4. A "ss7 show linkset 1" seems to be missing signalling links that were there before, as if they were never configured.
> The log entries before everything goes pear-shaped look like this:
> {noformat}
> [Jan  1 03:14:08] WARNING[24375] sig_ss7.c: MTP2 link down (SLC 33)
> [Jan  1 03:14:08] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0.  Dropping
> [Jan  1 03:14:08] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0.  Dropping
> [Jan  1 03:14:08] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0.  Dropping
> [Jan  1 03:14:08] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0.  Dropping
> [Jan  1 03:14:09] VERBOSE[24375] sig_ss7.c: MTP2 link up (SLC 34)
> [Jan  1 03:14:09] VERBOSE[24375] chan_dahdi.c: [1] MTP3 T2 timer expired on link SLC: 3 ADJPC: 2200 changeover completed
> [Jan  1 03:14:09] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x1.  Dropping
> [Jan  1 03:14:09] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0.  Dropping
> [Jan  1 03:14:09] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x2.  Dropping
> [Jan  1 03:14:09] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x1.  Dropping
> [Jan  1 03:14:09] ERROR[24375] chan_dahdi.c: [1] T7 expired on link SLC: 0 ADJPC: 2200
> [Jan  1 03:14:09] WARNING[24375] sig_ss7.c: MTP2 link down (SLC 34)
> [Jan  1 03:14:09] ERROR[24375] chan_dahdi.c: [1] T7 expired on link SLC: 1 ADJPC: 2200
> [Jan  1 03:14:09] WARNING[24375] sig_ss7.c: MTP2 link down (SLC 34)
> {noformat}
> This highlights a related issue: I have SLC numbers 0 to 3 only. The system is actually logging an incorrect SLC number (34) as having failed and/or restored. It appears that this goes beyond just the log entry. The entire system seems to get horribly confused because it thinks that an event has ocurred relating to an invalid SLC number.
> Note that the "Received message for slc 0x3, but we are 0x2.  Dropping" messages are NOT the source of the issue; I get these when Asterisk starts up cleanly as well, because chan_dahdi/libss7 wrongly filters out STD Test responses received via a different SLC from the one they're sent on. This is fine because the applicable timer expires and it stops waiting for the STD Test response, however, it highlights another unrelated bug (i.e. that Asterisk shouldn't be filtering STD_TEST replies received on a different signalling link within a single linkset and should treat the received message as proper acknowledgement).
> When Asterisk starts up for the first time, it always shows SLC 0 as up for all signalling links:
> {noformat}
> [Jan  1 18:51:03] VERBOSE[24690] sig_ss7.c: MTP2 link up (SLC 0)
> [Jan  1 18:51:03] VERBOSE[24690] sig_ss7.c: MTP2 link up (SLC 0)
> [Jan  1 18:51:03] VERBOSE[24690] sig_ss7.c: MTP2 link up (SLC 0)
> {noformat}
> This is despite these being SLC 0, 1 and 2 (all part of one linkset).
> The SLC numbers only reflect 33 or 34 (invalid) once one of the signalling links fails and immediately restores.
> The configuration appears to be 100% because the system can run for days - even weeks - perfectly, processing tens of thousands of calls. Things only go awry when the SLC fails and immediately restores and they go awry every single time this happens.
> My suspicion is that the issues here is that Asterisk is mishandling the situation where the signalling link fails by misindentifying which SLC number has failed. Thereafter, it seems to corrupt its own structures relating to the signalling links. Even when the signalling link restores, things are already corrupted beyond any hope of functioning.
> This has happened countless times over many months, however, is becoming more frequent and difficult to manage. What has aggravated it has been enabling 4 signalling links (previously I had 2) within the linkset, as there is a higher probability of failure. 
> All drops have been for a second or less; I'm not sure if this has any bearing on the situation (e.g. if two timers are running concurrently, one relating to link failure and the other to link restoration).
> Killing Asterisk completely and restarting immediately fixes the issue. It's not necessary to restart DAHDI.
> The 100% CPU on a single core does seem to indicate a deadlocked thread, however, I cannot readily identify which thread.
> The config is ITU SS7, 4x E1 circuits are connected to the TE820 card. The TE420 card in the system is unused at this stage (the intention was to provide for 12x E1 in the future). I've experienced similar issues on a system with 2x TE820 and, similarly, 4x E1 on one of the TE820's.

--
This message was sent by Atlassian JIRA
(v6.2#6252)