[asterisk-bugs] [JIRA] (ASTERISK-28674) Asterisk becomes unstable after SS7 signalling link restarts
Gregory Massel (JIRA)
noreply at issues.asterisk.org
Thu Jan 9 11:08:26 CST 2020
[ https://issues.asterisk.org/jira/browse/ASTERISK-28674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=249304#comment-249304 ]
Gregory Massel commented on ASTERISK-28674:
-------------------------------------------
More insight from an actual crash early this morning (not one that reproduced).
Four E1's with 4 signalling links (SLC 0 to 3) and the fourth E1 went into alarm.
Immediately it logs:
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] MTP3 T2 timer started on link SLC: 3 ADJPC: 2200
Then correctly sends COO via SLC's 0, 1 and 2 in respect of 3:
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Len = 14 [ 57 b7 0b c0 98 08 c2 30 11 0d 00 00 00 00 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] FSN: 55 FIB 1
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] BSN: 87 BIB 0
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] >[2200:0] MSU
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ 57 b7 0b ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Network Indicator: 3 Priority: 0 User Part: NET_MNG (0)
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ c0 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] OPC 776 DPC 2200 SLS 3
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ 98 08 c2 30 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] H0: 1 H1: 1
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Message type: COO
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ 11 0d 00 00 00 00 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Len = 14 [ e9 fd 0b c0 98 08 c2 30 11 0d 00 00 00 00 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] FSN: 125 FIB 1
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] BSN: 105 BIB 1
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] >[2200:1] MSU
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ e9 fd 0b ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Network Indicator: 3 Priority: 0 User Part: NET_MNG (0)
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ c0 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] OPC 776 DPC 2200 SLS 3
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ 98 08 c2 30 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] H0: 1 H1: 1
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Message type: COO
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ 11 0d 00 00 00 00 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Len = 14 [ 7a 8f 0b c0 98 08 c2 30 11 0d 00 00 00 00 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] FSN: 15 FIB 1
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] BSN: 122 BIB 0
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] >[2200:2] MSU
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ 7a 8f 0b ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Network Indicator: 3 Priority: 0 User Part: NET_MNG (0)
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ c0 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] OPC 776 DPC 2200 SLS 3
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ 98 08 c2 30 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] H0: 1 H1: 1
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Message type: COO
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ 11 0d 00 00 00 00 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Message type: COO
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ 11 0d 00 00 00 00 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Len = 4 [ 85 8d 01 03 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] FSN: 13 FIB 1
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] BSN: 5 BIB 1
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] <[2200:3] LSSU SIOS
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]
Then it receives COA via SLC 0 in respect of SLC3:
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Len = 10 [ b7 58 07 c0 08 03 26 32 21 05 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] FSN: 88 FIB 0
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] BSN: 55 BIB 1
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] <[2200:0] MSU
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ b7 58 07 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Network Indicator: 3 Priority: 0 User Part: NET_MNG (0)
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ c0 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] OPC 2200 DPC 776 SLS 3
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ 08 03 26 32 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] H0: 1 H1: 2
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Message type: COA
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ 21 05 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] MTP3 T2 timer stopped on link SLC: 3 ADJPC: 2200
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Changeover completed on link SLC: 3 PC: 2200 FSN: 5
The above looks good, however, things go bad from here.
We receive another COA in respect of our second COO, however, it's received on SLC 0 and, owing to logic error in libss7, it's expecting it to only be received via the SLC that it was originally sent out of (SLC's 1 and 2):
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Len = 10 [ b7 59 07 c0 08 03 26 32 21 05 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] FSN: 89 FIB 0
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] BSN: 55 BIB 1
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] <[2200:0] MSU
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ b7 59 07 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Network Indicator: 3 Priority: 0 User Part: NET_MNG (0)
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ c0 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] OPC 2200 DPC 776 SLS 3
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ 08 03 26 32 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] H0: 1 H1: 2
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Message type: COA
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] [ 21 05 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]
[Jan 9 04:08:47] ERROR[22441] chan_dahdi.c: [1] Got COA on SLC 3 ADJPC 2200 but we haven't sent COO or ECO
Note the error logged!
But here is where it gets even worse:
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] Len = 4 [ ff ff 01 00 ]
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] FSN: 127 FIB 1
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] BSN: 127 BIB 1
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1] <[2200:3] LSSU SIO
[Jan 9 04:08:47] VERBOSE[22441] chan_dahdi.c: [1]
[Jan 9 04:08:49] ERROR[22441] chan_dahdi.c: [1] T7 expired on link SLC: 0 ADJPC: 2200
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Link state change: INSERVICE -> IDLE
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Link state change: IDLE -> NOTALIGNED
[Jan 9 04:08:49] ERROR[22441] chan_dahdi.c: [1] T7 expired on link SLC: 1 ADJPC: 2200
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Link state change: INSERVICE -> IDLE
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Link state change: IDLE -> NOTALIGNED
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] MTP3 T2 timer started on link SLC: 0 ADJPC: 2200
[Jan 9 04:08:49] VERBOSE[22441] sig_ss7.c: Linkset 1: Processing event: MTP2_LINK_DOWN
[Jan 9 04:08:49] WARNING[22441] sig_ss7.c: MTP2 link down (SLC 0)
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] MTP3 T2 timer started on link SLC: 1 ADJPC: 2200
[Jan 9 04:08:49] VERBOSE[22441] sig_ss7.c: Linkset 1: Processing event: MTP2_LINK_DOWN
[Jan 9 04:08:49] WARNING[22441] sig_ss7.c: MTP2 link down (SLC 1)
libss7 thinks that SLC 0 and SLC 1 are down because of a timer T7 expiry; i.e. it thinks that there was excessive delay in it receiving a COA response to its COO message, however, it received the COA's but disregarded them because it erroneously thought they were not valid if not received on the same SLC on which they were sent!
Now it tries to send COO via SLC 2 in respect of SLC 0 and SLC 1 (which are actually up):
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Len = 4 [ ff ff 01 00 ]
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] FSN: 127 FIB 1
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] BSN: 127 BIB 1
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] >[2200:0] LSSU SIO
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Len = 4 [ ff ff 01 00 ]
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] FSN: 127 FIB 1
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] BSN: 127 BIB 1
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] >[2200:1] LSSU SIO
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Len = 12 [ 7a 90 09 c0 98 08 c2 00 11 59 00 00 ]
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] FSN: 16 FIB 1
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] BSN: 122 BIB 0
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] >[2200:2] MSU
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] [ 7a 90 09 ]
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Network Indicator: 3 Priority: 0 User Part: NET_MNG (0)
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] [ c0 ]
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] OPC 776 DPC 2200 SLS 0
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] [ 98 08 c2 00 ]
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] H0: 1 H1: 1
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Message type: COO
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] [ 11 59 00 00 ]
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Len = 10 [ 7a 91 07 c0 98 08 c2 10 11 69 ]
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] FSN: 17 FIB 1
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] BSN: 122 BIB 0
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] >[2200:2] MSU
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] [ 7a 91 07 ]
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Network Indicator: 3 Priority: 0 User Part: NET_MNG (0)
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] [ c0 ]
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] OPC 776 DPC 2200 SLS 1
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] [ 98 08 c2 10 ]
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] H0: 1 H1: 1
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] Message type: COO
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] [ 11 69 ]
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1]
At this stage, only SLC 2 is thought to be up and, as a message arrives on SLC 2:
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] <[2200:2] MSU
[Jan 9 04:08:49] VERBOSE[22441] chan_dahdi.c: [1] [ 91 7b 07 ]
Asterisk then crashes.
>From the core dump, we can see that we received a COA (as expected) and tried to process changeover:
#0 __GI_raise (sig=sig at entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007fb1cd25d801 in __GI_abort () at abort.c:79
#2 0x00007fb1cd2a6897 in __libc_message (action=action at entry=do_abort, fmt=fmt at entry=0x7fb1cd3d3b9a "%sn") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007fb1cd2ad90a in malloc_printerr (str=str at entry=0x7fb1cd3d5870 "double free or corruption (out)") at malloc.c:5350
#4 0x00007fb1cd2b4e75 in _int_free (have_lock=0, p=0x7fb1040058b0, av=0x7fb1cd608c40 <main_arena>) at malloc.c:4278
#5 __GI___libc_free (mem=mem at entry=0x7fb1040058c0) at malloc.c:3124
#6 0x00007fb17b480077 in mtp3_transmit (ss7=ss7 at entry=0x560a784062e0, userpart=<optimized out>, rl=..., priority=<optimized out>, m=0x7fb1040058c0, link=0x0) at mtp3.c:1796
#7 0x00007fb17b480103 in mtp3_transmit_buffer (ss7=0x560a784062e0, buf=buf at entry=0x7fb12e474640) at mtp3.c:640
#8 0x00007fb17b482a33 in mtp3_changeover (link=link at entry=0x560a783f59c0, fsn=<optimized out>) at mtp3.c:881
#9 0x00007fb17b4835ee in net_mng_receive (rl=0x7fb12e474670, rl=0x7fb12e474670, rl=0x7fb12e474670, len=776423125, buf=<optimized out>, mtp2=0x560a78468da0, ss7=0x560a784062e0) at mtp3.c:1207
#10 mtp3_receive (ss7=0x560a784062e0, link=0x560a78468da0, msg=msg at entry=0x7fb12e4746db, len=<optimized out>) at mtp3.c:1891
#11 0x00007fb17b47f67e in msu_rx (len=<optimized out>, h=<optimized out>, link=<optimized out>) at mtp2.c:765
#12 mtp2_receive (link=<optimized out>, buf=buf at entry=0x7fb12e4746e0 "221{a300b003&022!}024377", len=<optimized out>) at mtp2.c:956
#13 0x00007fb17b48486e in ss7_read (ss7=<optimized out>, fd=<optimized out>) at ss7.c:495
#14 0x00007fb17b920713 in ss7_linkset (data=0x7fb17bb3ff60 <linksets>) at sig_ss7.c:1452
#15 0x0000560a75090e21 in dummy_start (data=0x560a783fb830) at utils.c:1249
#16 0x00007fb1ce1556db in start_thread (arg=0x7fb12e475700) at pthread_create.c:463
#17 0x00007fb1cd33e88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
It's basically doing something that is causing it to try and remove its last remaining signalling link and then trying to transmit a message out the signalling link that it's just removed.
So two MORE bugs can be isolated:
1. It should never try and transmit a message out a link structure that it's just free()'ed;
2. It is erroneously disregarding COA messages received if they're received via a different SLC to the one the COO message was transmitted out of, notwithstanding that the message was correct and valid in respect of the SLC specified (SLS parameter).
> Asterisk becomes unstable after SS7 signalling link restarts
> ------------------------------------------------------------
>
> Key: ASTERISK-28674
> URL: https://issues.asterisk.org/jira/browse/ASTERISK-28674
> Project: Asterisk
> Issue Type: Bug
> Security Level: None
> Components: Channels/chan_dahdi/SS7
> Affects Versions: 13.29.2, 16.7.0
> Environment: Asterisk 16.7.0, LibSS7 2.0.0, DAHDI 3.1.0, Digium TE820 + TE420 (5th gen), Ubuntu 18.04.3 LTS, Kernel 5.0.0-37-generic.
> Same issue also ocurrs using Asterisk 13.29.2.
> Reporter: Gregory Massel
> Assignee: Unassigned
> Severity: Minor
> Attachments: chan_dahdi.conf.txt, core-asterisk-running-2020-01-02T23-14-27+0200-brief.txt, core-asterisk-running-2020-01-02T23-14-27+0200-full.txt, core-asterisk-running-2020-01-02T23-14-27+0200-locks.txt, core-asterisk-running-2020-01-02T23-14-27+0200-thread1.txt, core-asterisk-running-2020-01-03T21-43-37+0200-brief.txt, core-asterisk-running-2020-01-03T21-43-37+0200-full.txt, core-asterisk-running-2020-01-03T21-43-37+0200-locks.txt, core-asterisk-running-2020-01-03T21-43-37+0200-thread1.txt, core-asterisk-running-2020-01-03T21-44-45+0200-brief.txt, core-asterisk-running-2020-01-03T21-44-45+0200-full.txt, core-asterisk-running-2020-01-03T21-44-45+0200-locks.txt, core-asterisk-running-2020-01-03T21-44-45+0200-thread1.txt, core-brief.txt, core-full.txt, core-locks.txt, core-thread1.txt, ss7-debug.txt, ss7.timers.txt
>
>
> When one of the SS7 signalling links goes down briefly and then restores, the SS7 subsystem goes into a corrupt state. This manifests in a numbers of ways:
> 1. The system starts using consistently 100% CPU on exactly one core, resulting in a load average of just over 1.0. This will continue indefinitely.
> 2. All outbound DAHDI calls thereafter hang during initiation reflecting a "Ring" state; a "core show channels" after a few hours shows thousands of hung channels. The corresponding PJSIP channels close and log "res_pjsip_sdp_rtp.c: Disconnecting channel 'PJSIP/xxx' for lack of RTP activity in 60 seconds" but the call and DAHDI channel remain hung.
> 3. No inbound calls are accepted via the DAHDI spans.
> 4. A "ss7 show linkset 1" seems to be missing signalling links that were there before, as if they were never configured.
> The log entries before everything goes pear-shaped look like this:
> {noformat}
> [Jan 1 03:14:08] WARNING[24375] sig_ss7.c: MTP2 link down (SLC 33)
> [Jan 1 03:14:08] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0. Dropping
> [Jan 1 03:14:08] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0. Dropping
> [Jan 1 03:14:08] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0. Dropping
> [Jan 1 03:14:08] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0. Dropping
> [Jan 1 03:14:09] VERBOSE[24375] sig_ss7.c: MTP2 link up (SLC 34)
> [Jan 1 03:14:09] VERBOSE[24375] chan_dahdi.c: [1] MTP3 T2 timer expired on link SLC: 3 ADJPC: 2200 changeover completed
> [Jan 1 03:14:09] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x1. Dropping
> [Jan 1 03:14:09] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x0. Dropping
> [Jan 1 03:14:09] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x2. Dropping
> [Jan 1 03:14:09] ERROR[24375] chan_dahdi.c: [1] Received message for slc 0x3, but we are 0x1. Dropping
> [Jan 1 03:14:09] ERROR[24375] chan_dahdi.c: [1] T7 expired on link SLC: 0 ADJPC: 2200
> [Jan 1 03:14:09] WARNING[24375] sig_ss7.c: MTP2 link down (SLC 34)
> [Jan 1 03:14:09] ERROR[24375] chan_dahdi.c: [1] T7 expired on link SLC: 1 ADJPC: 2200
> [Jan 1 03:14:09] WARNING[24375] sig_ss7.c: MTP2 link down (SLC 34)
> {noformat}
> This highlights a related issue: I have SLC numbers 0 to 3 only. The system is actually logging an incorrect SLC number (34) as having failed and/or restored. It appears that this goes beyond just the log entry. The entire system seems to get horribly confused because it thinks that an event has ocurred relating to an invalid SLC number.
> Note that the "Received message for slc 0x3, but we are 0x2. Dropping" messages are NOT the source of the issue; I get these when Asterisk starts up cleanly as well, because chan_dahdi/libss7 wrongly filters out STD Test responses received via a different SLC from the one they're sent on. This is fine because the applicable timer expires and it stops waiting for the STD Test response, however, it highlights another unrelated bug (i.e. that Asterisk shouldn't be filtering STD_TEST replies received on a different signalling link within a single linkset and should treat the received message as proper acknowledgement).
> When Asterisk starts up for the first time, it always shows SLC 0 as up for all signalling links:
> {noformat}
> [Jan 1 18:51:03] VERBOSE[24690] sig_ss7.c: MTP2 link up (SLC 0)
> [Jan 1 18:51:03] VERBOSE[24690] sig_ss7.c: MTP2 link up (SLC 0)
> [Jan 1 18:51:03] VERBOSE[24690] sig_ss7.c: MTP2 link up (SLC 0)
> {noformat}
> This is despite these being SLC 0, 1 and 2 (all part of one linkset).
> The SLC numbers only reflect 33 or 34 (invalid) once one of the signalling links fails and immediately restores.
> The configuration appears to be 100% because the system can run for days - even weeks - perfectly, processing tens of thousands of calls. Things only go awry when the SLC fails and immediately restores and they go awry every single time this happens.
> My suspicion is that the issues here is that Asterisk is mishandling the situation where the signalling link fails by misindentifying which SLC number has failed. Thereafter, it seems to corrupt its own structures relating to the signalling links. Even when the signalling link restores, things are already corrupted beyond any hope of functioning.
> This has happened countless times over many months, however, is becoming more frequent and difficult to manage. What has aggravated it has been enabling 4 signalling links (previously I had 2) within the linkset, as there is a higher probability of failure.
> All drops have been for a second or less; I'm not sure if this has any bearing on the situation (e.g. if two timers are running concurrently, one relating to link failure and the other to link restoration).
> Killing Asterisk completely and restarting immediately fixes the issue. It's not necessary to restart DAHDI.
> The 100% CPU on a single core does seem to indicate a deadlocked thread, however, I cannot readily identify which thread.
> The config is ITU SS7, 4x E1 circuits are connected to the TE820 card. The TE420 card in the system is unused at this stage (the intention was to provide for 12x E1 in the future). I've experienced similar issues on a system with 2x TE820 and, similarly, 4x E1 on one of the TE820's.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
More information about the asterisk-bugs
mailing list