[asterisk-bugs] [JIRA] (ASTERISK-25127) DTLS crashes following "Unable to cancel schedule ID" in dtls_srtp_check_pending
Dade Brandon (JIRA)
noreply at issues.asterisk.org
Mon May 25 18:43:32 CDT 2015
[ https://issues.asterisk.org/jira/browse/ASTERISK-25127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dade Brandon updated ASTERISK-25127:
------------------------------------
Attachment: core-2.txt
Core #2 for today,
\[May 25 09:57:18\] WARNING\[16886\]\[C-000000e9\] res_rtp_asterisk.c: Unable to cancel schedule ID 4383. This is probably a bug (res_rtp_asterisk.c: dtls_srtp_check_pending, line 1836).
Followed
#2 0x00007f47c872138f in OpenSSLDie () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
Interestingly, this one doesn't have rtp->engine == null (in backtrace I added print instance->engine from frame 7), however it's still preceded by the same "Unable to cancel schedule ID" message. This is another crash location we've been seeing since we switched on webrtc back in 11.15, but I'd never drawn the connection to this debug message before today.
If anybody knows how I can compile with debug symbols for openssl, I'd be happy to do so.
Attached as core-2.txt [^core-2.txt]
> DTLS crashes following "Unable to cancel schedule ID" in dtls_srtp_check_pending
> --------------------------------------------------------------------------------
>
> Key: ASTERISK-25127
> URL: https://issues.asterisk.org/jira/browse/ASTERISK-25127
> Project: Asterisk
> Issue Type: Bug
> Security Level: None
> Components: Resources/res_rtp_asterisk
> Affects Versions: 11.18.0
> Environment: Linux kernel "3.13.0-24-generic"
> Ubuntu 14.04,
> Asterisk 11.18.0-rc1,
> Compiler flags: DONT_OPTIMIZE, LOADABLE_MODULES, BETTER_BACKTRACES, BUILD_NATIVE, G711_NEW_ALGORITHM
> Openssl: 1.0.1f-1ubuntu2.11
> libuuid1: 2.20.1-5.1ubuntu20.4
> SIP Realtime: Module loaded & not in use
> Timer: res_timing_timerfd (res_timing_pthread also loaded)
> See attached 'environment.txt' for output of 'core show settings' and 'module show'
> All calls are:
> [Peer] <->ch1<-> Asterisk <->ch2<-> [Misc ITSPs]
> ch1 transport is always SIP over WSS using sipjs on Chrome (stable/M42/M43 and/or Canary M45) with ulaw codec. Peer is almost always NATd, Asterisk is never NATd,
> ch2 transport is always Plain Old SIP (5060, no TLS) with RTP (no [d]TLS)
> Reporter: Dade Brandon
> Severity: Critical
> Attachments: core-1.txt, core-2.txt, environment.txt
>
>
> h2. Preface
> First I just want to say, I am very familiar with the other DTLS crash issues on Jira, I believe that if this is related, it's probably a precursor to the crashes that create the later segfaults, because since upgrading to trunk I haven't had a core dump, but have continued to experience crashes (asterisk restarting via safe_asterisk-- unknown signal since there's no core). This is likely more suited as a parent issue to some of the other DTLS crash issues.
> We get about 5-10 crashes per production day (across 35 servers) and I only did the latest update Friday evening, so I will probably know if there's useful core dumps by the end of Tuesday, due to US holidays on Monday. We service businesses, so the volume is extremely low right now due to the long weekend in the US market.
> Also, we notice that the crashes seem to target certain servers, and that there appears to be a correlation between the affected servers, and the latency of the peers connected to that server.
> h2. Details:
> Preceding an asterisk crash, we receive "Unable to cancel schedule ID nnnnnn. This is probably a bug (res_rtp_asterisk.c: dtls_srtp_check_pending, line NNNN)"
> (The line number is 1811 on trunk, we have other patches applied above which are mostly logging related, causing our line number to be less relevant. The line is "AST_SCHED_DEL_UNREF(rtp->sched, rtp->dtlstimerid, ao2_ref(instance, -1));"
> Asterisk does not die immediately after. In messages, there is anywhere from 2 seconds to a full minute remaining before each crash. Note the timing in the example debug logs below.
> On servers that we've added extra logging to, we find logs reporting that for each of these issues, an ast_debug call we inserted from main/rtp_engine.c, for the same thread and call, indicating that the rtp instance->engine is NULL:
> h3. Example 1
> {noformat}
> [14:55:16] WARNING[11973][C-0000e496] res_rtp_asterisk.c: Unable to cancel schedule ID 539829. This is probably a bug (res_rtp_asterisk.c: dtls_srtp_check_pending, line 1834).
> [14:55:55] DEBUG[11973][C-0000e496] rtp_engine.c: XWSDEBUG4.2 ast_rtp_instance_set_remote_address-- NULL INSTANCE ENGINE for RTP instance '0x7f7c50076da8'
> - this debug message is placed in ast_rtp_instance_set_remote_address, after ast_sockaddr_copy, and is called if (!instance->engine).
> {noformat}
> h3. Example 2
> {noformat}
> [12:18:54] WARNING[10203][C-0000a76c] res_rtp_asterisk.c: Unable to cancel schedule ID 436177. This is probably a bug (res_rtp_asterisk.c: dtls_srtp_check_pending, line 1834).
> [12:18:59] DEBUG[23236][C-0000a76c] rtp_engine.c: XWSDEBUG4.2 ast_rtp_instance_set_remote_address-- NULL INSTANCE ENGINE for RTP instance '0x7f6492f03a98'
> - this debug message is placed in ast_rtp_instance_set_remote_address, after ast_sockaddr_copy, and is called if (!instance->engine).
> [12:18:59] DEBUG[10203][C-0000a76c] rtp_engine.c: XWSDEBUG1.2 ast_rtp_instance_write-- NULL INSTANCE ENGINE for RTP instance '0x7f6492f03a98'
> - this debug line asserts (instance && !instance->engine) before instance->engine->write(instance, frame)
> [12:18:59] VERBOSE[11061][C-0000a76c] app_verbose.c: Caller party hung up MinDur running on SIP/Pir1-0000fadf -- Answered time :atime 1431112732:talk 7 -- Route 7503 -- MinDur request 0
> [12:18:59] DEBUG[10203][C-0000a76c] rtp_engine.c: XWSDEBUG29.2 -- NULL INSTANCE ENGINE for Instance '0x7f6492f03a98'
> - this asserts (instance && !instance->engine) before instance->engine->stop(instance)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.2#6252)
More information about the asterisk-bugs
mailing list