[asterisk-bugs] [JIRA] (ASTERISK-25127) DTLS crashes following "Unable to cancel schedule ID" in dtls_srtp_check_pending

Mon Jul 6 17:32:33 CDT 2015

    [ https://issues.asterisk.org/jira/browse/ASTERISK-25127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=226804#comment-226804 ] 

Dade Brandon commented on ASTERISK-25127:
-----------------------------------------

Add to the last backtrace, if I do 'frame 1; print dtls->ssl' the result is (SSL*)0x0


> DTLS crashes following "Unable to cancel schedule ID" in dtls_srtp_check_pending
> --------------------------------------------------------------------------------
>
>                 Key: ASTERISK-25127
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-25127
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Resources/res_rtp_asterisk
>    Affects Versions: 11.18.0
>         Environment: Linux kernel "3.13.0-24-generic"
> Ubuntu 14.04,
> Asterisk 11.18.0-rc1, 
> Compiler flags: DONT_OPTIMIZE, LOADABLE_MODULES, BETTER_BACKTRACES, BUILD_NATIVE, G711_NEW_ALGORITHM
> Openssl: 1.0.1f-1ubuntu2.11
> libuuid1: 2.20.1-5.1ubuntu20.4
> SIP Realtime: Module loaded & not in use
> Timer: res_timing_timerfd    (res_timing_pthread also loaded)
> See attached 'environment.txt' for output of 'core show settings' and 'module show'
> All calls are:
>    [Peer] <->ch1<-> Asterisk <->ch2<-> [Misc ITSPs]
> ch1 transport is always SIP over WSS using sipjs on Chrome (stable/M42/M43 and/or Canary M45) with ulaw codec.  Peer is almost always NATd, Asterisk is never NATd,
> ch2 transport is always Plain Old SIP (5060, no TLS) with RTP (no [d]TLS)
>            Reporter: Dade Brandon
>            Assignee: Joshua Colp
>            Severity: Critical
>         Attachments: backtrace latest patch.txt, core-1.txt, core-2.txt, core-all_thread_bt_jun10.txt, core-fullbt-jun22#14.txt, core-fullbt-jun24#15, core-jun10#2.txt, debug.21.txt, debug-crash_jun10.txt, debug-jun10#2.txt, debug-jun22#14.txt, debug-jun24#15, environment.txt
>
>
> h2. Preface
> First I just want to say, I am very familiar with the other DTLS crash issues on Jira, I believe that if this is related, it's probably a precursor to the crashes that create the later segfaults, because since upgrading to trunk I haven't had a core dump, but have continued to experience crashes (asterisk restarting via safe_asterisk-- unknown signal since there's no core).  This is likely more suited as a parent issue to some of the other DTLS crash issues.
> We get about 5-10 crashes per production day (across 35 servers) and I only did the latest update Friday evening, so I will probably know if there's useful core dumps by the end of Tuesday, due to US holidays on Monday.  We service businesses, so the volume is extremely low right now due to the long weekend in the US market.
> Also, we notice that the crashes seem to target certain servers, and that there appears to be a correlation between the affected servers, and the latency of the peers connected to that server.  
> h2. Details:
> Preceding an asterisk crash, we receive "Unable to cancel schedule ID nnnnnn.   This is probably a bug (res_rtp_asterisk.c: dtls_srtp_check_pending, line NNNN)"
> (The line number is 1811 on trunk, we have other patches applied above which are mostly logging related, causing our line number to be less relevant.  The line is "AST_SCHED_DEL_UNREF(rtp->sched, rtp->dtlstimerid, ao2_ref(instance, -1));"
> Asterisk does not die immediately after.  In messages, there is anywhere from 2 seconds to a full minute remaining before each crash.  Note the timing in the example debug logs below.
> On servers that we've added extra logging to, we find logs reporting that for each of these issues, an ast_debug call we inserted from main/rtp_engine.c, for the same thread and call, indicating that the rtp instance->engine is NULL:
> h3. Example 1
> {noformat}
> [14:55:16] WARNING[11973][C-0000e496] res_rtp_asterisk.c: Unable to cancel schedule ID 539829.  This is probably a bug (res_rtp_asterisk.c: dtls_srtp_check_pending, line 1834).
> [14:55:55] DEBUG[11973][C-0000e496] rtp_engine.c: XWSDEBUG4.2 ast_rtp_instance_set_remote_address-- NULL INSTANCE ENGINE for RTP instance '0x7f7c50076da8'
>     - this debug message is placed in ast_rtp_instance_set_remote_address, after ast_sockaddr_copy, and is called if (!instance->engine).
> {noformat}
> h3. Example 2
> {noformat}
> [12:18:54] WARNING[10203][C-0000a76c] res_rtp_asterisk.c: Unable to cancel schedule ID 436177.  This is probably a bug (res_rtp_asterisk.c: dtls_srtp_check_pending, line 1834).
> [12:18:59] DEBUG[23236][C-0000a76c] rtp_engine.c: XWSDEBUG4.2 ast_rtp_instance_set_remote_address-- NULL INSTANCE ENGINE for RTP instance '0x7f6492f03a98'
>     - this debug message is placed in ast_rtp_instance_set_remote_address, after ast_sockaddr_copy, and is called if (!instance->engine).
> [12:18:59] DEBUG[10203][C-0000a76c] rtp_engine.c: XWSDEBUG1.2 ast_rtp_instance_write-- NULL INSTANCE ENGINE for RTP instance '0x7f6492f03a98'
>    - this debug line asserts (instance && !instance->engine) before instance->engine->write(instance, frame) 
> [12:18:59] VERBOSE[11061][C-0000a76c] app_verbose.c: Caller party hung up MinDur running on SIP/Pir1-0000fadf -- Answered time :atime 1431112732:talk 7 -- Route 7503 -- MinDur request 0
> [12:18:59] DEBUG[10203][C-0000a76c] rtp_engine.c: XWSDEBUG29.2 -- NULL INSTANCE ENGINE for Instance '0x7f6492f03a98'
>   - this asserts (instance && !instance->engine) before instance->engine->stop(instance)
> {noformat}


--
This message was sent by Atlassian JIRA
(v6.2#6252)