[asterisk-bugs] [JIRA] (ASTERISK-25127) DTLS crashes following "Unable to cancel schedule ID" in dtls_srtp_check_pending

Dade Brandon (JIRA) noreply at issues.asterisk.org
Mon Jul 6 17:38:33 CDT 2015


    [ https://issues.asterisk.org/jira/browse/ASTERISK-25127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=226806#comment-226806 ] 

Dade Brandon commented on ASTERISK-25127:
-----------------------------------------

Deadlock threads in case its relevant - multiple threads of the following (this is after adding if (!dtls->ssl) return -- line numbers here may be off as this layer of testing is being done including our usual patches (which have a lot of added debug statements and some skips in SDP parsing for private IP ranges)
Thread 37 (Thread 0x7fa983721700 (LWP 17717)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fa9f26df657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fa9f26df480 in __GI___pthread_mutex_lock (mutex=0x7fa9540644f0) at ../nptl/pthread_mutex_lock.c:79
#3  0x00000000005033f1 in __ast_pthread_mutex_lock (filename=0x7fa99a7432ab "res_rtp_asterisk.c", lineno=1862, func=0x7fa99a745f30 <__PRETTY_FUNCTION__.29879> "dtls_srtp_stop_timeout_timer", mutex_name=0x7fa99a74350c "&dtls->lock",
    t=0x7fa9540644f0) at lock.c:313
#4  0x00007fa99a701516 in dtls_srtp_stop_timeout_timer (instance=0x7fa954070978, rtp=0x7fa954075b80, rtcp=1) at res_rtp_asterisk.c:1862
#5  0x00007fa99a7004db in ast_rtp_dtls_stop (instance=0x7fa954070978) at res_rtp_asterisk.c:1402
#6  0x00007fa99a70396b in ast_rtp_destroy (instance=0x7fa954070978) at res_rtp_asterisk.c:2554
#7  0x00000000005505fd in instance_destructor (obj=0x7fa954070978) at rtp_engine.c:217
#8  0x000000000044d1e5 in internal_ao2_ref (user_data=0x7fa954070978, delta=-1, file=0x5be3db "astobj2.c", line=551, func=0x5be691 <__FUNCTION__.8503> "__ao2_ref") at astobj2.c:469
#9  0x000000000044d4ed in __ao2_ref (user_data=0x7fa954070978, delta=-1) at astobj2.c:551
#10 0x00007fa99a7015d2 in dtls_srtp_stop_timeout_timer (instance=0x7fa954070978, rtp=0x7fa954075b80, rtcp=1) at res_rtp_asterisk.c:1863
#11 0x00007fa99a70f4e5 in ast_rtp_stop (instance=0x7fa954070978) at res_rtp_asterisk.c:4854
#12 0x00000000005538f4 in ast_rtp_instance_stop (instance=0x7fa954070978) at rtp_engine.c:1182
#13 0x00007fa9ab08bbc0 in stop_media_flows (p=0x7fa954065748) at chan_sip.c:23926
#14 0x00007fa9ab03492d in sip_hangup (ast=0x7fa954055cf8) at chan_sip.c:6942
#15 0x0000000000479500 in ast_hangup (chan=0x7fa954055cf8) at channel.c:2842
#16 0x000000000053b3a8 in __ast_pbx_run (c=0x7fa954055cf8, args=0x0) at pbx.c:6813
#17 0x000000000053b74c in pbx_thread (data=0x7fa954055cf8) at pbx.c:6905
#18 0x000000000059b45b in dummy_start (data=0x7fa9540656d0) at utils.c:1223
#19 0x00007fa9f26dd182 in start_thread (arg=0x7fa983721700) at pthread_create.c:312
#20 0x00007fa9f384a47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111



> DTLS crashes following "Unable to cancel schedule ID" in dtls_srtp_check_pending
> --------------------------------------------------------------------------------
>
>                 Key: ASTERISK-25127
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-25127
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Resources/res_rtp_asterisk
>    Affects Versions: 11.18.0
>         Environment: Linux kernel "3.13.0-24-generic"
> Ubuntu 14.04,
> Asterisk 11.18.0-rc1, 
> Compiler flags: DONT_OPTIMIZE, LOADABLE_MODULES, BETTER_BACKTRACES, BUILD_NATIVE, G711_NEW_ALGORITHM
> Openssl: 1.0.1f-1ubuntu2.11
> libuuid1: 2.20.1-5.1ubuntu20.4
> SIP Realtime: Module loaded & not in use
> Timer: res_timing_timerfd    (res_timing_pthread also loaded)
> See attached 'environment.txt' for output of 'core show settings' and 'module show'
> All calls are:
>    [Peer] <->ch1<-> Asterisk <->ch2<-> [Misc ITSPs]
> ch1 transport is always SIP over WSS using sipjs on Chrome (stable/M42/M43 and/or Canary M45) with ulaw codec.  Peer is almost always NATd, Asterisk is never NATd,
> ch2 transport is always Plain Old SIP (5060, no TLS) with RTP (no [d]TLS)
>            Reporter: Dade Brandon
>            Assignee: Joshua Colp
>            Severity: Critical
>         Attachments: backtrace latest patch.txt, core-1.txt, core-2.txt, core-all_thread_bt_jun10.txt, core-fullbt-jun22#14.txt, core-fullbt-jun24#15, core-jun10#2.txt, debug.21.txt, debug-crash_jun10.txt, debug-jun10#2.txt, debug-jun22#14.txt, debug-jun24#15, environment.txt
>
>
> h2. Preface
> First I just want to say, I am very familiar with the other DTLS crash issues on Jira, I believe that if this is related, it's probably a precursor to the crashes that create the later segfaults, because since upgrading to trunk I haven't had a core dump, but have continued to experience crashes (asterisk restarting via safe_asterisk-- unknown signal since there's no core).  This is likely more suited as a parent issue to some of the other DTLS crash issues.
> We get about 5-10 crashes per production day (across 35 servers) and I only did the latest update Friday evening, so I will probably know if there's useful core dumps by the end of Tuesday, due to US holidays on Monday.  We service businesses, so the volume is extremely low right now due to the long weekend in the US market.
> Also, we notice that the crashes seem to target certain servers, and that there appears to be a correlation between the affected servers, and the latency of the peers connected to that server.  
> h2. Details:
> Preceding an asterisk crash, we receive "Unable to cancel schedule ID nnnnnn.   This is probably a bug (res_rtp_asterisk.c: dtls_srtp_check_pending, line NNNN)"
> (The line number is 1811 on trunk, we have other patches applied above which are mostly logging related, causing our line number to be less relevant.  The line is "AST_SCHED_DEL_UNREF(rtp->sched, rtp->dtlstimerid, ao2_ref(instance, -1));"
> Asterisk does not die immediately after.  In messages, there is anywhere from 2 seconds to a full minute remaining before each crash.  Note the timing in the example debug logs below.
> On servers that we've added extra logging to, we find logs reporting that for each of these issues, an ast_debug call we inserted from main/rtp_engine.c, for the same thread and call, indicating that the rtp instance->engine is NULL:
> h3. Example 1
> {noformat}
> [14:55:16] WARNING[11973][C-0000e496] res_rtp_asterisk.c: Unable to cancel schedule ID 539829.  This is probably a bug (res_rtp_asterisk.c: dtls_srtp_check_pending, line 1834).
> [14:55:55] DEBUG[11973][C-0000e496] rtp_engine.c: XWSDEBUG4.2 ast_rtp_instance_set_remote_address-- NULL INSTANCE ENGINE for RTP instance '0x7f7c50076da8'
>     - this debug message is placed in ast_rtp_instance_set_remote_address, after ast_sockaddr_copy, and is called if (!instance->engine).
> {noformat}
> h3. Example 2
> {noformat}
> [12:18:54] WARNING[10203][C-0000a76c] res_rtp_asterisk.c: Unable to cancel schedule ID 436177.  This is probably a bug (res_rtp_asterisk.c: dtls_srtp_check_pending, line 1834).
> [12:18:59] DEBUG[23236][C-0000a76c] rtp_engine.c: XWSDEBUG4.2 ast_rtp_instance_set_remote_address-- NULL INSTANCE ENGINE for RTP instance '0x7f6492f03a98'
>     - this debug message is placed in ast_rtp_instance_set_remote_address, after ast_sockaddr_copy, and is called if (!instance->engine).
> [12:18:59] DEBUG[10203][C-0000a76c] rtp_engine.c: XWSDEBUG1.2 ast_rtp_instance_write-- NULL INSTANCE ENGINE for RTP instance '0x7f6492f03a98'
>    - this debug line asserts (instance && !instance->engine) before instance->engine->write(instance, frame) 
> [12:18:59] VERBOSE[11061][C-0000a76c] app_verbose.c: Caller party hung up MinDur running on SIP/Pir1-0000fadf -- Answered time :atime 1431112732:talk 7 -- Route 7503 -- MinDur request 0
> [12:18:59] DEBUG[10203][C-0000a76c] rtp_engine.c: XWSDEBUG29.2 -- NULL INSTANCE ENGINE for Instance '0x7f6492f03a98'
>   - this asserts (instance && !instance->engine) before instance->engine->stop(instance)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)



More information about the asterisk-bugs mailing list