[asterisk-bugs] [JIRA] (ASTERISK-24832) DTLS-crashes within openssl

Stefan Engström (JIRA) noreply at issues.asterisk.org
Fri Feb 27 15:03:34 CST 2015


     [ https://issues.asterisk.org/jira/browse/ASTERISK-24832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stefan Engström updated ASTERISK-24832:
---------------------------------------

    Description: 
I'm using 4 chan sip peers with transport WSS. They all use Chrome SIPml5 webrtc. 2 of them call a queue and the other 2 answer. Every 100-1000 calls or so, asterisk gets a crash due to segmentation fault or abort signal within openssl.

Since it's load-related it's hard to provide enough information but ill try add more continuously.

First thing i noticed was  that dtls_perform_handshake was called too many times but that was fixed with ASTERISK-24830 

I have no prior experience of using openssl and little experience of asterisk and C, so debugging is challenging.

By code inspection and tracing logs; it looks like the crashes mostly occur for dtls->ssl instances where asterisk has role: server, (SSL_set_accept_state(dtls->ssl) has been called.) 

I'm not sure how to debug further other than trying to somehow log all calls to libssl and see if any calls are out of order just before crash?

EDIT - the last coredump crash3 seems to prove a concurrency issue:

thread 5 leaving asterisk code at dtls_perform_handshake is performing ssl3_clear on the same ssl struct as that which is sent to ssl_read from __rtp_recvfrom in thread 1


EDIT2 - I have about 20 other old core dumps from before i compiled with malloc_debug - each is slightly different, so possibly there are many independent issues resulting in crashes,  but one common failure is the assertion:
OpenSSLDie (file=file at entry=0x7f19a8915db8 "d1_both.c", line=line at entry=1210, assertion=assertion at entry=0x7f19a8915ee0 "s->d1->w_msg_hdr.msg_len + DTLS1_HM_HEADER_LENGTH == (unsigned int)s->init_num") at cryptlib.c:919, this looks to me like a timing thing, i.e. that some other thread has written to s->d1 or s->init_num for reasons unknown... 

EDIT3: Im curious about the behavior of ast_rtp_on_ice_complete() {
...
        dtls_perform_handshake(instance, &rtp->dtls, 0);
        if (rtp->rtcp) {
                dtls_perform_handshake(instance, &rtp->rtcp->dtls, 1);
        }
...
}
chan_sip seems to call process_sdp -> process_sdp_a_dtls -> res_asterisk_rtp::dtls_set_setup which ulimately sets SSL_set_connect_state(ssl) OR SSL_set_accept_state(ssl) on both (RTP+RTCP) ssl sessions. But this races with the firing of  dtls_perform_handshake(instance, &rtp->dtls, 0); from ast_rtp_on_ice_complete. I'm not sure if this is a problem but in my last crash crash4 the ast_ice_on_ice_complete fired before dtls_set_setup which i have never noticed during non-crash-calls,


EDIT4: 
why is SSL_do_handshake(dtls->ssl) called at all if we are passive? 
i added a debug-check dtls_perform_handshake() {...} to only SSL_do_handshake if we are not passive, and it seems to do no harm. (dtls->dtls_setup != AST_RTP_DTLS_SETUP_PASSIVE)



Possibly related to ASTERISK-24651
Requires patch from ASTERISK-24711













  was:
I'm using 4 chan sip peers with transport WSS. They all use Chrome SIPml5 webrtc. 2 of them call a queue and the other 2 answer. Every 100-1000 calls or so, asterisk gets a crash due to segmentation fault or abort signal within openssl.

Since it's load-related it's hard to provide enough information but ill try add more continuously.

First thing i noticed was  that dtls_perform_handshake was called too many times but that was fixed with ASTERISK-24830 

I have no prior experience of using openssl and little experience of asterisk and C, so debugging is challenging.

By code inspection and tracing logs; it looks like the crashes only occur for dtls->ssl instances where asterisk has role: server, (SSL_set_accept_state(dtls->ssl) has been called.) 

I'm not sure how to debug further other than trying to somehow log all calls to libssl and see if any calls are out of order just before crash?

EDIT - the last coredump crash3 seems to prove a concurrency issue:

thread 5 leaving asterisk code at dtls_perform_handshake is performing ssl3_clear on the same ssl struct as that which is sent to ssl_read from __rtp_recvfrom in thread 1


EDIT2 - I have about 20 other old core dumps from before i compiled with malloc_debug - each is slightly different, so possibly there are many independent issues resulting in crashes,  but one common failure is the assertion:
OpenSSLDie (file=file at entry=0x7f19a8915db8 "d1_both.c", line=line at entry=1210, assertion=assertion at entry=0x7f19a8915ee0 "s->d1->w_msg_hdr.msg_len + DTLS1_HM_HEADER_LENGTH == (unsigned int)s->init_num") at cryptlib.c:919, this looks to me like a timing thing, i.e. that some other thread has written to s->d1 or s->init_num for reasons unknown... 

EDIT3: Im curious about the behavior of ast_rtp_on_ice_complete() {
...
        dtls_perform_handshake(instance, &rtp->dtls, 0);
        if (rtp->rtcp) {
                dtls_perform_handshake(instance, &rtp->rtcp->dtls, 1);
        }
...
}
chan_sip seems to call process_sdp -> process_sdp_a_dtls -> res_asterisk_rtp::dtls_set_setup which ulimately sets SSL_set_connect_state(ssl) OR SSL_set_accept_state(ssl) on both (RTP+RTCP) ssl sessions. But this races with the firing of  dtls_perform_handshake(instance, &rtp->dtls, 0); from ast_rtp_on_ice_complete. I'm not sure if this is a problem but in my last crash crash4 the ast_ice_on_ice_complete fired before dtls_set_setup which i have never noticed during non-crash-calls,


EDIT4: 
why is SSL_do_handshake(dtls->ssl) called at all if we are passive? 
i added a debug-check dtls_perform_handshake() {...} to only SSL_do_handshake if we are not passive, and it seems to do no harm. (dtls->dtls_setup != AST_RTP_DTLS_SETUP_PASSIVE)



Possibly related to ASTERISK-24651
Requires patch from ASTERISK-24711














> DTLS-crashes within openssl 
> ----------------------------
>
>                 Key: ASTERISK-24832
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-24832
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Resources/res_rtp_asterisk
>    Affects Versions: 13.1.0
>         Environment: Fedora 20 x86_64, openssl-1.0.1e-41.fc20.x86_64, Asterisk 13.1.0, Chrome SIPML5 chan_sip peers with transport WSS
>            Reporter: Stefan Engström
>            Assignee: Stefan Engström
>         Attachments: crash1.txt, crash2.txt, crash3.txt, crash4.txt, CUSTOMERRORDEBUGLOG, SIPCONF.txt, TESTDTLS.patch.workingcopy
>
>
> I'm using 4 chan sip peers with transport WSS. They all use Chrome SIPml5 webrtc. 2 of them call a queue and the other 2 answer. Every 100-1000 calls or so, asterisk gets a crash due to segmentation fault or abort signal within openssl.
> Since it's load-related it's hard to provide enough information but ill try add more continuously.
> First thing i noticed was  that dtls_perform_handshake was called too many times but that was fixed with ASTERISK-24830 
> I have no prior experience of using openssl and little experience of asterisk and C, so debugging is challenging.
> By code inspection and tracing logs; it looks like the crashes mostly occur for dtls->ssl instances where asterisk has role: server, (SSL_set_accept_state(dtls->ssl) has been called.) 
> I'm not sure how to debug further other than trying to somehow log all calls to libssl and see if any calls are out of order just before crash?
> EDIT - the last coredump crash3 seems to prove a concurrency issue:
> thread 5 leaving asterisk code at dtls_perform_handshake is performing ssl3_clear on the same ssl struct as that which is sent to ssl_read from __rtp_recvfrom in thread 1
> EDIT2 - I have about 20 other old core dumps from before i compiled with malloc_debug - each is slightly different, so possibly there are many independent issues resulting in crashes,  but one common failure is the assertion:
> OpenSSLDie (file=file at entry=0x7f19a8915db8 "d1_both.c", line=line at entry=1210, assertion=assertion at entry=0x7f19a8915ee0 "s->d1->w_msg_hdr.msg_len + DTLS1_HM_HEADER_LENGTH == (unsigned int)s->init_num") at cryptlib.c:919, this looks to me like a timing thing, i.e. that some other thread has written to s->d1 or s->init_num for reasons unknown... 
> EDIT3: Im curious about the behavior of ast_rtp_on_ice_complete() {
> ...
>         dtls_perform_handshake(instance, &rtp->dtls, 0);
>         if (rtp->rtcp) {
>                 dtls_perform_handshake(instance, &rtp->rtcp->dtls, 1);
>         }
> ...
> }
> chan_sip seems to call process_sdp -> process_sdp_a_dtls -> res_asterisk_rtp::dtls_set_setup which ulimately sets SSL_set_connect_state(ssl) OR SSL_set_accept_state(ssl) on both (RTP+RTCP) ssl sessions. But this races with the firing of  dtls_perform_handshake(instance, &rtp->dtls, 0); from ast_rtp_on_ice_complete. I'm not sure if this is a problem but in my last crash crash4 the ast_ice_on_ice_complete fired before dtls_set_setup which i have never noticed during non-crash-calls,
> EDIT4: 
> why is SSL_do_handshake(dtls->ssl) called at all if we are passive? 
> i added a debug-check dtls_perform_handshake() {...} to only SSL_do_handshake if we are not passive, and it seems to do no harm. (dtls->dtls_setup != AST_RTP_DTLS_SETUP_PASSIVE)
> Possibly related to ASTERISK-24651
> Requires patch from ASTERISK-24711



--
This message was sent by Atlassian JIRA
(v6.2#6252)



More information about the asterisk-bugs mailing list