[asterisk-bugs] [JIRA] (ASTERISK-24832) DTLS-crashes within openssl

Fri Feb 27 09:35:34 CST 2015

     [ https://issues.asterisk.org/jira/browse/ASTERISK-24832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stefan Engström updated ASTERISK-24832:
---------------------------------------

    Description: 
I'm using 4 chan sip peers with transport WSS. They all use Chrome SIPml5 webrtc. 2 of them call a queue and the other 2 answer. Every 100-1000 calls or so, asterisk gets a crash due to segmentation fault or abort signal within openssl.

Since it's load-related it's hard to provide enough information but ill try add more continuously.

First thing i noticed was  that dtls_perform_handshake was called too many times but that was fixed with https://issues.asterisk.org/jira/browse/ASTERISK-24830 

I have no prior experience of using openssl and little experience of asterisk and C, so debugging is challenging.

By code inspection and tracing logs; it looks like the crashes only occur for dtls->ssl instances where asterisk has role: server, (SSL_set_accept_state(dtls->ssl) has been called.) 

I'm not sure how to debug further other than trying to somehow log all calls to libssl and see if any calls are out of order just before crash?

EDIT - the last coredump crash3 seems to prove a concurrency issue:

thread 5 leaving asterisk code at dtls_perform_handshake is performing ssl3_clear on the same ssl struct as that which is sent to ssl_read from __rtp_recvfrom in thread 1

EDIT2 - I have about 20 other old core dumps from before i compiled with malloc_debug - each is slightly different, so possibly there are many independent issues resulting in crashes,  but one common failure is the assertion:
OpenSSLDie (file=file at entry=0x7f19a8915db8 "d1_both.c", line=line at entry=1210, assertion=assertion at entry=0x7f19a8915ee0 "s->d1->w_msg_hdr.msg_len + DTLS1_HM_HEADER_LENGTH == (unsigned int)s->init_num") at cryptlib.c:919, this looks to me like a timing thing, i.e. that some other thread has written to s->d1 or s->init_num for reasons unknown... 

EDIT3: Im curious about the behavior of ast_rtp_on_ice_complete() {
...
       //This function seems to fire at unpredictable times...  sometimes before chan_sip has called dtls_set_setup... (dangerous?)... 
        dtls_perform_handshake(instance, &rtp->dtls, 0);
        if (rtp->rtcp) {
                dtls_perform_handshake(instance, &rtp->rtcp->dtls, 1);
        }
...
}

Possibly related to ASTERISK-24651
Requires patch from ASTERISK-24711

  was:
I'm using 4 chan sip peers with transport WSS. They all use Chrome SIPml5 webrtc. 2 of them call a queue and the other 2 answer. Every 100-1000 calls or so, asterisk gets a crash due to segmentation fault or abort signal within openssl.

Since it's load-related it's hard to provide enough information but ill try add more continuously.

First thing i noticed was  that dtls_perform_handshake was called too many times but that was fixed with https://issues.asterisk.org/jira/browse/ASTERISK-24830 

I have no prior experience of using openssl and little experience of asterisk and C, so debugging is challenging.

By code inspection and tracing logs; it looks like the crashes only occur for dtls->ssl instances where asterisk has role: server, (SSL_set_accept_state(dtls->ssl) has been called.) 

I'm not sure how to debug further other than trying to somehow log all calls to libssl and see if any calls are out of order just before crash?

EDIT - the last coredump crash3 seems to prove a concurrency issue:

thread 5 leaving asterisk code at dtls_perform_handshake is performing ssl3_clear on the same ssl struct as that which is sent to ssl_read from __rtp_recvfrom in thread 1

EDIT2 - I have about 20 other old core dumps from before i compiled with malloc_debug - each is slightly different, so possibly there are many independent issues resulting in crashes,  but one common failure is the assertion:
OpenSSLDie (file=file at entry=0x7f19a8915db8 "d1_both.c", line=line at entry=1210, assertion=assertion at entry=0x7f19a8915ee0 "s->d1->w_msg_hdr.msg_len + DTLS1_HM_HEADER_LENGTH == (unsigned int)s->init_num") at cryptlib.c:919, this looks to me like a timing thing, i.e. that some other thread has written to s->d1 or s->init_num for reasons unknown... 

EDIT3: Im curious about the behavior of ast_rtp_on_ice_complete() {
...
       //When this function fires, we may
        dtls_perform_handshake(instance, &rtp->dtls, 0);
        if (rtp->rtcp) {
                dtls_perform_handshake(instance, &rtp->rtcp->dtls, 1);
        }
...
}

Possibly related to ASTERISK-24651
Requires patch from ASTERISK-24711

> DTLS-crashes within openssl 
> ----------------------------
>
>                 Key: ASTERISK-24832
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-24832
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Resources/res_rtp_asterisk
>    Affects Versions: 13.1.0
>         Environment: Fedora 20 x86_64, openssl-1.0.1e-41.fc20.x86_64, Asterisk 13.1.0, Chrome SIPML5 chan_sip peers with transport WSS
>            Reporter: Stefan Engström
>            Assignee: Stefan Engström
>         Attachments: crash1.txt, crash2.txt, crash3.txt, CUSTOMERRORDEBUGLOG, SIPCONF.txt, TESTDTLS.patch.workingcopy
>
>
> I'm using 4 chan sip peers with transport WSS. They all use Chrome SIPml5 webrtc. 2 of them call a queue and the other 2 answer. Every 100-1000 calls or so, asterisk gets a crash due to segmentation fault or abort signal within openssl.
> Since it's load-related it's hard to provide enough information but ill try add more continuously.
> First thing i noticed was  that dtls_perform_handshake was called too many times but that was fixed with https://issues.asterisk.org/jira/browse/ASTERISK-24830 
> I have no prior experience of using openssl and little experience of asterisk and C, so debugging is challenging.
> By code inspection and tracing logs; it looks like the crashes only occur for dtls->ssl instances where asterisk has role: server, (SSL_set_accept_state(dtls->ssl) has been called.) 
> I'm not sure how to debug further other than trying to somehow log all calls to libssl and see if any calls are out of order just before crash?
> EDIT - the last coredump crash3 seems to prove a concurrency issue:
> thread 5 leaving asterisk code at dtls_perform_handshake is performing ssl3_clear on the same ssl struct as that which is sent to ssl_read from __rtp_recvfrom in thread 1
> EDIT2 - I have about 20 other old core dumps from before i compiled with malloc_debug - each is slightly different, so possibly there are many independent issues resulting in crashes,  but one common failure is the assertion:
> OpenSSLDie (file=file at entry=0x7f19a8915db8 "d1_both.c", line=line at entry=1210, assertion=assertion at entry=0x7f19a8915ee0 "s->d1->w_msg_hdr.msg_len + DTLS1_HM_HEADER_LENGTH == (unsigned int)s->init_num") at cryptlib.c:919, this looks to me like a timing thing, i.e. that some other thread has written to s->d1 or s->init_num for reasons unknown... 
> EDIT3: Im curious about the behavior of ast_rtp_on_ice_complete() {
> ...
>        //This function seems to fire at unpredictable times...  sometimes before chan_sip has called dtls_set_setup... (dangerous?)... 
>         dtls_perform_handshake(instance, &rtp->dtls, 0);
>         if (rtp->rtcp) {
>                 dtls_perform_handshake(instance, &rtp->rtcp->dtls, 1);
>         }
> ...
> }
>  
> Possibly related to ASTERISK-24651
> Requires patch from ASTERISK-24711

--
This message was sent by Atlassian JIRA
(v6.2#6252)