[asterisk-bugs] [JIRA] (ASTERISK-24832) DTLS-crashes within openssl

Sat Feb 28 14:59:34 CST 2015

     [ https://issues.asterisk.org/jira/browse/ASTERISK-24832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stefan Engström updated ASTERISK-24832:
---------------------------------------

    Description: 
I'm using 4 chan sip peers with transport WSS. They all use Chrome SIPml5 webrtc. 2 of them call a queue and the other 2 answer. Every 100-1000 calls or so, asterisk gets a crash due to segmentation fault or abort signal within openssl.

Since it's load-related it's hard to provide enough information but ill try add more continuously.

ISSUE-0
First thing i noticed was  that dtls_perform_handshake was called too many times but that was fixed by compensating for ASTERISK-24830 

By code inspection and tracing logs; it looks like the crashes mostly occur for dtls->ssl instances where asterisk has role: server, (SSL_set_accept_state(dtls->ssl) has been called.) 

EDIT -- This JIRA is getting a little bigger. It seems there are many sub-problems which are all related to DTLS though... not all sub-issues below may be real issues, some are just me asking questions about code. I'd be happy if a developer took a look at it and answered questions or discussed some of the issues and possible fixes.

ISSUE-1 - crash3 seems to prove a concurrency issue:

thread 5 leaving asterisk code at dtls_perform_handshake is performing ssl3_clear on the same ssl struct as that which is sent to ssl_read from __rtp_recvfrom in thread 1

ISSUE-2: Im curious about the behavior of ast_rtp_on_ice_complete() {
...
        dtls_perform_handshake(instance, &rtp->dtls, 0);
        if (rtp->rtcp) {
                dtls_perform_handshake(instance, &rtp->rtcp->dtls, 1);
        }
...
}
chan_sip seems to call process_sdp -> process_sdp_a_dtls -> res_asterisk_rtp::dtls_set_setup which ulimately sets SSL_set_connect_state(ssl) OR SSL_set_accept_state(ssl) on both (RTP+RTCP) ssl sessions. But this races with the firing of  dtls_perform_handshake(instance, &rtp->dtls, 0); from ast_rtp_on_ice_complete. I'm not sure if this is a problem but in my last crash crash4 the ast_ice_on_ice_complete fired before dtls_set_setup which i have never noticed during non-crash-calls,

ISSUE-3 
why is SSL_do_handshake(dtls->ssl) called at all if we are passive? 
i added a debug-check dtls_perform_handshake() {...} to only SSL_do_handshake if we are not passive, and it seems to do no harm. (dtls->dtls_setup != AST_RTP_DTLS_SETUP_PASSIVE)

ISSUE-4
continuing from issue-2; It's possible for SSL_is_init_finished(&rtp->dtls) to be false when calling dtls_perform_handshake(instance, &rtp->dtls, 0) and the next instant when dtls_perform_handshake(instance, &rtp->rtcp->dtls, 1) is called SSL_is_init_finished(&rtp->rtcp->dtls) is true, which causes a clear on rtp->rtcp->dtls->ssl but not for rtp->dtls->ssl, this leads to potential problems?

ISSUE-5
See crash5.txt . in __rtp_rcvfrom (thread 4) we call dtls_srtp_setup if SSL_is_init_finished(dtls->ssl) is true for the rtp->dtls - we don't seem to care if SSL_is_init_finished(rtp->rtcp->dtls->ssl) is true too, anyways this results in a call to ast_srtp_create, causing thread 1 to execute ast_srtp_protect --which it fails for some reason....- Q: Why is this? I attached a special crash5.log showing some function calls up to 30 seconds before the crash, which was generated by debugpatch: TESTDTLS.patch . By log it seems like the call to dtls_srtp_setup was started at 01:10:09 and had not returned ~25 seconds later by the time of the crash,. (relevant instance was 0x7fdc00dc6958)

Possibly related to ASTERISK-24651
Requires patch from ASTERISK-24711
Requires patch from ASTERISK-24830  (the obvious fix of replacing USE_PJPROJECT WITH HAVE_PJPROJECT...)

  was:
I'm using 4 chan sip peers with transport WSS. They all use Chrome SIPml5 webrtc. 2 of them call a queue and the other 2 answer. Every 100-1000 calls or so, asterisk gets a crash due to segmentation fault or abort signal within openssl.

Since it's load-related it's hard to provide enough information but ill try add more continuously.

ISSUE-0
First thing i noticed was  that dtls_perform_handshake was called too many times but that was fixed by compensating for ASTERISK-24830 

By code inspection and tracing logs; it looks like the crashes mostly occur for dtls->ssl instances where asterisk has role: server, (SSL_set_accept_state(dtls->ssl) has been called.) 

EDIT -- This JIRA is getting a little bigger. It seems there are many sub-problems which are all related to DTLS though... not all sub-issues below may be real issues, some are just me asking questions about code. I'd be happy if a developer took a look at it and answered questions or discussed some of the issues and possible fixes.

ISSUE-1 - crash3 seems to prove a concurrency issue:

thread 5 leaving asterisk code at dtls_perform_handshake is performing ssl3_clear on the same ssl struct as that which is sent to ssl_read from __rtp_recvfrom in thread 1

ISSUE-2: Im curious about the behavior of ast_rtp_on_ice_complete() {
...
        dtls_perform_handshake(instance, &rtp->dtls, 0);
        if (rtp->rtcp) {
                dtls_perform_handshake(instance, &rtp->rtcp->dtls, 1);
        }
...
}
chan_sip seems to call process_sdp -> process_sdp_a_dtls -> res_asterisk_rtp::dtls_set_setup which ulimately sets SSL_set_connect_state(ssl) OR SSL_set_accept_state(ssl) on both (RTP+RTCP) ssl sessions. But this races with the firing of  dtls_perform_handshake(instance, &rtp->dtls, 0); from ast_rtp_on_ice_complete. I'm not sure if this is a problem but in my last crash crash4 the ast_ice_on_ice_complete fired before dtls_set_setup which i have never noticed during non-crash-calls,

ISSUE-3 
why is SSL_do_handshake(dtls->ssl) called at all if we are passive? 
i added a debug-check dtls_perform_handshake() {...} to only SSL_do_handshake if we are not passive, and it seems to do no harm. (dtls->dtls_setup != AST_RTP_DTLS_SETUP_PASSIVE)

ISSUE-4
continuing from issue-2; It's possible for SSL_is_init_finished(&rtp->dtls) to be false when calling dtls_perform_handshake(instance, &rtp->dtls, 0) and the next instant when dtls_perform_handshake(instance, &rtp->rtcp->dtls, 1) is called SSL_is_init_finished(&rtp->rtcp->dtls) is true, which causes a clear on rtp->rtcp->dtls->ssl but not for rtp->dtls->ssl, this leads to potential problems?

ISSUE-5
See crash5.txt . in __rtp_rcvfrom (thread 4) we call dtls_srtp_setup if SSL_is_init_finished(dtls->ssl) is true for the rtp->dtls - we don't seem to care if SSL_is_init_finished(rtp->rtcp->dtls->ssl) is true too, anyways this results in a call to ast_srtp_create, causing thread 1 to execute ast_srtp_protect --- Q: Why is this? was it a race issue or is it because SSL_is_init_finished(rtp->rtcp->dtls->ssl) wasn't finished?

Possibly related to ASTERISK-24651
Requires patch from ASTERISK-24711
Requires patch from ASTERISK-24830  (the obvious fix of replacing USE_PJPROJECT WITH HAVE_PJPROJECT...)

> DTLS-crashes within openssl 
> ----------------------------
>
>                 Key: ASTERISK-24832
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-24832
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Resources/res_rtp_asterisk
>    Affects Versions: 13.1.0
>         Environment: Fedora 20 x86_64, openssl-1.0.1e-41.fc20.x86_64, Asterisk 13.1.0, Chrome SIPML5 chan_sip peers with transport WSS
>            Reporter: Stefan Engström
>            Assignee: Rusty Newton
>         Attachments: crash1.txt, crash2.txt, crash3.txt, crash4.txt, crash5.extralog, crash5.txt, CUSTOMERRORDEBUGLOG, SIPCONF.txt, TESTDTLS.patch, TESTDTLS.patch.workingcopy
>
>
> I'm using 4 chan sip peers with transport WSS. They all use Chrome SIPml5 webrtc. 2 of them call a queue and the other 2 answer. Every 100-1000 calls or so, asterisk gets a crash due to segmentation fault or abort signal within openssl.
> Since it's load-related it's hard to provide enough information but ill try add more continuously.
> ISSUE-0
> First thing i noticed was  that dtls_perform_handshake was called too many times but that was fixed by compensating for ASTERISK-24830 
> By code inspection and tracing logs; it looks like the crashes mostly occur for dtls->ssl instances where asterisk has role: server, (SSL_set_accept_state(dtls->ssl) has been called.) 
> EDIT -- This JIRA is getting a little bigger. It seems there are many sub-problems which are all related to DTLS though... not all sub-issues below may be real issues, some are just me asking questions about code. I'd be happy if a developer took a look at it and answered questions or discussed some of the issues and possible fixes.
> ISSUE-1 - crash3 seems to prove a concurrency issue:
> thread 5 leaving asterisk code at dtls_perform_handshake is performing ssl3_clear on the same ssl struct as that which is sent to ssl_read from __rtp_recvfrom in thread 1
> ISSUE-2: Im curious about the behavior of ast_rtp_on_ice_complete() {
> ...
>         dtls_perform_handshake(instance, &rtp->dtls, 0);
>         if (rtp->rtcp) {
>                 dtls_perform_handshake(instance, &rtp->rtcp->dtls, 1);
>         }
> ...
> }
> chan_sip seems to call process_sdp -> process_sdp_a_dtls -> res_asterisk_rtp::dtls_set_setup which ulimately sets SSL_set_connect_state(ssl) OR SSL_set_accept_state(ssl) on both (RTP+RTCP) ssl sessions. But this races with the firing of  dtls_perform_handshake(instance, &rtp->dtls, 0); from ast_rtp_on_ice_complete. I'm not sure if this is a problem but in my last crash crash4 the ast_ice_on_ice_complete fired before dtls_set_setup which i have never noticed during non-crash-calls,
> ISSUE-3 
> why is SSL_do_handshake(dtls->ssl) called at all if we are passive? 
> i added a debug-check dtls_perform_handshake() {...} to only SSL_do_handshake if we are not passive, and it seems to do no harm. (dtls->dtls_setup != AST_RTP_DTLS_SETUP_PASSIVE)
> ISSUE-4
> continuing from issue-2; It's possible for SSL_is_init_finished(&rtp->dtls) to be false when calling dtls_perform_handshake(instance, &rtp->dtls, 0) and the next instant when dtls_perform_handshake(instance, &rtp->rtcp->dtls, 1) is called SSL_is_init_finished(&rtp->rtcp->dtls) is true, which causes a clear on rtp->rtcp->dtls->ssl but not for rtp->dtls->ssl, this leads to potential problems?
> ISSUE-5
> See crash5.txt . in __rtp_rcvfrom (thread 4) we call dtls_srtp_setup if SSL_is_init_finished(dtls->ssl) is true for the rtp->dtls - we don't seem to care if SSL_is_init_finished(rtp->rtcp->dtls->ssl) is true too, anyways this results in a call to ast_srtp_create, causing thread 1 to execute ast_srtp_protect --which it fails for some reason....- Q: Why is this? I attached a special crash5.log showing some function calls up to 30 seconds before the crash, which was generated by debugpatch: TESTDTLS.patch . By log it seems like the call to dtls_srtp_setup was started at 01:10:09 and had not returned ~25 seconds later by the time of the crash,. (relevant instance was 0x7fdc00dc6958)
> Possibly related to ASTERISK-24651
> Requires patch from ASTERISK-24711
> Requires patch from ASTERISK-24830  (the obvious fix of replacing USE_PJPROJECT WITH HAVE_PJPROJECT...)

--
This message was sent by Atlassian JIRA
(v6.2#6252)