[asterisk-bugs] [JIRA] (ASTERISK-25614) DTLS negotiation delays

Mon Dec 14 14:27:33 CST 2015

     [ https://issues.asterisk.org/jira/browse/ASTERISK-25614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Asterisk Team updated ASTERISK-25614:
-------------------------------------

    Assignee: Asterisk Team  (was: Dade Brandon)
      Status: Triage  (was: Waiting for Feedback)

> DTLS negotiation delays
> -----------------------
>
>                 Key: ASTERISK-25614
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-25614
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>    Affects Versions: 11.20.0
>            Reporter: Dade Brandon
>            Assignee: Asterisk Team
>         Attachments: ASTERISK-25614 patch.txt
>
>
> Chrome M47 and higher was experiencing a *consistent* 0.9s delay bridging audio when being a webrtc peer with Asterisk.
> Chrome M47 was released Dec 1st 2015
> We asked the Chrome team back when M47 was beta and they suggested that they've improved their SSL negotiation code in M47, and that Asterisk was likely not yet listening at the time that they sent the handshake.  [External link to Chrome issue 536209|https://code.google.com/p/chromium/issues/detail?id=536209&thanks=536209&ts=1443219585] 
> We noticed that before Gerrit commit 1ad827, this issue wasn't present.
> The particular line that changed in 1ad827, which when reversed corrects the issue, is in __rtp_recvfrom;  a dtls_srtp_check_pending was removed, which was happening on every call to dtls_srtp_check_pending.
> Re-adding a constant dtls_srtp_check_pending resolved the delay issue permanently.  Now that M47 is live, I noticed this has greatly increased our crashes per day (up to 40 crashes today across 50 servers, having processed only 20 million calls WebRTC calls.
> Digging in to this, I found that the reason the added dtls_srtp_check_pending resolved the issue, is because when we receive the handshake (*in == 22), there is no remote_address set -- the ast_sockaddr_isnull check in dtls_srtp_check_pending causes a return, keeping the response to the handshake in the write BIO.  Chrome no longer sends unnecessary repetitive handshakes (their negotiation code before M47 was quite unorganized).  This means that dtls_srtp_check_pending isn't called again until 0.9 seconds later, when Chrome times out the first handshake and sends another.  The response Asterisk is providing Chrome at this point, is actually the response to the first one!
> remote_addr is set shortly after the handshake is received, when ice candidates are processed by a separate thread.
> I've created a patch which triggers dtls_srtp_check_pending to send on ast_rtp_remote_address_set.
> This patch causes the response to send from a thread other than the one which __rtp_recvfrom is running from.  I've added locks to the dtls mutex within __rtp_recvfrom, as I believe that the unlocked access to the BIO and DTLS structures when ice candidate negotiation happens concurrently with a filled write bio sending through dtls_srtp_check_pending, was causing extremely inconsistent crashes (they also seemed to be more significant for customers with higher jitter)
> Further, in the triggered check in ast_rtp_remote_address_set, I took the care to only perform the triggered check if the DTLS structure is set to passive - as I confirmed through added debug logs, that for this issue, the code under the comment in __rtp_recvfrom "If we don't yet know if we are active or passive and we receive a packet... we are obviously passive" - has definitely executed when this issue occurs.
> I'm listing this as a regression since we identified that the problem was introduced in a particular commit, however I believe that this solution is much better than a revert, based on the testing I described.


--
This message was sent by Atlassian JIRA
(v6.2#6252)