[asterisk-bugs] [JIRA] (ASTERISK-25645) res_rtp_asterisk: Lock inversion

Dade Brandon (JIRA) noreply at issues.asterisk.org
Wed Dec 23 13:38:33 CST 2015


    [ https://issues.asterisk.org/jira/browse/ASTERISK-25645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=228729#comment-228729 ] 

Dade Brandon edited comment on ASTERISK-25645 at 12/23/15 1:36 PM:
-------------------------------------------------------------------

This should be unrelated to the patch, although more info like a full backtrace of threads would help to show otherwise.

We also use early media, and JsSIP (0.6.4 due to heavy modifications, but the version should be irrelevant for an Asterisk deadlock).  We ran this patch successfully in a large production environment from Dec 10 to 17 & and have been running the updated patch (6dc4aacaed merged yesterday) since Dec 17th.

I'd be curious to see the 'thread apply all bt full' to see if there's a relation to ASTERISK-25275, since the workaround we've been using for that solved all of our previous deadlock issues.

You can't tell from the mentioned codepaths, what is holding the lock in pj_ice_sess_send_data.  I'm willing to bet on it being a concurrent code path looking something like:
{noformat}
    res_rtp_asterisk.c      __rtp_recv_from
    pjnath/ice_session.c    pj_ice_sess_on_rx_pkt
            - locks  ((pj_ice_sess*)ice)->mutex
            - calls pj_stun_session_on_rx_pkt
    pjnath/stun_session.c   pj_stun_session_on_rx_pkt
           - locks  ((pj_stun_sess*)sess)->lock
           - calls check_cached_response
    pjnath/stun_session.c   check_cached_response
           - stuck in the while loop due to t never being == sess->cached_response_list
{noformat}

sess->cached_response_list is corrupted by the pj_timer_heap_schedule in pj_stun_session_send_msg -- usually this results in a SIGSEGV in either check_cached_response_list due to null dereference (per ASTERISK-25275) or  null dereference in pj_stun_session_destroy, however if the cache eviction timer executes while check_cached_response_list is in its loop, then it will never find its starting point, causing an infinite loop.  From our experience, that happens roughly 1 out of 10 times (10 SIGSEGVs per 1 infinite loop)

As another indicator to whether or not I'm referencing the same problem, if you are using Chrome with JsSIP, then you'd have noticed this issue spike significantly after the Chrome M47 release on Dec 1st (or after your browser upgraded to it) since they adjusted their DTLS code (an improvement) which somehow managed to expose this issue, causing it to occur roughly 50x more often than before.



was (Author: dade):
This should be unrelated to the patch, although more info like a full backtrace of threads would help to show otherwise.

We also use early media, and JsSIP (0.6.4 due to heavy modifications, but the version should be irrelevant for an Asterisk deadlock).  We ran this patch successfully in a large production environment from Dec 10 to 17 & and have been running the updated patch (6dc4aacaed merged yesterday) since Dec 17th.

I'd be curious to see the 'thread apply all bt full' to see if there's a relation to ASTERISK-25275, since the workaround we've been using for that solved all of our previous deadlock issues.

You can't tell from the mentioned codepaths, what is holding the lock in pj_ice_sess_send_data.  I'm willing to bet on it being a concurrent code path looking something like:

    res_rtp_asterisk.c      __rtp_recv_from
    pjnath/ice_session.c    pj_ice_sess_on_rx_pkt
            - locks  ((pj_ice_sess*)ice)->mutex
            - calls pj_stun_session_on_rx_pkt
    pjnath/stun_session.c   pj_stun_session_on_rx_pkt
           - locks  ((pj_stun_sess*)sess)->lock
           - calls check_cached_response
    pjnath/stun_session.c   check_cached_response
           - stuck in the while loop due to t never being == sess->cached_response_list


sess->cached_response_list is corrupted by the pj_timer_heap_schedule in pj_stun_session_send_msg -- usually this results in a SIGSEGV in either check_cached_response_list due to null dereference (per ASTERISK-25275) or  null dereference in pj_stun_session_destroy, however if the cache eviction timer executes while check_cached_response_list is in its loop, then it will never find its starting point, causing an infinite loop.  From our experience, that happens roughly 1 out of 10 times (10 SIGSEGVs per 1 infinite loop)

As another indicator to whether or not I'm referencing the same problem, if you are using Chrome with JsSIP, then you'd have noticed this issue spike significantly after the Chrome M47 release on Dec 1st (or after your browser upgraded to it) since they adjusted their DTLS code (an improvement) which somehow managed to expose this issue, causing it to occur roughly 50x more often than before.


> res_rtp_asterisk: Lock inversion
> --------------------------------
>
>                 Key: ASTERISK-25645
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-25645
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Resources/res_rtp_asterisk
>            Reporter: Joshua Colp
>
> Reported by Steve Davies on asterisk-dev:
> commit 5e6b1476a087407a052f007d326c504cfeefebe7
> ASTERISK-25614
> 2 code paths which approximate the following will cause a lock-inversion deadlock:
> approximate call orders are:
> a)
> pj_timer_heap_poll (PJ_LOCK)
> ast_rtp_on_ice_complete
> ast_rtp_instance_set_remote_address
> remote_address_set
> ast_rtp_remote_address_set
> (DTLS_LOCK)
> ...
> b)
> ast_pbx...
> app_dial
> bridge...
> read
> rtp_read
> ...
> __rtp_recvfrom
> (DTLS_LOCK)
> dtls_srtp_check_pending
> __rtp_sendto
> pj_ice_sess_send_data
> (PJ_LOCK)



--
This message was sent by Atlassian JIRA
(v6.2#6252)



More information about the asterisk-bugs mailing list