[asterisk-bugs] [JIRA] (ASTERISK-25645) res_rtp_asterisk: Lock inversion

Thu Dec 24 21:39:33 CST 2015

    [ https://issues.asterisk.org/jira/browse/ASTERISK-25645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=228746#comment-228746 ] 

Dade Brandon edited comment on ASTERISK-25645 at 12/24/15 9:38 PM:
-------------------------------------------------------------------

I'm sorry if my response is going to look like a flame, but I just see too many holes in the explanations here, without any real evidence.  Holding back on the backtraces doesn't help if you can't communicate a problem that's evident in the branch you're referring to (11.21-rc1).

While I wish I could take your suggestion to attempt to reproduce it reliably using your test setup, that would require your UI code in addition to the modded pjsip (since for example, our UI code doesn't care if there's multiple channels open in pjsip), and even if I went through that, it would likely require identical network conditions - we've seen and resolved a lot of issues in DTLS/STUN that are only reproducable by our customers within a specific RTT from our servers, or with a specific jitter level.  That doesn't even count the Asterisk and pjproject modifications you're referencing.  So going through the hassle on my end of using your modded pjsip to reproduce on my end to debug, isn't going to happen unfortunately.

Also note, I don't work for Digium, I'm just the guy that submitted the patch that introduced the problem you're referencing, so when I say "we" I don't mean the Asterisk developer community, I mean my company, which has ran several hundred million production webrtc calls, bridged to PSTN and to other webrtc peers, using the mentioned patch.  I'm not closed to the idea of there being unforseen issues with it, and I would like to help to ensure that if there's any issues in that commit, that it doesn't make it past the release candidate, however.

I've personally made the mistake of opening up several issues on jira without seeing if it's related to our internal changes.  I don't advise it since it complicates matters, unless those internal changes have fixed something, in which case I suggest submitting them back before opening issues without first backtracking to the release, without any customizations.  All it does is diverge your afflicted setup from the setup we can test with.

>I will provide the backtraces of the deadlocked threads to demonstrate the issue.

What you attached still does nothing to show what threads are holding the lock, unless this is a true inversion, which it also doesn't show.  The lock acquired in pj_timer_heap_poll is not the ice->lock that's in attempt to be acquired in pj_ice_sess_send_data, it's ice->stun_cfg.timer_heap->lock, if that exists (pjlib/src/pj/timer.c:103) - plus, it's unlocked before the timer's callback is called (ast_rtp_on_ice_complete) - so unless you've modified pjproject in Asterisk-11.21-rc1, what you've sent doesn't show the problem in the path as you've described.
It's far more likely that the deadlock is in a thread that isn't locked.

I'd still like to see your "thread apply all bt full" - not filtered to waiting threads.

Preferably change res/Makefile on line 78 to have "-fPIC -g3" rather than just "-fPIC" so that there's debug symbols for pjproject

> Is the attached patch 'experimental_anti_deadlock' utterly stupid?

It reverts some locking I added in, which I wasn't sure would be necessary, useful or problematic.  It definitely wasn't directly related to the delay issue that the rest of the commit was intended to resolve.  I particularly requested that someone familiar with the dtls lock would review that portion on Gerrit, and it received a go-ahead, but things can slip through the cracks even with the review process.  I still think seeing proper evidence of the lock being related to that would be more appropriate, or else taking the referenced commit out of 11.21 before release & holding until 22.  However, in reverting it, it doesn't add back in the dtls lock I'd removed lower down in the same control structure, which the new/expanded lock was inclusive of. 

Given that locking already occurred inside of the same control structure in __rtp_recvfrom, and still would occur even if it was within dtls_srtp_check_pending, moving it doesn't seem to serve any positive effect, except perhaps keeping the lock maintained for a shortened duration (while increasing the number of times that the lock is acquired). 

I also wouldn't add the lock in to dtls_srtp_check_pending - at least not where you added it, since you added it after the BIO_READ.  If it's going to be useful anywhere, it's going to be before there, as the point is to try to prevent two threads from affecting the SSL BIO buffers at the same time (it's also being referenced in __rtp_recvfrom) and to through that hopefully resolve some of the SSL asserts we've seen in our core dumps (which we haven't seen since the patch).  Adding the lock directly in dtls_srtp_check_pending is going to cause a deadlock when it's called from dtls_perform_handshake (it sounds like you're running in passive, so this wouldn't be an issue for you, but it would be for others), dtls_srtp_handle_timeout, amongst other places.  I considered that before this patch, and determined changing all of those places was too significant of a change for my understanding of res_rtp_asterisk + it's DTLS integration.

If moving the locks in __rtp_recvfrom resolves the issue in your test setup, I think it is most likely that it's moving the problem - possibly to a place where you'd never see it again, but I think it's more likely going to cause some SSL aborts from within their debug asserts.

>I am familiar with the SEGV / 100% CPU infinite loop bug, and this is not that bug. I added locks to on_cache_timeout() in pjproject as referenced in ASTERISK-25275 to resolve this, but none of those locks are held when this issue deadlocks.

Funny, because my company held our information about that privately until I posted the update to it at the same time as replying to you yesterday, and I feel like "familiarity" would be difficult to establish over a 24 hour period.  If you mean you have your own experience with it, did you resolve it a different way?  I haven't resolved this by adding locks, and I'd be more likely to believe that to do that, could be the cause of your unaquireable ice->sess lock as the pjnath achitecture is both messy and complex.  I don't know it well enough myself to make that change with any level of comfort.

I mentioned that locks being added on to on_cache_timeout were how the more up to date pjproject's have solved the problem, but in actuality, the only evidence I have for this problem being resolved in updated versions of pjproject, is through hearing that these problems don't exist in Asterisk-13 when using chan_pjsip with webrtc, which is not our environment, and trying to backtrack pjproject's changelogs to locate a relevant change to that area of the code.
I am fairly sure you've read ASTERISK-25275 outside of context since you also think that it suggests disabling pjproject's group locks.  My suggested workaround on there, although brute, is to essentially just ensure that the cached_response_list is never used.  Since I implemented that workaround for my company, I've consulted with several other companies who've also confirmed that that resolved the deadlocks they began to experience at higher frequency post-Chrome-M47.  Note I'm also the original issue opener months ago, the other person recently responding on there is someone that re-opened that thread yesterday asking for any related solutions.

> Thread 36 is the PJ callback on ICE complete, which holds the PJ group lock, and wants the DTLS lock.

There is no "group lock" on Asterisk-11's embedded pjproject

> Threads 34 and 3 are Asterisk RTP threads which call dtls_srtp_check_pending() with the DTLS lock, and IF this results in a write via pj_ice_sess_send_data(), will subsequently require the PJ group lock.

"" and still nothing to suggest it's locked, unless your pjproject actually is modified with group locks + is locking the group lock somehow from within timer.c in the timer poll, which seems architecturally questionable.

It's also not relevant if the required pjproject lock(s) aren't deadlocked.

> ASTERISK-25275 suggests that disabling the group lock might be a thing, but that seems to fly in the face of everything 

One person that's experiencing the problem that recently re-opened that thread, mentioned he'd tried upgrading to a group-lock patched version of pjproject within Asterisk-11.  He mentioned setting a a group lock parameter to NULL in a call to pj_ice_sess_create, which is not the same as disabling the group lock.  That is a person attempting to work out the problem, who is hopefully going to get back to that issue# with feedback on the workaround I suggested.  That is not a suggested solution, it is something he said did NOT work.

(Restricted to JIRA Users group)

was (Author: dade):
You seem to be refering to a lot of customization within Asterisk, which is risky.  While I wish I could take your suggestion to attempt to reproduce it reliably using your test setup, that would require your UI code in addition to the modded pjsip (since for example, our UI code doesn't care if there's multiple channels open in pjsip), and even if I went through that, it would likely require identical network conditions - we've seen and resolved a lot of issues in DTLS/STUN that are only reproducable by our customers within a specific RTT from our servers, or with a specific jitter level.  That doesn't even count the asterisk modifications you're referencing.  So going through the hassle on my end of using your modded pjsip to reproduce on my end to debug, isn't going to happen unfortunately.
Also note, I don't work for Digium, I'm just the guy that submitted the patch that introduced the problem you're referencing, so when I say "we" I don't mean the Asterisk developer community, I mean my company, which has ran several hundred million production webrtc calls, bridged to PSTN and to other webrtc peers, using the mentioned patch.  I'm not closed to the idea of there being unforseen issues with it, and I would like to help to ensure that if there's any issues in that commit, that it doesn't make it past the release candidate, however.
I've personally made the mistake of opening up several issues on jira without seeing if it's related to our internal changes.  I don't advise it since it complicates matters, unless those internal changes have fixed something, and you've submitted those fixes back, all it does is diverge your afflicted setup from the setup we can test with.

>I will provide the backtraces of the deadlocked threads to demonstrate the issue.
What you attached still does nothing to show what threads are holding the lock, unless this is a true inversion, which it also doesn't show.  The lock acquired in pj_timer_heap_poll is not the ice->lock that's in attempt to be acquired in pj_ice_sess_send_data, it's ice->stun_cfg.timer_heap->lock, if that exists (pjlib/src/pj/timer.c:103) - plus, it's unlocked before the timer's callback is called (ast_rtp_on_ice_complete) - so unless you've modified pjproject in Asterisk-11.21-rc1, what you've sent doesn't show the problem in the path as you've described.
It's far more likely that the deadlock is in a thread that isn't locked.
I'd still like to see your "thread apply all bt full" - not filtered to waiting threads.
Preferably change res/Makefile on line 78 to have "-fPIC -g3" rather than just "-fPIC" so that there's debug symbols for pjproject
> Is the attached patch 'experimental_anti_deadlock' utterly stupid?
It reverts some locking I added in, which I wasn't sure would be necessary, useful or problematic.  It definitely wasn't directly related to the delay issue that the rest of the commit was intended to resolve.  I particularly requested that someone familiar with the dtls lock would review that portion on Gerrit, and it received a go-ahead, but things can slip through the cracks even with the review process.  I still think seeing proper evidence of the lock being related to that would be more appropriate, or else taking the referenced commit out of 11.21 before release & holding until 22.  However, in reverting it, it doesn't add back in the dtls lock I'd removed lower down in the same control structure, which the new/expanded lock was inclusive of. 
Given that locking already occurred inside of the same control structure in __rtp_recvfrom, and still would occur even if it was within dtls_srtp_check_pending, moving it doesn't seem to serve any positive effect, except perhaps keeping the lock maintained for a shortened duration (while increasing the number of times that the lock is acquired). 
I also wouldn't add the lock in to dtls_srtp_check_pending - at least not where you added it, since you added it after the BIO_READ.  If it's going to be useful anywhere, it's going to be before there, as the point is to try to prevent two threads from affecting the SSL BIO buffers at the same time (it's also being referenced in __rtp_recvfrom) and to through that hopefully resolve some of the SSL asserts we've seen in our core dumps (which we haven't seen since the patch).  Adding the lock directly in dtls_srtp_check_pending is going to cause a deadlock when it's called from dtls_perform_handshake (it sounds like you're running in passive, so this wouldn't be an issue for you, but it would be for others), dtls_srtp_handle_timeout, amongst other places.  I considered that before this patch, and determined changing all of those places was too significant of a change for my understanding of res_rtp_asterisk + it's DTLS integration.
>I am familiar with the SEGV / 100% CPU infinite loop bug, and this is not that bug. I added locks to on_cache_timeout() in pjproject as referenced in ASTERISK-25275 to resolve this, but none of those locks are held when this issue deadlocks.
Funny, because my company held our information about that privately until I posted the update to it at the same time as replying to you yesterday, and I feel like "familiarity" would be difficult to establish over a 24 hour period.  I haven't resolved this by adding locks, and I'd be more likely to believe that to do that, could be the cause of your unaquireable ice->sess lock as the pjnath achitecture is both messy and complex.  I don't know it well enough myself to make that change with any level of comfort.
I mentioned that locks being added on to on_cache_timeout were how the more up to date pjproject's have solved the problem, but in actuality, the only evidence I have for this problem being resolved in updated versions of pjproject, is through hearing that these problems don't exist in Asterisk-13 when using chan_pjsip with webrtc, which is not our environment, and trying to backtrack pjproject's changelogs to locate a relevant change to that area of the code.
I am fairly sure you've read ASTERISK-25275 outside of context since you also think that it suggests disabling pjproject's group locks.  My suggested workaround on there, although brute, is to essentially just ensure that the cached_response_list is never used.  Since I implemented that workaround for my company, I've consulted with several other companies who've also confirmed that that resolved the deadlocks they began to experience at higher frequency post-Chrome-M47.  Note I'm also the original issue opener months ago, the other person recently responding on there is someone that re-opened that thread yesterday asking for any related solutions.

> Thread 36 is the PJ callback on ICE complete, which holds the PJ group lock, and wants the DTLS lock.
There is no "group lock" on Asterisk-11's embedded pjproject

> Threads 34 and 3 are Asterisk RTP threads which call dtls_srtp_check_pending() with the DTLS lock, and IF this results in a write via pj_ice_sess_send_data(), will subsequently require the PJ group lock.
"" and still nothing to suggest it's locked, unless your pjproject actually is modified with group locks + is locking the group lock somehow from within timer.c in the timer poll, which seems architectually questionable.

> ASTERISK-25275 suggests that disabling the group lock might be a thing, but that seems to fly in the face of everything 
One person that's experiencing the problem that recently re-opened that thread, mentioned he'd tried upgrading to a group-lock patched version of pjproject within Asterisk-11.  He mentioned setting a a group lock parameter to NULL in a call to pj_ice_sess_create, which is not the same as disabling the group lock.  That is a person attempting to work out the problem, who is hopefully going to get back to that issue# with feedback on the workaround I suggested.  That is not a suggested solution, it is something he said did NOT work.

(Restricted to JIRA Users group)
> res_rtp_asterisk: Lock inversion
> --------------------------------
>
>                 Key: ASTERISK-25645
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-25645
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Resources/res_rtp_asterisk
>            Reporter: Joshua Colp
>         Attachments: deadlocked_threads.txt, experimental_anti_deadlock
>
>
> Reported by Steve Davies on asterisk-dev:
> commit 5e6b1476a087407a052f007d326c504cfeefebe7
> ASTERISK-25614
> 2 code paths which approximate the following will cause a lock-inversion deadlock:
> approximate call orders are:
> a)
> pj_timer_heap_poll (PJ_LOCK)
> ast_rtp_on_ice_complete
> ast_rtp_instance_set_remote_address
> remote_address_set
> ast_rtp_remote_address_set
> (DTLS_LOCK)
> ...
> b)
> ast_pbx...
> app_dial
> bridge...
> read
> rtp_read
> ...
> __rtp_recvfrom
> (DTLS_LOCK)
> dtls_srtp_check_pending
> __rtp_sendto
> pj_ice_sess_send_data
> (PJ_LOCK)

--
This message was sent by Atlassian JIRA
(v6.2#6252)