[Asterisk-code-review] pbx: Deadlock between contexts container and context merge l... (asterisk[11])

Mon Dec 28 09:43:26 CST 2015

Matt Jordan has posted comments on this change.

Change subject: pbx: Deadlock between contexts container and context_merge locks
......................................................................

Patch Set 1: Code-Review-2

> Patch Set 1: Code-Review-2

Posting our conversation from #asterisk-dev:

<mjordan> file: I think I'm a bit confused about review https://gerrit.asterisk.org/#/c/1855 still. What shared state is being mutated both in the device state callback, as well as in the ast_change_hint call?
<mjordan> Clearly the hint is, but that is locked at the hint level
<file> there is no shared state, however there is nothing to stop both the device state callback and ast_change_hint from using/invoking device state callbacks at the same time
<file> and the device state callback doesn't keep the hint lock for long, so both could give different state
<mjordan> any reason the device state callback can't just hold onto the hint lock?
<file> I'm sure there's a lock inversion waiting there between channel drivers and the PBX core
<mjordan> maybe, but kevin's solution feels worse than the problem we're solving
<file> I'm not sure if I have an alternative, besides letting that race condition... exist
<mjordan> At the very least, state_callbacks_lock feels completely redundant with just locking the entire hints container
<mjordan> I'm not sure this race condition was that harmful, compared to the alternatives
<file> which race condition? the device state callback/ast_change_hint one?
<mjordan> yes
<file> it's a slim chance, the end result shouldn't be a crash - the device state shown on a phone may just not reflect reality, until another device state change occurs
<mjordan> it's not good, don't get me wrong, but if we tank Asterisk's ability to just do a state update in normal operation outside of startup, that's bad
<mjordan> Really, the hint should be locked during the callback
<mjordan> that feels like the real crux of the problem: we unlock it and let something else slip in and mutate the state
<file> well, nothing else previously slipped in and mutated the state to this degree
<file> it's a new condition
<mjordan> what changed?
<file> fixing ast_change_hint to update and propogate state if the underlying devices change
<mjordan> actually.
<mjordan> This may work.
<mjordan> execute_state_callback actually relocks the hint, copies the data off, then calls the callback
<file> yes
<mjordan> really, it appears as if we're just letting go of the lock on the hint too soon
<mjordan> we should let execute_state_callback do that for us
<mjordan> either that, or pull the data off prior to calling that, and let it notify the callbacks
<file> you'd still potentially run into the race condition above, unless you held the hint while invoking the callback - it's just lessened
<file> hum, what would be really nice is if ast_change_hint could actually queue up a device state callback which would get handled normally...
<mjordan> How would you run into the race condition? You're holding onto the hint lock the entire time.
<file> where, and hint or hints?
<mjordan> The only way we'd have a race is if we are trying to execute a state callback and we release the hint lock prior to caching off the information
* mjordan does not feel like this problem is well understood
<file> that is what currently happens, are you proposing changing that?
<mjordan> no, that isn't what happens
<mjordan> ast_change_hint releases the hint lock
<mjordan> you then go into execute_state_callback, and re-grab the lock
<mjordan> between that period of time, something else can change the state of the hint
<mjordan> such as the device state callback
<mjordan> I presumed that was the actual race condition
<file> would stasis allow multiple simultaneous invocations of the device state callback from separate threads?
<mjordan> yes
* mjordan sighs
<mjordan> also, we have this warning comment:
<mjordan>   /*
<mjordan>    * Get device state for this hint.
<mjordan>    *
<mjordan>    * NOTE: We cannot hold any locks while determining the hint
<mjordan>    * device state or notifying the watchers without causing a
<mjordan>    * deadlock.  (conlock, hints, and hint)
<mjordan>    */
<mjordan> what a freaking mess.
<file> I give up on it.
<file> according to the stasis documentation the device_state_callbacks would be serialized
<mjordan> hrm.
* mjordan is unsure of why
<file> unless we changed that and never updated the documentation
<mjordan> is it a dedicated subscription?
<file> one subscription to the device state all topic using stasis_subscribe
<mjordan> that is, it doesn't get served on the the threadpool?
<file> from a PBX core perspective, no
* file ponders
<file> mjordan, I have... a thought.
<file> mjordan, a new message type that ast_change_hint can use to publish a hint update message, which is handled in the device and presence state callbacks
<file> that would return things to the exact locking they were previously
<mjordan> I like that.
<mjordan> I'd rather serialize things in a queue then try to keep doing this insane lock the world crap
<file> yeah

tl;dr: use Stasis to serialize the updates, which precludes having to use yet another global lock.

-- 
To view, visit https://gerrit.asterisk.org/1855
To unsubscribe, visit https://gerrit.asterisk.org/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7f67ce4b4c94b731c23932cc2ba9e0a4a720813d
Gerrit-PatchSet: 1
Gerrit-Project: asterisk
Gerrit-Branch: 11
Gerrit-Owner: Kevin Harwell <kharwell at digium.com>
Gerrit-Reviewer: Anonymous Coward #1000019
Gerrit-Reviewer: Joshua Colp <jcolp at digium.com>
Gerrit-Reviewer: Matt Jordan <mjordan at digium.com>
Gerrit-HasComments: No