[Asterisk-code-review] Work around thread safety issues in the PBX core. (asterisk[11])

Corey Farrell asteriskteam at digium.com
Wed May 20 17:08:49 CDT 2015


Corey Farrell has posted comments on this change.

Change subject: Work around thread safety issues in the PBX core.
......................................................................


Patch Set 1:

> > > Since this is the PBX core and this changes some fundamental
 > > usage
 > > > of it across the tree I'm not sure I'm comfortable with this
 > > going
 > > > into 11. It may not be correct there, but I can't think of any
 > > > issues filed as a result of it. We also don't have tests that
 > > cover
 > > > the areas impacted. I'll let others join in on their thoughts.
 > >
 > > Are you concerned about:
 > > 1) Creation of the nolock variants or changes made to existing
 > > functions to allow this?
 > > or
 > > 2) The mass code update to stop using the unsafe variants of the
 > > functions?
 > >
 > > If it's #1 then I'd like to wait for more opinions.  If it's #2 I
 > > can reduce amount of changes, but ultimately I believe they are
 > all
 > > correct fixes to race conditions that do exist in code.  I don't
 > > think we can confidently say that this hasn't been experienced
 > and
 > > reported, just that it hasn't been successfully triaged.  Race
 > > conditions can be very difficult to trace.
 > 
 > The first point is really the more relevant of the two points, and
 > the second is moot if we don't agree on the API changes. So.
 > 
 > I'm wary about having public 'nolock' variants of these functions.
 > An ast_context struct is opaque, and exposing the locking semantics
 > of an opaque struct feels like we're solving this problem in the
 > wrong way. Users of the PBX core should not have to worry about
 > locking or unlocking an opaque structure; they should be able to
 > request a context structure and manipulate as needed. If they need
 > to alter a property on the structure, the API presented by the PBX
 > core should be responsible for managing the thread safety.
 > 
 > Really, an ast_context should be an ao2 object. When retrieving it,
 > its reference should be bumped. Destroying it should drop the pbx
 > core's reference that it holds for the context, and remove the
 > context from the dialplan. If the ast_context struct should be
 > mutated, that safety should be provided by the PBX core - but that
 > isn't really the issue here: the issue is lifetime, and that should
 > be handled by reference counting the object.
 > 
 > The race condition you've noticed is really a fundamental problem
 > with how things are structured in the PBX core, and is at the root
 > of several other problems as well - most notably the
 > pbx_realtime/autoservice/recursive locking deadlock shenaniganry.
 > 
 > While this is a good attempt at fixing the problem, it kind of just
 > perpetuates the poor architecture we have in the PBX core. If we're
 > going to go down the road of fixing this for a theoretical race
 > condition problem, we might as well fix it correctly in master, and
 > then decide if it is worth backporting.

About the _nolock naming, the issue is locking of the global contexts container (ast_wrlock_contexts), not locking of the actual ast_context.  It is only safe to use a pointer to an ast_context during the contexts lock that was active when it was acquired, as it can be freed the moment you release the contexts lock.  Most existing functions that accept an ast_context as a parameter expect the caller to have already locked the global contexts.  These functions that I've created 'nolock' variants for are the oddballs.  Really the original functions should just be changed to expect the caller to handle locking of the global contexts, but that would be an unacceptable API change for 11/13.

I'm not sure AO2 would solve this problem.  Consider if pbx_ael registers 100 at mycontext, pbx_config registers 101 at mycontext.  Then extensions.conf is modified to add 102 at mycontext and a reload is issued.   If this simply added 102 to the existing mycontext object, it would be impossible for 'dialplan reload' to be an atomic operation when multiple contexts are modified.

-- 
To view, visit https://gerrit.asterisk.org/481
To unsubscribe, visit https://gerrit.asterisk.org/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1866b6787730c9c4f3f836b6133ffe9c820734fa
Gerrit-PatchSet: 1
Gerrit-Project: asterisk
Gerrit-Branch: 11
Gerrit-Owner: Corey Farrell <git at cfware.com>
Gerrit-Reviewer: Corey Farrell <git at cfware.com>
Gerrit-Reviewer: Joshua Colp <jcolp at digium.com>
Gerrit-Reviewer: Matt Jordan <mjordan at digium.com>
Gerrit-HasComments: No



More information about the asterisk-code-review mailing list