[asterisk-dev] [Code Review] 3752: astobj2: Fix race condition in ref_debug log

Sat Jul 12 12:23:31 CDT 2014

> On July 11, 2014, 10:01 p.m., Corey Farrell wrote:
> > /team/group/media_formats-reviewed-trunk/main/astobj2.c, lines 503-507
> > <https://reviewboard.asterisk.org/r/3752/diff/1/?file=62839#file62839line503>
> >
> >     This reorder will prevent ref_log from recording a double-free.
> >     
> >     Example:
> >     obj = ao2_alloc(..);
> >     ao2_ref(obj, -1);
> >     ao2_ref(obj, -1);
> >     
> >     Currently this block of code would produce an error in the ref_log.
> >     
> >     Also I know it sounds weird but I've seen:
> >     
> >     Thread 1: ao2_ref(obj, -1) == 2
> >     Thread 2: ao2_ref(obj, -1) == 1
> >     
> >     And for thread 2 (which runs the destructor) actually finishing first (and the last 2 entries to ref_log are reversed.
> 
> Matt Jordan wrote:
>     I'm not sure either of those are worth not fixing this:
>     
>     (1) A double free is highly likely to crash immediately, in which case, the ref debug log is only so useful. Plus, in a double free, you typically have the memory address that is now hostile, which can be correlated back to a ref debug log.
>     
>     (2) The out of order printing of destruction is possible, but is handled by the refcounter script since it looks for the destructor message (and will now actually get it)
>     
>     With the current code, we can get an admittedly large number of false positives, which isn't super useful.
> 
> Corey Farrell wrote:
>     The out of order destruction printing is actually worse with this update.  It causes the destructor to be 2nd to last, then the last entry is unref from 2.  This last entry appears as a skew and leak, separate from the rest of the history for that object.  So we will still get the same false positive, but in this case we'll be able to see the history of refs to know that it's a false positive.

I do think that this could cause the destructor message to get printed earlier slightly more often, although the testing I've done so far hasn't shown that we get considerably more 'skewed' results than we did of incorrectly reported leaked objects.

Regardless, I still think this is, in the long run, better.

(1) Generally, ref count debugging is used to find ref leaks. Having an erroneously report of a leak is more detrimental to the primary purpose of ref_debug than an incorrectly reported skewed object.

(2) The script *could* be modified to work around the out of order destruction message. The script could be made to look at the reported ref count of the object (since that value will now be accurate) and build out the full history of the object. Only when the object's history is fully built, *and* we get a destruction message, would it then print and dispose of the object. This would fix the falsely reported skewed objects in the script. Contrary to this, without this patch, it would be very difficult to fix inaccurate reporting of leaked objects in the script. (a) You don't get an indication that the object was destroyed, and (b) The reference count reported in the messages is false.

(3) Skewed objects are generally a reflection of a coding error in astobj2 and nothing else. For example, a framework that uses astobj2 should not be able to create an object with an initial reference count of 3. If an object is skewed, it's because it 'showed up' in the log with no constructor and with an unexpected reference count; generally, that would mean that something managed to bypass astobj's ref_debug printing. That typically only occurs when internal nodes in astobj2 incorrectly manipulate the ref functions themselves. As such, having misreported skewed objects is generally less harmful (and can also be suppressed) than misreported leaked objects.

Finally, we have a race condition (one that I admittedly caused!)  Disregarding the printing of the destruction message, we have an atomic integer value that is wrongly reported without this patch. We could make ref_debug more complex here to truly "solve" the race between threads destroying an object and the reporting; however, there is limited bang for the buck.
 - A global lock would cripple the system and/or render ref_debug useless
 - A thread safe queue that enqueues messages to another thread for processing is generally equivalent to a global lock

I'd prefer to make the post-processing 'smarter', and make the ref counts reported accurate.

- Matt

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3752/#review12579
-----------------------------------------------------------

On July 11, 2014, 9:45 p.m., Matt Jordan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviewboard.asterisk.org/r/3752/
> -----------------------------------------------------------
> 
> (Updated July 11, 2014, 9:45 p.m.)
> 
> 
> Review request for Asterisk Developers and Corey Farrell.
> 
> 
> Repository: Asterisk
> 
> 
> Description
> -------
> 
> As Corey noticed, when an object's reference is changed rapidly the ref debug log will report the wrong reference. This is due to accessing the reference (which is not thread safe) prior to the reference count being changed. Since the change in reference count is returned by internal_ao2_ref, it can be used instead to determine what the current state of the object is.
> 
> 
> Diffs
> -----
> 
>   /team/group/media_formats-reviewed-trunk/main/astobj2.c 418435 
> 
> Diff: https://reviewboard.asterisk.org/r/3752/diff/
> 
> 
> Testing
> -------
> 
> Objects that were incorrectly reported as leaked no longer are.
> 
> 
> Thanks,
> 
> Matt Jordan
> 
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20140712/4e17aaa3/attachment.html>