[asterisk-dev] [Code Review] MALLOC_HOLD for debugging memory

Mon Jul 13 17:42:08 CDT 2009

> On 2009-07-13 15:51:48, Mark Michelson wrote:
> > I question the usefulness of this change mainly because you can't see where in the code the illegal write to memory occurred. This also does not allow for the detection of illegal reads of freed memory.
> > 
> > Another potential problem (although this would never occur if Asterisk were perfect) is that the method used here is not thread-safe. You iterate over each byte of memory and see if the memory is non-NULL. The problem is that a section you already read could be written to after you pass over it. In order for this to work correctly, every write to memory would have to grab the reglock. That simply is not feasible.
> 
> Mark Michelson wrote:
>     Another problem is that this change masks any double frees which occur. A nice upside to this is that with a small bit of modification, you can actually detect and report double frees pretty easily with this code change.
>     
>     (I would have mentioned this earlier, but review board was adamantly refusing to let me edit my review further)
> 
> Tilghman Lesher wrote:
>     1) This is already threadsafe, as reglock is held during the entire time that the memory check is occurring.
>     2) Memory should not be written to after the memory is zeroed, because as far as the process is concerned, the memory is technically already freed (though not yet released back into the pool).  Therefore, any memory written to after the memory is zeroed is a possible memory corruption issue.
>     3) I understand that we cannot see what code wrote to the memory, but finding out where the memory was originally allocated is generally sufficient to track down that error.  This isn't Valgrind, after all, but merely a tool to help us when Valgrind cannot be deployed due to production concerns.
>     4) This does NOT mask double-frees.  We already check for that condition, and that check is not eliminated here.
> 
> Mark Michelson wrote:
>     Yes, the code is thread-safe with regards to allocations and frees, but it is not thread-safe if a thread is freeing a region and another thread is writing to an already-freed region which has been already-freed for 120 seconds or more. The thread freeing the region holds the reglock, but the thread writing to the already-freed region does even attempt to grab the reglock before writing to the already-freed memory. Now, obviously, the thread doing the writing should not even be trying to write to freed memory, but inevitably this happens. It would require some rather precise timing to occur, I'll admit, but it is possible.
>     
>     If you ask me, seeing where the actual write occurred is just as important as seeing where the memory was originally freed. I don't see how you're supposed to be able to find the problematic area in the code if you don't see where the fault occurred.
>     
>     You're going to have to show me how this doesn't mask double frees. I know the original code will print a "Freeing unused memory" message if a double free occurs. The problem here is that since the region is not removed from the list, you can free the same region twice and find it in the list of allocated regions both times. Therefore the "Freeing unused memory" message will not print. It is still possible to catch double frees here, but it would require the two occurrences to be at least 120 seconds apart and have at least one other region to be freed in the interval. In fact, taking a closer look now, I see that the "freeing unused memory" message cannot even print in MALLOC_HOLD mode since it is #ifdef'd out.
> 
> Tilghman Lesher wrote:
>     The thread writing to the already-freed region still possesses the reglock until it exits the loop.  Even if it didn't, you're suggesting that a thread writing zeroes to a memory segment could consume 120 seconds of wallclock time?  That sounds ridiculous.
>     
>     I'll look into the double free code.
> 
> Mark Michelson wrote:
>     I'm not suggesting any such thing.
>     
>     Let's say that threads A and B have pointers to a region of malloced memory. Thread A frees the memory, but thread B is still pointing there, unaware that the region has been freed. Now, more than 120 seconds later, thread C frees some completely different region of memory. Now, the free triggered by thread C is going to cause the MALLOC_HOLD code to see that the region freed by thread A is older than 120 seconds, and so it will begin to iterate, byte-by-byte, over the region of memory, checking for anything non-zero. During this check, it is still possible for thread B to write to the memory as it is being checked, since there is no requirement that thread B holds the reglock when writing to heap-allocated memory. As a result, if thread B writes to memory that the MALLOC_HOLD code has already inspected, then the invalid write will go undetected. Again, such a situation is highly unlikely, but it is possible to occur.
> 
> Mark Michelson wrote:
>     Hmm, after re-reading that scenario I came up with, I definitely am not seeing the forest for the trees. The more pressing problem in that scenario is that even 120 seconds after thread A has freed the region, thread B still thinks it can write to it. The larger problem here is that thread B is likely to write to the region after the MALLOC_HOLD has completely freed memory, causing a crash. This is a limitation that will have to be accepted, though, I guess since we can't hold on to the region forever, and 120 seconds is more than a reasonable amount of time to hold onto the memory.

Okay, while I agree with that, I chose the 120 second threshold specifically because if any memory corruption is likely to occur, it should occur well before that point in time.  I could certainly choose a longer amount of time that the memory chunk remains in limbo, but I'd need a clear justification for however long that should be.

- Tilghman

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/309/#review966
-----------------------------------------------------------

On 2009-07-13 17:04:05, Tilghman Lesher wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviewboard.asterisk.org/r/309/
> -----------------------------------------------------------
> 
> (Updated 2009-07-13 17:04:05)
> 
> 
> Review request for Asterisk Developers.
> 
> 
> Summary
> -------
> 
> MALLOC_HOLD is an optional compile-time definition used to detect memory corruption.  It does this in a similar way to valgrind, although it does not slow down Asterisk appreciably (although the memory will grow somewhat larger than normal).
> 
> 
> Diffs
> -----
> 
>   /branches/1.4/build_tools/cflags.xml 206144 
>   /branches/1.4/main/astmm.c 206144 
> 
> Diff: https://reviewboard.asterisk.org/r/309/diff
> 
> 
> Testing
> -------
> 
> Extensive testing with an idle system.
> 
> 
> Thanks,
> 
> Tilghman
> 
>