[asterisk-dev] [Code Review]: Reorder Asterisk startup sequence to prevent mutex errors caused by daemon fork

Fri Nov 16 12:27:32 CST 2012

> On Nov. 16, 2012, 10:39 a.m., David Lee wrote:
> > /trunk/main/asterisk.c, line 3977
> > <https://reviewboard.asterisk.org/r/2197/diff/1/?file=32166#file32166line3977>
> >
> >     Shouldn't we put an assertion in ast_pthread_create_* that fails if we haven't hit this point in main yet? That would help prevent these sorts of problems from sneaking in down the road.
> 
> Matt Jordan wrote:
>     That will help, but it won't uniquely solve the issue.
>     
>     I'm actually not convinced that it was thread creation itself that was causing the problem.  One of the solutions I tried that 'fixed' the thread registration problem was to replace the recursive mutex surrounding the read/write locks with a non-recursive variant (which works because a recursive call to a read/write lock isn't going to work in the first place).  That stopped the deadlock that's documented on the Asterisk issue, only to cause a new one to crop up on the logmsgs list lock.  That lock is a 'normal' Asterisk lock, which has to be recursive - and may have been accessed by the main process thread prior to the logmsgs thread being created (or at the same time).  It was this moving target that made me restructure the startup sequence in the first place.
>     
>     If its the main process thread interacting with a lock that causes this problem, then an assert on thread creation will certainly help, but won't completely prevent this situation from occurring.  A huge problem with this issue is that it is inherently timing dependent, and getting an exact picture of the players involved is difficult - even aborting immediately on detecting the pthread mutex failure was still too late, as the unlock occurs after the fork has apparently executed.

By your analysis, thread creation wasn't the problem. But holding a lock while a daemon() happens does, and that can only happen if there are other threads running (or someone puts some _really_ bad code in main).

In general, I think we can declare that starting threads before daemon() is a Bad Think(TM) and assert for it.

I agree it's not a total solution, but it should cover enough cases to be helpful.

- David

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/2197/#review7406
-----------------------------------------------------------

On Nov. 16, 2012, 9:38 a.m., Matt Jordan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviewboard.asterisk.org/r/2197/
> -----------------------------------------------------------
> 
> (Updated Nov. 16, 2012, 9:38 a.m.)
> 
> 
> Review request for Asterisk Developers.
> 
> 
> Summary
> -------
> 
> This is a very convoluted issue - for a full explanation, see ASTERISK-19463.
> 
> For a long time now, the Asterisk Test Suite has run into a problem where Asterisk appears to hit a deadlock during startup.  This only occurs if DEBUG_THREADS is enabled; only occurs very rarely (on the 64-bit test system about 1 out of every 500 or so tests; on a 32-bit test system about 3-5 out of every 150 or so tests); occurs in the lock tracking object's recursive mutex (see the backtrace on ASTERISK-19463 for an example of what it looks like); and *only* occurs during startup.  Once the Asterisk system is up and running, this error hasn't been seen.
> 
> Initially, we suspected that there was some unbalance in the lock/unlock of said mutex; however, inspection and enough testing verified that to not be the case.  When we actually put in debug code to check the return value of the recursive mutex, we found that an unlock operation (a call to ast_reentrancy_unlock) was failing with an error code of EPERM, implying that the thread that was performing the unlock operation did not own the mutex.  Note that the code where this occurs is very simple:
> 
> 		ast_reentrancy_lock(lt);
> 		if (lt->reentrancy < AST_MAX_REENTRANCY) {
> 			lt->file[lt->reentrancy] = filename;
> 			lt->lineno[lt->reentrancy] = line;
> 			lt->func[lt->reentrancy] = func;
> 			lt->thread[lt->reentrancy] = pthread_self();
> 			lt->reentrancy++;
> 		}
> 		ast_reentrancy_unlock(lt);
> 
> In a recursive pthread mutexes, the thread that owns the mutex is tracked to determine if the lock/unlock call is recursive.  There's very few ways that the thread identifier could change in a protected section of the code.  One way in which it can happen is if the process is forked.
> 
> When Asterisk starts up and is told to run in the background, it makes a call to daemon.  In order to send the process into the background, the process is forked and the forked process is sent to the background, while the parent process exits.  While we were unable to determine what code path was causing the thread identifier in the recursive mutex to change or be mangled, we were able to identify that it occured in a call to logger.
> 
> The patch attached does the following:
> 
> 1) During startup, prior to calling daemon all output methods use fprintf to send error messages to stderr.  This prevents ever having to go into the Asterisk logger subsystem, thereby potentially accessing one of it's mutexes.
> 
> 2) The call to daemon has been moved up earlier in the startup sequence.  Things in Asterisk that can be moved later have been moved further down in the start up sequence to minimize the risk that they call into logger and access a recursive mutex before the fork.
> 
> Additionally, this patch includes one additional fix.  When DEBUG_THREADS is enabled threadstorage is initialized used to store lock information.  This must occur before the thread is registered, as the thread registration itself will attempt to use the threadstorage to store the lock information.  This was not causing the problem, but initializing a mutex after its been used is bad.
> 
> 
> This addresses bug ASTERISK-19463.
>     https://issues.asterisk.org/jira/browse/ASTERISK-19463
> 
> 
> Diffs
> -----
> 
>   /trunk/main/asterisk.c 376374 
>   /trunk/main/utils.c 376374 
> 
> Diff: https://reviewboard.asterisk.org/r/2197/diff
> 
> 
> Testing
> -------
> 
> A lot.
> 
> A bamboo test agent (32-bit) was used to test the failures, as it would routinely reproduce the behavior (approximately 3 failures out of 150 or so tests) in every run.  After this patch was applied, the tests were run an additional 13 times with no failures (more are being run currently).
> 
> Evidence for the patch resolving the failures can be seen here:
> http://bamboo.asterisk.org/browse/ASTTEAM-MJORDAN-94
> 
> Note that this issue has caused numerous false test failures in the continuous integration tests, and resolving it will go a long ways towards making that process more stable.
> 
> 
> Thanks,
> 
> Matt
> 
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20121116/901bffcb/attachment.htm>