[asterisk-dev] [Code Review] Fix a deadlock relating to autoservice. Issue 14057

Thu Dec 11 11:10:20 CST 2008

> On 2008-12-11 10:33:02, Russell Bryant wrote:
> > /branches/1.4/main/autoservice.c, lines 124-136
> > <http://reviewboard.digium.com/r/83/diff/2/?file=1838#file1838line124>
> >
> >     As you suspected in your review request, I believe you do have a race condition here.  Just because you were able to lock and unlock here does not mean that you are safe from this deadlock.  So, I don't think this approach is going to work.
> >     
> >     I'll keep thinking about it but I'm afraid that the fix will have to be making a rule that ast_autoservice_stop() is not called with the channel locked.
> 
>  wrote:
>     That's why I suggested not unlocking the channels until after the call to ast_waitfor_n. After that, the only attempt to lock the channel is done in ast_read, and that already has deadlock avoidance built-in. Is there another reason you suspect that testing for lockability does not preclude the code from deadlocking?
> 
>  wrote:
>     Well, that would prevent the deadlock, but I think we should avoid going that route.  That would mean you're locking a whole bunch of channels and then having the thread go to sleep waiting for input on the channel.  That can cause other deadlock issues, as some channel drivers (IAX2 for example) must acquire the channel lock to provide the input that you're asleep waiting for.

Ah, I see. that makes perfect sense. I'll see if I can find some sensible way to make sure channels are not locked prior to stopping autoservice. I don't imagine this will be easy at all :(

- Mark

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviewboard.digium.com/r/83/#review190
-----------------------------------------------------------

On 2008-12-11 09:36:21, Mark Michelson wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://reviewboard.digium.com/r/83/
> -----------------------------------------------------------
> 
> (Updated 2008-12-11 09:36:21)
> 
> 
> Review request for Asterisk Developers.
> 
> 
> Summary
> -------
> 
> This is a potential fix for issue 14057 on Mantis (http://bugs.digium.com/view.php?id=14057)
> 
> The problem presented is that a thread which holds a lock on a channel is attempting to stop autoservice on that channel. Meanwhile, the autoservice thread is attempting to lock the channel held by the thread attempting to stop autoservice. Unfortunately the thread attempting to stop autoservice cannot complete since it must wait for the current iteration of the autoservice loop to finish so that the as_chan_list_state can change.
> 
> I have added a check prior to the call to ast_waitfor_nandfds which loops through the channels currently in autoservice and attempts to lock them. If the lock is unsuccessful, then we remove the channel from the array of channels to poll for the current iteration of autoservice.
> 
> There are three pieces of feedback I am looking for from this review request (although I would welcome other forms of feedback, too)
> 1. This sort of fix makes perfect sense for a channel stopping autoservice whose autoservice use count will be reduced to 0 once the call to ast_autoservice_stop completes. However, the possibility also exists that ast_autoservice_stop will only be decrementing the use count of the channel and not removing it from the aslist for the time being. In this scenario, I'm wondering if my fix is actually counterproductive since it could potentially mean that a queue of frames will build on the channel as long as the channel's lock is held by another thread, essentially meaning that autoservice is doing nothing.
> 
> 2. Is there still a race condition here? When the ast_channel_trylock() call succeeds, my current action is to immediately unlock the channel. I'm wondering if the lock should actually be held until after ast_wait_for_n returns, since it is possible that the channel's lock may be claimed by another thread in the small time margin between testing for lockability and the actual call to ast_waitfor_n.
> 
> 3. As a sanity check, make sure that the memmove call is correct. If it is, I may be tempted to add some sort of ARRAY_REMOVE macro which could be used here and in other places where we are using memmove to essentially remove an element from an array.
> 
> 
> This addresses bug 14057.
>     http://bugs.digium.com/view.php?id=14057
> 
> 
> Diffs
> -----
> 
>   /branches/1.4/main/autoservice.c 162886 
> 
> Diff: http://reviewboard.digium.com/r/83/diff
> 
> 
> Testing
> -------
> 
> Since this particular deadlock is not one I know how to reproduce, I just ran some calls through some known code that uses autoservice. In my tests, things worked correctly, but I'm fairly certain I never actually hit the case where the memmove was necessary.
> 
> 
> Thanks,
> 
> Mark
> 
>