[asterisk-dev] [Code Review]: Prevent race condition that can cause ast_read to "miss" important frames

Mark Michelson reviewboard at asterisk.org
Thu Mar 1 18:50:19 CST 2012



> On March 1, 2012, 11:09 a.m., rmudgett wrote:
> > I think only this second diff should go in.  The first was clearing something not directly related to the function's purpose.
> 
> Mark Michelson wrote:
>     I keep going back and forth on this. On one hand, I think you're incorrect. Setting a channel's timing rate to 0 and setting its timingfunc NULL is a way of communicating that the timing fd is irrelevant at this point and so any pending reads on it should not happen. On the other hand, I can how such a change might seem out of place and might be better suited in ast_read() itself in the part near the beginning in the ast_check_hangup() block. I need a bit of time to deliberate on this and I'll see what I come up with.

So after deliberating some more, I think my analysis is indeed correct. Setting the timing rate to 0 and the timingfunc NULL invalidates the fact that the channel supposedly has data to be read due to a timingfd trigger. I'm going to commit both diff 1 and 2. If you can come up with a scenario where diff 1 is harmful, please don't hesitate to revert and explain the situation where it would be problematic.


- Mark


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/1779/#review5703
-----------------------------------------------------------


On March 1, 2012, 10:28 a.m., Mark Michelson wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviewboard.asterisk.org/r/1779/
> -----------------------------------------------------------
> 
> (Updated March 1, 2012, 10:28 a.m.)
> 
> 
> Review request for Asterisk Developers.
> 
> 
> Summary
> -------
> 
> There appears to be a potential race condition that can occur when a timingfd is in use on a channel. I'll use the referenced issue to illustrate what is occurring.
> 
> 1. A SIP channel with progressinband=yes places a call through Asterisk to another phone.
> 2. The destination phone starts ringing, resulting in Asterisk needing to generate a ringing tone on the inbound channel.
> 3. The caller hangs up his phone.
> 4. The SIP channel driver receives a CANCEL from the calling channel, and so it queues a hangup frame.
> 5. The timer for the tone generator happens to fire shortly after.
> 6. ast_waitfor_n() is called by app_dial.
> 7. ast_waitfor_n() says the timingfd for the caller channel has data to be read.
> 8. Whoever called ast_waitfor_n() then calls ast_read() on the winning channel.
> 9. ast_read() notices the channel is being hung up and deactivates the tone generator.
> 10. ast_read() sees that the channel's fdno is the AST_TIMING_FD.
> 11. Since the generator has been deactivated, there is no timing function to call, so the channel is unlocked and an ast_null_frame is returned.
> 
> This is where the problem happens. Since the calling party has hung up, it will never be the winner of any further ast_waitfor_n() calls. Furthermore, that hangup that was queued has been "lost" since there happened to be two sources of data ready at the same time. As far as I can tell, this problem will only occur if a timing fd is in use because otherwise the alertpipe will properly be read even if multiple sources trigger a read at the same time. Note that while SIP was used here, and the situation was canceling a ringing call, I suspect that any time that a timer is being actively used a hangup could be missed. This means that hanging up during file playback or hanging up an analog channel while ringing could have the same effect.
> 
> My solution to this problem is at step 9. In ast_settimeout(), if the rate is 0, the timingfunc is NULL, and the current chan->fdno is AST_TIMING_FD, then we set chan->fdno to -1. This allows us to bypass steps 10 and 11 above and move further into ast_read(), which will properly return NULL since the channel has been hung up.
> 
> I'm putting this on review board since I need to know for sure that this is safe to do. Hanging up is not the only thing that can deactivate a generator or cause ast_settimeout() to get called with a 0 rate and NULL timingfunc. As far as I know though, by setting the timeout to 0 like this, you are essentially attempting to say "I'm done with whatever timing-related task I was doing, so don't allow the AST_TIMING_FD to trigger reads any more." By clearing the channel's fdno like this, you are accomplishing this goal.
> 
> Please let me know if there are any potential pitfalls to this approach.
> 
> 
> This addresses bug ASTERISK-19223.
>     https://issues.asterisk.org/jira/browse/ASTERISK-19223
> 
> 
> Diffs
> -----
> 
>   /branches/1.8/main/channel.c 357621 
> 
> Diff: https://reviewboard.asterisk.org/r/1779/diff
> 
> 
> Testing
> -------
> 
> I inserted some debug into the path that leads to the error occurring. I determined that without my patch, seeing that debug message print meant a missed hangup. With my patch, I saw the debug message print, but the hangup was not missed.
> 
> 
> Thanks,
> 
> Mark
> 
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20120302/dec667c3/attachment-0001.htm>


More information about the asterisk-dev mailing list