[asterisk-dev] [Code Review] Prevent race condition that can cause ast_read to "miss" important frames

Wed Feb 29 13:48:20 CST 2012

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/1779/
-----------------------------------------------------------

Review request for Asterisk Developers.

Summary
-------

There appears to be a potential race condition that can occur when a timingfd is in use on a channel. I'll use the referenced issue to illustrate what is occurring.

1. A SIP channel with progressinband=yes places a call through Asterisk to another phone.
2. The destination phone starts ringing, resulting in Asterisk needing to generate a ringing tone on the inbound channel.
3. The caller hangs up his phone.
4. The SIP channel driver receives a CANCEL from the calling channel, and so it queues a hangup frame.
5. The timer for the tone generator happens to fire shortly after.
6. ast_waitfor_n() is called by app_dial.
7. ast_waitfor_n() says the timingfd for the caller channel has data to be read.
8. Whoever called ast_waitfor_n() then calls ast_read() on the winning channel.
9. ast_read() notices the channel is being hung up and deactivates the tone generator.
10. ast_read() sees that the channel's fdno is the AST_TIMING_FD.
11. Since the generator has been deactivated, there is no timing function to call, so the channel is unlocked and an ast_null_frame is returned.

This is where the problem happens. Since the calling party has hung up, it will never be the winner of any further ast_waitfor_n() calls. Furthermore, that hangup that was queued has been "lost" since there happened to be two sources of data ready at the same time. As far as I can tell, this problem will only occur if a timing fd is in use because otherwise the alertpipe will properly be read even if multiple sources trigger a read at the same time. Note that while SIP was used here, and the situation was canceling a ringing call, I suspect that any time that a timer is being actively used a hangup could be missed. This means that hanging up during file playback or hanging up an analog channel while ringing could have the same effect.

My solution to this problem is at step 9. In ast_settimeout(), if the rate is 0, the timingfunc is NULL, and the current chan->fdno is AST_TIMING_FD, then we set chan->fdno to -1. This allows us to bypass steps 10 and 11 above and move further into ast_read(), which will properly return NULL since the channel has been hung up.

I'm putting this on review board since I need to know for sure that this is safe to do. Hanging up is not the only thing that can deactivate a generator or cause ast_settimeout() to get called with a 0 rate and NULL timingfunc. As far as I know though, by setting the timeout to 0 like this, you are essentially attempting to say "I'm done with whatever timing-related task I was doing, so don't allow the AST_TIMING_FD to trigger reads any more." By clearing the channel's fdno like this, you are accomplishing this goal.

Please let me know if there are any potential pitfalls to this approach.

This addresses bug ASTERISK-19223.
    https://issues.asterisk.org/jira/browse/ASTERISK-19223

Diffs
-----

  /tags/1.8.8.1/channels/chan_sip.c 351448 
  /tags/1.8.8.1/main/channel.c 351448 

Diff: https://reviewboard.asterisk.org/r/1779/diff

Testing
-------

I inserted some debug into the path that leads to the error occurring. I determined that without my patch, seeing that debug message print meant a missed hangup. With my patch, I saw the debug message print, but the hangup was not missed.

Thanks,

Mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20120229/9e5542d4/attachment.htm>