[asterisk-dev] [Code Review] Prevent race condition that can cause ast_read to "miss" important frames

Thu Mar 1 10:28:46 CST 2012

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/1779/
-----------------------------------------------------------

(Updated March 1, 2012, 10:28 a.m.)

Review request for Asterisk Developers.

Changes
-------

I've taken a completely different approach on this thanks to some collaboration with Richard Mudgett. For some reason I had it in my head that if a channel was using a timingfd, then that meant it had no alertpipe. This is not the case!

The problem here is actually a bit simpler than I had made it out to be. When the call is canceled in step 4 of the situation described, an AST_CONTROL_HANGUP is queued. When this hangup is queued, the SIP monitor thread writes to the channel's alertpipe to indicate that there's a frame to be pulled off the queue. The ringback generator trips at around the same time.

So in app_dial, ast_waitfor_n() says the calling channel has data to read. This is because we both have a frame to read from the queue, plus we have the timer that has fired to say that there's audio to write.

Now ast_read() is called. In ast_read(), the generator gets deactivated because the incoming channel's softhangup is set. We then read from the alertpipe, then see that the timingfd is what got triggered. Within the timingfd code block, we return an ast_null_frame.

Now, do you notice anything strange about that? Why did we read from the alertpipe at all if the timingfd is what triggered the read? By reading from the alertpipe but not grabbing a frame off the channel's readq, we have essentially thrown away the hangup frame. Since the alertpipe has been cleared and the generator has been deactivated, it means that further calls to ast_waitfor_n() will not ever result in the incoming channel winning.

In this patch, I have simply moved the read of the alertpipe to be adjacent to the part where we pull a frame from the frame queue. This seems correct to me because the alertpipe and readq are very closely related; reading from one should always mean reading from the other as well.

I have confirmed with this change that the reported issue does not occur, and I have confirmed that calls in general are still working.

Summary
-------

There appears to be a potential race condition that can occur when a timingfd is in use on a channel. I'll use the referenced issue to illustrate what is occurring.

1. A SIP channel with progressinband=yes places a call through Asterisk to another phone.
2. The destination phone starts ringing, resulting in Asterisk needing to generate a ringing tone on the inbound channel.
3. The caller hangs up his phone.
4. The SIP channel driver receives a CANCEL from the calling channel, and so it queues a hangup frame.
5. The timer for the tone generator happens to fire shortly after.
6. ast_waitfor_n() is called by app_dial.
7. ast_waitfor_n() says the timingfd for the caller channel has data to be read.
8. Whoever called ast_waitfor_n() then calls ast_read() on the winning channel.
9. ast_read() notices the channel is being hung up and deactivates the tone generator.
10. ast_read() sees that the channel's fdno is the AST_TIMING_FD.
11. Since the generator has been deactivated, there is no timing function to call, so the channel is unlocked and an ast_null_frame is returned.

This is where the problem happens. Since the calling party has hung up, it will never be the winner of any further ast_waitfor_n() calls. Furthermore, that hangup that was queued has been "lost" since there happened to be two sources of data ready at the same time. As far as I can tell, this problem will only occur if a timing fd is in use because otherwise the alertpipe will properly be read even if multiple sources trigger a read at the same time. Note that while SIP was used here, and the situation was canceling a ringing call, I suspect that any time that a timer is being actively used a hangup could be missed. This means that hanging up during file playback or hanging up an analog channel while ringing could have the same effect.

My solution to this problem is at step 9. In ast_settimeout(), if the rate is 0, the timingfunc is NULL, and the current chan->fdno is AST_TIMING_FD, then we set chan->fdno to -1. This allows us to bypass steps 10 and 11 above and move further into ast_read(), which will properly return NULL since the channel has been hung up.

I'm putting this on review board since I need to know for sure that this is safe to do. Hanging up is not the only thing that can deactivate a generator or cause ast_settimeout() to get called with a 0 rate and NULL timingfunc. As far as I know though, by setting the timeout to 0 like this, you are essentially attempting to say "I'm done with whatever timing-related task I was doing, so don't allow the AST_TIMING_FD to trigger reads any more." By clearing the channel's fdno like this, you are accomplishing this goal.

Please let me know if there are any potential pitfalls to this approach.

This addresses bug ASTERISK-19223.
    https://issues.asterisk.org/jira/browse/ASTERISK-19223

Diffs (updated)
-----

  /branches/1.8/main/channel.c 357621 

Diff: https://reviewboard.asterisk.org/r/1779/diff

Testing
-------

I inserted some debug into the path that leads to the error occurring. I determined that without my patch, seeing that debug message print meant a missed hangup. With my patch, I saw the debug message print, but the hangup was not missed.

Thanks,

Mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20120301/328b4a53/attachment.htm>