[asterisk-dev] [Code Review] 4135: Resolve race condition that can result in ARI apps not being notified of transfers

Tue Nov 4 15:44:44 CST 2014

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/4135/
-----------------------------------------------------------

(Updated Nov. 4, 2014, 9:44 p.m.)

Review request for Asterisk Developers.

Changes
-------

Because of what Richard pointed out in the previous review, the approach I have taken was not workable due to the race condition not being resolved in parking code.

Going by the approaches listed in the original description, I have shifted from approach 4 to approach 2. This means there is an API change in the way that transfers are published. Before this change, the call to publish a transfer was a single thing that created the transfer message and published. Now this has been separated into functions to create the transfer message and to publish it. For blind transfers, we take the approach of creating a minimal transfer message and fill in the details as we get them. Then we publish when we're ready to do so. For attended transfers, we create a minimal transfer message as well, but we have helper functions to complete the transfer message depending on how the transfer ends up progressing. With this approach, snapshots of bridges are taken early in the transfer process so that when it comes time to publish the transfer, we publish details as they existed when the transfer was initiated.

Repository: Asterisk

Description (updated)
-------

During blind transfer testing, it was noticed that tests were failing occasionally because the ARI blind transfer event was not being sent. After investigating, I detected a race condition in the blind transfer code. When blind transferring a single channel, the actual transfer operation (i.e. removing the transferee from the bridge and directing them to the proper dialplan location) is queued onto the transferee bridge channel. After queuing the transfer operation, the blind transfer Stasis message is published. At the time of publication, snapshots of the channels and bridge involved are created. The ARI subscriber to the blind transfer Stasis message then attempts to determine if the bridge or any of the involved channels are subscribed to by ARI applications. If so, then the blind transfer message is sent to the applications. The way that the ARI blind transfer message handler works is to first see if the transferer channel is subscribed to. If not, then iterate over all the channel IDs in the bridge snapshot and determine if any of those are subscribed to. In the test we were running, the lone transferee channel was subscribed to, so an ARI event should have been sent to our application. Occasionally, though, the bridge snapshot did not have any channels IDs on it at all. Why?

The problem is that since the blind transfer operation is handled by a separate thread, it is possible that the transfer will have completed and the channels removed from the bridge before we publish the blind transfer Stasis message. Since the blind transfer has completed, the bridge on which the transfer occurred no longer has any channels on it, so the resulting bridge snapshot has no channels on it. Through investigation of the code, I found that attended transfers can have this issue too for the case where a transferee is transferred to an application.

To fix this problem, I thought of four possible fixes:
1) Let the thread that actually performs the blind transfer publish the Stasis message.
2) Get the bridge snapshot before queuing the blind transfer operation.
3) Publish the blind transfer Stasis message before queuing the blind transfer operation.
4) Hold the bridge lock while queuing the blind transfer operation and publishing the blind transfer Stasis message.

Option 1 is clearly the "best" option, but it also is made nearly impossible due to the way that bridge channel operations are queued. Bridge channel operations use frames, which require that their payload be copyable using memcpy(). Including any sort of pointer is a no-no. So I would be forced to come up with some inane method of representing multiple channels and bridges in a frame in order to do this.

Option 2 is slightly more workable. Currently, there are functions to publish blind and attended transfers that require bridges and channels, not snapshots. Changing the API to accommodate snapshots is possible, but it is more widespread than I would like, and it changes the API. It also runs the slight risk of publishing "stale" data, though that is not likely.

Option 3 is easiest to implement, but it runs the (very slight) risk that we could publish that a blind transfer happened successfully when in fact the attempt to queue the blind transfer operation failed. I almost went with this, but I felt like the possibility of lying in our events was a bad thing to do.

Option 4 is my least favorite, but within a release felt like the most viable method. The bridge is already locked while publishing the transfer messages, and I didn't see any harm in having the bridge locked while queuing the transfer operation either. By holding the bridge lock the entire time, we prevent the bridge state from changing out from under us while we publish our Stasis message. This means that the bridge will stay stable until we finally unlock after publishing the Stasis message.

EDIT NOV 4: Richard pointed out a problem with Option 4, so from revision 3 onwards, the approach now taken is option 2. See the notes on revision 3 for more details about the changes.

Diffs (updated)
-----

  /branches/12/res/stasis/app.c 426802 
  /branches/12/main/stasis_bridges.c 426802 
  /branches/12/main/cel.c 426802 
  /branches/12/main/bridge_basic.c 426802 
  /branches/12/main/bridge.c 426802 
  /branches/12/include/asterisk/stasis_bridges.h 426802 
  /branches/12/apps/app_queue.c 426802 

Diff: https://reviewboard.asterisk.org/r/4135/diff/

Testing
-------

I have re-run the blind transfer tests that had occasionally failed, and I do not see failures. More importantly, the tests still function properly and the change to locking has not introduced any noticeable bad effects.

Thanks,

Mark Michelson

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20141104/3b30bf9d/attachment-0001.html>