[asterisk-dev] [Code Review] 4135: Resolve race condition that can result in ARI apps not being notified of transfers

Fri Nov 7 12:31:20 CST 2014

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/4135/
-----------------------------------------------------------

(Updated Nov. 7, 2014, 6:31 p.m.)

Review request for Asterisk Developers.

Changes
-------

Address opticron's feedback.

Repository: Asterisk

Description
-------

During blind transfer testing, it was noticed that tests were failing occasionally because the ARI blind transfer event was not being sent. After investigating, I detected a race condition in the blind transfer code. When blind transferring a single channel, the actual transfer operation (i.e. removing the transferee from the bridge and directing them to the proper dialplan location) is queued onto the transferee bridge channel. After queuing the transfer operation, the blind transfer Stasis message is published. At the time of publication, snapshots of the channels and bridge involved are created. The ARI subscriber to the blind transfer Stasis message then attempts to determine if the bridge or any of the involved channels are subscribed to by ARI applications. If so, then the blind transfer message is sent to the applications. The way that the ARI blind transfer message handler works is to first see if the transferer channel is subscribed to. If not, then iterate over all the channel IDs in the bridge snapshot and determine if any of those are subscribed to. In the test we were running, the lone transferee channel was subscribed to, so an ARI event should have been sent to our application. Occasionally, though, the bridge snapshot did not have any channels IDs on it at all. Why?

The problem is that since the blind transfer operation is handled by a separate thread, it is possible that the transfer will have completed and the channels removed from the bridge before we publish the blind transfer Stasis message. Since the blind transfer has completed, the bridge on which the transfer occurred no longer has any channels on it, so the resulting bridge snapshot has no channels on it. Through investigation of the code, I found that attended transfers can have this issue too for the case where a transferee is transferred to an application.

To fix this problem, I thought of four possible fixes:
1) Let the thread that actually performs the blind transfer publish the Stasis message.
2) Get the bridge snapshot before queuing the blind transfer operation.
3) Publish the blind transfer Stasis message before queuing the blind transfer operation.
4) Hold the bridge lock while queuing the blind transfer operation and publishing the blind transfer Stasis message.

Option 1 is clearly the "best" option, but it also is made nearly impossible due to the way that bridge channel operations are queued. Bridge channel operations use frames, which require that their payload be copyable using memcpy(). Including any sort of pointer is a no-no. So I would be forced to come up with some inane method of representing multiple channels and bridges in a frame in order to do this.

Option 2 is slightly more workable. Currently, there are functions to publish blind and attended transfers that require bridges and channels, not snapshots. Changing the API to accommodate snapshots is possible, but it is more widespread than I would like, and it changes the API. It also runs the slight risk of publishing "stale" data, though that is not likely.

Option 3 is easiest to implement, but it runs the (very slight) risk that we could publish that a blind transfer happened successfully when in fact the attempt to queue the blind transfer operation failed. I almost went with this, but I felt like the possibility of lying in our events was a bad thing to do.

Option 4 is my least favorite, but within a release felt like the most viable method. The bridge is already locked while publishing the transfer messages, and I didn't see any harm in having the bridge locked while queuing the transfer operation either. By holding the bridge lock the entire time, we prevent the bridge state from changing out from under us while we publish our Stasis message. This means that the bridge will stay stable until we finally unlock after publishing the Stasis message.

EDIT NOV 4: Richard pointed out a problem with Option 4, so from revision 3 onwards, the approach now taken is option 2. See the notes on revision 3 for more details about the changes.

Diffs (updated)
-----

  /branches/12/res/stasis/app.c 426802 
  /branches/12/main/stasis_bridges.c 426802 
  /branches/12/main/cel.c 426802 
  /branches/12/main/bridge_basic.c 426802 
  /branches/12/main/bridge.c 426802 
  /branches/12/include/asterisk/stasis_bridges.h 426802 
  /branches/12/apps/app_queue.c 426802 

Diff: https://reviewboard.asterisk.org/r/4135/diff/

Testing
-------

I have re-run the blind transfer tests that had occasionally failed, and I do not see failures. More importantly, the tests still function properly and the change to locking has not introduced any noticeable bad effects.

Thanks,

Mark Michelson

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20141107/97e55e43/attachment-0001.html>