[Asterisk-code-review] Resolve race conditions involving Stasis bridges. (asterisk[certified/13.1])
Mark Michelson
asteriskteam at digium.com
Tue Sep 22 12:33:59 CDT 2015
Mark Michelson has uploaded a new change for review.
https://gerrit.asterisk.org/1298
Change subject: Resolve race conditions involving Stasis bridges.
......................................................................
Resolve race conditions involving Stasis bridges.
This resolves two observed race conditions.
First, a bit of background on what the Stasis application does:
1a Creates a stasis_app_control structure. This structure is linked into
a global container and can be looked up using a channel's unique ID.
2a Puts the channel in an event loop. The event loop can exit either
because the stasis_app_control structure has been marked done, or
because of some other factor, such as a hangup. In the event loop, the
stasis_app_control determines if any specific ARI commands need to be
run on the channel and will run them from this thread.
3a Checks if the channel is bridged. If the channel is bridged, then
ast_bridge_depart() is called since channels that are added to Stasis
bridges are always imparted as departable.
4a Unlink the stasis_app_control from the container.
When an ARI command is received by Asterisk, the following occurs
1b A thread is spawned to handle the HTTP request
2b The stasis_app_control(s) that corresponds to the channel(s) in the
request is/are retrieved. If the stasis_app_control cannot be
retrieved, then it is assumed that the channel in question has exited
the Stasis app or perhaps was never in Stasis in the first place.
3b A command is queued onto the stasis_app_control, and the channel's
event loop thread is signaled to run the command.
4b While most ARI commands do nothing further, some, such as adding or
removing channels from a bridge, will block until the command they
issued has been completed by the channel's event loop.
The first race condition that is solved by this patch involves a crash
that can occur due to faulty detection of the channel's bridged status
in step 3a. What can happen is that in step 2a, the event loop may run
the ast_bridge_impart() function to asynchronously place the channel
into a bridge, then immediately exit the event loop because the channel
has hung up. In step 3a, we would detect that the channel was not
bridged and would not call ast_bridge_depart(). The reason that the
channel did not appear to be bridged was that the depart_thread that is
spawned by ast_bridge_impart() had not yet started. That is the thread
where the channel is marked as being bridged. Since we did not call
ast_bridge_depart(), the Stasis application would exit, and then the
channel would be destroyed Then the depart_thread would start up and
try to manipulate the destroyed channel, causing a crash.
The fix for this is to switch from using ast_channel_is_bridged() to
checking the NULLity of ast_channel_internal_bridge_channel() to
determine if ast_bridge_depart() needs to be called. The channel's
internal bridge_channel is set when ast_bridge_impart() is called and
is NULLed by the call to ast_bridge_depart(). If the channel's internal
bridge_channel is non-NULL, then the channel must have been imparted
into the bridge and needs to be departed, even if the actual bridging
operation has not yet started. By departing the channel when necessary,
the thread that is running the Stasis application will block until the
bridge gives the okay that the depart_thread has exited.
The second race condition that is solved by this patch involves a leak
of HTTP handler threads. The problem was that step 2b would successfully
retrieve a stasis_app_control structure. Then step 2a would exit the
channel from the event loop due to a hangup. Steps 3a and 4a would
execute, and then finally steps 3b and 4b would. The problem is that at
step 4b, when attempting to add a channel to a bridge, the thread would
block forever since the channel would never execute the queued command
since it was finished with the event loop. This meant that the HTTP
handling thread would be leaked, along with any references that thread
may have owned (in my case, I was seeing bridges leaked).
The fix for this is to hone in better on when the channel has exited the
event loop. The stasis_app_control structure has an is_done field that
is now set at each point where the channel may exit the event loop. If
step 2b retrieves a valid stasis_app_control structure but the control
is marked as done, then the attempted operation exits immediately since
there will be nothing to service the attempted command.
ASTERISK-25091 #close
Reported by Ilya Trikoz
Change-Id: If66265b73b4c9f8f58599124d777fedc54576628
---
M res/res_stasis.c
M res/stasis/control.c
2 files changed, 7 insertions(+), 3 deletions(-)
git pull ssh://gerrit.asterisk.org:29418/asterisk refs/changes/98/1298/1
diff --git a/res/res_stasis.c b/res/res_stasis.c
index 2da15f1..180dd23 100644
--- a/res/res_stasis.c
+++ b/res/res_stasis.c
@@ -1280,6 +1280,7 @@
/* Check to see if a bridge absorbed our hangup frame */
if (ast_check_hangup_locked(chan)) {
+ control_mark_done(control);
break;
}
@@ -1303,6 +1304,7 @@
if (r < 0) {
ast_debug(3, "%s: Poll error\n",
ast_channel_uniqueid(chan));
+ control_mark_done(control);
break;
}
@@ -1323,6 +1325,7 @@
/* Continue on in the dialplan */
ast_debug(3, "%s: Hangup (no more frames)\n",
ast_channel_uniqueid(chan));
+ control_mark_done(control);
break;
}
@@ -1331,13 +1334,14 @@
/* Continue on in the dialplan */
ast_debug(3, "%s: Hangup\n",
ast_channel_uniqueid(chan));
+ control_mark_done(control);
break;
}
}
}
ast_channel_lock(chan);
- needs_depart = ast_channel_is_bridged(chan);
+ needs_depart = (ast_channel_internal_bridge_channel(chan) != NULL);
ast_channel_unlock(chan);
if (needs_depart) {
ast_bridge_depart(chan);
diff --git a/res/stasis/control.c b/res/stasis/control.c
index 7dc1220..0d5a5f1 100644
--- a/res/stasis/control.c
+++ b/res/stasis/control.c
@@ -713,7 +713,7 @@
{
RAII_VAR(struct stasis_app_command *, command, NULL, ao2_cleanup);
- if (control == NULL) {
+ if (control == NULL || control->is_done) {
return -1;
}
@@ -738,7 +738,7 @@
{
RAII_VAR(struct stasis_app_command *, command, NULL, ao2_cleanup);
- if (control == NULL) {
+ if (control == NULL || control->is_done) {
return -1;
}
--
To view, visit https://gerrit.asterisk.org/1298
To unsubscribe, visit https://gerrit.asterisk.org/settings
Gerrit-MessageType: newchange
Gerrit-Change-Id: If66265b73b4c9f8f58599124d777fedc54576628
Gerrit-PatchSet: 1
Gerrit-Project: asterisk
Gerrit-Branch: certified/13.1
Gerrit-Owner: Mark Michelson <mmichelson at digium.com>
More information about the asterisk-code-review
mailing list