[asterisk-bugs] [JIRA] (ASTERISK-30081) app_confbridge: Channel can join wrong bridge due to race condition
N A (JIRA)
noreply at issues.asterisk.org
Wed May 25 20:01:49 CDT 2022
[ https://issues.asterisk.org/jira/browse/ASTERISK-30081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
N A updated ASTERISK-30081:
---------------------------
Description:
This doesn't happen all the time, but it happens often enough. The actual timing seems to be random - possible race condition.
This reproduces very frequently - not every time, but about half the time or more - with this scenario:
- A and B are in a confbridge
- All in a split second, but in this order:
- A leaves the bridge
- B gets channel redirected out of the bridge
- (confbridge is now empty, ceases to exist momentarily)
- A re-enters the bridge.
- C enters the bridge.
C ends up in the bridge correctly, but A ends up in the wrong bridge.
Two channels are supposed to get into the same ConfBridge, but they each end up by themselves, obviously not bridged together.
It works correctly when the timing is different, and there ISN'T a brief moment where the bridge is "EMPTY" in the middle. Therefore, you ONLY see this in a trace for a bad, broken bridge:
{noformat}
[2022-05-25 15:32:49] DEBUG[23368][C-0000002a]: app_confbridge.c:1570 alloc_playback_chan: Created announcer channel 'CBAnn/oe7-0000007a;1' to conference bridge 'oe7'
[2022-05-25 15:32:49] -- Channel CBAnn/oe7-0000007a;2 joined 'softmix' base-bridge <66c254c2-ef39-4252-9cc4-405fb58d1bbc>
{noformat}
I have several call debugs of "good" and "bad" calls that broke, and only in the bad/broken calls can I find "EMPTY" in the app_confbridge debug messages.
The race condition can be detected by checking if the bridge exists later - note in this annotated example just how exactly right the threads need to interleave to trigger this. If the timing is just right, this will reproduce frequently. Otherwise, it is unlikely to occur.
{noformat}
CLI output with core set debug 1 app_confbridge:
Thread exiting bridge:
[2022-05-26 00:53:42] DEBUG[28878][C-0000000b]: confbridge/conf_state.c:82 conf_change_state: Changing conference 'oe7' state from SINGLE to EMPTY
Thread joining bridge:
[2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: app_confbridge.c:1751 join_conference_bridge: Trying to find conference bridge 'oe7'
[2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: app_confbridge.c:1902 join_conference_bridge: Conference 'oe7' already exists in container.
Thread exiting bridge:
[2022-05-26 00:53:42] DEBUG[28878][C-0000000b]: app_confbridge.c:1528 conf_ended: CONF END
(this is the conf_ended function)
Thread joining bridge:
[2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: confbridge/conf_state.c:82 conf_change_state: Changing conference 'oe7' state from EMPTY to SINGLE
[2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: app_confbridge.c:1352 conf_update_user_mute: User PJSIP/ATAxLB2-00000003 is unmuted: user:0 system:0.
[2022-05-26 00:53:42] WARNING[28852][C-0000000b]: app_confbridge.c:1943 join_conference_bridge: Due to a race condition, conf oe7 disappeared before we could enter it
[2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: confbridge/conf_state.c:82 conf_change_state: Changing conference 'oe7' state from SINGLE to EMPTY
^ Here, we successfully detect the race condition and abort.
{noformat}
More details:
core show channels shows them both executing the ConfBridge application on the same ConfBridge, but only one channel is in the bridge according to confbridge list and confbridge show.
A backtrace confirms the other channel has somehow entered a *different* bridge by itself.
On occasion, channels joining the same ConfBridge at nearly the same time appear to get into different bridges as opposed to the same bridge, even though they have been requested to go into the same ConfBridge.
A backtrace confirms that for channels with the same confbridge name, confbridge->bridge is different for them here (they should be the same):
https://github.com/asterisk/asterisk/blob/master/apps/app_confbridge.c#L2848
I am marking this "major" since such a race condition can completely break the usage of ConfBridge, and suggests that perhaps the locking here is insufficient: https://github.com/asterisk/asterisk/blob/master/apps/app_confbridge.c#L1736
CLI trace showing bizarre confbridge/bridge/channel disonnance:
https://paste.interlinked.us/zq10e1w9qo.txt
Output of bridge show:
{noformat}
*CLI> bridge show
17378691-4657-41df-a2ba-ff871b5cda79 2916a9dc-9bad-4aea-8fae-fc4ace5ea748 3022d1db-2659-4dad-988f-e8e6994641bb all de65981b-7f2d-420c-8ce4-e2c6120953dd
[2022-05-25 14:24:28] -- Remote UNIX connection
[2022-05-25 14:24:28] -- Remote UNIX connection disconnected
*CLI> bridge show 17378691-4657-41df-a2ba-ff871b5cda79
Id: 17378691-4657-41df-a2ba-ff871b5cda79
Type: base
Technology: softmix
Subclass: base
Creator: ConfBridge
Name: oe6
Video-Mode: none
Video-Source-Id:
Num-Channels: 3
Num-Active: 3
Duration: 00:31:02
Channel: PJSIP/ATAxLB1-00000016
Channel: CBAnn/oe6-00000048;2
Channel: Local/2129 at originate-local-00000049;1
*CLI> bridge show 2916a9dc-9bad-4aea-8fae-fc4ace5ea748
Id: 2916a9dc-9bad-4aea-8fae-fc4ace5ea748
Type: base
Technology: softmix
Subclass: base
Creator: ConfBridge
Name: oe7
Video-Mode: none
Video-Source-Id:
Num-Channels: 2
Num-Active: 2
Duration: 00:30:49
Channel: Local/4002702129 at lines-pre-0000004a;2
Channel: CBAnn/oe7-0000004d;2
*CLI> bridge show 3022d1db-2659-4dad-988f-e8e6994641bb
Id: 3022d1db-2659-4dad-988f-e8e6994641bb
Type: basic
Technology: simple_bridge
Subclass: basic
Creator:
Name:
Video-Mode: none
Video-Source-Id:
Num-Channels: 2
Num-Active: 2
Duration: 00:30:51
Channel: Local/2129 at originate-local-00000049;2
Channel: Local/4002702129 at lines-pre-0000004a;1
*CLI> bridge show de65981b-7f2d-420c-8ce4-e2c6120953dd
Id: de65981b-7f2d-420c-8ce4-e2c6120953dd
Type: base
Technology: softmix
Subclass: base
Creator: ConfBridge
Name: oe7
Video-Mode: none
Video-Source-Id:
Num-Channels: 2
Num-Active: 2
Duration: 00:30:59
Channel: PJSIP/ATAxLB2-00000015
Channel: CBAnn/oe7-0000004c;2
{noformat}
was:
This doesn't happen all the time, but it happens often enough. The actual timing seems to be random - possible race condition.
This reproduces very frequently - not every time, but about half the time or more - with this scenario:
- A and B are in a confbridge
- All in a split second, but in this order:
- A leaves the bridge
- B gets channel redirected out of the bridge
- (confbridge is now empty, ceases to exist momentarily)
- A re-enters the bridge.
- C enters the bridge.
C ends up in the bridge correctly, but A ends up in the wrong bridge.
Two channels are supposed to get into the same ConfBridge, but they each end up by themselves, obviously not bridged together.
It works correctly when the timing is different, and there ISN'T a brief moment where the bridge is "EMPTY" in the middle. Therefore, you ONLY see this in a trace for a bad, broken bridge:
{noformat}
[2022-05-25 15:32:49] DEBUG[23368][C-0000002a]: app_confbridge.c:1570 alloc_playback_chan: Created announcer channel 'CBAnn/oe7-0000007a;1' to conference bridge 'oe7'
[2022-05-25 15:32:49] -- Channel CBAnn/oe7-0000007a;2 joined 'softmix' base-bridge <66c254c2-ef39-4252-9cc4-405fb58d1bbc>
{noformat}
I have several call debugs of "good" and "bad" calls that broke, and only in the bad/broken calls can I find "EMPTY" in the app_confbridge debug messages.
The race condition can be detected by checking if the bridge exists later:
{noformat}
CLI output with core set debug 1 app_confbridge:
Thread exiting bridge:
[2022-05-26 00:53:42] DEBUG[28878][C-0000000b]: confbridge/conf_state.c:82 conf_change_state: Changing conference 'oe7' state from SINGLE to EMPTY
Thread joining bridge:
[2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: app_confbridge.c:1751 join_conference_bridge: Trying to find conference bridge 'oe7'
[2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: app_confbridge.c:1902 join_conference_bridge: Conference 'oe7' already exists in container.
Thread exiting bridge:
[2022-05-26 00:53:42] DEBUG[28878][C-0000000b]: app_confbridge.c:1528 conf_ended: CONF END
(this is the conf_ended function)
Thread joining bridge:
[2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: confbridge/conf_state.c:82 conf_change_state: Changing conference 'oe7' state from EMPTY to SINGLE
[2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: app_confbridge.c:1352 conf_update_user_mute: User PJSIP/ATAxLB2-00000003 is unmuted: user:0 system:0.
[2022-05-26 00:53:42] WARNING[28852][C-0000000b]: app_confbridge.c:1943 join_conference_bridge: Due to a race condition, conf oe7 disappeared before we could enter it
[2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: confbridge/conf_state.c:82 conf_change_state: Changing conference 'oe7' state from SINGLE to EMPTY
^ Here, we successfully detect the race condition and abort.
{noformat}
More details:
core show channels shows them both executing the ConfBridge application on the same ConfBridge, but only one channel is in the bridge according to confbridge list and confbridge show.
A backtrace confirms the other channel has somehow entered a *different* bridge by itself.
On occasion, channels joining the same ConfBridge at nearly the same time appear to get into different bridges as opposed to the same bridge, even though they have been requested to go into the same ConfBridge.
A backtrace confirms that for channels with the same confbridge name, confbridge->bridge is different for them here (they should be the same):
https://github.com/asterisk/asterisk/blob/master/apps/app_confbridge.c#L2848
I am marking this "major" since such a race condition can completely break the usage of ConfBridge, and suggests that perhaps the locking here is insufficient: https://github.com/asterisk/asterisk/blob/master/apps/app_confbridge.c#L1736
CLI trace showing bizarre confbridge/bridge/channel disonnance:
https://paste.interlinked.us/zq10e1w9qo.txt
Output of bridge show:
{noformat}
*CLI> bridge show
17378691-4657-41df-a2ba-ff871b5cda79 2916a9dc-9bad-4aea-8fae-fc4ace5ea748 3022d1db-2659-4dad-988f-e8e6994641bb all de65981b-7f2d-420c-8ce4-e2c6120953dd
[2022-05-25 14:24:28] -- Remote UNIX connection
[2022-05-25 14:24:28] -- Remote UNIX connection disconnected
*CLI> bridge show 17378691-4657-41df-a2ba-ff871b5cda79
Id: 17378691-4657-41df-a2ba-ff871b5cda79
Type: base
Technology: softmix
Subclass: base
Creator: ConfBridge
Name: oe6
Video-Mode: none
Video-Source-Id:
Num-Channels: 3
Num-Active: 3
Duration: 00:31:02
Channel: PJSIP/ATAxLB1-00000016
Channel: CBAnn/oe6-00000048;2
Channel: Local/2129 at originate-local-00000049;1
*CLI> bridge show 2916a9dc-9bad-4aea-8fae-fc4ace5ea748
Id: 2916a9dc-9bad-4aea-8fae-fc4ace5ea748
Type: base
Technology: softmix
Subclass: base
Creator: ConfBridge
Name: oe7
Video-Mode: none
Video-Source-Id:
Num-Channels: 2
Num-Active: 2
Duration: 00:30:49
Channel: Local/4002702129 at lines-pre-0000004a;2
Channel: CBAnn/oe7-0000004d;2
*CLI> bridge show 3022d1db-2659-4dad-988f-e8e6994641bb
Id: 3022d1db-2659-4dad-988f-e8e6994641bb
Type: basic
Technology: simple_bridge
Subclass: basic
Creator:
Name:
Video-Mode: none
Video-Source-Id:
Num-Channels: 2
Num-Active: 2
Duration: 00:30:51
Channel: Local/2129 at originate-local-00000049;2
Channel: Local/4002702129 at lines-pre-0000004a;1
*CLI> bridge show de65981b-7f2d-420c-8ce4-e2c6120953dd
Id: de65981b-7f2d-420c-8ce4-e2c6120953dd
Type: base
Technology: softmix
Subclass: base
Creator: ConfBridge
Name: oe7
Video-Mode: none
Video-Source-Id:
Num-Channels: 2
Num-Active: 2
Duration: 00:30:59
Channel: PJSIP/ATAxLB2-00000015
Channel: CBAnn/oe7-0000004c;2
{noformat}
> app_confbridge: Channel can join wrong bridge due to race condition
> -------------------------------------------------------------------
>
> Key: ASTERISK-30081
> URL: https://issues.asterisk.org/jira/browse/ASTERISK-30081
> Project: Asterisk
> Issue Type: Bug
> Security Level: None
> Components: Applications/app_confbridge
> Affects Versions: 18.9.0
> Reporter: N A
> Severity: Major
> Attachments: app_confbridge.c, wyeed8jbjr.txt, zq10e1w9qo.txt
>
>
> This doesn't happen all the time, but it happens often enough. The actual timing seems to be random - possible race condition.
> This reproduces very frequently - not every time, but about half the time or more - with this scenario:
> - A and B are in a confbridge
> - All in a split second, but in this order:
> - A leaves the bridge
> - B gets channel redirected out of the bridge
> - (confbridge is now empty, ceases to exist momentarily)
> - A re-enters the bridge.
> - C enters the bridge.
> C ends up in the bridge correctly, but A ends up in the wrong bridge.
> Two channels are supposed to get into the same ConfBridge, but they each end up by themselves, obviously not bridged together.
> It works correctly when the timing is different, and there ISN'T a brief moment where the bridge is "EMPTY" in the middle. Therefore, you ONLY see this in a trace for a bad, broken bridge:
> {noformat}
> [2022-05-25 15:32:49] DEBUG[23368][C-0000002a]: app_confbridge.c:1570 alloc_playback_chan: Created announcer channel 'CBAnn/oe7-0000007a;1' to conference bridge 'oe7'
> [2022-05-25 15:32:49] -- Channel CBAnn/oe7-0000007a;2 joined 'softmix' base-bridge <66c254c2-ef39-4252-9cc4-405fb58d1bbc>
> {noformat}
> I have several call debugs of "good" and "bad" calls that broke, and only in the bad/broken calls can I find "EMPTY" in the app_confbridge debug messages.
> The race condition can be detected by checking if the bridge exists later - note in this annotated example just how exactly right the threads need to interleave to trigger this. If the timing is just right, this will reproduce frequently. Otherwise, it is unlikely to occur.
> {noformat}
> CLI output with core set debug 1 app_confbridge:
> Thread exiting bridge:
> [2022-05-26 00:53:42] DEBUG[28878][C-0000000b]: confbridge/conf_state.c:82 conf_change_state: Changing conference 'oe7' state from SINGLE to EMPTY
> Thread joining bridge:
> [2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: app_confbridge.c:1751 join_conference_bridge: Trying to find conference bridge 'oe7'
> [2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: app_confbridge.c:1902 join_conference_bridge: Conference 'oe7' already exists in container.
> Thread exiting bridge:
> [2022-05-26 00:53:42] DEBUG[28878][C-0000000b]: app_confbridge.c:1528 conf_ended: CONF END
> (this is the conf_ended function)
> Thread joining bridge:
> [2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: confbridge/conf_state.c:82 conf_change_state: Changing conference 'oe7' state from EMPTY to SINGLE
> [2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: app_confbridge.c:1352 conf_update_user_mute: User PJSIP/ATAxLB2-00000003 is unmuted: user:0 system:0.
> [2022-05-26 00:53:42] WARNING[28852][C-0000000b]: app_confbridge.c:1943 join_conference_bridge: Due to a race condition, conf oe7 disappeared before we could enter it
> [2022-05-26 00:53:42] DEBUG[28852][C-0000000b]: confbridge/conf_state.c:82 conf_change_state: Changing conference 'oe7' state from SINGLE to EMPTY
> ^ Here, we successfully detect the race condition and abort.
> {noformat}
> More details:
> core show channels shows them both executing the ConfBridge application on the same ConfBridge, but only one channel is in the bridge according to confbridge list and confbridge show.
> A backtrace confirms the other channel has somehow entered a *different* bridge by itself.
> On occasion, channels joining the same ConfBridge at nearly the same time appear to get into different bridges as opposed to the same bridge, even though they have been requested to go into the same ConfBridge.
> A backtrace confirms that for channels with the same confbridge name, confbridge->bridge is different for them here (they should be the same):
> https://github.com/asterisk/asterisk/blob/master/apps/app_confbridge.c#L2848
> I am marking this "major" since such a race condition can completely break the usage of ConfBridge, and suggests that perhaps the locking here is insufficient: https://github.com/asterisk/asterisk/blob/master/apps/app_confbridge.c#L1736
> CLI trace showing bizarre confbridge/bridge/channel disonnance:
> https://paste.interlinked.us/zq10e1w9qo.txt
> Output of bridge show:
> {noformat}
> *CLI> bridge show
> 17378691-4657-41df-a2ba-ff871b5cda79 2916a9dc-9bad-4aea-8fae-fc4ace5ea748 3022d1db-2659-4dad-988f-e8e6994641bb all de65981b-7f2d-420c-8ce4-e2c6120953dd
> [2022-05-25 14:24:28] -- Remote UNIX connection
> [2022-05-25 14:24:28] -- Remote UNIX connection disconnected
> *CLI> bridge show 17378691-4657-41df-a2ba-ff871b5cda79
> Id: 17378691-4657-41df-a2ba-ff871b5cda79
> Type: base
> Technology: softmix
> Subclass: base
> Creator: ConfBridge
> Name: oe6
> Video-Mode: none
> Video-Source-Id:
> Num-Channels: 3
> Num-Active: 3
> Duration: 00:31:02
> Channel: PJSIP/ATAxLB1-00000016
> Channel: CBAnn/oe6-00000048;2
> Channel: Local/2129 at originate-local-00000049;1
> *CLI> bridge show 2916a9dc-9bad-4aea-8fae-fc4ace5ea748
> Id: 2916a9dc-9bad-4aea-8fae-fc4ace5ea748
> Type: base
> Technology: softmix
> Subclass: base
> Creator: ConfBridge
> Name: oe7
> Video-Mode: none
> Video-Source-Id:
> Num-Channels: 2
> Num-Active: 2
> Duration: 00:30:49
> Channel: Local/4002702129 at lines-pre-0000004a;2
> Channel: CBAnn/oe7-0000004d;2
> *CLI> bridge show 3022d1db-2659-4dad-988f-e8e6994641bb
> Id: 3022d1db-2659-4dad-988f-e8e6994641bb
> Type: basic
> Technology: simple_bridge
> Subclass: basic
> Creator:
> Name:
> Video-Mode: none
> Video-Source-Id:
> Num-Channels: 2
> Num-Active: 2
> Duration: 00:30:51
> Channel: Local/2129 at originate-local-00000049;2
> Channel: Local/4002702129 at lines-pre-0000004a;1
> *CLI> bridge show de65981b-7f2d-420c-8ce4-e2c6120953dd
> Id: de65981b-7f2d-420c-8ce4-e2c6120953dd
> Type: base
> Technology: softmix
> Subclass: base
> Creator: ConfBridge
> Name: oe7
> Video-Mode: none
> Video-Source-Id:
> Num-Channels: 2
> Num-Active: 2
> Duration: 00:30:59
> Channel: PJSIP/ATAxLB2-00000015
> Channel: CBAnn/oe7-0000004c;2
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.2#6252)
More information about the asterisk-bugs
mailing list