[asterisk-dev] Bridges, T.38, and other good times

Thu Dec 10 13:52:40 CST 2015

On Sun, Dec 6, 2015 at 7:57 PM, Matthew Jordan <mjordan at digium.com> wrote:

> Hello all -
>
> One of the efforts that a number of developers in the community here at
> Digium have been at work at are cleaning up test failures exposed by
> Jenkins [1]. One of these, in particular, has been rather difficult to
> resolve - namely, fax/pjsip/directmedia_reinvite_t38 [2]. This e-mail goes
> over what has been accomplished, and asks some questions on how we might
> try and fix Asterisk under this scenario.
>
> The directmedia_reinvite_t38 test attempts to do the following:
>  (1) UAC1 calls UAC2 through Asterisk, with audio as the media. The dial
> is performed using the 'g' flag, such that UAC2 will continue on if UAC1
> hangs up.
>  (2) UAC1 and UAC2 are configured for direct media. Asterisk sends a
> re-INVITE to UAC1 and UAC2 to initiate direct media.
>  (3) After responding with a 200 OK to the direct media requests, UAC1
> sends a re-INVITE offering T.38.
>  (4) Asterisk sends an INVITE with T.38 to UAC2
>  (5) UAC2 sends back a 200 OK for T.38; Asterisk sends that to UAC1.
> Asterisk switches out of a direct media bridge to a core bridge.
>  (6) UAC1 hangs up. Asterisk sends a re-INVITE to UAC2 for audio back to
> Asterisk. UAC2 responds with a 200 OK for the audio.
>  (7) Asterisk ejects UAC2 back to the dialplan.
>
> It's important to note that this test never should have passed - an update
> to the test suite "fixed" the test erroneously passing, which led to us
> investigating why the scenario was failing. This test was copied over from
> an identical chan_sip test, which passes.
>
> The PJSIP stack has two issues which make life difficult for it in this
> scenario:
> (1) The T.38 logic is implemented in res_pjsip_t38. While that is _mostly_
> a very good thing - as it keeps all the fax state logic outside of the
> channel driver - we are also a layer removed from interactions that occur
> in the channel driver. That makes it challenging to influence direct media
> checks and other Asterisk/channel interactions.
> (2) Being very asynchronous, requests may be serviced that influence T.38
> state while other interactions are occurring in the core. Informing the
> core of what has occurred can have more race conditions than what occurs in
> chan_sip, which is single threaded.
>
> The first bug discovered when the test was investigated was an issue in
> step (2). We never actually initiated a direct media re-INVITE. This was
> due to res_pjsip_t38 using a frame hook, and not implementing the
> .consume_cb callback. That callback allows a framehook to inform the core
> (and also the bridging framework) of the types of frames that a framehook
> wants to consume. If a framehook needs audio, a direct media bridge will be
> explicitly denied, and - by default - the bridging framework assumes that
> framehooks will want all frames. Another bug that was discovered occurred
> in step (6). When UAC1 sends a BYE request, nothing informed UAC2 that the
> fax had ended - instead, it was merely ejected from the bridge. This meant
> that it kept its T.38 session going, and Asterisk never sent a re-INVITE to
> UAC2. Both of these bugs were fixed by 726ee873a6.
>
> Except, unfortunately, the second bug wasn't really fixed.
>
> 726ee873a6 did the "right" thing by intercepting the BYE request sent by
> UAC1, and queueing up a control frame of type AST_CONTROL_T38_PARAMETERS
> with a new state of AST_T38_TERMINATED. This is supposed to be passed on to
> UAC2, informing it that the T.38 fax has ended, and that it should have its
> media re-negotiated back to the last known state (audio) but also back to
> Asterisk (since we aren't going to be in a bridge any longer).
> Unfortunately, this code was insufficient.
>
> A race condition exists in this case. On the one hand, we've just queued
> up a frame on UAC1's channel to be passed into the bridge, which should get
> tossed onto UAC2's channel. On the other hand, we've just told the bridging
> framework to kill UAC1's channel with extreme prejudice, thereby also
> terminating the bridge and ejecting UAC2 off into the dialplan. In the
> first case, this is an asynchronous, message passing mechanism; in the
> second case, the bridging framework inspects the channel to see if it
> should be hung up on *every frame* and *immediately* starts the
> hangup/shutdown procedure if it knows the channel should die. This is not
> asynchronous in any way. As a result, UAC1 may be hung up and the bridge
> dissolved before UAC2 ever gets its control frame from UAC1.
>
> There were a couple of solutions to this problem that were tried:
> (1) First, I tried to make sure that enqueued control frames were flushed
> out of a channel and passed over the bridge when a hangup was detected. In
> practice, this was incredibly cumbersome - some control frames should get
> tossed, others need to be preserved. What was worse was the sheer number of
> places the bridge dissolution can be triggered. While it wasn't hard to
> make sure we flushed frames off an ejected channel into a bridge, it was
> nigh impossible to ensure that this occurred every single time before the
> other channels were ejected. Again, the bridging framework is ridiculously
> - perhaps ludicrously - aggressive in tossing channels out of a bridge once
> it has decided the bridge should be dissolved.
> (2) Second, I tried to make the bridge ejection process asynchronous. This
> was done by enqueuing another control frame onto the channel being ejected;
> when it leaves, it flushes its control frames into the bridge. When the
> 'ejection' control frame gets passed into the bridging core, that causes
> the bridge to dissolve. This worked well in some scenarios, and it also
> guaranteed that the T.38 control frame would be delivered. Unfortunately,
> in other cases, it caused all of the channels to hang out in the bridge ...
> permanently. Again, there's a lot of edge cases in the bridging code that
> deal with channels being kicked out of a bridge, and the bridge
> dissolving... and it was more than I could chew on.
>
> The long and short of it is: while Asterisk 12+ has a nice bridging
> framework that hides or eliminates a lot of the horrendous
> masquerade/transfer code, as well as the 'triple infinite loop' in
> features/channel that existed in Asterisk 11-, it is still ridiculously
> complex and prone to breaking spectacularly in subtle ways. Not to mention
> both (1) and (2) end up being massive changes to the design that are risky
> in an LTS (no one likes it when a channel can't be hung up.)
>
> So those ideas were scratched.
>
> The next solution was to try a bridge mixing technology that specifically
> managed the T.38 state. This worked ... really well. Incredibly well, in
> fact. It avoided all of the previous problems because, unlike external
> modules or even certain places in the bridging core, a bridge technology is
> guaranteed by the core to be called in a synchronized fashion when any of
> the following occurs:
> (1) When a bridge technology is chosen
> (2) When that technology is started
> (3) When that bridge has a channel added
> (4) When that bridge has a channel removed
> (5) When that technology is stopped
> All of which covers the necessary places to know when a channel has hung
> up, and gives us a place where we can safely inform the other channels
> before the bridging framework starts doing mean things. bridge_t38 was the
> result [3]. It managed a bit of T.38 state for the two channels in a core
> bridge that were in a T.38 fax, and, when one of them leaves, it informed
> the other channel that it should end its T.38 fax.
>
> Problem solved.
>
> \o/
>
> Not quite.
>
> After merging [3] in f42d22d3a1, we noticed that the masquerade test [4]
> started to fail. That's a really, really bad sign. The masquerade 'super
> test' was originally tested to stress test masquerades in Asterisk 1.8 and
> 11. It constructs a chain of 300 Local channels, then optimizes them all
> down to a single pair of 'real' channels. In Asterisk 12+, masquerades were
> eliminated in this scenario, but we instead have a series of incredibly
> complex Local channel optimization-caused bridge/swaps/merges that kick off
> as the Local channels collapse and merge their bridges down to one. It's a
> great "canary in the coal mine" test, as when it fails, it almost certainly
> means you've introduced a dead lock into one of the more complex operations
> in Asterisk - regardless of the versions.
>
> And lo and behold, we had.
>
> Local channels are weird. One of the 'fun things' they do is 'help' T.38
> along by passing along a channel query option for T.38 state. This lets us
> do ridiculous things like make sure a T.38 fax works across a Local channel
> chain (and is covered by the fax/sip/local_channel_t38_queryoption test).
> Unfortunately, the bridge_t38 module had to query for T.38 state in its
> compatible callback - this allowed it to determine the current state of
> T.38 on the channels in the bridge to see if it needed to be activated.
> Unfortunately, in a 300 Local channel chain, that means reaching across 300
> bridges - simultaneously - locking bridges, bridge_channels, channels,
> PVTs, and the entire world in the process. Since the bridge lock was
> already held in the compatible callback, this caused a locking inversion
> (no surprise there), deadlocking the whole thing.
>
> This is not a trivial locking situation to resolve. Even if we unlock the
> bridge, we're still liable to deadlock merely by trying to lock 300 bridges
> simultaneously. (There may even be another bug in here, but it is hardly
> worth trying to find or fix at this point.) And we can't remove the query
> option code in chan_local, as T.38 faxes will no longer work across Local
> channels.
>
> As an aside, if there's a lesson in all this, it is that synchronous code
> in a heavily multi-threaded environment is bad. Message passing may be
> harder to write, but it is far easier to maintain.
>
> Anyway, as a result, I've reverted the bridge_t38 module in 75c800eb28.
>
> So what do we do now?
>
> The crux of this problem is that the bridging framework does not have a
> standard way of informing a channel when it has joined or - more
> importantly - left a bridge. Direct media has its own mechanism managed by
> the RTP engine - so it works around this. However, we have a number of
> scenarios where "things happen" in a bridge that involves state on a
> channel and - right now - we don't have a unified way of handling it. In
> addition to T.38, we also have channels being put on hold, DTMF traversing
> a channel, and more. Often, the channel driver has this state - but
> instead, we have a lot of 'clean up' logic being added to the bridging core
> to handle these situations.
>
> As I see it, we really only have two options here:
> (1) Add code to the bridging framework to clean up T.38 on a channel when
> it leaves. This is kind of annoying, as it will happen on every channel
> when it leaves, regardless of whether or not the channel even supports T.38.
> (2) Add a new channel technology callback that a bridge can use to inform
> a channel driver that it is being ejected from a bridge. This would give us
> a single place to put cleanup logic that has to happen in a channel driver
> when it is no longer bridged.
>
> I'm not sure those two options will work, exactly, but it's the best
> options that I can think of after exhausting lots of other code changes in
> the bridging core. If someone has other suggestions, I'd be more than happy
> to entertain them.
>
> Matt
>
>
> [1] https://jenkins.asterisk.org/
> [2]
> https://jenkins.asterisk.org/jenkins/job/periodic-asterisk-master/75/testReport/junit/%28root%29/AsteriskTestSuite/tests_fax_pjsip_directmedia_reinvite_t38/
> [3] https://gerrit.asterisk.org/#/c/1761/
> [4]
> https://jenkins.asterisk.org/jenkins/job/periodic-asterisk-master/80/testReport/junit/%28root%29/AsteriskTestSuite/tests_masquerade/
>

I think option 1 is what needs to be done because it doesn't introduce an
API/ABI change to v13.

Like DTMF begin/end and Hold/Unhold the T.38 state has begin/end events
that must be completed
in a similar manner when the channel leaves the bridging system.  Also like
DTMF and hold, T.38 would
need to be completed if someone masqueraded the channel out of an
application like ReceiveFAX/SendFAX
using AMI Redirect.  Thus the channel would need to keep track of its T.38
state just like it currently keeps
track of DTMF and hold.

I cannot think of a generic way to have the bridging framework nor
masquerade be able to generically
end these three things without getting cumbersome with something like
datastores being added and
removed whenever there is a pending end event.

Trying to get the two locations where the end events need to be simulated
refactored into a single routine
would be difficult because both locations have very strict requirements to
operate correctly.  A compromise
could be done by extracting the respective code into their own co-located
routines.  Then it might be easier
to remember to keep both versions in sync.  Either that or place a dire
warning comments to keep both
places synced.

Richard
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20151210/221208dd/attachment-0001.html>