[asterisk-bugs] [JIRA] (ASTERISK-28831) Leaking stasis subscriptions can linger indefinitely and brick Asterisk

Tue Apr 14 10:44:25 CDT 2020

lvl created ASTERISK-28831:
------------------------------

             Summary: Leaking stasis subscriptions can linger indefinitely and brick Asterisk
                 Key: ASTERISK-28831
                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-28831
             Project: Asterisk
          Issue Type: Bug
      Security Level: None
          Components: Core/Stasis
    Affects Versions: 16.9.0
            Reporter: lvl

Split off from ASTERISK-28829. I thought this deserves a separate ticket for clarity.

A scenario such as the one in ASTERISK-28829 causes subscriptions to {{ast_channel_topic_all()}} to linger around indefinitely. These subscriptions come with a dedicated taskprocessor per subscriber and will receive *all* events from *all* active channels.

After running with an affected Asterisk for a while (depending on how frequently your scenario occurs), you'll end up with hundreds of these lingering subscriptions. Running a "core show taskprocessors" will show hundreds of "stasis/p:channel:all" taskprocessors, with millions of processed events.

At some point, Asterisk will be so busy delivering events to all these lingering subscribers that CPU usage will increase and regular call processing will start to fail.

It's pretty hard to discover what's going on now. If you have chan_pjsip configured to reject calls when its taskprocessor is overloaded, you would see "Taskprocessor overload alert" but only on the debug level. The generic "Taskprocessor '%s' triggered the high water alert." message will trigger but also only on the debug level.

Unless you know exactly where to look, your Asterisk will shorty become completely irresponsive to everything depending on stasis/task processors (pretty much everything) without any warnings at all.

I propose that at the very least we should add more noticeable warning messages. For example..

* When a task processor has processed more than X (millions of) items
* When there are more than X (hundreds of) task processors
* When the high water alert is reached (for a sustained period)

Ideally, we would also prevent this scenario, because even if the root cause for ASTERISK-28829 is found and fixed, there might be more scenarios like it. For example..

* Have a stasis subscription automatically detect that noone is really listening anymore

.. but I am unsure to gauge how hard this would be.

--
This message was sent by Atlassian JIRA
(v6.2#6252)