[asterisk-bugs] [JIRA] (ASTERISK-28625) Playback of local files impacted by large media cache

Kevin Reeves (JIRA) noreply at issues.asterisk.org
Fri Nov 22 10:37:32 CST 2019


    [ https://issues.asterisk.org/jira/browse/ASTERISK-28625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=248849#comment-248849 ] 

Kevin Reeves commented on ASTERISK-28625:
-----------------------------------------

Sean - 

To clarify - we have 24 cores in each server.  The 1 minute load average was hovering at 20-30 but at times would spike 200+.  That load average spike didn't seem to correlate to the delayed playback times.  These servers are dedicated, non virtualized hardware.  Meaning they are not sharing their CPU with any other processes.

My team was actively debugging during this impacted time, so I went back to the logs from the incident and here's an example of what we were seeing.
[2019-11-19 14:45:24] VERBOSE[14196][C-000228fa] pbx.c: Executing [s at nx-dial-session:20] Read("SIP/aaa-002f6456", "__DS_PIN,pb/ac-enter-pin") in new stack
[2019-11-19 14:45:31] VERBOSE[14196][C-000228fa] file.c: <SIP/aaa-002f6456> Playing 'pb/ac-enter-pin.slin' (language 'en')

Nothing else was being done on that channel in-between those 2 lines.

Notice the timing of these 2 lines.  First line shows the Read command from the dialplan. The 2nd line shows the time that the audio started playing.  Notice how there's a 7-second delay between them.  The experience to the caller is dead air during that 7-seconds.  That file pb/ac-enter-pin is on a local disk - not NFS.

Here's another example on the same server later in the day:
[2019-11-19 16:20:02] VERBOSE[16688][C-000002dc] pbx.c: Executing [s at nx-dial-session:20] Read("SIP/aaa-00001905", "__DS_PIN,pb/ac-enter-pin") in new stack
[2019-11-19 16:20:11] VERBOSE[16688][C-000002dc] file.c: <SIP/aaa-00001905> Playing 'pb/ac-enter-pin.slin' (language 'en')

You see a similar delay.

I have lots of examples like this.

At the time, we had probably 1,500+ entries in the media cache.   Not nearly as massive a list as what you entered in your development environment, but it was definitely impacting our servers.

Just to complete the picture, here's an example of the same file being played instantaneously - this is good and what we want to see:
[2019-11-19 16:24:24] VERBOSE[23354][C-000002f7] pbx.c: Executing [s at nx-dial-session-ac:10] Read("SIP/aaa-00001b05", "AC_PIN,pb/ac-enter-pin") in new stack
[2019-11-19 16:24:24] VERBOSE[23354][C-000002f7] file.c: <SIP/aaa-00001b05> Playing 'pb/ac-enter-pin.slin' (language 'en')

It happened during the same timeframe as the others, which highlights the randomness of the issue.  It took us a LOT of brainstorming and debugging before we identified the media cache as the culprit.

It's worth noting that we weren't using the media cache until 4 or 5 days prior to this incident.  So we never had ANYTHING in the media cache until then.  When we started using the media cache we didn't see any issues for a few days.  It wasn't until we ran for a few days and accumulated a media cache of this size that we started to feel the impact.

Once we started to suspect the media cache, we wrote a script to clear it out using the CLI command. But, the command was incredibly slow.  That surprised us too.  

Each time that we called the command to remove an entry it took a few seconds to finish.   We were using: 

media cache delete <URL of the media>

That was an interesting revelation because it made us think the Asterisk internals were probably having to go through the list linearly and perhaps that was very slow.

So I started reading through the source code, and noticed that every audio file reading operation was going through the same code and therefore was scanning through the media cache.

So that’s ultimately how we landed on suspecting this as the issue.

I hope this helps.  Let me know if you have other questions.

> Playback of local files impacted by large media cache
> -----------------------------------------------------
>
>                 Key: ASTERISK-28625
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-28625
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Core/Streams
>    Affects Versions: 16.6.1
>         Environment: CentOS Linux release 7.6.1810
>            Reporter: Kevin Reeves
>            Assignee: Kevin Reeves
>              Labels: patch
>         Attachments: ASTERISK-28625.patch
>
>
> Playback of local files is impacted by media cache.  As the cache grows, performance of common Read and Playback (and presumably other) operations for local files is impacted.
> For example, Playback command of a local file like this:
> exten => s,n,Playback(thankyou)
> Even though it's playing a local file, those operations are searching the media cache. As the cache size grows, the performance of that type of call is impacted.  In fact, we've seen a multi-second delay from the time the Playback command is executed to when the caller hears the audio.
> The delay grows significantly over time as the cache grows bigger.  With thousands of files in the media cache while the server is under heavy load, the delay has been upwards of 15-30 seconds.  Without files in the media cache there is zero delay.
> Here's a debug output snippet that shows the issue.  The Read application is playing a local file named 'pb/ac-enter-pin'.  Notice that media_cache.c line 194 fails to locate the media.
>     -- Executing [s at nx-dial-session:20] Read("SIP/vitel-inbound2-00000002", "__DS_PIN,/var/lib/asterisk/sounds/pb/ac-enter-pin") in new stack
> [2019-11-20 15:22:06] DEBUG[30329][C-00000003]: media_cache.c:194 ast_media_cache_retrieve: Failed to obtain media at '/var/lib/asterisk/sounds/pb/ac-enter-pin'
> [2019-11-20 15:22:06] DEBUG[30329][C-00000003]: media_cache.c:194 ast_media_cache_retrieve: Failed to obtain media at '/var/lib/asterisk/sounds/pb/ac-enter-pin'
> This section of the code should not be used in cases where the file is a local file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)



More information about the asterisk-bugs mailing list