[asterisk-dev] Pain Points For Large Scale Instance Provisioning

Wed Oct 21 13:01:09 CDT 2020

Missed you all at DevCon this year! Happy to chime in on this one, having
had some experience with relatively large fleets of Asterisk servers for
various purposes. Certainly seen a few themes...

* Dialplan versioning and deployment. I have relied on static
dialplans with HTTP or func_odbc calls for a majority of larger
deployments. Most of this is managed via git, and tagged for release via
one of a few pipeline processes. The process of actually releasing it and
ensuring the proper version is running can be tricky. We often want to test
new dialplan on a small subset of servers in a fleet and so managing
multiple versions in production is important. We've build companion apps,
made Asterisk servers git deployment targets with post-deployment hooks for
reloads, used rsync, etc... to work around this, but all of it feels a
little inelegant.

* Docker/k8s official support. Most of our Asterisk builds are customized
in one fashion or another. In the past, that meant building our own
packages and deploying with Puppet/Terraform/ugly bash scripts, but
increasingly, we've been relying on Docker to deploy, tag, and release,
even if that has just meant running Docker daemon on top of a "standard"
Linux install. This is vastly easier to orchestrate and manage than
building and deploying custom packages or distributing source around and
recompiling. It'd be nice to have this available out of the box a little
more easily, and targeting a base Docker image that is much, much smaller
than CentOS or Debian would be ideal. Our Docker image is several gigabytes
to get a base Asterisk build. Something that we could build on Alpine that
would dramatically shrink the footprint would be desirable. Kubernetes is a
bit newer to most stacks, but it's increasingly being asked for. The
temporal nature of networking and IP addressing, at least without
additional orchestration tooling, is a barrier to running effectively in
k8s.

* Cluster awareness of peer instances/endpoints. In cases where there is no
external SIP registrar, it is desirable to be able to round-robin
registrations or otherwise distribute endpoints between cluster nodes.
Finding these elements for anything beyond basic call features can be quite
difficult and usually requires external application code. Using a shared
realtime database with ps_contacts, for instance, doesn't work the way one
might expect in this use case, so we end up writing lots of fairly ugly
dialplan to account for intra-cluster communications. Similar
considerations around dynamic insertion/removal of peer Asterisk instances.

* AMI/ARI aggregation. We've spent a lot of dev cycles providing a unified
view to our application layers that doesn't require app code to have
knowledge of node placement inside Asterisk clusters. Often, this has
required recreating and aggregating things like device state, notification
for call events, and mechanisms for invoking 3rd party call control. The
firehose for AMI coming off of 100 busy instances can overwhelm almost any
application without some quite impressive-looking gyrations of
EventFilters, even if you're scoped to minimum classes.

* Database dependencies/performance. We've more or less had to acknowledge
that direct ODBC links from things like CEL/CDR to Asterisk don't scale
well at all. Having dozens to hundreds of nodes writing to databases
directly, even when DBs are properly sized, tends to just cause unexpected
issues. Clustered databases add another layer of complexity for Asterisk,
especially when you want to do maintenance or switch active write nodes.
Asterisk lockups due to failures in DB instances are also still a problem.
Early solutions tended to be to use haproxy locally on Asterisk instances
and control database connectivity from there but more recent solutions have
been to use AMI and a messaging pipeline to push CDR, CEL, etc... to
ZMQ/Kafka/Kinesis for upstream processing.

Well.. that's probably enough for one email. :) Certainly not complaining
about the state of things nor implying these are Asterisk issues or bugs.
These are just major areas of development effort that we expected to and
ended up undertaking to make systems work in the hundreds of millions of
minutes a month scale.

J.

On Wed, Oct 21, 2020 at 2:44 AM Jaco Kroon <jaco at uls.co.za> wrote:

> Hi,
>
> * in asterisk cli, display instance name to which is attached, in other
> words, if the being controlled instance has "systemname => bar" on a
> host with hostname "foo", then instead of the prompt being:
>
> foo*CLI>
>
> Either:
>
> foo*bar*CLI>
>
> or simply:
>
> bar*CLI>
>
> Would be great.  This is actually something I should be able to attend to.
>
> Kind Regards,
> Jaco
>
> On 2020/10/21 10:35, Jaco Kroon wrote:
>
> > Hi,
> >
> > On 2020/10/20 23:32, Michael Cargile wrote:
> >
> >> Towards the end of DevCon, Matt asked if there were any pain points
> >> for provisioning large numbers of Asterisk instances and I mentioned I
> >> would talk to my colleague who handles such things. He provided this
> >> list:
> >>
> >> * Sanity checks within Asterisk at start up and module reload. this
> >> include:
> >>      -- Asterisk making sure it has the proper file permission for all
> >> directories that it is configured to write / read to
> > If this is implemented, this would need to be configurable.  We had a
> > check in our init script on Gentoo.  This was switched off.  Most
> > deploys was not an issue ... but I think the record we clocked was ~6
> > hours startup time just checking /var/spool/asterisk and sub paths.
> > Yes, the script could have been improved to not check recursively and
> > not descend into sub folders for recordings ...
> >
> > Still, very good and valid suggestion.
> >
> >>      -- verification that things like audio files called from
> >> Background / Playback are actually there
> >>   If these checks fail throw an error at start up / reload rather than
> >> when something is attempted to be accessed so these problems can be
> >> addressed sooner
> > Some of these names are determined dynamically, especially when multiple
> > formats are involved.  One thing that would be nice is at least a syntax
> > validation, eg, missing or extraneous brackets and the like.  For
> > example ... Set(foo=${bar) <-- obviously invalid, missing }.  Nice to
> > have in my opinion though.
> >
> >> * asterisk.conf directory variables for things like audio files are
> >> not always honored requiring symlinks as a work around (though this
> >> might be the OpenSuSE build of Asterisk causing issues)
> > Never encountered this.  And we make heavy use of this (eg, running
> > multiple, generally < 100, instances on the same physical and using
> > astspooldir => /var/spool/asterisk.uls).  If this was not being honoured
> > we'd have issues that we'd only be able to describe as insane critical.
> >
> >> * Reliable module reloading without core restarts
> >>      Example: Client lets their SSL certificate lapse on an Asterisk
> >> server and they only figure this out when their
> >>      agents attempting to log in using WebRTC clients. They have
> >> dozens or even hundreds of customer calls in queue,
> >>      but their agents cannot login. On Asterisk 13 we cannot fix the
> >> SSL certs without a full restart of Asterisk
> >>      which drops these calls. A reload of the http module does not fix
> >> this.
> > Sean mentioned this fixed, looking at the diff, http module reload will
> > now be adequate.  And PJSIP from what I can tell don't suffer this
> > issue.  chan_sip loads certificates at accept() time.
> >
> > Trying to confirm chan_sip I did find that the setting of sip_reloading
> > = FALSE happens in an odd place ... will check that out a bit later.
> >
> > And then I'd like to also add:
> >
> > * reduction of idle-instance CPU usage (which seems to be running @
> > ~0.7% of a core generally when asterisk is doing "nothing" - obviously
> > variable based on CPU clock speed).  Not a major problem when running
> > one or two instances, but does create an artificial upper limit, and
> > there are measurable power implications when running hundreds of
> > instances in the same rack.
> >
> > Kind Regards,
> > Jaco
> >
> >
> >>
>
> --
> _____________________________________________________________________
> -- Bandwidth and Colocation Provided by http://www.api-digital.com --
>
> asterisk-dev mailing list
> To UNSUBSCRIBE or update options visit:
>    http://lists.digium.com/mailman/listinfo/asterisk-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20201021/ffe155b0/attachment-0001.html>