[asterisk-dev] Proper use of locking.
Michael Cargile
mcargile at explido.us
Wed Jul 18 15:52:21 CDT 2007
For a while now I have been trying to hunt down the reason why the
processor load on our Vicidial systems spikes frequently. It normally
stays at around 2 on a 4 core server. It can however jump up as high as
45 which is not good for call quality. Upon investigation I figured out
that it is the Asterisk process (version 1.2.21.1) that is doing this.
For those who do not know what Vicidial is or how it works, Vicidial
is a call center dialing application that uses Asterisk. Basically a
bunch of call center agents sit in their own respective meetme waiting
for calls. Vicidial dials a number of lines per available agent. When a
call picks up, Vicidial determines which agent to hand the call to, and
drops the call into the meetme for that agent. When an agent wants to
transfer a call to another agent, the new agent is ripped from his
meetme and placed in the other agent's so that they can both talk to the
person on the phone. Once the first agent is done talking to the second
he is pulled from that meetme and placed into a free one. In total there
are about 150 meetmes per box. Generally I have only about 30-40 agents
on these boxes.
To try and hunt down what was causing my load spikes I compiled
Asterisk with full debugging and ran Vicidial as normal. I found that
there was a correlation between lots of calls picking up and the load
spikes. Also during this period of time app_meetme complains of
deadlocks. Thinking that I had a deadlocking issue on my hands I poured
through the code and was less than thrilled with what I found. The
problem has nothing to do with deadlocks (We actually found one but it
could not actually happen during normal Vicidial operation because it
only occurs when dynamically creating new conferences and Vicidial does
not dynamically create them). Now I have looked somewhat into the
version of app_meetme in 1.4 as well as the svn trunk and some of these
problems have been fix but some others have not. I have also seen these
problems in a number of other places through out Asterisk. The problems
are as follows:
1. There is only one lock in all of app_meetme (version 1.2) and that is
on the linked list that holds the configurations for for each
conference. When something needs to be changed in a single conference,
such as adding another channel, all of the conferences get locked. This
is not that big of a deal when you have one 1 or 2 conferences but when
you have 10 to 15 it becomes a somewhat of a problem and when you have
150 with channels being added and removed all the time it is a big
problem. Sure using a single lock does a great job of helping to prevent
deadlocks, but it would be far better to use thread safe programming
techniques which do not need a lock. For example uning a thread safe
linked list that allows reading and traversing the list without locking
and only requires locks for insertion and deletion, and a lock for each
one of the conferences when something needs to be change on it. Not to
mention that it would be even better to convert the linked list to a
thread safe hash that would allow much faster access on a higher number
of meetmes.
2. There are locks around file I/O, mainly writing to the log file. File
I/O is very slow. Locking and unlocking should happen as quickly as
possible. You do not want to keep a lock while things are happening that
potentially have to wait on anything, especially not on something that
could be based on another lock, like concurrent access to a file!
Locking, performing file I/O with the locked data, then unlocking should
be avoided whenever possible. It is far better to lock, make a copy of
the data, unlock, then perform file I/O with the copy of the data,
especially when it is something like writing to a log file. Making a
copy in ram is far faster than file I/O.
This gets even worse when doing things like throwing manager events or
writing to the cli. Here is where we have to start talking about
deadlocks. This appears even in the svn version.
3. Last we checked Asterisk locks were non-reentrant, meaning a single
thread cannot lock on the same lock twice. This is a problem that seems
to be currently being worked on but I am not aware if it has been fixed
as of yet. This means we need to be careful when locking. A deadlock we
fixed was caused by a function locking the one lock, then calling a
function that locks on the same lock.
Now I consider much of this to be pretty simple to understand, but
based on the pervasiveness of these problems throughout Asterisk, I
think that a good locking practices section should be added to the
Asterisk coding guidelines. It should address not only these issues but
other ones as well like:
1. Using finer granular locks rather then monster locks that block all
functionality of a module for each tiny operation.
2. Correct ordering of locking and unlocking (lock a,b,c then unlock
c,b,a) to avoid cross-locks.
thread 1 locking a -> ? -> b
thread 2 locking b -> ? -> a
can cause both threads to wait on each other with no chance of releasing
their locks.
3. Make sure locking order is kept to avoid deadlocks even when it seems
pointless to do so. For instance it is wise to not only lock a and c but
b as well even though b does not need to be locked. Not doing this can
sometimes lead to deadlocks. In general chains of dependent locks have
to always be used in the same order. If partial chains are used in some
places you have to make sure that the calling order does not lead to
crosses over multiple functions that use the same locks (circular
dependency locks).
thread 1 locking a -> ? -> b
thread 2 locking b -> ? -> c
thread 3 locking c -> ? -> a
4. Good ways of thinking through the locking that is going on. Sometimes
this can be very difficult especially when dealing with many many locks
and different types of locks. Drawing nesting diagrams can help greatly
here.
5. Above all stressing that Asterisk is a very large, multi-threaded
application, and that poorly locked code can have far reaching effects
in other areas of Asterisk where you would not expect it. This can make
debugging problems extremely difficult and aggravating. It can also have
a massive impact on the systems performance such as the case with
app_meetme.
I am more than willing to help write such a section, but it will have
to wait till I fix the issues with the 1.2, 1.4, and svn versions of
app_meetme.
Michael Cargile
Explido Software USA Inc.
www.explido.us
More information about the asterisk-dev
mailing list