[Asterisk-Dev] HowTo Debug a DeadLock in * was [DEBUG_THREADS in * makefile DeadLocks]

TC trclark at shaw.ca
Sat Nov 29 10:26:05 MST 2003


OK I guess I am talking to myself here :)
This is just for the archives unless anyone wants to point out any mistakes

1) in the asterisk makefile you need to uncomment
# Optional debugging parameters
DEBUG_THREADS = #-DDEBUG_THREADS #-DDO_CRASH

2) Apply this patch
   http://bugs.digium.com/bug_view_page.php?bug_id=0000599
   this will allow you to log ast_verbose msg's to the logs so we can
   what the bt threads are doing in time sequence order
   -this also helps to re-create the core or deadlock situation

3) When you deadlock dont restart the box or restart * instead
    take the 5 mins while everyone is freaking out to
    attach gdb to the running * process and do
    gdb /usr/sbin/asterisk <pid of main * process>
    (you can get that from asterisk -r, -> currently running on blah (pid =
9075)
     or look for lowest pid after doing ps ax |grep asterisk

4) after gdb loads do
    info thread
    thread apply all bt
    At the very least you are now going to save that bt output to a file
    and post that to bug.digium.com

5) Identify dead locked threads by this pattern
    note the "_pthread_wait_for_restart_signal"
   that mean we are in wait loop wanting the mutex lock
  Thread 23 (Thread 3576854 (LWP 2910)):
   #0 0x400c787e in sigsuspend () from /lib/libc.so.6
   #1 0x40022879 in __pthread_wait_for_restart_signal () from
/lib/libpthread.so.0
   #2 0x40024a36 in __pthread_alt_lock () from /lib/libpthread.so.0
   #3 0x40020fd2 in pthread_mutex_lock () from /lib/libpthread.so.0
   * Not apparenly not all system implement the
"pthread_wait_for_restart_signal"
   so Iguess you might just want to scan for at least  "pthread_mutex_lock "
   You will usually find more than one of these patterns because once a
thread is dead locked
   on a mutex lock, other threads that want the same lock will pile up
quickly


6) Try to identify the first thread, that is dead locked.
The sequence number of bt threads is not relevent, bcus threads are re-used.

Look in your log files at the time stamps and try to corrolate the
bt THREAD number eg (Thread 23 (Thread 3576854 (LWP 2910))
to the earliest entry in the log file with that same
THREAD number eg ( VERBOSE[3575828]),
note the FRAME number just b4 doing "pthread_mutex_lock()" ,
(that is the #0, #1, #2, number right after bt THREAD number )

Log files are usually in /var/log/asterisk/[messages][debug]

7) Now that we have our potential guilty party as the
   first in line for the lock do
thread <sequence number> (for the THREAD of interest)
frame <fame number> (for the frame # b4 the pthread_mutex_lock () )
this now should be in our * sources right where we call ast_mutex()
record the name of the lock it was trying to get
eg ast_mutex_lock(&agentlock);

8) Now if we have properly turned on thread debugging we are going to
be able to see into the include/asterisk/lock.f ast_mutex_t struc
which looks like this
pthread_mutex_t mutex;
char *file;
int lineno;
char *func;
pthread_t thread;

so now that we have our lock we can see who has it & what we are waiting on
do the following bt cmds
p somelockIjustFound->thread
p somelockIjustFound->file
p somelockIjustFound->func
p somelockIjustFound->lineno

This is the guility code that is holding the lock that we want
to look at


9) Now comes the hard part ...
Why is this Code in that thread, file, function, lineno
not releasing or fscking lock !
We have to now scour the code checking all places where that lock is
set & released looking for
hmm ...
1) places where there is a lock hierarachy and
locks are set and released in different order,
(ie. is same rules for locking rows in a sql db, lock & release in the same
order everwhere, )
2) places where a lock is held toooo long
   for/while loops, b4 loger ruunning function calls etc
    (this is the same rule as sql db transactions get in get out quick)
3) Do not mutex lock at critical section where we might receive O/S signals

...OK I am out of steam
if some veteran * developer could now pick this up
& tell us what else to look for specfic to * mutex locks areas






More information about the asterisk-dev mailing list