[asterisk-bugs] [JIRA] (ASTERISK-21162) Deadlock in cdr.c: cdr_batch_lock vs cdr_pending_lock

Chase Venters (JIRA) noreply at issues.asterisk.org
Mon Feb 25 12:21:18 CST 2013


Chase Venters created ASTERISK-21162:
----------------------------------------

             Summary: Deadlock in cdr.c: cdr_batch_lock vs cdr_pending_lock
                 Key: ASTERISK-21162
                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-21162
             Project: Asterisk
          Issue Type: Bug
      Security Level: None
          Components: CDR/General
    Affects Versions: 1.8.19.0
         Environment: CentOS release 6.3 (Final)
Linux ivr01.XXXXXXXX 2.6.32-279.22.1.el6.i686 #1 SMP Wed Feb 6 00:31:03 UTC 2013 i686 i686 i386 GNU/Linux
Asterisk 1.8.19.0 from EPEL
            Reporter: Chase Venters
            Severity: Critical


I've run into a deadlock in Asterisk cdr.c. I have hundreds of threads blocked like this:

Thread 3 (Thread 0xab1f9b70 (LWP 8232)):
#0  0x0027d416 in __kernel_vsyscall ()
#1  0x0023f019 in __lll_lock_wait () from /lib/libpthread.so.0
#2  0x0023a43e in _L_lock_731 () from /lib/libpthread.so.0
#3  0x0023a34a in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x00370436 in pthread_mutex_lock () from /lib/libc.so.6
#5  0x08096841 in ast_cdr_detach (cdr=0xb5718bb0) at cdr.c:1354
#6  0x08124b5c in ast_pbx_outgoing_cdr_failed () at pbx.c:8797
#7  0x08134552 in ast_pbx_outgoing_exten (type=0x973ec64 "Local", format=64, data=0x973ec6c, timeout=180000, context=0x973ec82 "pams-out-agi", 

exten=0x973ec92 "s", priority=1, reason=0xab1f9278, synchronous=1, cid_num=0x0, cid_name=0x0, vars=0x973eb58, account=0x8214506 "", 

channel=0xab1f9274) at pbx.c:8877
#8  0x08119143 in fast_originate (data=0x973ebd8) at manager.c:3757
#9  0x08174c97 in dummy_start (data=0x967c770) at utils.c:1012
#10 0x00238a49 in start_thread () from /lib/libpthread.so.0
#11 0x0036063e in clone () from /lib/libc.so.6

ast_cdr_detach() is blocked trying to lock cdr_batch_lock. If it can get cdr_batch_lock, it calls submit_unscheduled_batch(). That function adds 

a 1ms scheduler timeout to submit_scheduled_batch(), then locks cdr_pending_lock and fires a pthread condition.

Separately, the do_cdr() thread is locking cdr_pending_lock and when the condition fires, it calls ast_sched_runq() to start dispatching the 

calls to submit_scheduled_batch(). The problem is that submit_scheduled_batch() calls ast_cdr_submit_batch() with cdr_pending_lock held, and it 

then ast_cdr_submit_batch() tries to lock cdr_batch_lock. A classic lock ordering conflict ensues. 

You can see that do_cdr() is holding cdr_pending_lock and is blocked trying to grab cdr_batch_lock on this thread:

Thread 771 (Thread 0xb744bb70 (LWP 2640)):
#0  0x0027d416 in __kernel_vsyscall ()
#1  0x0023f019 in __lll_lock_wait () from /lib/libpthread.so.0
#2  0x0023a43e in _L_lock_731 () from /lib/libpthread.so.0
#3  0x0023a34a in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x00370436 in pthread_mutex_lock () from /lib/libc.so.6
#5  0x08096a63 in ast_cdr_submit_batch (do_shutdown=0) at cdr.c:1279
#6  0x08096c43 in submit_scheduled_batch (data=0x0) at cdr.c:1301
#7  0x0815d5be in ast_sched_runq (con=0x95be6b8) at sched.c:640
#8  0x080954dc in do_cdr (data=0x0) at cdr.c:1394
#9  0x08174c97 in dummy_start (data=0x9380038) at utils.c:1012
#10 0x00238a49 in start_thread () from /lib/libpthread.so.0
#11 0x0036063e in clone () from /lib/libc.so.6

Last but not least, here's a thread that is holding cdr_batch_lock but wants cdr_pending_lock:

Thread 759 (Thread 0xb5d9ab70 (LWP 6632)):
#0  0x0027d416 in __kernel_vsyscall ()
#1  0x0023f019 in __lll_lock_wait () from /lib/libpthread.so.0
#2  0x0023a43e in _L_lock_731 () from /lib/libpthread.so.0
#3  0x0023a34a in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x00370436 in pthread_mutex_lock () from /lib/libc.so.6
#5  0x0809529b in submit_unscheduled_batch () at cdr.c:1315
#6  0x0809698d in ast_cdr_detach (cdr=0xb5031d08) at cdr.c:1370
#7  0x080ae257 in ast_hangup (chan=0xb503ada8) at channel.c:2910
#8  0x080b294a in __ast_request_and_dial (type=0x9689e2e "Local", format=64, requestor=0x0, data=0x9689e36, timeout=<value optimized out>, 

outstate=0xb5d9a278, cid_num=<value optimized out>, cid_name=<value optimized out>, oh=0xb5d9a1b8) at channel.c:5569
#9  0x08134486 in ast_pbx_outgoing_exten (type=0x9689e2e "Local", format=64, data=0x9689e36, timeout=180000, context=0x9689e4c "pams-out-agi", 

exten=0x9689e5c "s", priority=1, reason=0xb5d9a278, synchronous=1, cid_num=0x0, cid_name=0x0, vars=0x9689438, account=0x8214506 "", 

channel=0xb5d9a274) at pbx.c:8821
#10 0x08119143 in fast_originate (data=0x9679fe0) at manager.c:3757
#11 0x08174c97 in dummy_start (data=0x9690610) at utils.c:1012
#12 0x00238a49 in start_thread () from /lib/libpthread.so.0
#13 0x0036063e in clone () from /lib/libc.so.6

I think from looking at the 1.8.21.0-rc1 code that this issue still exists there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira



More information about the asterisk-bugs mailing list