<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jul 12, 2018 at 5:01 PM, Salahuddin Ahmed <span dir="ltr"><<a href="mailto:txrubel@gmail.com" target="_blank">txrubel@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><span style="font-family:garamond,serif">Hello,</span></div><div><span style="font-family:garamond,serif"><br></span></div><div><span style="font-family:garamond,serif">Recently I have experienced a deadlock situation in Asterisk Version 13.20.0. According to our gdb analysis there have 3 suspicious thread(3, 11 and 44). Thread 44 push a task to taskpool, thread 11 execute that task and the thread 3 is a
timer event. In thread 11 during a response going to transmit a group
lock has been acquired. and this thread was locked on this point. In
Thread 3 while retransmission timer triggered same group lock again
acquired and try to retransmit response. In this situation asterisk was blocked. <br></span></div><div><span style="font-family:garamond,serif"><br></span></div><div><span style="font-family:garamond,serif">Let me explain according to GDB log,</span></div><div><span style="font-family:garamond,serif"><br></span></div><div><span style="font-family:garamond,serif"><div>Thread-03:</div><div>#4 0x00007f00ba6d9db0 in tsx_send_msg (tsx=0x7eff8bdd4568, tdata=0x7eff892ae668) at ../src/pjsip/sip_transaction.<wbr>c:2122<br></div><div><br></div><div>Thread-11:</div><div>#6 0x00007f00ba6dc4ac in pjsip_tsx_send_msg (tsx=tsx@entry=0x7eff8bdd4568, tdata=tdata@entry=<wbr>0x7eff892ae668) at ../src/pjsip/sip_transaction.<wbr>c:1783</div><div>(gdb) p tsx->transport_flag<br>$2 = 1(TSX_HAS_PENDING_TRANSPORT)<br></div><div><br></div><div>Here
Thread-03 is execute first and try to retransmit a message before acquire a tsx group lock
in Thread-11, because in thread 3 tsx->transport_flag |=
TSX_HAS_PENDING_TRANSPORT has been set, Now if we read this flag from thread 11 in frame 04 then we have found this flag was set by someone before acquiring tsx group lock. That means during this acquiring lock in Thread3, Thread 11
trying to send data on transport. So this might be the case of
deadlock, somehow timer thread 3 execute before thread-11.</div><div> </div></span></div><div><span style="font-family:garamond,serif"><span>Here I enclose Thread-3 and Thread-11 gdb backtrace.</span><br></span></div><div><span style="font-family:garamond,serif"><br></span></div><div><span style="font-family:garamond,serif">Could anyone please help me on this?</span></div><div><span style="font-family:garamond,serif"><br></span></div><div><span style="font-family:garamond,serif"><br></span></div><div><span style="font-family:garamond,serif"> </span>Thread-11:</div><div>#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/<wbr>linux/x86_64/lowlevellock.S:<wbr>135<br>#1 0x00007f00f62bf494 in _L_lock_952 () from /lib/x86_64-linux-gnu/<wbr>libpthread.so.0<br>#2 0x00007f00f62bf2f6 in __GI___pthread_mutex_lock (mutex=0x7eff41197c58) at ../nptl/pthread_mutex_lock.c:<wbr>114<br>#3 0x00007f00b97e49a5 in pj_mutex_lock (mutex=0x7eff41197c58) at ../src/pj/os_core_unix.c:1265<br>#4 0x00007f00b97eace1 in grp_lock_acquire (p=0x7eff41197b48) at ../src/pj/lock.c:290<br>#5 0x00007f00b97ead22 in pj_grp_lock_acquire (grp_lock=<optimized out>) at ../src/pj/lock.c:478<br>#6 0x00007f00ba6dc4ac in pjsip_tsx_send_msg (tsx=tsx@entry=0x7eff8bdd4568, tdata=tdata@entry=<wbr>0x7eff892ae668) at ../src/pjsip/sip_transaction.<wbr>c:1783<br>#7 0x00007f00ba6de952 in pjsip_dlg_send_response (dlg=<optimized out>, tsx=0x7eff8bdd4568, tdata=tdata@entry=<wbr>0x7eff892ae668) at ../src/pjsip/sip_dialog.c:1531<br>#8 0x00007f00bab1c09a in pjsip_inv_send_msg (inv=0x7eff88bdf9b8, tdata=tdata@entry=<wbr>0x7eff892ae668) at ../src/pjsip-ua/sip_inv.c:3231<br>#9 0x00007f0092874b4c in ast_sip_session_send_response (session=session@entry=<wbr>0x7eff89367700, tdata=0x7eff892ae668) at res_pjsip_session.c:983<br>#10 0x00007f002a431841 in answer (data=0x7eff89367700) at chan_pjsip.c:591<br>#11 0x00007f0094b7a490 in sync_task (data=0x7efdbcf8b810) at res_pjsip.c:4268<br>#12 0x00000000005d52ce in ast_taskprocessor_execute (tps=tps@entry=0x267f9b0) at taskprocessor.c:971<br>#13 0x00000000005dcc80 in execute_tasks (data=0x267f9b0) at threadpool.c:1322<br>#14 0x00000000005d52ce in ast_taskprocessor_execute (tps=0x23e0580) at taskprocessor.c:971<br>#15 0x00000000005dd558 in threadpool_execute (pool=<optimized out>) at threadpool.c:351<br>#16 worker_active (worker=<optimized out>) at threadpool.c:1105<br>#17 worker_start (arg=0x7f0080001680) at threadpool.c:1024<br>#18 0x00000000005e6dfa in dummy_start (data=<optimized out>) at utils.c:1238<br>#19 0x00007f00f62bd064 in start_thread (arg=0x7eff67c9b700) at pthread_create.c:309<br>#20 0x00007f00f55a562d in clone () at ../sysdeps/unix/sysv/linux/<wbr>x86_64/clone.S:111<br><br></div><div><br></div><div>Thread-3:</div><div>#0 sanitize_tdata (tdata=<optimized out>) at res_pjsip/pjsip_message_<wbr>filter.c:211<br>#1 filter_on_tx_message (tdata=0x7eff892ae668) at res_pjsip/pjsip_message_<wbr>filter.c:233<br>#2 0x00007f00ba6c5b0e in endpt_on_tx_msg (endpt=<optimized out>, tdata=0x7eff892ae668) at ../src/pjsip/sip_endpoint.c:<wbr>1088<br>#3 0x00007f00ba6cc5c0 in pjsip_transport_send (tr=0x2677718, tdata=tdata@entry=<wbr>0x7eff892ae668, addr=addr@entry=<wbr>0x7eff8bdd4644, addr_len=16, token=token@entry=<wbr>0x7eff8bdd4568, <br> cb=cb@entry=0x7f00ba6d91b0 <transport_callback>) at ../src/pjsip/sip_transport.c:<wbr>833<br>#4 0x00007f00ba6d9db0 in tsx_send_msg (tsx=0x7eff8bdd4568, tdata=0x7eff892ae668) at ../src/pjsip/sip_transaction.<wbr>c:2122<br>#5 0x00007f00ba6da8da in tsx_on_state_proceeding_uas (tsx=0x7eff8bdd4568, event=<optimized out>) at ../src/pjsip/sip_transaction.<wbr>c:2882<br>#6 0x00007f00ba6d9b86 in tsx_timer_callback (theap=<optimized out>, entry=0x7eff8bdd46e8) at ../src/pjsip/sip_transaction.<wbr>c:1189<br>#7 0x00007f00b97f6757 in pj_timer_heap_poll (ht=0x23e53a0, next_delay=next_delay@entry=<wbr>0x7f0090649e50) at ../src/pj/timer.c:644<br>#8 0x00007f00ba6c65c4 in pjsip_endpt_handle_events2 (endpt=0x23e50b8, max_timeout=max_timeout@entry=<wbr>0x7f0090649ea0, p_count=p_count@entry=0x0) at ../src/pjsip/sip_endpoint.c:<wbr>715<br>#9 0x00007f00ba6c66d7 in pjsip_endpt_handle_events (endpt=<optimized out>, max_timeout=max_timeout@entry=<wbr>0x7f0090649ea0) at ../src/pjsip/sip_endpoint.c:<wbr>776<br>#10 0x00007f0094b7a118 in monitor_thread_exec (endpt=<optimized out>) at res_pjsip.c:4345<br>#11 0x00007f00b97e4ca0 in thread_main (param=0x23e3c68) at ../src/pj/os_core_unix.c:541<br>#12 0x00007f00f62bd064 in start_thread (arg=0x7f009064a700) at pthread_create.c:309<br>#13 0x00007f00f55a562d in clone () at ../sysdeps/unix/sysv/linux/<wbr>x86_64/clone.S:111<br><br></div></div></blockquote><div><br></div><div>* The attached backtrace does not match the threads you are highlighting in your email. In fact</div><div>the attached backtrace was done with a non-bundled pjproject without any symbols.</div><div><br></div><div>* Thread 3 above should have acquired the transaction group lock in tsx_timer_callback() before</div><div> calling the state handler routine. From there the pjproject line numbers do not match the bundled</div><div> pjproject version for 13.20.0. Also the thread is not blocked so I don't see how it as participating</div><div> in a deadlock unless you think it is in an endless loop.</div><div><br></div><div>You <b>really</b> need to be using bundled pjproject [1] for the reasons listed on that page.<br></div><div><br></div><div>Richard</div></div><div class="gmail_quote"><br></div><div class="gmail_quote">[1] <a href="https://wiki.asterisk.org/wiki/display/AST/PJSIP-pjproject">https://wiki.asterisk.org/wiki/display/AST/PJSIP-pjproject</a><br></div></div></div>