[asterisk-bugs] [JIRA] (ASTERISK-27966) pjsip: Race condition in 183 re transmission can result in a deadlock
Torrey Searle (JIRA)
noreply at issues.asterisk.org
Thu Jul 19 10:40:55 CDT 2018
[ https://issues.asterisk.org/jira/browse/ASTERISK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=244187#comment-244187 ]
Torrey Searle commented on ASTERISK-27966:
------------------------------------------
callflow is basically this
A -> INV -> Asterisk -> INV - > B
A <- 183 <- Asterisk <- 183 <- B
*60 seconds later*
A <- 183 <- Asterisk <- 183 <- B
*Deadlock*
after 60 seconds pjsip is trying to internally retransmit the 183 using the tdata reference tsx->last_tx
while asterisk is also trying to send a 183, which is first trying to remove the contact header in inv->last_answer
and tsx->last_tx points to the same tdata object as inv->last_answer
modifying the data object why the heap timer is trying to retransmit the 183 results in an infinite loop while holding the tsx group lock
> pjsip: Race condition in 183 re transmission can result in a deadlock
> ---------------------------------------------------------------------
>
> Key: ASTERISK-27966
> URL: https://issues.asterisk.org/jira/browse/ASTERISK-27966
> Project: Asterisk
> Issue Type: Bug
> Security Level: None
> Components: Third-Party/pjproject
> Affects Versions: 13.20.0
> Reporter: Torrey Searle
> Assignee: Torrey Searle
> Labels: pjsip
>
> Deadlock between two threads.
> The first thread is stuck in an infinite loop on a header list size 1 containing just a contact header
> {noformat}
> Thread 4 (Thread 0x7f34e6391700 (LWP 14984)):
> #0 pjsip_get_uri (uri=0x7f3205e5c2c0) at ../src/pjsip/sip_uri.c:221
> #1 0x00007f34e7bc4a82 in pjsip_uri_get_uri (uri=<optimized out>) at /usr/include/pjsip/sip_uri.h:272
> #2 sanitize_tdata (tdata=<optimized out>) at res_pjsip/pjsip_message_filter.c:212
> #3 filter_on_tx_message (tdata=0x7f3206374e38) at res_pjsip/pjsip_message_filter.c:233
> #4 0x00007f352ec3ab0e in endpt_on_tx_msg (endpt=<optimized out>, tdata=0x7f3206374e38) at ../src/pjsip/sip_endpoint.c:1088
> #5 0x00007f352ec415c0 in pjsip_transport_send (tr=0x1aa67b8, tdata=tdata at entry=0x7f3206374e38, addr=addr at entry=0x7f321858ff34, addr_len=16, token=token at entry=0x7f321858fe58, cb=cb at entry=0x7f352ec4e1b0 <transport_callback>) at ../src/pjsip/sip_transport.c:833
> #6 0x00007f352ec4edb0 in tsx_send_msg (tsx=0x7f321858fe58, tdata=0x7f3206374e38) at ../src/pjsip/sip_transaction.c:2122
> #7 0x00007f352ec4f8da in tsx_on_state_proceeding_uas (tsx=0x7f321858fe58, event=<optimized out>) at ../src/pjsip/sip_transaction.c:2882
> #8 0x00007f352ec4eb86 in tsx_timer_callback (theap=<optimized out>, entry=0x7f321858ffd8) at ../src/pjsip/sip_transaction.c:1189
> #9 0x00007f352dd6b757 in pj_timer_heap_poll (ht=0x1814600, next_delay=next_delay at entry=0x7f34e6390e50) at ../src/pj/timer.c:644
> #10 0x00007f352ec3b5c4 in pjsip_endpt_handle_events2 (endpt=0x1814318, max_timeout=max_timeout at entry=0x7f34e6390ea0, p_count=p_count at entry=0x0) at ../src/pjsip/sip_endpoint.c:715
> #11 0x00007f352ec3b6d7 in pjsip_endpt_handle_events (endpt=<optimized out>, max_timeout=max_timeout at entry=0x7f34e6390ea0) at ../src/pjsip/sip_endpoint.c:776
> #12 0x00007f34e7bb8118 in monitor_thread_exec (endpt=<optimized out>) at res_pjsip.c:4345
> #13 0x00007f352dd59ca0 in thread_main (param=0x18139b8) at ../src/pj/os_core_unix.c:541
> #14 0x00007f354837a064 in start_thread (arg=0x7f34e6391700) at pthread_create.c:309
> #15 0x00007f354766262d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> {noformat}
> The second thread is stuck trying to get the tsx group lock heald by the first thread
> {noformat}
> Thread 16 (Thread 0x7f32ab2eb700 (LWP 6953)):
> #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> #1 0x00007f354837c494 in _L_lock_952 () from /lib/x86_64-linux-gnu/libpthread.so.0
> #2 0x00007f354837c2f6 in _GI__pthread_mutex_lock (mutex=0x7f321a996058) at ../nptl/pthread_mutex_lock.c:114
> #3 0x00007f352dd599a5 in pj_mutex_lock (mutex=0x7f321a996058) at ../src/pj/os_core_unix.c:1265
> #4 0x00007f352dd5fce1 in grp_lock_acquire (p=0x7f321a995f48) at ../src/pj/lock.c:290
> #5 0x00007f352dd5fd22 in pj_grp_lock_acquire (grp_lock=<optimized out>) at ../src/pj/lock.c:478
> #6 0x00007f352ec514ac in pjsip_tsx_send_msg (tsx=tsx at entry=0x7f321858fe58, tdata=tdata at entry=0x7f3206374e38) at ../src/pjsip/sip_transaction.c:1783
> #7 0x00007f352ec53952 in pjsip_dlg_send_response (dlg=<optimized out>, tsx=0x7f321858fe58, tdata=tdata at entry=0x7f3206374e38) at ../src/pjsip/sip_dialog.c:1531
> #8 0x00007f352f09109a in pjsip_inv_send_msg (inv=0x7f321b7f68b8, tdata=tdata at entry=0x7f3206374e38) at ../src/pjsip-ua/sip_inv.c:3231
> #9 0x00007f35257fcb4c in ast_sip_session_send_response (session=session at entry=0x7f3219a9c7c0, tdata=0x7f3206374e38) at res_pjsip_session.c:983
> #10 0x00007f347ff2c1f5 in indicate (data=0x7f321b5ed260) at chan_pjsip.c:1246
> #11 0x00000000005d52ce in ast_taskprocessor_execute (tps=tps at entry=0x1aaceb0) at taskprocessor.c:971
> #12 0x00000000005dcc80 in execute_tasks (data=0x1aaceb0) at threadpool.c:1322
> #13 0x00000000005d52ce in ast_taskprocessor_execute (tps=0x180fdf0) at taskprocessor.c:971
> #14 0x00000000005dd558 in threadpool_execute (pool=<optimized out>) at threadpool.c:351
> #15 worker_active (worker=<optimized out>) at threadpool.c:1105
> #16 worker_start (arg=0x7f34d4001ba0) at threadpool.c:1024
> #17 0x00000000005e6dfa in dummy_start (data=<optimized out>) at utils.c:1238
> #18 0x00007f354837a064 in start_thread (arg=0x7f32ab2eb700) at pthread_create.c:309
> #19 0x00007f354766262d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> {noformat}
> The first thread is an infinite loop because it's iterating a circular list of size one of type contact header
> Infinite loop in sanitize_tdata when iterating the headers. This is the root cause of the outage
> $85 = (pjsip_hdr *) 0x7f3205e5c218
> (gdb) print hdr->next
> $86 = (struct pjsip_hdr *) 0x7f3205e5c218
> From analysis, I've found that when asterisk is trying to forward a 183 it invokes indicate (as seen in the 2nd thread). This thread has a reference to the tdata of the last 183 send (which is the same tdata as the internal heap timer for retransmission has). Indicate will call the pjsip method pjsip_dlg_modify_response which intern removes the contact header from the tdata, causing a chance for the first thread to end up in the above infinite loop.
> As since this tdata is being shared by both a transaction and a dialog, I feel the best way to avoid conflict is to have the pjsip_dlg_modify_response to get the tsx group lock to avoid access by the by the heap timer while it's modifying a possibly shared object.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
More information about the asterisk-bugs
mailing list