[asterisk-bugs] [JIRA] (ASTERISK-27966) race condition in 183 re transmission can result in a deadlock

Torrey Searle (JIRA) noreply at issues.asterisk.org
Tue Jul 17 07:07:54 CDT 2018


     [ https://issues.asterisk.org/jira/browse/ASTERISK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Torrey Searle updated ASTERISK-27966:
-------------------------------------

    Description: 
Deadlock between two threads.

The first thread is stuck in an infinite loop on a header list size 1 containing just a contact header

{pre}
Thread 5 (Thread 0x7f3208352700 (LWP 29224)):
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f354837c494 in _L_lock_952 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f354837c2f6 in _GI__pthread_mutex_lock (mutex=0x7f3219e74ef0) at ../nptl/pthread_mutex_lock.c:114
#3 0x000000000045ae87 in __ao2_lock (user_data=0x7f3219e74f40, lock_how=(unknown: 128), lock_how at entry=AO2_LOCK_REQ_RDLOCK, file=0x7f3219e74ef0 "", file at entry=0x7f3219e74f40 "", func=0x0, func at entry=0x60af31 "astobj2_container.c", line=514, var=0x60af2c "self") at astobj2.c:164
#4 0x000000000045dffd in internal_ao2_traverse (self=0x7f3219e74f40, flags=flags at entry=OBJ_SEARCH_KEY, cb_fn=0x7f35257f9dc0 <session_media_cmp>, arg=arg at entry=0x7f3484638360, tag=tag at entry=0x0, file=file at entry=0x0, line=0, func=0x0, type=AO2_CALLBACK_DEFAULT, data=0x0) at astobj2_container.c:336
#5 0x000000000045e6f3 in __ao2_callback (arg=0x7f3484638360, cb_fn=<optimized out>, flags=OBJ_SEARCH_KEY, c=<optimized out>) at astobj2_container.c:455
#6 __ao2_find (c=<optimized out>, arg=arg at entry=0x7f3484638360, flags=flags at entry=OBJ_SEARCH_KEY) at astobj2_container.c:496
#7 0x00007f3484635aec in t38_framehook_read (session=0x7f3219791ec0, session=0x7f3219791ec0, f=0x0, chan=0x7f3218015810) at res_pjsip_t38.c:448
#8 t38_framehook (chan=0x7f3218015810, f=0x0, event=<optimized out>, data=<optimized out>) at res_pjsip_t38.c:466
#9 0x000000000051cc2b in framehook_list_push_event (framehooks=0x7f3218311b80, frame=0x0, event=AST_FRAMEHOOK_EVENT_READ) at framehook.c:118
#10 0x00000000004bdd45 in __ast_read (chan=0x7f3218015810, dropaudio=0) at channel.c:3973
#11 0x00000000004827c6 in bridge_handle_trip (bridge_channel=<optimized out>) at bridge_channel.c:2416
#12 bridge_channel_wait (bridge_channel=<optimized out>) at bridge_channel.c:2586
#13 bridge_channel_internal_join (bridge_channel=0x7f321a966260) at bridge_channel.c:2732
#14 0x000000000046c600 in bridge_channel_ind_thread (data=data at entry=0x7f321a966260) at bridge.c:1782
#15 0x00000000005e6dfa in dummy_start (data=<optimized out>) at utils.c:1238
#16 0x00007f354837a064 in start_thread (arg=0x7f3208352700) at pthread_create.c:309
#17 0x00007f354766262d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{pre}

The second thread is stuck trying to get the tsx group lock heald by the first thread
{pre}
Thread 16 (Thread 0x7f32ab2eb700 (LWP 6953)):
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f354837c494 in _L_lock_952 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f354837c2f6 in _GI__pthread_mutex_lock (mutex=0x7f321a996058) at ../nptl/pthread_mutex_lock.c:114
#3 0x00007f352dd599a5 in pj_mutex_lock (mutex=0x7f321a996058) at ../src/pj/os_core_unix.c:1265
#4 0x00007f352dd5fce1 in grp_lock_acquire (p=0x7f321a995f48) at ../src/pj/lock.c:290
#5 0x00007f352dd5fd22 in pj_grp_lock_acquire (grp_lock=<optimized out>) at ../src/pj/lock.c:478
#6 0x00007f352ec514ac in pjsip_tsx_send_msg (tsx=tsx at entry=0x7f321858fe58, tdata=tdata at entry=0x7f3206374e38) at ../src/pjsip/sip_transaction.c:1783
#7 0x00007f352ec53952 in pjsip_dlg_send_response (dlg=<optimized out>, tsx=0x7f321858fe58, tdata=tdata at entry=0x7f3206374e38) at ../src/pjsip/sip_dialog.c:1531
#8 0x00007f352f09109a in pjsip_inv_send_msg (inv=0x7f321b7f68b8, tdata=tdata at entry=0x7f3206374e38) at ../src/pjsip-ua/sip_inv.c:3231
#9 0x00007f35257fcb4c in ast_sip_session_send_response (session=session at entry=0x7f3219a9c7c0, tdata=0x7f3206374e38) at res_pjsip_session.c:983
#10 0x00007f347ff2c1f5 in indicate (data=0x7f321b5ed260) at chan_pjsip.c:1246
#11 0x00000000005d52ce in ast_taskprocessor_execute (tps=tps at entry=0x1aaceb0) at taskprocessor.c:971
#12 0x00000000005dcc80 in execute_tasks (data=0x1aaceb0) at threadpool.c:1322
#13 0x00000000005d52ce in ast_taskprocessor_execute (tps=0x180fdf0) at taskprocessor.c:971
#14 0x00000000005dd558 in threadpool_execute (pool=<optimized out>) at threadpool.c:351
#15 worker_active (worker=<optimized out>) at threadpool.c:1105
#16 worker_start (arg=0x7f34d4001ba0) at threadpool.c:1024
#17 0x00000000005e6dfa in dummy_start (data=<optimized out>) at utils.c:1238
#18 0x00007f354837a064 in start_thread (arg=0x7f32ab2eb700) at pthread_create.c:309
#19 0x00007f354766262d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{pre}

The first thread is an infinite loop because it's iterating a circular list of size one of type contact header

Infinite loop in sanitize_tdata when iterating the headers. This is the root cause of the outage

$85 = (pjsip_hdr *) 0x7f3205e5c218
(gdb) print hdr->next
$86 = (struct pjsip_hdr *) 0x7f3205e5c218

>From analysis, I've found that when asterisk is trying to forward a 183 it invokes indicate (as seen in the 2nd thread).  This thread has a reference to the tdata of the last 183 send (which is the same tdata as the internal heap timer for retransmission has).  Indicate will call the pjsip method pjsip_dlg_modify_response which intern removes the contact header from the tdata, causing a chance for the first thread to end up in the above infinite loop.  

As since this tdata is being shared by both a transaction and a dialog, I feel the best way to avoid conflict is to have the  pjsip_dlg_modify_response to get the tsx group lock to avoid access by the by the heap timer while it's modifying a possibly shared object.


  was:
Deadlock between two threads.

The first thread is stuck in an infinite loop on a header list size 1 containing just a contact header


Thread 5 (Thread 0x7f3208352700 (LWP 29224)):
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f354837c494 in _L_lock_952 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f354837c2f6 in _GI__pthread_mutex_lock (mutex=0x7f3219e74ef0) at ../nptl/pthread_mutex_lock.c:114
#3 0x000000000045ae87 in __ao2_lock (user_data=0x7f3219e74f40, lock_how=(unknown: 128), lock_how at entry=AO2_LOCK_REQ_RDLOCK, file=0x7f3219e74ef0 "", file at entry=0x7f3219e74f40 "", func=0x0, func at entry=0x60af31 "astobj2_container.c", line=514, var=0x60af2c "self") at astobj2.c:164
#4 0x000000000045dffd in internal_ao2_traverse (self=0x7f3219e74f40, flags=flags at entry=OBJ_SEARCH_KEY, cb_fn=0x7f35257f9dc0 <session_media_cmp>, arg=arg at entry=0x7f3484638360, tag=tag at entry=0x0, file=file at entry=0x0, line=0, func=0x0, type=AO2_CALLBACK_DEFAULT, data=0x0) at astobj2_container.c:336
#5 0x000000000045e6f3 in __ao2_callback (arg=0x7f3484638360, cb_fn=<optimized out>, flags=OBJ_SEARCH_KEY, c=<optimized out>) at astobj2_container.c:455
#6 __ao2_find (c=<optimized out>, arg=arg at entry=0x7f3484638360, flags=flags at entry=OBJ_SEARCH_KEY) at astobj2_container.c:496
#7 0x00007f3484635aec in t38_framehook_read (session=0x7f3219791ec0, session=0x7f3219791ec0, f=0x0, chan=0x7f3218015810) at res_pjsip_t38.c:448
#8 t38_framehook (chan=0x7f3218015810, f=0x0, event=<optimized out>, data=<optimized out>) at res_pjsip_t38.c:466
#9 0x000000000051cc2b in framehook_list_push_event (framehooks=0x7f3218311b80, frame=0x0, event=AST_FRAMEHOOK_EVENT_READ) at framehook.c:118
#10 0x00000000004bdd45 in __ast_read (chan=0x7f3218015810, dropaudio=0) at channel.c:3973
#11 0x00000000004827c6 in bridge_handle_trip (bridge_channel=<optimized out>) at bridge_channel.c:2416
#12 bridge_channel_wait (bridge_channel=<optimized out>) at bridge_channel.c:2586
#13 bridge_channel_internal_join (bridge_channel=0x7f321a966260) at bridge_channel.c:2732
#14 0x000000000046c600 in bridge_channel_ind_thread (data=data at entry=0x7f321a966260) at bridge.c:1782
#15 0x00000000005e6dfa in dummy_start (data=<optimized out>) at utils.c:1238
#16 0x00007f354837a064 in start_thread (arg=0x7f3208352700) at pthread_create.c:309
#17 0x00007f354766262d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

The second thread is stuck trying to get the tsx group lock heald by the first thread

Thread 5 (Thread 0x7f3208352700 (LWP 29224)):
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f354837c494 in _L_lock_952 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f354837c2f6 in _GI__pthread_mutex_lock (mutex=0x7f3219e74ef0) at ../nptl/pthread_mutex_lock.c:114
#3 0x000000000045ae87 in __ao2_lock (user_data=0x7f3219e74f40, lock_how=(unknown: 128), lock_how at entry=AO2_LOCK_REQ_RDLOCK, file=0x7f3219e74ef0 "", file at entry=0x7f3219e74f40 "", func=0x0, func at entry=0x60af31 "astobj2_container.c", line=514, var=0x60af2c "self") at astobj2.c:164
#4 0x000000000045dffd in internal_ao2_traverse (self=0x7f3219e74f40, flags=flags at entry=OBJ_SEARCH_KEY, cb_fn=0x7f35257f9dc0 <session_media_cmp>, arg=arg at entry=0x7f3484638360, tag=tag at entry=0x0, file=file at entry=0x0, line=0, func=0x0, type=AO2_CALLBACK_DEFAULT, data=0x0) at astobj2_container.c:336
#5 0x000000000045e6f3 in __ao2_callback (arg=0x7f3484638360, cb_fn=<optimized out>, flags=OBJ_SEARCH_KEY, c=<optimized out>) at astobj2_container.c:455
#6 __ao2_find (c=<optimized out>, arg=arg at entry=0x7f3484638360, flags=flags at entry=OBJ_SEARCH_KEY) at astobj2_container.c:496
#7 0x00007f3484635aec in t38_framehook_read (session=0x7f3219791ec0, session=0x7f3219791ec0, f=0x0, chan=0x7f3218015810) at res_pjsip_t38.c:448
#8 t38_framehook (chan=0x7f3218015810, f=0x0, event=<optimized out>, data=<optimized out>) at res_pjsip_t38.c:466
#9 0x000000000051cc2b in framehook_list_push_event (framehooks=0x7f3218311b80, frame=0x0, event=AST_FRAMEHOOK_EVENT_READ) at framehook.c:118
#10 0x00000000004bdd45 in __ast_read (chan=0x7f3218015810, dropaudio=0) at channel.c:3973
#11 0x00000000004827c6 in bridge_handle_trip (bridge_channel=<optimized out>) at bridge_channel.c:2416
#12 bridge_channel_wait (bridge_channel=<optimized out>) at bridge_channel.c:2586
#13 bridge_channel_internal_join (bridge_channel=0x7f321a966260) at bridge_channel.c:2732
#14 0x000000000046c600 in bridge_channel_ind_thread (data=data at entry=0x7f321a966260) at bridge.c:1782
#15 0x00000000005e6dfa in dummy_start (data=<optimized out>) at utils.c:1238
#16 0x00007f354837a064 in start_thread (arg=0x7f3208352700) at pthread_create.c:309
#17 0x00007f354766262d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111


The first thread is an infinite loop because it's iterating a circular list of size one of type contact header

Infinite loop in sanitize_tdata when iterating the headers. This is the root cause of the outage

$85 = (pjsip_hdr *) 0x7f3205e5c218
(gdb) print hdr->next
$86 = (struct pjsip_hdr *) 0x7f3205e5c218

>From analysis, I've found that when asterisk is trying to forward a 183 it invokes indicate (as seen in the 2nd thread).  This thread has a reference to the tdata of the last 183 send (which is the same tdata as the internal heap timer for retransmission has).  Indicate will call the pjsip method pjsip_dlg_modify_response which intern removes the contact header from the tdata, causing a chance for the first thread to end up in the above infinite loop.  

As since this tdata is being shared by both a transaction and a dialog, I feel the best way to avoid conflict is to have the  pjsip_dlg_modify_response to get the tsx group lock to avoid access by the by the heap timer while it's modifying a possibly shared object.



> race condition in 183 re transmission can result in a deadlock
> --------------------------------------------------------------
>
>                 Key: ASTERISK-27966
>                 URL: https://issues.asterisk.org/jira/browse/ASTERISK-27966
>             Project: Asterisk
>          Issue Type: Bug
>      Security Level: None
>          Components: Third-Party/pjproject
>    Affects Versions: 13.20.0
>            Reporter: Torrey Searle
>              Labels: pjsip
>
> Deadlock between two threads.
> The first thread is stuck in an infinite loop on a header list size 1 containing just a contact header
> {pre}
> Thread 5 (Thread 0x7f3208352700 (LWP 29224)):
> #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> #1 0x00007f354837c494 in _L_lock_952 () from /lib/x86_64-linux-gnu/libpthread.so.0
> #2 0x00007f354837c2f6 in _GI__pthread_mutex_lock (mutex=0x7f3219e74ef0) at ../nptl/pthread_mutex_lock.c:114
> #3 0x000000000045ae87 in __ao2_lock (user_data=0x7f3219e74f40, lock_how=(unknown: 128), lock_how at entry=AO2_LOCK_REQ_RDLOCK, file=0x7f3219e74ef0 "", file at entry=0x7f3219e74f40 "", func=0x0, func at entry=0x60af31 "astobj2_container.c", line=514, var=0x60af2c "self") at astobj2.c:164
> #4 0x000000000045dffd in internal_ao2_traverse (self=0x7f3219e74f40, flags=flags at entry=OBJ_SEARCH_KEY, cb_fn=0x7f35257f9dc0 <session_media_cmp>, arg=arg at entry=0x7f3484638360, tag=tag at entry=0x0, file=file at entry=0x0, line=0, func=0x0, type=AO2_CALLBACK_DEFAULT, data=0x0) at astobj2_container.c:336
> #5 0x000000000045e6f3 in __ao2_callback (arg=0x7f3484638360, cb_fn=<optimized out>, flags=OBJ_SEARCH_KEY, c=<optimized out>) at astobj2_container.c:455
> #6 __ao2_find (c=<optimized out>, arg=arg at entry=0x7f3484638360, flags=flags at entry=OBJ_SEARCH_KEY) at astobj2_container.c:496
> #7 0x00007f3484635aec in t38_framehook_read (session=0x7f3219791ec0, session=0x7f3219791ec0, f=0x0, chan=0x7f3218015810) at res_pjsip_t38.c:448
> #8 t38_framehook (chan=0x7f3218015810, f=0x0, event=<optimized out>, data=<optimized out>) at res_pjsip_t38.c:466
> #9 0x000000000051cc2b in framehook_list_push_event (framehooks=0x7f3218311b80, frame=0x0, event=AST_FRAMEHOOK_EVENT_READ) at framehook.c:118
> #10 0x00000000004bdd45 in __ast_read (chan=0x7f3218015810, dropaudio=0) at channel.c:3973
> #11 0x00000000004827c6 in bridge_handle_trip (bridge_channel=<optimized out>) at bridge_channel.c:2416
> #12 bridge_channel_wait (bridge_channel=<optimized out>) at bridge_channel.c:2586
> #13 bridge_channel_internal_join (bridge_channel=0x7f321a966260) at bridge_channel.c:2732
> #14 0x000000000046c600 in bridge_channel_ind_thread (data=data at entry=0x7f321a966260) at bridge.c:1782
> #15 0x00000000005e6dfa in dummy_start (data=<optimized out>) at utils.c:1238
> #16 0x00007f354837a064 in start_thread (arg=0x7f3208352700) at pthread_create.c:309
> #17 0x00007f354766262d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> {pre}
> The second thread is stuck trying to get the tsx group lock heald by the first thread
> {pre}
> Thread 16 (Thread 0x7f32ab2eb700 (LWP 6953)):
> #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> #1 0x00007f354837c494 in _L_lock_952 () from /lib/x86_64-linux-gnu/libpthread.so.0
> #2 0x00007f354837c2f6 in _GI__pthread_mutex_lock (mutex=0x7f321a996058) at ../nptl/pthread_mutex_lock.c:114
> #3 0x00007f352dd599a5 in pj_mutex_lock (mutex=0x7f321a996058) at ../src/pj/os_core_unix.c:1265
> #4 0x00007f352dd5fce1 in grp_lock_acquire (p=0x7f321a995f48) at ../src/pj/lock.c:290
> #5 0x00007f352dd5fd22 in pj_grp_lock_acquire (grp_lock=<optimized out>) at ../src/pj/lock.c:478
> #6 0x00007f352ec514ac in pjsip_tsx_send_msg (tsx=tsx at entry=0x7f321858fe58, tdata=tdata at entry=0x7f3206374e38) at ../src/pjsip/sip_transaction.c:1783
> #7 0x00007f352ec53952 in pjsip_dlg_send_response (dlg=<optimized out>, tsx=0x7f321858fe58, tdata=tdata at entry=0x7f3206374e38) at ../src/pjsip/sip_dialog.c:1531
> #8 0x00007f352f09109a in pjsip_inv_send_msg (inv=0x7f321b7f68b8, tdata=tdata at entry=0x7f3206374e38) at ../src/pjsip-ua/sip_inv.c:3231
> #9 0x00007f35257fcb4c in ast_sip_session_send_response (session=session at entry=0x7f3219a9c7c0, tdata=0x7f3206374e38) at res_pjsip_session.c:983
> #10 0x00007f347ff2c1f5 in indicate (data=0x7f321b5ed260) at chan_pjsip.c:1246
> #11 0x00000000005d52ce in ast_taskprocessor_execute (tps=tps at entry=0x1aaceb0) at taskprocessor.c:971
> #12 0x00000000005dcc80 in execute_tasks (data=0x1aaceb0) at threadpool.c:1322
> #13 0x00000000005d52ce in ast_taskprocessor_execute (tps=0x180fdf0) at taskprocessor.c:971
> #14 0x00000000005dd558 in threadpool_execute (pool=<optimized out>) at threadpool.c:351
> #15 worker_active (worker=<optimized out>) at threadpool.c:1105
> #16 worker_start (arg=0x7f34d4001ba0) at threadpool.c:1024
> #17 0x00000000005e6dfa in dummy_start (data=<optimized out>) at utils.c:1238
> #18 0x00007f354837a064 in start_thread (arg=0x7f32ab2eb700) at pthread_create.c:309
> #19 0x00007f354766262d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> {pre}
> The first thread is an infinite loop because it's iterating a circular list of size one of type contact header
> Infinite loop in sanitize_tdata when iterating the headers. This is the root cause of the outage
> $85 = (pjsip_hdr *) 0x7f3205e5c218
> (gdb) print hdr->next
> $86 = (struct pjsip_hdr *) 0x7f3205e5c218
> From analysis, I've found that when asterisk is trying to forward a 183 it invokes indicate (as seen in the 2nd thread).  This thread has a reference to the tdata of the last 183 send (which is the same tdata as the internal heap timer for retransmission has).  Indicate will call the pjsip method pjsip_dlg_modify_response which intern removes the contact header from the tdata, causing a chance for the first thread to end up in the above infinite loop.  
> As since this tdata is being shared by both a transaction and a dialog, I feel the best way to avoid conflict is to have the  pjsip_dlg_modify_response to get the tsx group lock to avoid access by the by the heap timer while it's modifying a possibly shared object.



--
This message was sent by Atlassian JIRA
(v6.2#6252)



More information about the asterisk-bugs mailing list