<div dir="ltr"><div dir="ltr">On Thu, Dec 12, 2019 at 2:51 PM Jaco Kroon <<a href="mailto:jaco@uls.co.za">jaco@uls.co.za</a>> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Hi All,</p>
<p>So I've received a few backtraces from a customer (Still using
ast 11 ... we're in progress of getting them migrated). They all
follow this pattern (full stack below):</p>
<p class="MsoNormal">#0 ast_hashtab_lookup_internal
(tab=0x7f41bc277920, tab=0x7f41bc277920, h=<optimized out>,
obj=0x7f4203176a50) at hashtab.c:602</p>
<b>b = 0x1041</b><b><br>
</b>#1 ast_hashtab_lookup (tab=0x7f41bc277920,
obj=obj@entry=0x7f4203176a50) at hashtab.c:548#2 0x000000000053af60
in pbx_find_extension (chan=chan@entry=0x0, bypass=bypass@entry=0x0,
q=q@entry=0x7f4203176e10, context=context@entry=0x7f4207f12c70
<sip_cfg+80> "sip_registrations",
exten=exten@entry=0x7f4203176d10 "filtered",
priority=priority@entry=1, label=label@entry=0x0,
callerid=callerid@entry=0x7f4207cdde03 "",
action=action@entry=E_MATCH) at pbx.c:3438 <br>
<p class="MsoNormal">
#3 0x00007f4207c4d7c1 in register_peer_exten
(peer=peer@entry=0x7f41bd4a0ca8, onoff=0) at chan_sip.c:5023</p>
#4 0x00007f4207c65bfa in sip_destroy_peer (peer=0x7f41bd4a0ca8) at
chan_sip.c:5084
<p class="MsoNormal">Note the pointer value of b which comes from
tab->array. In this particular case this was the actual value
from tab->array[1] ... a pointer which points to the first page
of memory, and an obvious invalid pointer.<br>
</p>
<p class="MsoNormal">After some digging I realized that the
following hashtab creations does NOT set the do_locking argument
to 1 (master branch, but verified against 13 too, so I'm assuming
15, 16 and 17 as well):</p>
<p class="MsoNormal">./main/pbx.c:</p>
<p class="MsoNormal">6197 contexts_table =
ast_hashtab_create(17,<br>
6198 ast_hashtab_compare_contexts,<br>
6199 ast_hashtab_resize_java,<br>
6200 ast_hashtab_newsize_java,<br>
6201 ast_hashtab_hash_contexts,<br>
<b>6202 0);</b></p>
<p class="MsoNormal">7512 tmp->peer_table =
ast_hashtab_create(13,<br>
7513 hashtab_compare_exten_numbers,<br>
7514 ast_hashtab_resize_java,<br>
7515 ast_hashtab_newsize_java,<br>
7516 hashtab_hash_priority,<br>
<b>7517 0);</b></p>
<p class="MsoNormal">7518 tmp->peer_label_table =
ast_hashtab_create(7,<br>
7519 hashtab_compare_exten_labels,<br>
7520 ast_hashtab_resize_java,<br>
7521 ast_hashtab_newsize_java,<br>
7522 hashtab_hash_labels,<br>
<b>7523 0);</b><br>
</p>
<p class="MsoNormal">After scoping out hashtab.c, and in spite of
various comments in the header, I'm by no means convinced that
this code is in any way threadsafe without that argument being set
to 1. For pbx.c it may be just fine in a read-heavy environment,
but with regcontext as per above that assumption goes out the
door. And I'm surprised the frequency of this particular crash is
approximately once a week over an 8-node setup.</p>
<p class="MsoNormal">In master (same in 13 branch) there is four
more uses in pbx/pbx_*.c which also does the same thing, and I'm
assuming will suffer the same consequences. The EOL 11 branch has
a few extra uses, mostly setting locking to on.<br>
</p>
<p class="MsoNormal">In 13 branch there is actually a few extra
users in main/pbx.c, three specifically, none of them sets
locking.</p>
<p class="MsoNormal">Since the dialplan code is accessed from
multiple threads - can we please get an explanation of how
hashtab.c protects against concurrent access given the lack of
do_locking? From the header:</p>
<p class="MsoNormal"> 66 6. Recently updated the hash routines to
use Doubly-linked lists for buckets,<br>
67 and added a doubly-linked list that threads thru every
bucket in the table.<br>
68 The list of all buckets is on the HashTab struct. The
Traversal was modified<br>
69 to go thru this list instead of searching the bucket array
for buckets.<br>
70 <b>This also should make it safe to remove a bucket
during the traversal.</b><br>
71 Removal and destruction routines will work faster.<br>
</p>
<p class="MsoNormal">On line 70 specifically, I don't see any way
this can be true.<br></p></div></blockquote><div><br></div><div>It would be up to the higher level consumer of the hashtab API to protect it through its own mechanism. That could be a lock, or it could be a fundamental design decision such as immutable structures and objects. I do not recall precisely what the PBX core attempts to do, someone like Corey Farrell has spent more time in it than I most likely.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><p class="MsoNormal">
</p>
<p class="MsoNormal">May I suggest that the argument for do_locking
is discarded, and do_locking is forced on:</p></blockquote><div><br></div><div>I'm not comfortable making such a substantial change without confirmation that the problem also appears under current supported versions and without understanding the specific scenario and circumstances. The specific implications of it would also need to be confirmed within the PBX core itself. It is seldom that just turning on locking "just works" without any consequences.</div></div><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div style="font-family:tahoma,sans-serif"><font color="#073763">Joshua C. Colp</font></div><div style="font-family:tahoma,sans-serif"><font color="#073763">Senior Software Developer</font></div><div style="font-family:tahoma,sans-serif"><font color="#073763">Sangoma Technologies</font></div><div style="font-family:tahoma,sans-serif"><font color="#073763">Check us out at <a href="http://www.sangoma.com" target="_blank">www.sangoma.com</a> and <a href="http://www.asterisk.org" target="_blank">www.asterisk.org</a></font><br></div></div></div></div></div></div></div></div></div>