[asterisk-dev] hashtab memory corruption and resulting segfaults due to lack of thread safety

Joshua C. Colp jcolp at sangoma.com
Thu Dec 12 13:02:52 CST 2019


On Thu, Dec 12, 2019 at 2:51 PM Jaco Kroon <jaco at uls.co.za> wrote:

> Hi All,
>
> So I've received a few backtraces from a customer (Still using ast 11 ...
> we're in progress of getting them migrated).  They all follow this pattern
> (full stack below):
>
> #0  ast_hashtab_lookup_internal (tab=0x7f41bc277920, tab=0x7f41bc277920,
> h=<optimized out>, obj=0x7f4203176a50) at hashtab.c:602
>         *b = 0x1041*
> #1  ast_hashtab_lookup (tab=0x7f41bc277920, obj=obj at entry=0x7f4203176a50)
> at hashtab.c:548#2  0x000000000053af60 in pbx_find_extension
> (chan=chan at entry=0x0, bypass=bypass at entry=0x0, q=q at entry=0x7f4203176e10,
> context=context at entry=0x7f4207f12c70 <sip_cfg+80> "sip_registrations",
> exten=exten at entry=0x7f4203176d10 "filtered",    priority=priority at entry=1,
> label=label at entry=0x0, callerid=callerid at entry=0x7f4207cdde03 "",
> action=action at entry=E_MATCH) at pbx.c:3438
>
> #3  0x00007f4207c4d7c1 in register_peer_exten (peer=peer at entry=0x7f41bd4a0ca8,
> onoff=0) at chan_sip.c:5023
> #4  0x00007f4207c65bfa in sip_destroy_peer (peer=0x7f41bd4a0ca8) at
> chan_sip.c:5084
>
> Note the pointer value of b which comes from tab->array.  In this
> particular case this was the actual value from tab->array[1] ... a pointer
> which points to the first page of memory, and an obvious invalid pointer.
>
> After some digging I realized that the following hashtab creations does
> NOT set the do_locking argument to 1 (master branch, but verified against
> 13 too, so I'm assuming 15, 16 and 17 as well):
>
> ./main/pbx.c:
>
> 6197             contexts_table = ast_hashtab_create(17,
> 6198                 ast_hashtab_compare_contexts,
> 6199                 ast_hashtab_resize_java,
> 6200                 ast_hashtab_newsize_java,
> 6201                 ast_hashtab_hash_contexts,
> *6202                 0);*
>
> 7512         tmp->peer_table = ast_hashtab_create(13,
> 7513                         hashtab_compare_exten_numbers,
> 7514                         ast_hashtab_resize_java,
> 7515                         ast_hashtab_newsize_java,
> 7516                         hashtab_hash_priority,
> *7517                         0);*
>
> 7518         tmp->peer_label_table = ast_hashtab_create(7,
> 7519                             hashtab_compare_exten_labels,
> 7520                             ast_hashtab_resize_java,
> 7521                             ast_hashtab_newsize_java,
> 7522                             hashtab_hash_labels,
> *7523                             0);*
>
> After scoping out hashtab.c, and in spite of various comments in the
> header, I'm by no means convinced that this code is in any way threadsafe
> without that argument being set to 1.  For pbx.c it may be just fine in a
> read-heavy environment, but with regcontext as per above that assumption
> goes out the door.  And I'm surprised the frequency of this particular
> crash is approximately once a week over an 8-node setup.
>
> In master (same in 13 branch) there is four more uses in pbx/pbx_*.c which
> also does the same thing, and I'm assuming will suffer the same
> consequences.  The EOL 11 branch has a few extra uses, mostly setting
> locking to on.
>
> In 13 branch there is actually a few extra users in main/pbx.c, three
> specifically, none of them sets locking.
>
> Since the dialplan code is accessed from multiple threads - can we please
> get an explanation of how hashtab.c protects against concurrent access
> given the lack of do_locking?  From the header:
>
>  66  6. Recently updated the hash routines to use Doubly-linked lists for
> buckets,
>  67     and added a doubly-linked list that threads thru every bucket in
> the table.
>  68     The list of all buckets is on the HashTab struct. The Traversal
> was modified
>  69     to go thru this list instead of searching the bucket array for
> buckets.
>  70     *This also should make it safe to remove a bucket during the
> traversal.*
>  71     Removal and destruction routines will work faster.
>
> On line 70 specifically, I don't see any way this can be true.
>

It would be up to the higher level consumer of the hashtab API to protect
it through its own mechanism. That could be a lock, or it could be a
fundamental design decision such as immutable structures and objects. I do
not recall precisely what the PBX core attempts to do, someone like Corey
Farrell has spent more time in it than I most likely.


> May I suggest that the argument for do_locking is discarded, and
> do_locking is forced on:
>

I'm not comfortable making such a substantial change without confirmation
that the problem also appears under current supported versions and without
understanding the specific scenario and circumstances. The specific
implications of it would also need to be confirmed within the PBX core
itself. It is seldom that just turning on locking "just works" without any
consequences.

-- 
Joshua C. Colp
Senior Software Developer
Sangoma Technologies
Check us out at www.sangoma.com and www.asterisk.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20191212/45925ea9/attachment.html>


More information about the asterisk-dev mailing list