[asterisk-dev] Problems with reregister schedule

Fri Sep 28 09:05:42 CDT 2018

Hello again!

As no one have provided any insights into this yet, I will combine a bump
with providing some more information
 I have unfortunately not resolved this
yet.

I tried to put a custom lock around the call to ast_sched_clean_by_callback
by adding the mutex near the definition of sip_reload_lock:

AST_MUTEX_DEFINE_STATIC(sip_reregister_cleanup_lock);

And then surrounding the ast_sched_clean_by_callback call with:  

ast_mutex_lock(&sip_reregister_cleanup_lock);

ast_mutex_unlock(&sip_reregister_cleanup_lock);

This made no difference, which I didnt really expect. I had added debug
prints before and after the ast_sched_clean_by_callback call, which indicate
that the call is quite fast, and I would guess that a new thread isnt
created for this. So my mutex would only stop this from being run at the
same time, which it shouldnt anyway as if I recall correctly there is a
lock against multiple reloads being run at the same time.

So it seems that there are something in the reload process that do not like
that the schedule already have been cleared. I did identify one of these
areas, in the macro AST_SCHED_DEL_UNREF a warning is printed if a schedule
cannot be canceled in 10 tries, and after my initial patch I got a lot of
these warnings. So I modified the macro and added:

                void *_data = (void *)ast_sched_find_data(sched, id);
\

                if(!_data) {    \

                        continue;\

                }  \

Which checks if the schedule exists, and if not continues without trying to
cancel the schedule. This solved the error message, but the crash is still
there.

After analyzing a couple of coredumps I can see the follow pattern in the
stacktraces:

__ast_string_field_ptr_grow (mgr=mgr at entry=0x15b0ac8,
pool_head=pool_head at entry=0x15b09d0, needed=needed at entry=8,
ptr=ptr at entry=0x15b09f8) at stringfields.c:277

__ast_string_field_ptr_grow (mgr=mgr at entry=0x15b0ac8,
pool_head=pool_head at entry=0x15b09d0, needed=needed at entry=8,
ptr=ptr at entry=0x15b09f8) at stringfields.c:277

set_peer_defaults (peer=peer at entry=0x15b0980) at chan_sip.c:31927

build_peer (name=name at entry=0x7f5e481663c0 "dalco.00--5060-out",
v_head=0x7f5e481664c0, alt=alt at entry=0x0, devstate_only=0, realtime=0) at
chan_sip.c:32150

reload_config (reason=<optimized out>) at chan_sip.c:34047

sip_do_reload (reason=<optimized out>) at chan_sip.c:34892

do_monitor (data=data at entry=0x0) at chan_sip.c:30470

dummy_start (data=<optimized out>) at utils.c:1235

start_thread (arg=0x7f5e0dd4e700) at pthread_create.c:309

clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

So something segfaults during build_peer and set_peer_defaults
 Any pointers
to what could go wrong in these cases? And the problem does not occur on
every reload, several are needed. A fair amount of registers seem to be
needed too. Note that the line numbers probably do not match, as we have
some patches to chan_sip.

If I looked at the code correctly then the crashing line is:

size_t space = (*pool_head)->size - (*pool_head)->used;

Which I cannot see would be affected by having cleaned the schedule.

Any ideas of how to proceed? 

Best regards,

Torbjörn Abrahamsson

Från: Torbjörn Abrahamsson [mailto:torbjorn.abrahamsson at gmail.com] 
Skickat: den 19 september 2018 09:46
Till: 'Asterisk Developers Mailing List'
Ämne: Problems with reregister schedule

Hello!

We have encountered a problem concerning the scheduling of reregisters in
chan_sip. We are using version 13.15.0. 

Our problem is that sometimes the scheduler seem to contain more objects
than it should, resulting in more registers being sent than it should. The
problem seem to occur when doing reloads, but not always. 

If I do a sip show registry, I see the number of expected registers, and
if I do a sip show sched, I see that there are more reregister schedules
than the previously shown number of registers. On a fresh machine these
values are the same, but after an amount of reloads they begin to differ.
The registry_list seem to contain the correct amount of objects. These rouge
reregisters seem to live a life of their own. This is not a really big
problem because sending 10 registers instead of 1 only consumes more network
traffic, but the REGISTRAR does not really care. But, if we remove a
register from Asterisk, then the rouge ones will still be there, keeping on
registering until the end of the world. Same goes for changing the extension
that a register maps to, which will result in two registrations with
different contact being sent. These cases are problematic. The only way the
stop them seem to be to restart Asterisk. 

After looking at the code, we see that on a reload the schedule is canceled
and rebuilt. The problem is that this cancel/rebuild is based on the
registry_list, which do not contain the rouge registers. So we started to
look at possibilities to clear the whole schedule. After a little
investigation we found the ast_sched_clean_by_callback function. So we
implemented this new callback function:

static int my_clean_task(const void *data)

{

    return 0;

}

And then we modified cleanup_all_regs, and added the following function call
before calling the ao2_t_callback:

ast_sched_clean_by_callback(sched, sip_reregister, my_clean_task);

This seemed to solve the problem, the sip show registry and sip show
sched now always showed the same value. The problem now was that Asterisk
segfaulted (sig 11) when doing multiple reloads. So my guess is that we do
need to lock something before doing this, but I do unfortunately not see
what lock to use.

So, any pointers to what to do? Is our solution on the right track? Should
this be solved in another way?

Thanks in advance, and best regards,

Torbjörn Abrahamsson

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20180928/ed4dd19e/attachment.html>