[asterisk-dev] Problems with reregister schedule

Mon Oct 1 08:56:36 CDT 2018

On Fri, Sep 28, 2018 at 9:06 AM Torbjörn Abrahamsson
<torbjorn.abrahamsson at gmail.com> wrote:
>
> Hello again!
>
>
>
> As no one have provided any insights into this yet, I will combine a bump with providing some more information… I have unfortunately not resolved this yet.
>
>
>
> I tried to put a custom lock around the call to ast_sched_clean_by_callback by adding the mutex near the definition of sip_reload_lock:
>
>
>
> AST_MUTEX_DEFINE_STATIC(sip_reregister_cleanup_lock);
>
>
>
> And then surrounding the ast_sched_clean_by_callback call with:
>
> ast_mutex_lock(&sip_reregister_cleanup_lock);
>
> …
>
> ast_mutex_unlock(&sip_reregister_cleanup_lock);
>
>
>
> This made no difference, which I didn’t really expect. I had added debug prints before and after the ast_sched_clean_by_callback call, which indicate that the call is quite fast, and I would guess that a new thread isn’t created for this. So my mutex would only stop this from being run at the same time, which it shouldn’t anyway as if I recall correctly there is a lock against multiple reloads being run at the same time.
>
>
>
> So it seems that there are something in the reload process that do not like that the schedule already have been cleared. I did identify one of these areas, in the macro AST_SCHED_DEL_UNREF a warning is printed if a schedule cannot be canceled in 10 tries, and after my initial patch I got a lot of these warnings. So I modified the macro and added:
>
>
>
>                 void *_data = (void *)ast_sched_find_data(sched, id);                   \
>
>                 if(!_data) {    \
>
>                         continue;\
>
>                 }  \
>
>
>
> Which checks if the schedule exists, and if not continues without trying to cancel the schedule. This solved the error message, but the crash is still there.
>
>
>
> After analyzing a couple of coredumps I can see the follow pattern in the stacktraces:
>
>
>
> __ast_string_field_ptr_grow (mgr=mgr at entry=0x15b0ac8, pool_head=pool_head at entry=0x15b09d0, needed=needed at entry=8, ptr=ptr at entry=0x15b09f8) at stringfields.c:277
>
> __ast_string_field_ptr_grow (mgr=mgr at entry=0x15b0ac8, pool_head=pool_head at entry=0x15b09d0, needed=needed at entry=8, ptr=ptr at entry=0x15b09f8) at stringfields.c:277
>
> set_peer_defaults (peer=peer at entry=0x15b0980) at chan_sip.c:31927
>
> build_peer (name=name at entry=0x7f5e481663c0 "dalco.00--5060-out", v_head=0x7f5e481664c0, alt=alt at entry=0x0, devstate_only=0, realtime=0) at chan_sip.c:32150
>
> reload_config (reason=<optimized out>) at chan_sip.c:34047
>
> sip_do_reload (reason=<optimized out>) at chan_sip.c:34892
>
> do_monitor (data=data at entry=0x0) at chan_sip.c:30470
>
> dummy_start (data=<optimized out>) at utils.c:1235
>
> start_thread (arg=0x7f5e0dd4e700) at pthread_create.c:309
>
> clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>
>
>
> So something segfaults during build_peer and set_peer_defaults… Any pointers to what could go wrong in these cases? And the problem does not occur on every reload, several are needed. A fair amount of registers seem to be needed too. Note that the line numbers probably do not match, as we have some patches to chan_sip.
>
>
>
> If I looked at the code correctly then the crashing line is:
>
>
>
> size_t space = (*pool_head)->size - (*pool_head)->used;
>
>
>
> Which I cannot see would be affected by having cleaned the schedule.
>
>
>
> Any ideas of how to proceed?

Hey Torbjorn,

I don't think anybody is intentionally ignoring your post - it's
probably not being responded to because you are working on code in the
deep dark belly of chan_sip and any people still familiar with it
either aren't familiar with that part of the code or are not
responding for other reasons.

>From an overall project perspective (or at least for the majority of
developers that work with me), we're putting 100% of our efforts on
chan_pjsip due to code maintenance challenges with chan_sip.  We ran
into situations where fixing one chan_sip bug had a tendency to create
three new ones due to its highly interdependent code architecture.

I guess that doesn't solve your bug, but does provide a little big of
explanation as to why you might be getting minimal feedback.

Best wishes,
Matthew Fredrickson
Asterisk Project Lead

>
>
>
> Best regards,
>
> Torbjörn Abrahamsson
>
>
>
>
>
> Från: Torbjörn Abrahamsson [mailto:torbjorn.abrahamsson at gmail.com]
> Skickat: den 19 september 2018 09:46
> Till: 'Asterisk Developers Mailing List'
> Ämne: Problems with reregister schedule
>
>
>
> Hello!
>
>
>
> We have encountered a problem concerning the scheduling of reregisters in chan_sip. We are using version 13.15.0.
>
>
>
> Our problem is that sometimes the scheduler seem to contain more objects than it should, resulting in more registers being sent than it should. The problem seem to occur when doing reloads, but not always.
>
>
>
> If I do a “sip show registry”, I see the number of expected registers, and if I do a “sip show sched”, I see that there are more reregister schedules than the previously shown number of registers. On a fresh machine these values are the same, but after an amount of reloads they begin to differ. The registry_list seem to contain the correct amount of objects. These rouge reregisters seem to live a life of their own. This is not a really big problem because sending 10 registers instead of 1 only consumes more network traffic, but the REGISTRAR does not really care. But, if we remove a register from Asterisk, then the rouge ones will still be there, keeping on registering until the end of the world. Same goes for changing the extension that a register maps to, which will result in two registrations with different contact being sent. These cases are problematic. The only way the stop them seem to be to restart Asterisk.
>
>
>
> After looking at the code, we see that on a reload the schedule is canceled and rebuilt. The problem is that this cancel/rebuild is based on the registry_list, which do not contain the rouge registers. So we started to look at possibilities to clear the whole schedule. After a little investigation we found the ast_sched_clean_by_callback function. So we implemented this new callback function:
>
>
>
> static int my_clean_task(const void *data)
>
> {
>
>     return 0;
>
> }
>
>
>
> And then we modified cleanup_all_regs, and added the following function call before calling the ao2_t_callback:
>
>
>
> ast_sched_clean_by_callback(sched, sip_reregister, my_clean_task);
>
>
>
> This seemed to solve the problem, the “sip show registry” and “sip show sched” now always showed the same value. The problem now was that Asterisk segfaulted (sig 11) when doing multiple reloads. So my guess is that we do need to lock something before doing this, but I do unfortunately not see what lock to use.
>
>
>
> So, any pointers to what to do? Is our solution on the right track? Should this be solved in another way?
>
>
>
> Thanks in advance, and best regards,
>
>
>
> Torbjörn Abrahamsson
>
> --
> _____________________________________________________________________
> -- Bandwidth and Colocation Provided by http://www.api-digital.com --
>
> Astricon is coming up October 9-11!  Signup is available at: https://www.asterisk.org/community/astricon-user-conference
>
> asterisk-dev mailing list
> To UNSUBSCRIBE or update options visit:
>    http://lists.digium.com/mailman/listinfo/asterisk-dev

-- 
Matthew Fredrickson
Digium - A Sangoma Company | Asterisk Project Lead
445 Jan Davis Drive NW - Huntsville, AL 35806 - USA