<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head><meta http-equiv=Content-Type content="text/html; charset=iso-8859-1"><meta name=Generator content="Microsoft Word 15 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
span.E-postmall17
{mso-style-type:personal;
font-family:"Calibri",sans-serif;
color:windowtext;}
span.E-postmall18
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:#1F497D;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:70.85pt 70.85pt 70.85pt 70.85pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]--></head><body lang=SV link="#0563C1" vlink="#954F72"><div class=WordSection1><p class=MsoNormal><span style='color:#1F497D'>Hello again!<o:p></o:p></span></p><p class=MsoNormal><span style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>As no one have provided any insights into this yet, I will combine a bump with providing some more information… I have unfortunately not resolved this yet.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>I tried to put a custom lock around the call to ast_sched_clean_by_callback by adding the mutex near the definition of sip_reload_lock:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>AST_MUTEX_DEFINE_STATIC(sip_reregister_cleanup_lock);<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>And then surrounding the ast_sched_clean_by_callback call with: <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>ast_mutex_lock(&sip_reregister_cleanup_lock);<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>…<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>ast_mutex_unlock(&sip_reregister_cleanup_lock);<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>This made no difference, which I didn’t really expect. I had added debug prints before and after the ast_sched_clean_by_callback call, which indicate that the call is quite fast, and I would guess that a new thread isn’t created for this. So my mutex would only stop this from being run at the same time, which it shouldn’t anyway as if I recall correctly there is a lock against multiple reloads being run at the same time.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>So it seems that there are something in the reload process that do not like that the schedule already have been cleared. I did identify one of these areas, in the macro AST_SCHED_DEL_UNREF a warning is printed if a schedule cannot be canceled in 10 tries, and after my initial patch I got a lot of these warnings. So I modified the macro and added:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'> void *_data = (void *)ast_sched_find_data(sched, id); \<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'> if(!_data) { \<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'> continue;\<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'> } \<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>Which checks if the schedule exists, and if not continues without trying to cancel the schedule. This solved the error message, but the crash is still there.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>After analyzing a couple of coredumps I can see the follow pattern in the stacktraces:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>__ast_string_field_ptr_grow (mgr=mgr@entry=0x15b0ac8, pool_head=pool_head@entry=0x15b09d0, needed=needed@entry=8, ptr=ptr@entry=0x15b09f8) at stringfields.c:277<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>__ast_string_field_ptr_grow (mgr=mgr@entry=0x15b0ac8, pool_head=pool_head@entry=0x15b09d0, needed=needed@entry=8, ptr=ptr@entry=0x15b09f8) at stringfields.c:277<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>set_peer_defaults (peer=peer@entry=0x15b0980) at chan_sip.c:31927<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>build_peer (name=name@entry=0x7f5e481663c0 "dalco.00--5060-out", v_head=0x7f5e481664c0, alt=alt@entry=0x0, devstate_only=0, realtime=0) at chan_sip.c:32150<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>reload_config (reason=<optimized out>) at chan_sip.c:34047<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>sip_do_reload (reason=<optimized out>) at chan_sip.c:34892<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>do_monitor (data=data@entry=0x0) at chan_sip.c:30470<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>dummy_start (data=<optimized out>) at utils.c:1235<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>start_thread (arg=0x7f5e0dd4e700) at pthread_create.c:309<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>So something segfaults during build_peer and set_peer_defaults… Any pointers to what could go wrong in these cases? And the problem does not occur on every reload, several are needed. A fair amount of registers seem to be needed too. Note that the line numbers probably do not match, as we have some patches to chan_sip.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>If I looked at the code correctly then the crashing line is:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>size_t space = (*pool_head)->size - (*pool_head)->used;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>Which I cannot see would be affected by having cleaned the schedule.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>Any ideas of how to proceed? <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>Best regards,<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'>Torbjörn Abrahamsson<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US style='color:#1F497D'><o:p> </o:p></span></p><div><div style='border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0cm 0cm 0cm'><p class=MsoNormal><b><span style='mso-fareast-language:SV'>Från:</span></b><span style='mso-fareast-language:SV'> Torbjörn Abrahamsson [mailto:torbjorn.abrahamsson@gmail.com] <br><b>Skickat:</b> den 19 september 2018 09:46<br><b>Till:</b> 'Asterisk Developers Mailing List'<br><b>Ämne:</b> Problems with reregister schedule<o:p></o:p></span></p></div></div><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal>Hello!<o:p></o:p></p><p class=MsoNormal><o:p> </o:p></p><p class=MsoNormal><span lang=EN-US>We have encountered a problem concerning the scheduling of reregisters in chan_sip. We are using version 13.15.0. <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Our problem is that sometimes the scheduler seem to contain more objects than it should, resulting in more registers being sent than it should. The problem seem to occur when doing reloads, but not always. <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>If I do a “sip show registry”, I see the number of expected registers, and if I do a “sip show sched”, I see that there are more reregister schedules than the previously shown number of registers. On a fresh machine these values are the same, but after an amount of reloads they begin to differ. The registry_list seem to contain the correct amount of objects. These rouge reregisters seem to live a life of their own. This is not a really big problem because sending 10 registers instead of 1 only consumes more network traffic, but the REGISTRAR does not really care. But, if we remove a register from Asterisk, then the rouge ones will still be there, keeping on registering until the end of the world. Same goes for changing the extension that a register maps to, which will result in two registrations with different contact being sent. These cases are problematic. The only way the stop them seem to be to restart Asterisk. <o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>After looking at the code, we see that on a reload the schedule is canceled and rebuilt. The problem is that this cancel/rebuild is based on the registry_list, which do not contain the rouge registers. So we started to look at possibilities to clear the whole schedule. After a little investigation we found the ast_sched_clean_by_callback function. So we implemented this new callback function:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>static int my_clean_task(const void *data)<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>{<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US> return 0;<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US>}<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>And then we modified cleanup_all_regs, and added the following function call before calling the ao2_t_callback:<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>ast_sched_clean_by_callback(sched, sip_reregister, my_clean_task);<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>This seemed to solve the problem, the “sip show registry” and “sip show sched” now always showed the same value. The problem now was that Asterisk segfaulted (sig 11) when doing multiple reloads. So my guess is that we do need to lock something before doing this, but I do unfortunately not see what lock to use.<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>So, any pointers to what to do? Is our solution on the right track? Should this be solved in another way?<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Thanks in advance, and best regards,<o:p></o:p></span></p><p class=MsoNormal><span lang=EN-US><o:p> </o:p></span></p><p class=MsoNormal><span lang=EN-US>Torbjörn Abrahamsson<o:p></o:p></span></p></div></body></html>