[asterisk-dev] [Code Review] chan_sip refcount cleanup derived from rgagnon's review 1101

Terry Wilson reviewboard at asterisk.org
Tue May 10 16:56:40 CDT 2011



> On 2011-05-10 15:40:01, Rob Gagnon wrote:
> > Initial testing didn't go well.  Applying only this patch to 1.6.2 branch in my dev environment, and then placing one test call through it left me with 18 unbalanced objects when asterisk was shut down.
> > 
> > It's going to take some time to review the differences in code.  Right now, I was just testing by running it only, and not reading through the entire patch.
> 
> Rob Gagnon wrote:
>     Further information....
>     
>     simply starting asterisk, waiting 10 seconds, and then stopping it (without any calls hitting the box) shows the same 18 references
> 
> Rob Gagnon wrote:
>     The generic remaining references are here:
>     
>     Problem: net Refcount not zero for object ac3c5f48
>     Object ac3c5f48 history:
>        0x2aaaac3c5f48 =1   chan_sip.c:26548:load_module (allocate threadt table)
>     ==============
>     Problem: net Refcount not zero for object ac3c3008
>     Object ac3c3008 history:
>        0x2aaaac3c3008 =1   chan_sip.c:26547:load_module (allocate dialogs)
>     ==============
>     Problem: net Refcount not zero for object ac3c00c8
>     Object ac3c00c8 history:
>        0x2aaaac3c00c8 =1   chan_sip.c:26546:load_module (allocate peers_by_ip)
>     ==============
>     Problem: net Refcount not zero for object ac3bd188
>     Object ac3bd188 history:
>        0x2aaaac3bd188 =1   chan_sip.c:26545:load_module (allocate peers)
>     ==============
>     
>     
>     The others are related to the number of peers in my config.  Each of the other 14 references all start with a call to build_peer(), but end with different reference counts on exit.
> 
> Terry Wilson wrote:
>     There must be something seriously weird going on if you are getting this on 1.6.2 as I get references after starting and unloading chan_sip.so no matter how many peers I have, whether they are using qualify, registering, or whatever. Are you sure that these references aren't related to the phones sending a SUBSCRIBE that expires after several minutes?
> 
> Rob Gagnon wrote:
>     it could be something like that, I was just about to look into the logging deeper.  My thinking is that some of the refcount differences are due to peer "poking" where some are online, and some are not.
>     
>     The point I think though, is that on exit, all of that should be cleared out anyhow.  Any existing timers that might hold a reference, or whatever should all be canceled and have references removed.
>     
>     I did find this:
>     of the 14 peers I have, 
>     "sip show peers" showed 7 monitored (2 online, 5 offline), and 7 unmonitored (all 7 online)
>     
>     When I looked at the refcounter output, the correlation was there:
>     The 2 online monitored peers had 4 references remaining each.
>     The 5 offline monitored peers had 1 reference remaining each.
>     The 7 online unmonitored peers had 2 references remaining each.
>     
>     I have a place to start looking, but at least it seems all related to peers.
>     
>     Good news that "sip reload" does not cause the count of 18 to increase.

> The point I think though, is that on exit, all of that should be cleared out anyhow.  Any existing timers that might hold a reference, or whatever should all be canceled and have references removed.

Not necessarily. The refcounts should keep things around until they are done being used. The SUBSCRIBEs would not have expired (and therefor not release their refs).  Also, Asterisk will force the exit anyway and not be able to write to the /tmp/refs file that things are gone. Try just comment out all peers, do a sip reload, and wait. I've had it take up to 10 minutes for certain phones to finally have their registrations disappear this way.

Ideally we would come up with a way to force expiration of the SUBSCRIBE dialogs on exit which are keeping the peers around. Currently there is no record on the peer side of what subscribe dialogs are created in handle_request_subscribe (peer->call is never set), so there is no way to tear those down when we are unreffing the peers and we just have to let them expire naturally. I thought about trying to add the peer->call and handling it, but I was afraid of side effects. I really hate that circular dependency (peer->call->relatedpeer->...) and would like to find a way around it. Since the case of unloading the module and loading it again is fairly limited in operation (sip reload would be more likely) and restarting will of course free everything anyway, I left it as is.

I would be interested in getting a copy of your sip configs so I can add those to the tests I'm doing as well.


- Terry


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/1207/#review3515
-----------------------------------------------------------


On 2011-05-09 12:22:35, Terry Wilson wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviewboard.asterisk.org/r/1207/
> -----------------------------------------------------------
> 
> (Updated 2011-05-09 12:22:35)
> 
> 
> Review request for Asterisk Developers, David Vossel and Rob Gagnon.
> 
> 
> Summary
> -------
> 
> This is a version of rgagnon's refcount fixes with just the changes I'm positive need to be made and without debugging or whitespace improvements. The other improvements like switching ao2_t_(un)ref to (un)ref_(peer|dialog), sip_show_sched improvements, etc. would happen in a separate commit. I have included some other refcount issues that I have found. Changes are limited solely to chan_sip.c as the other files changed in the review centered around debugging improvements.
> 
> I will try to outline the differences between this patch and the rgagnon's here, to make it clear why I left some things out and added others by referencing numbered comments from review 1101.
> 
> ==========================
> Changes not included
> ==========================
> 
> 1) Add forward declarations for functions that will be referenced in sip_show_sched()
> 
>          I did not make this change, because it was debugging only improvement. It would be included in a separate patch.
> 
> 
> 3) dialog_unlink_all():
>    - Add null check to ensure we don't attempt to unlink a null dialog
> 
>           I would rather find the bug by crashing here when someone makes a silly change than it scooting by and causing some subtle issue later.
> 
>    - Change some text in dialog_unref() and registry_unref() calls to properly
>      explain what they are unreferencing, and make it easier to match to the
>      reference creation from transmit_register().  See item 27 below for
>      information on fix to transmit_register().
> 
>           Debugging only-fix not included
> 
> 4) __sip_autodestruct():
>    - Add final dialog_unref() needed to balance the ao2_alloc() for the dialog
>      There was actually a required unref here that was commented out, but it is
>      actually needed.  You will see why later, down in item 10 for the
>      find_call() function change explanation.
> 
> 8) Various places:
>    - Change any ao2_t_ref() function calls that are for a dialog object to use
>      dialog_ref(), or dialog_unref() as needed.
>    - Change any ao2_t_ref() function calls that are for a peer object to use
>      ref_peer(), or unref_peer() as needed.
>    - Change ao2_iterator_next() to ao2_t_iterator_next() in order to trouble-
>      shoot output of the refcounter utility better.
> 
>           Debugging only-fix not included
> 
> 9) sip_alloc():
>    - Remove early binding of T38 UDPTL port, as it will the code is redundant,
>      and will be executed in handle_request_invite() anyhow.
> 
>           I didn't include this because it wasn't causing a problem and I couldn't guarantee that there wouldn't be a side effect. handle_request_invite() checks to make sure we don't allocate twice, so I left this alone.
> 
> 
> 10) find_call():
>     This function is the biggest source of all dialog reference unbalancing. 
>     The function is intended to return a locked, and referenced dialog (if it
>     does not return NULL).  However, if it had to create a dialog object in
>     order to return it, it would not add a reference, only the lock.  With this
>     function not returning a consistent number of references with its dialog,
>     all other unreferencing is doomed to failure.  The change in this area
>     fixes this by adding a reference before returning the pointer to the
>     dialog.
> 
>          This change was unneeded and required a lot of other changes to offset it.
> 
> 12) sip_show_peers():
>     - Add an unref_peer() call in order to properly release the ao2_iterator
>       reference to the peer.
>     - Remove extraneous unref_peer() calls where no reference was added.
> 
>           This change was incorrect as the reference from the iterator was being kept around because we were storing the pointer to the peer in the peerarray which we are iterating through below where the unref_peer calls are made. I did set peer = peerarray[k] = unref_peer(...) to make sure that the references weren't accidentally used after they were unreffed.
> 
> 13) dialog_needdestroy():
>     - Similar to __sip_autodestruct(), added a dialog de-reference where a 
>       comment mistakenly tells you not to.  See item 10 about find_call()
> 
>           Removed because 10 was removed.
> 
> 14) sip_show_sched():
>     - Add names, and pointers to almost all functions that can be referenced
>       by the chan_sip scheduler so the CLI output will minimize the count shown
>       for "<unknown>".
> 
>           Removed because it was debugging-only
> 
> 18) stop_session_timer():
>     - Removed log write for WARNING if calling function will a NULL stimer as
>       this function can be called to ensure a session timer is stopped, and
>       there's really no need for the warning wasting space in the log.
> 
>           The only reason this was wasting space in the log was the added call to stop_session_timer() without the check of if (p->st_timer) that all other calls of stop_session_timer() had, so I left the WARNING as is. If we decide to remove the the warning, we will also remove all of the if (p->stimer) checks as well. For now, better to leave things consistent.
> 
>     - AST_SCHED_DEL_UNREF() moved inside a check to ensure there is a timer
>       running before it announces to the logs that it stopped the timer.
> 
>           AST_SCHED_DEL_UNREF() checks for id > -1 anyway, and if stimer->active is set when it isn't running, it seems like we should address that issue instead.
> 
> 21) proc_session_timer():
>     - A call to stop_session_timer() is removed, and replaced with a portion of
>       the code from stop_session_timer().  This is to avoid a double
>       de-referencing of the timer object is the timer is NOT to be rescheduled.
>       [stop_session_timer() would have de-referenced, as well as the needed
>        dialog_unref() also in this function]
>     - This change prevents core dumps!
> 
>           AST_SCHED_DEL_UNREF will not ever unref if the schedid is < 0. Since st_schedid is set = -1 before calling stop-session_timer(), I can see no way the logic would ever permit anything dangerous, so I have left it as is. If there is a crash with this patch, I'd be interested in seeing a core dump so we can figure out what is going wrong.
> 
> 22) sip_poke_noanswer():
>     - dialog_unref() and dialog_unlink_all() are reversed here.  This is due to
>       the doubly-linked nature of the peer->call and the dialog.
>     - Calling dialog_unlink_all() first will leave peer->call set to NULL,
>       causing the needed dialog_unref() to not do anything, when it needed to
> 
>           This is a little tricky. The problem is there is a circular reference with peers and pvts. p->relatedpeer->call can == p. dialog_unlink_all will unref p->relatedpeer->call and set it to NULL. When doing dialog_unlink_all(peer->call), peer->call can == call->relatedpeer->call so when it is set to NULL, the following dialog_unref(peer->call) becomes a NoOp. The question becomes, is this wrong? When peer->call is set, it is from newly sip_allloc'd pvt sip_poke_peer. The dialog leaves that function with a refcount of 2. Once for the link remaining from sip_alloc, and another for the explicit ref when we set peer->call. dialog_unlink_all() specifically unlinks the dialog from dialogs and will then unref dialog->relatedpeer->call and set it to NULL. So, the refcount will then be 0 like we would expect, and the dialog_unref(NULL) would be appropriate because we don't want to unref anymore. I don't think there would be a case where relatedpeer and relatedpeer->call weren't set in this case, so it might be perfectly fine to remove the dialog_unref from the code. But, as it is a NoOp and I don't like making changes to things that aren't specifically causing problems, I leave it in for now.
> 
> 23) sip_poke_peer():
>     - Same reversal of dialog_unref() and dialog_unlink_all() as #22 above.
> 
>           Same reason as #22
> 
>     - Removed a dialog_unref() at the end of the function as it would cause
>       the dialog to be free'd, and it is still needed in memory until either
>       the poke expires, or the response is received.
> 
>           sip_alloc returns a pvt with a refcount of 2. We immediately bump it to 3 when assigning to peer-.call. The dialog_unref takes back to two where it should be. One for the link in dialogs, and one for the peer->call.
> 
> 24) sip_request_call():
>     - Add needed dialog_unref() following dialog_unlink_all().
> 
>           Not needed. Stems from the confusion that we should hold on to the reference from sip_alloc instead of relying on the link in dialogs. Further explanation on review 1101.
> 
> 25) load_module():
>     (1.8 and trunk only):
>     - Change ao2_container_alloc() for sip_monitor_instances to
>       ao2_t_container_alloc() for tracking in the refcounter utility.
> 
>           This review is for 1.6.2 only, so this is left out (and is debugging only)
> 
> 27) transmit_register():
>     - Change some text in different dialog_ref() calls in order to properly
>       match references and unreferences that occur during dialog_unlink_all().
>     - Add a dialog_ref() when sip_alloc() is called in the function in order
>       to balance with the dialog_unref() that is called just before the 
>       function exits.
> 
>           Didn't keep the debugging only changes, and the dialog_ref also stemmed from the mistaken belief that we are supposed to hold onto the sip_alloc ref.
> 
> ==========================
> Changes included
> ==========================
> 2) ref_peer() and unref_peer():
>    (1.6.x only):
>    - Added macro definitions (based on the REF_DEBUG #define) around functions
>      in a manner similar to how dialog_ref() and dialog_unref() operate.  This
>      allows for better reference debugging as actual chan_sip line numbers will
>      appear in the /tmp/refs file when invoked.
> 
>           This was debugging only, but is so incredibly useful I snuck it in anyway. :-p
> 
> 3) dialog_unlink_all():
>    - Add call to stop_session_timer() so an ended call is not held in memory
>      for the duration of the session interval
> 
>           I included this, but wrapped it in an if() like other calls to stop_session_timer and kept the WARNING in stop_session_timer.
> 
> 5) sip_cancel_destroy():
>    - Function always returns 0.  Optimized the code to use AST_SCHED_DEL_UNREF()
> 
> 6) __sip_destroy():
>    - Moved redundant session timer free'ing to top of function only.
>    - Also, the first check of p->stimer at the top of the function was causing
>      a required timer stop, and de-referencing to not happen.
> 
> 7) interpret_t38_parameters():
>    - Change AST_SCHED_DEL() to AST_SCHED_DEL_UNREF() for t38 abort timer as it
>      is referenced in the dialog.
> 
> 
> 11) reqprep():
>     - Remove add_header() call for "Require: timer"
>     - This is mentioned in issue #18704, but I had to include it here, as the 
>       clients I was testing with were getting all messed up when using sessions.
> 
>           This I left in, even though it isn't directly related to the refcount issues. I probably should/will take it out and put it in a separate patch.
> 
> 15) handle_request_invite():
>     - Before setting the T38 abort timer, ensure one is not already running.  If
>       it is, stop that one, and start a new one.  Previously, t38id would be 
>       overwritten without any regard to the reference it had to the dialog.
> 
> 
>           I included the T38 part of the patch, but didn't keep the addition of && (c_state != AST_STATE_DOWN) mostly because I wasn't entirely sure why it was added. It very well may need to be included as well, I just need a little explanation as to the case that required it.
> 
> 16) handle_request_cancel():
>     Another large memory leak here.  Any time handle_request_cancel() was
>     called, any packets belonging to the dialog would be free'd, but two things
>     were not happening:  a) possible packet data was not free'd, and b) the
>     packet was not being unreferenced from the dialog.  The change here ensures
>     proper memory free'ing, as well as de-referencing.
> 
> 17) restart_session_timer():
>     - ast_debug() log write call moved to be ABOVE the AST_SCHED_DEL_UNREF()
>       call.
>     - With is BELOW, the output will always show the timer id as -1.
> 
> 18) stop_session_timer():
>        - ast_debug() log write call moved to be ABOVE the AST_SCHED_DEL_UNREF()
>       call.
>     - With is BELOW, the output will always show the timer id as -1.
> 
> 20) start_session_timer():
>     - In some cases a session timer is already running when start_session_timer()
>       is called again (for instance, an extra "200" message reply is sent).  The
>       change here ensures any existing session timer is stopped before a new one
>       is started.  Per RFC4028, this is the right way to do it--IE:  the timer
>       must start from the most recent "200" message.
>     - The ast_debug() message is also moved to be within the portion of the 
>       if-statement so it does not output a successful timer start should the
>       process fail to start a timer.  The dialog callid is also added to the
>       output.
> 
>           I think this and the code below it could be changed to an AST_SCHED_REPLACE_UNREF, but since it worked as-is I haven't changed it. I figured we could all decide during the review process if it would be better.
> 
> 
> 23) sip_poke_peer():
>     - Added a peer_ref() when xmitres == XMIT_ERROR because it will directly
>       call sip_poke_noanswer() without the scheduler.  sip_poke_noanswer() is
>       guaranteed to remove a reference as it is normally a scheduled callback.
>       This peer_ref() call keeps the references balanced.
> 
> 26) unload_module():
>     (all versions):
>     - Add iterator to unreference and free peers in the peers table.  Memory leak
>       just before chan_sip exits, but it was indicated as a problem by the
>       refcounter utility.
> 
>           I kept this, but changed it to use a function that shares functionality with unlink_marked_peers_from_tables
> 
> 
> ==========================
> New changes
> ==========================
> I have some changes in review 1198 that have already been "Ship It'd"
> 1. peer_is_marked() which now calls peer_sched_cleanup():
>    Change AST_SCHED_DEL to AST_SCHED_DEL_UNREF for peer->pokeexpire and also add an AST_SCHED_DEL_UNREF for peer->expire as we were leaking peers who had registered to us when removing them from sip.conf and doing a sip reload
> 
> 2. find_call():
>   When pedantic mode is enabled and the trylock fail, we do a goto restartsearch and need to unref the pvt that ao2_callback finds.
> 
> I also found one new one:
> 3. handle_request_subscribe():
>   Add a peer_unref() to get rid of the sip_alloc ref now that we are done with it and have added a ref for relatedpeer when we store authpeer there. 
> 
> 
> This addresses bug 17255.
>     https://issues.asterisk.org/view.php?id=17255
> 
> 
> Diffs
> -----
> 
>   /branches/1.6.2/channels/chan_sip.c 316920 
> 
> Diff: https://reviewboard.asterisk.org/r/1207/diff
> 
> 
> Testing
> -------
> 
> Defined peers with and without qualify and mailboxes. Registered phones to them, placed calls, transfers, unregisters/registers, removed the peer definitions and done sip reloads, and module unloads and verified with REF_DEBUG/refcounter that the refcount on all objects went to 0 (eventually).
> 
> One thing to note is that in the case of a phone subscribing to mailbox events, the SUBSCRIBE may have an Expires header value of several minutes. Because of the relatedpeer, the peer will stick around until __sip_autodestruct is called on this subscription.  Currently, there is no link from the peer side for this SUBSCRIBE's pvt, so it isn't easy to force the timeout/destruction of the dialog when you try to remove a peer from the config and do a reload. As soon as the autodestruct happens, though, the peer refcount goes to 0 and it is removed. This is how refcounting works, so I'm not too worried about it. It would be nice to kill the dialog early, though.
> 
> Since it isn't related to the problem of refcounts staying around and keeping UDP ports open, though, I vote we look at it later if we want to fix it. I just wanted to make it clear that it was happening in case anyone testing mistakenly counted it as a reference leak since it may take a very long time to go away depending on what the phone sends for an expiration time.
> 
> 
> Thanks,
> 
> Terry
> 
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20110510/130f6e90/attachment-0001.htm>


More information about the asterisk-dev mailing list