[asterisk-dev] [Code Review] Repair UDP port leaks, Memory Leaks, Denial of Service in chan_sip
rgagnon
reviewboard at asterisk.org
Mon Feb 14 03:43:26 CST 2011
> On 2011-02-11 04:56:21, schmidts wrote:
> > i just have taken a short look about your changes but didnt find anything which makes no sense.
> > i am now load testing your patch to see if i can find any problems with it.
>
> schmidts wrote:
> i have done some load tests and removing the udptl allocation from sip_alloc brings a lot of sip power. and also dialogs are much faster removed and as far as i have recognized i dont see any memory leaks.
>
> nice work!
Good to hear. What were you load testing with? sipp? or something else.
- rgagnon
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/1101/#review3179
-----------------------------------------------------------
On 2011-02-11 03:24:09, rgagnon wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviewboard.asterisk.org/r/1101/
> -----------------------------------------------------------
>
> (Updated 2011-02-11 03:24:09)
>
>
> Review request for Asterisk Developers, wdoekes, rgagnon, and loloski.
>
>
> Summary
> -------
>
> 2011-02-09 by Rob Gagnon <rgagnon> Patch documentation
> ======================================================
>
> This readme applies to patch files named below under issue 17255:
> https://issues.asterisk.org/view.php?id=17255
>
> branch-1.6.1-r307181-sip-dos-mem-leak-fix.diff
> branch-1.6.2-r307181-sip-dos-mem-leak-fix.diff
> tag-1.6.2.16.1-r307181-sip-dos-mem-leak-fix.diff
> branch-1.8-r307181-sip-dos-mem-leak-fix.diff
> trunk-r307181-sip-dos-mem-leak-fix.diff
>
> As the patch created is a fair size, I am using this file to explain the
> changes made to different files (in lieu of a large post to the issue).
>
> Patches are created for branch 1.6.1, 1.6.2, and 1.8, as well as the trunk,
> and tag 1.6.2.16.1.
>
> Special thanks to "wdoekes", and "loloski" for testing, and finding the conditions under
> which entries for sip.conf "register => ...." entries were causing problems.
>
> Patches 1.6.1 and 1.6.2 change the following files:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> chan_sip.c
> rtp.h
> sched.h
> rtp.c
> sched.c
>
> Patches 1.8 and trunk change the following files:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> chan_sip.c
> sip.h
> sched.h
> sched.c
>
> The changes to chan_sip.c by far are the fix to the majority of the problem
> for this issue. I will explain the other files first.
>
> rtp.c (branch 1.6.1 and 1.6.2 only)
> ~~~~~
> Remove the forward declarations for red_write() and ast_rtcp_write() as they
> will now be defined within rtp.h.
>
> Renamed red_write() to ast_rtp_red_write() as it will basically be public now.
>
> rtp.h (branch 1.6.1 and 1.6.2 only)
> ~~~~~
> Adds prototypes for ast_rtcp_write() and ast_rtp_red_write() functions so
> they may be reference in sip_show_sched() CLI function.
>
> sched.c (all patches)
> ~~~~~~~
> Add a space to align the output of CLI command "sip show sched"
>
> sched.h (all patches)
> ~~~~~~~
> Alter the static size of cb_list array in order to fit all the callback names
> from sip_show_sched() CLI function.
>
> sip.h (branch 1.8 and trunk only)
> ~~~~~
> Add macros for ref_peer() and unref_peer() in a manner similar to existing
> macros dialog_ref() and dialog_unref() that are defined in dialog.h.
>
> chan_sip.c (all patches)
> ~~~~~~~~~~
> Many alterations in this file. They are all meant to ensure proper balance of
> dialog object (struct sip_pvt) reference increases and decreases throughout the
> file. Due to the unbalancing of the references before this patch, the
> following would occur:
>
> - Memory leaks
> - due to dialogs not being destroyed/free'd
> - Denial of Service
> - RTP, RTCP, and UDPTL ports are not freed if the related dialog object is
> not destroyed
> - Asterisk core dumps
> - In some cases, a dialog reference decrease will cause it to be destroyed
> when such an action is not expected, nor wanted. This leads to null
> pointers being used, and then KABOOM.
>
> Some changes made may not be in all files. Some are in all patches. Some are
> just in the 1.6.x patches, and others are just in the 1.8 and trunk patches.
> The notes below should indicate any differences.
>
> 1) Add forward declarations for functions that will be referenced in sip_show_sched()
>
> 2) ref_peer() and unref_peer():
> (1.6.x only):
> - Added macro definitions (based on the REF_DEBUG #define) around functions
> in a manner similar to how dialog_ref() and dialog_unref() operate. This
> allows for better reference debugging as actual chan_sip line numbers will
> appear in the /tmp/refs file when invoked.
> (1.8 and trunk only):
> - Changed to ref_peer_debug() and unref_peer_debug() similar to how
> dialog_ref_debug() and dialog_unref_debug() operate. This allows for
> better reference debugging as actual chan_sip line numbers will appear in
> the /tmp/refs file when invoked.
>
> 3) dialog_unlink_all():
> - Add null check to ensure we don't attempt to unlink a null dialog
> - Change some text in dialog_unref() and registry_unref() calls to properly
> explain what they are unreferencing, and make it easier to match to the
> reference creation from transmit_register(). See item 27 below for
> information on fix to transmit_register().
> - Add call to stop_session_timer() so an ended call is not held in memory
> for the duration of the session interval
>
> 4) __sip_autodestruct():
> - Add final dialog_unref() needed to balance the ao2_alloc() for the dialog
> There was actually a required unref here that was commented out, but it is
> actually needed. You will see why later, down in item 10 for the
> find_call() function change explanation.
>
> 5) sip_cancel_destroy():
> - Function always returns 0. Optimized the code to use AST_SCHED_DEL_UNREF()
>
> 6) __sip_destroy():
> - Moved redundant session timer free'ing to top of function only.
> - Also, the first check of p->stimer at the top of the function was causing
> a required timer stop, and de-referencing to not happen.
>
> 7) interpret_t38_parameters():
> - Change AST_SCHED_DEL() to AST_SCHED_DEL_UNREF() for t38 abort timer as it
> is referenced in the dialog.
>
> 8) Various places:
> - Change any ao2_t_ref() function calls that are for a dialog object to use
> dialog_ref(), or dialog_unref() as needed.
> - Change any ao2_t_ref() function calls that are for a peer object to use
> ref_peer(), or unref_peer() as needed.
> - Change ao2_iterator_next() to ao2_t_iterator_next() in order to trouble-
> shoot output of the refcounter utility better.
>
> 9) sip_alloc():
> - Remove early binding of T38 UDPTL port, as it will the code is redundant,
> and will be executed in handle_request_invite() anyhow.
>
> 10) find_call():
> This function is the biggest source of all dialog reference unbalancing.
> The function is intended to return a locked, and referenced dialog (if it
> does not return NULL). However, if it had to create a dialog object in
> order to return it, it would not add a reference, only the lock. With this
> function not returning a consistent number of references with its dialog,
> all other unreferencing is doomed to failure. The change in this area
> fixes this by adding a reference before returning the pointer to the
> dialog.
>
> 11) reqprep():
> - Remove add_header() call for "Require: timer"
> - This is mentioned in issue #18704, but I had to include it here, as the
> clients I was testing with were getting all messed up when using sessions.
>
> 12) sip_show_peers():
> - Add an unref_peer() call in order to properly release the ao2_iterator
> reference to the peer.
> - Remove extraneous unref_peer() calls where no reference was added.
>
> 13) dialog_needdestroy():
> - Similar to __sip_autodestruct(), added a dialog de-reference where a
> comment mistakenly tells you not to. See item 10 about find_call()
>
> 14) sip_show_sched():
> - Add names, and pointers to almost all functions that can be referenced
> by the chan_sip scheduler so the CLI output will minimize the count shown
> for "<unknown>".
>
> 15) handle_request_invite():
> - Before setting the T38 abort timer, ensure one is not already running. If
> it is, stop that one, and start a new one. Previously, t38id would be
> overwritten without any regard to the reference it had to the dialog.
>
> 16) handle_request_cancel():
> Another large memory leak here. Any time handle_request_cancel() was
> called, any packets belonging to the dialog would be free'd, but two things
> were not happening: a) possible packet data was not free'd, and b) the
> packet was not being unreferenced from the dialog. The change here ensures
> proper memory free'ing, as well as de-referencing.
>
> 17) restart_session_timer():
> - ast_debug() log write call moved to be ABOVE the AST_SCHED_DEL_UNREF()
> call.
> - With is BELOW, the output will always show the timer id as -1.
>
> 18) stop_session_timer():
> - Removed log write for WARNING if calling function will a NULL stimer as
> this function can be called to ensure a session timer is stopped, and
> there's really no need for the warning wasting space in the log.
> - AST_SCHED_DEL_UNREF() moved inside a check to ensure there is a timer
> running before it announces to the logs that it stopped the timer.
> - ast_debug() log write call moved to be ABOVE the AST_SCHED_DEL_UNREF()
> call.
> - With is BELOW, the output will always show the timer id as -1.
>
> 20) start_session_timer():
> - In some cases a session timer is already running when start_session_timer()
> is called again (for instance, an extra "200" message reply is sent). The
> change here ensures any existing session timer is stopped before a new one
> is started. Per RFC4028, this is the right way to do it--IE: the timer
> must start from the most recent "200" message.
> - The ast_debug() message is also moved to be within the portion of the
> if-statement so it does not output a successful timer start should the
> process fail to start a timer. The dialog callid is also added to the
> output.
>
> 21) proc_session_timer():
> - A call to stop_session_timer() is removed, and replaced with a portion of
> the code from stop_session_timer(). This is to avoid a double
> de-referencing of the timer object is the timer is NOT to be rescheduled.
> [stop_session_timer() would have de-referenced, as well as the needed
> dialog_unref() also in this function]
> - This change prevents core dumps!
>
> 22) sip_poke_noanswer():
> - dialog_unref() and dialog_unlink_all() are reversed here. This is due to
> the doubly-linked nature of the peer->call and the dialog.
> - Calling dialog_unlink_all() first will leave peer->call set to NULL,
> causing the needed dialog_unref() to not do anything, when it needed to
>
> 23) sip_poke_peer():
> - Same reversal of dialog_unref() and dialog_unlink_all() as #22 above.
> - Added a peer_ref() when xmitres == XMIT_ERROR because it will directly
> call sip_poke_noanswer() without the scheduler. sip_poke_noanswer() is
> guaranteed to remove a reference as it is normally a scheduled callback.
> This peer_ref() call keeps the references balanced.
> - Removed a dialog_unref() at the end of the function as it would cause
> the dialog to be free'd, and it is still needed in memory until either
> the poke expires, or the response is received.
>
> 24) sip_request_call():
> - Add needed dialog_unref() following dialog_unlink_all().
>
> 25) load_module():
> (1.8 and trunk only):
> - Change ao2_container_alloc() for sip_monitor_instances to
> ao2_t_container_alloc() for tracking in the refcounter utility.
>
> 26) unload_module():
> (all versions):
> - Add iterator to unreference and free peers in the peers table. Memory leak
> just before chan_sip exits, but it was indicated as a problem by the
> refcounter utility.
> (1.8 and trunk only):
> - Add ao2_t_ref() to de-reference the sip_monitor_instances table
>
> 27) transmit_register():
> - Change some text in different dialog_ref() calls in order to properly
> match references and unreferences that occur during dialog_unlink_all().
> - Add a dialog_ref() when sip_alloc() is called in the function in order
> to balance with the dialog_unref() that is called just before the
> function exits.
>
>
> This addresses bugs 16774, 17255, 18014, 18027, 18121, and 18704.
> https://issues.asterisk.org/view.php?id=16774
> https://issues.asterisk.org/view.php?id=17255
> https://issues.asterisk.org/view.php?id=18014
> https://issues.asterisk.org/view.php?id=18027
> https://issues.asterisk.org/view.php?id=18121
> https://issues.asterisk.org/view.php?id=18704
>
>
> Diffs
> -----
>
> http://svn.asterisk.org/svn/asterisk/trunk/channels/chan_sip.c 307669
> http://svn.asterisk.org/svn/asterisk/trunk/channels/sip/include/sip.h 307669
> http://svn.asterisk.org/svn/asterisk/trunk/include/asterisk/sched.h 307669
> http://svn.asterisk.org/svn/asterisk/trunk/main/sched.c 307669
>
> Diff: https://reviewboard.asterisk.org/r/1101/diff
>
>
> Testing
> -------
>
> Besides myself, 2 other people on this issue have been testing using different methods: wdoekes and loloski.
>
> Loloski (Ronald) reports sipp tests are coming out clean, and his company is applying the patch to their production servers.
>
> wdoekes was the first to test and find a crash location in an area where I had previously not looked. That code was modified as well, and he is now reporting running without any issues.
>
> In my case, we were running 1.6.2.16.1 release and losing all ability to provide service within 15 minutes due to memory leaking objects holding UDP ports in use, causing a denial of service. We normally handle between 8000 and 15000 call requests per hour, so it didn't take long at all for the server that was testing 1.6.2.16.1 to die. With this patch applied, we have been running for a few days, and not a single UDP port has leaked. The CLI "sip show sched" and "sip show objects" do not show increasing timers and object references. As well, a check of the "netstat -anp" command no longer shows all the UDP ports tied up.
>
> When running T38 enabled, 3 UDP ports are tied to each channel. The output of "core show channels verbose" in order to get a live channel count directly compares to the number of UDP ports per channel shown by checking the "netstat" output. Prior to the patch, there was no linear correlation like there should be.
>
>
> Thanks,
>
> rgagnon
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20110214/8ece65d1/attachment-0001.htm>
More information about the asterisk-dev
mailing list