[asterisk-dev] [Code Review] Repair UDP port leaks, Memory Leaks, Denial of Service in chan_sip
rgagnon
reviewboard at asterisk.org
Thu Apr 28 17:50:41 CDT 2011
> On 2011-04-26 17:36:32, David Vossel wrote:
> > /trunk/channels/chan_sip.c, line 29434
> > <https://reviewboard.asterisk.org/r/1101/diff/2/?file=15662#file15662line29434>
> >
> > How are we guaranteed this second unref will destroy the dialog?
This is the unref that goes with the original sip_alloc
> On 2011-04-26 17:36:32, David Vossel wrote:
> > /trunk/channels/chan_sip.c, line 25082
> > <https://reviewboard.asterisk.org/r/1101/diff/2/?file=15662#file15662line25082>
> >
> > This doesn't have to change, the unref in stop_session_timer doesn't do anything if st_schedid is -1 thanks to the AST_SCHED_DEL_UNREF macro.
Yes. It does need to change. There is a logic difference between what happens here in proc_session_timer(), and what stop_session_timer() does, or should not do for us while we are in here.
> On 2011-04-26 17:36:32, David Vossel wrote:
> > /trunk/channels/chan_sip.c, lines 2697-2703
> > <https://reviewboard.asterisk.org/r/1101/diff/2/?file=15662#file15662line2697>
> >
> > Isn't this what is already done in the astobj2.h header file?
This might depend on the version of asterisk you are patching. I started with 1.4, and then migrated the patch up to 1.6.x family, then 1.8 and the trunk.
> On 2011-04-26 17:36:32, David Vossel wrote:
> > /trunk/channels/chan_sip.c, lines 2762-2765
> > <https://reviewboard.asterisk.org/r/1101/diff/2/?file=15662#file15662line2762>
> >
> > If this is happening, this is really bad.
Yes it would be bad, but not putting code in here to protect if that were to happen would be catastrophic. Other places in the code verify a pointer is valid before using it, why not here?
> On 2011-04-26 17:36:32, David Vossel wrote:
> > /trunk/channels/chan_sip.c, lines 24748-24751
> > <https://reviewboard.asterisk.org/r/1101/diff/2/?file=15662#file15662line24748>
> >
> > I don't see any reason to reduce the debug information here. This happens several other places as well.
The information is not reduced. It is made more helpful. The string only appears with REF_DEBUG enabled, and that will automatically cause the FILE, FUNCTION, LINE to appear in the refcounter output file. Having the "destroy dialog" messages all the same makes for easier debugging because you can match up "destroy dialog" messages with their corresponding "allocation" messages.
> On 2011-04-26 17:36:32, David Vossel wrote:
> > /trunk/channels/chan_sip.c, lines 24968-24970
> > <https://reviewboard.asterisk.org/r/1101/diff/2/?file=15662#file15662line24968>
> >
> > I don't see why we need to remove this log msg.
It's a useless message. It fills the log files for no reason. Like deleting a non-existent file. The end result is the file is gone.
> On 2011-04-26 17:36:32, David Vossel wrote:
> > /trunk/channels/chan_sip.c, lines 25283-25285
> > <https://reviewboard.asterisk.org/r/1101/diff/2/?file=15662#file15662line25283>
> >
> > Perhaps a comment explaining the order of this since the reasoning for this reversal is a little tricky.
The comment for this change is in step 22 under the chan_sip.c file changes. I thought it was clear. Let me know if not
> On 2011-04-26 17:36:32, David Vossel wrote:
> > /trunk/channels/chan_sip.c, lines 25333-25335
> > <https://reviewboard.asterisk.org/r/1101/diff/2/?file=15662#file15662line25333>
> >
> > A comment here would be nice as well.
The notes for step 22 in the "readme" above apply here for the same reason.
Without the change:
dialog_unlink_all() will nullify peer->call
peer->call is then passed to dialog_unref() in the next line
Since it is null, dialog_unref() won't do anything
> On 2011-04-26 17:36:32, David Vossel wrote:
> > /trunk/channels/chan_sip.c, lines 25694-25698
> > <https://reviewboard.asterisk.org/r/1101/diff/2/?file=15662#file15662line25694>
> >
> > I don't understand this change. It looks correct the way it was before.
The logic prior would perform an unref when one was not needed unless the condition on !tmpc is met.
The original comment "toss pvt ptr at end of sip_request_call" is incorrect in what it thinks it is doing. At this point in the code, it would cause a complete de-reference of the dialog, and that is not wanted unless !tmpc happens.
> On 2011-04-26 17:36:32, David Vossel wrote:
> > /trunk/include/asterisk/sched.h, lines 161-164
> > <https://reviewboard.asterisk.org/r/1101/diff/2/?file=15664#file15664line161>
> >
> > Could this cause a ABI issue?
> >
> > I'm not sure if we can make this change in a release branch since it affects the size of the structure. By doing this, a chan_sip module built with this header and these changes would crash if it was loaded with an asterisk core using the older version.
The structure needs to be bigger in order to hold all the function pointers for the "sip show sched" command. A better way to do this would have been to create a linked list of the pointers and names, but I do not plan on altering that capability now. The original code used a simple array. This just extends it some.
The struture is ONLY USED in chan_sip.c and solely for the use of "sip show sched"
PS: Why would you want to combine modules from different versions of asterisk?
- rgagnon
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/1101/#review3438
-----------------------------------------------------------
On 2011-04-26 11:31:33, rgagnon wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviewboard.asterisk.org/r/1101/
> -----------------------------------------------------------
>
> (Updated 2011-04-26 11:31:33)
>
>
> Review request for Asterisk Developers, Terry Wilson, wdoekes, rgagnon, and loloski.
>
>
> Summary
> -------
>
> 2011-02-09 by Rob Gagnon <rgagnon> Patch documentation
> ======================================================
>
> This readme applies to patch files named below under issue 17255:
> https://issues.asterisk.org/view.php?id=17255
>
> branch-1.6.1-r307181-sip-dos-mem-leak-fix.diff
> branch-1.6.2-r307181-sip-dos-mem-leak-fix.diff
> tag-1.6.2.16.1-r307181-sip-dos-mem-leak-fix.diff
> branch-1.8-r307181-sip-dos-mem-leak-fix.diff
> trunk-r307181-sip-dos-mem-leak-fix.diff
>
> As the patch created is a fair size, I am using this file to explain the
> changes made to different files (in lieu of a large post to the issue).
>
> Patches are created for branch 1.6.1, 1.6.2, and 1.8, as well as the trunk,
> and tag 1.6.2.16.1.
>
> Special thanks to "wdoekes", and "loloski" for testing, and finding the conditions under
> which entries for sip.conf "register => ...." entries were causing problems.
>
> Patches 1.6.1 and 1.6.2 change the following files:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> chan_sip.c
> rtp.h
> sched.h
> rtp.c
> sched.c
>
> Patches 1.8 and trunk change the following files:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> chan_sip.c
> sip.h
> sched.h
> sched.c
>
> The changes to chan_sip.c by far are the fix to the majority of the problem
> for this issue. I will explain the other files first.
>
> rtp.c (branch 1.6.1 and 1.6.2 only)
> ~~~~~
> Remove the forward declarations for red_write() and ast_rtcp_write() as they
> will now be defined within rtp.h.
>
> Renamed red_write() to ast_rtp_red_write() as it will basically be public now.
>
> rtp.h (branch 1.6.1 and 1.6.2 only)
> ~~~~~
> Adds prototypes for ast_rtcp_write() and ast_rtp_red_write() functions so
> they may be reference in sip_show_sched() CLI function.
>
> sched.c (all patches)
> ~~~~~~~
> Add a space to align the output of CLI command "sip show sched"
>
> sched.h (all patches)
> ~~~~~~~
> Alter the static size of cb_list array in order to fit all the callback names
> from sip_show_sched() CLI function.
>
> sip.h (branch 1.8 and trunk only)
> ~~~~~
> Add macros for ref_peer() and unref_peer() in a manner similar to existing
> macros dialog_ref() and dialog_unref() that are defined in dialog.h.
>
> chan_sip.c (all patches)
> ~~~~~~~~~~
> Many alterations in this file. They are all meant to ensure proper balance of
> dialog object (struct sip_pvt) reference increases and decreases throughout the
> file. Due to the unbalancing of the references before this patch, the
> following would occur:
>
> - Memory leaks
> - due to dialogs not being destroyed/free'd
> - Denial of Service
> - RTP, RTCP, and UDPTL ports are not freed if the related dialog object is
> not destroyed
> - Asterisk core dumps
> - In some cases, a dialog reference decrease will cause it to be destroyed
> when such an action is not expected, nor wanted. This leads to null
> pointers being used, and then KABOOM.
>
> Some changes made may not be in all files. Some are in all patches. Some are
> just in the 1.6.x patches, and others are just in the 1.8 and trunk patches.
> The notes below should indicate any differences.
>
> 1) Add forward declarations for functions that will be referenced in sip_show_sched()
>
> 2) ref_peer() and unref_peer():
> (1.6.x only):
> - Added macro definitions (based on the REF_DEBUG #define) around functions
> in a manner similar to how dialog_ref() and dialog_unref() operate. This
> allows for better reference debugging as actual chan_sip line numbers will
> appear in the /tmp/refs file when invoked.
> (1.8 and trunk only):
> - Changed to ref_peer_debug() and unref_peer_debug() similar to how
> dialog_ref_debug() and dialog_unref_debug() operate. This allows for
> better reference debugging as actual chan_sip line numbers will appear in
> the /tmp/refs file when invoked.
>
> 3) dialog_unlink_all():
> - Add null check to ensure we don't attempt to unlink a null dialog
> - Change some text in dialog_unref() and registry_unref() calls to properly
> explain what they are unreferencing, and make it easier to match to the
> reference creation from transmit_register(). See item 27 below for
> information on fix to transmit_register().
> - Add call to stop_session_timer() so an ended call is not held in memory
> for the duration of the session interval
>
> 4) __sip_autodestruct():
> - Add final dialog_unref() needed to balance the ao2_alloc() for the dialog
> There was actually a required unref here that was commented out, but it is
> actually needed. You will see why later, down in item 10 for the
> find_call() function change explanation.
>
> 5) sip_cancel_destroy():
> - Function always returns 0. Optimized the code to use AST_SCHED_DEL_UNREF()
>
> 6) __sip_destroy():
> - Moved redundant session timer free'ing to top of function only.
> - Also, the first check of p->stimer at the top of the function was causing
> a required timer stop, and de-referencing to not happen.
>
> 7) interpret_t38_parameters():
> - Change AST_SCHED_DEL() to AST_SCHED_DEL_UNREF() for t38 abort timer as it
> is referenced in the dialog.
>
> 8) Various places:
> - Change any ao2_t_ref() function calls that are for a dialog object to use
> dialog_ref(), or dialog_unref() as needed.
> - Change any ao2_t_ref() function calls that are for a peer object to use
> ref_peer(), or unref_peer() as needed.
> - Change ao2_iterator_next() to ao2_t_iterator_next() in order to trouble-
> shoot output of the refcounter utility better.
>
> 9) sip_alloc():
> - Remove early binding of T38 UDPTL port, as it will the code is redundant,
> and will be executed in handle_request_invite() anyhow.
>
> 10) find_call():
> This function is the biggest source of all dialog reference unbalancing.
> The function is intended to return a locked, and referenced dialog (if it
> does not return NULL). However, if it had to create a dialog object in
> order to return it, it would not add a reference, only the lock. With this
> function not returning a consistent number of references with its dialog,
> all other unreferencing is doomed to failure. The change in this area
> fixes this by adding a reference before returning the pointer to the
> dialog.
>
> 11) reqprep():
> - Remove add_header() call for "Require: timer"
> - This is mentioned in issue #18704, but I had to include it here, as the
> clients I was testing with were getting all messed up when using sessions.
>
> 12) sip_show_peers():
> - Add an unref_peer() call in order to properly release the ao2_iterator
> reference to the peer.
> - Remove extraneous unref_peer() calls where no reference was added.
>
> 13) dialog_needdestroy():
> - Similar to __sip_autodestruct(), added a dialog de-reference where a
> comment mistakenly tells you not to. See item 10 about find_call()
>
> 14) sip_show_sched():
> - Add names, and pointers to almost all functions that can be referenced
> by the chan_sip scheduler so the CLI output will minimize the count shown
> for "<unknown>".
>
> 15) handle_request_invite():
> - Before setting the T38 abort timer, ensure one is not already running. If
> it is, stop that one, and start a new one. Previously, t38id would be
> overwritten without any regard to the reference it had to the dialog.
>
> 16) handle_request_cancel():
> Another large memory leak here. Any time handle_request_cancel() was
> called, any packets belonging to the dialog would be free'd, but two things
> were not happening: a) possible packet data was not free'd, and b) the
> packet was not being unreferenced from the dialog. The change here ensures
> proper memory free'ing, as well as de-referencing.
>
> 17) restart_session_timer():
> - ast_debug() log write call moved to be ABOVE the AST_SCHED_DEL_UNREF()
> call.
> - With is BELOW, the output will always show the timer id as -1.
>
> 18) stop_session_timer():
> - Removed log write for WARNING if calling function will a NULL stimer as
> this function can be called to ensure a session timer is stopped, and
> there's really no need for the warning wasting space in the log.
> - AST_SCHED_DEL_UNREF() moved inside a check to ensure there is a timer
> running before it announces to the logs that it stopped the timer.
> - ast_debug() log write call moved to be ABOVE the AST_SCHED_DEL_UNREF()
> call.
> - With is BELOW, the output will always show the timer id as -1.
>
> 20) start_session_timer():
> - In some cases a session timer is already running when start_session_timer()
> is called again (for instance, an extra "200" message reply is sent). The
> change here ensures any existing session timer is stopped before a new one
> is started. Per RFC4028, this is the right way to do it--IE: the timer
> must start from the most recent "200" message.
> - The ast_debug() message is also moved to be within the portion of the
> if-statement so it does not output a successful timer start should the
> process fail to start a timer. The dialog callid is also added to the
> output.
>
> 21) proc_session_timer():
> - A call to stop_session_timer() is removed, and replaced with a portion of
> the code from stop_session_timer(). This is to avoid a double
> de-referencing of the timer object is the timer is NOT to be rescheduled.
> [stop_session_timer() would have de-referenced, as well as the needed
> dialog_unref() also in this function]
> - This change prevents core dumps!
>
> 22) sip_poke_noanswer():
> - dialog_unref() and dialog_unlink_all() are reversed here. This is due to
> the doubly-linked nature of the peer->call and the dialog.
> - Calling dialog_unlink_all() first will leave peer->call set to NULL,
> causing the needed dialog_unref() to not do anything, when it needed to
>
> 23) sip_poke_peer():
> - Same reversal of dialog_unref() and dialog_unlink_all() as #22 above.
> - Added a peer_ref() when xmitres == XMIT_ERROR because it will directly
> call sip_poke_noanswer() without the scheduler. sip_poke_noanswer() is
> guaranteed to remove a reference as it is normally a scheduled callback.
> This peer_ref() call keeps the references balanced.
> - Removed a dialog_unref() at the end of the function as it would cause
> the dialog to be free'd, and it is still needed in memory until either
> the poke expires, or the response is received.
>
> 24) sip_request_call():
> - Add needed dialog_unref() following dialog_unlink_all().
>
> 25) load_module():
> (1.8 and trunk only):
> - Change ao2_container_alloc() for sip_monitor_instances to
> ao2_t_container_alloc() for tracking in the refcounter utility.
>
> 26) unload_module():
> (all versions):
> - Add iterator to unreference and free peers in the peers table. Memory leak
> just before chan_sip exits, but it was indicated as a problem by the
> refcounter utility.
> (1.8 and trunk only):
> - Add ao2_t_ref() to de-reference the sip_monitor_instances table
>
> 27) transmit_register():
> - Change some text in different dialog_ref() calls in order to properly
> match references and unreferences that occur during dialog_unlink_all().
> - Add a dialog_ref() when sip_alloc() is called in the function in order
> to balance with the dialog_unref() that is called just before the
> function exits.
>
>
> This addresses bugs 16774, 17255, 18014, 18027, 18121, and 18704.
> https://issues.asterisk.org/view.php?id=16774
> https://issues.asterisk.org/view.php?id=17255
> https://issues.asterisk.org/view.php?id=18014
> https://issues.asterisk.org/view.php?id=18027
> https://issues.asterisk.org/view.php?id=18121
> https://issues.asterisk.org/view.php?id=18704
>
>
> Diffs
> -----
>
> /trunk/channels/chan_sip.c 308370
> /trunk/channels/sip/include/sip.h 308370
> /trunk/include/asterisk/sched.h 308370
> /trunk/main/sched.c 308370
>
> Diff: https://reviewboard.asterisk.org/r/1101/diff
>
>
> Testing
> -------
>
> Besides myself, 2 other people on this issue have been testing using different methods: wdoekes and loloski.
>
> Loloski (Ronald) reports sipp tests are coming out clean, and his company is applying the patch to their production servers.
>
> wdoekes was the first to test and find a crash location in an area where I had previously not looked. That code was modified as well, and he is now reporting running without any issues.
>
> In my case, we were running 1.6.2.16.1 release and losing all ability to provide service within 15 minutes due to memory leaking objects holding UDP ports in use, causing a denial of service. We normally handle between 8000 and 15000 call requests per hour, so it didn't take long at all for the server that was testing 1.6.2.16.1 to die. With this patch applied, we have been running for a few days, and not a single UDP port has leaked. The CLI "sip show sched" and "sip show objects" do not show increasing timers and object references. As well, a check of the "netstat -anp" command no longer shows all the UDP ports tied up.
>
> When running T38 enabled, 3 UDP ports are tied to each channel. The output of "core show channels verbose" in order to get a live channel count directly compares to the number of UDP ports per channel shown by checking the "netstat" output. Prior to the patch, there was no linear correlation like there should be.
>
>
> Thanks,
>
> rgagnon
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20110428/24b39c8c/attachment-0001.htm>
More information about the asterisk-dev
mailing list