<html>
<body>
<div style="font-family: Verdana, Arial, Helvetica, Sans-Serif;">
<table bgcolor="#f9f3c9" width="100%" cellpadding="8" style="border: 1px #c9c399 solid;">
<tr>
<td>
This is an automatically generated e-mail. To reply, visit:
<a href="https://reviewboard.asterisk.org/r/1101/">https://reviewboard.asterisk.org/r/1101/</a>
</td>
</tr>
</table>
<br />
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: 10px;">
<p style="margin-top: 0;">On February 11th, 2011, 4:56 a.m., <b>schmidts</b> wrote:</p>
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: 10px;">
<pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">i just have taken a short look about your changes but didnt find anything which makes no sense.
i am now load testing your patch to see if i can find any problems with it.</pre>
</blockquote>
<p>On February 14th, 2011, 2:56 a.m., <b>schmidts</b> wrote:</p>
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: 10px;">
<pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">i have done some load tests and removing the udptl allocation from sip_alloc brings a lot of sip power. and also dialogs are much faster removed and as far as i have recognized i dont see any memory leaks.
nice work!</pre>
</blockquote>
</blockquote>
<pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Good to hear. What were you load testing with? sipp? or something else.</pre>
<br />
<p>- rgagnon</p>
<br />
<p>On February 11th, 2011, 3:24 a.m., rgagnon wrote:</p>
<table bgcolor="#fefadf" width="100%" cellspacing="0" cellpadding="8" style="background-image: url('https://reviewboard.asterisk.orgrb/images/review_request_box_top_bg.png'); background-position: left top; background-repeat: repeat-x; border: 1px black solid;">
<tr>
<td>
<div>Review request for Asterisk Developers, wdoekes, rgagnon, and loloski.</div>
<div>By rgagnon.</div>
<p style="color: grey;"><i>Updated 2011-02-11 03:24:09</i></p>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Description </h1>
<table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" style="border: 1px solid #b8b5a0">
<tr>
<td>
<pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">2011-02-09 by Rob Gagnon <rgagnon> Patch documentation
======================================================
This readme applies to patch files named below under issue 17255:
https://issues.asterisk.org/view.php?id=17255
branch-1.6.1-r307181-sip-dos-mem-leak-fix.diff
branch-1.6.2-r307181-sip-dos-mem-leak-fix.diff
tag-1.6.2.16.1-r307181-sip-dos-mem-leak-fix.diff
branch-1.8-r307181-sip-dos-mem-leak-fix.diff
trunk-r307181-sip-dos-mem-leak-fix.diff
As the patch created is a fair size, I am using this file to explain the
changes made to different files (in lieu of a large post to the issue).
Patches are created for branch 1.6.1, 1.6.2, and 1.8, as well as the trunk,
and tag 1.6.2.16.1.
Special thanks to "wdoekes", and "loloski" for testing, and finding the conditions under
which entries for sip.conf "register => ...." entries were causing problems.
Patches 1.6.1 and 1.6.2 change the following files:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
chan_sip.c
rtp.h
sched.h
rtp.c
sched.c
Patches 1.8 and trunk change the following files:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
chan_sip.c
sip.h
sched.h
sched.c
The changes to chan_sip.c by far are the fix to the majority of the problem
for this issue. I will explain the other files first.
rtp.c (branch 1.6.1 and 1.6.2 only)
~~~~~
Remove the forward declarations for red_write() and ast_rtcp_write() as they
will now be defined within rtp.h.
Renamed red_write() to ast_rtp_red_write() as it will basically be public now.
rtp.h (branch 1.6.1 and 1.6.2 only)
~~~~~
Adds prototypes for ast_rtcp_write() and ast_rtp_red_write() functions so
they may be reference in sip_show_sched() CLI function.
sched.c (all patches)
~~~~~~~
Add a space to align the output of CLI command "sip show sched"
sched.h (all patches)
~~~~~~~
Alter the static size of cb_list array in order to fit all the callback names
from sip_show_sched() CLI function.
sip.h (branch 1.8 and trunk only)
~~~~~
Add macros for ref_peer() and unref_peer() in a manner similar to existing
macros dialog_ref() and dialog_unref() that are defined in dialog.h.
chan_sip.c (all patches)
~~~~~~~~~~
Many alterations in this file. They are all meant to ensure proper balance of
dialog object (struct sip_pvt) reference increases and decreases throughout the
file. Due to the unbalancing of the references before this patch, the
following would occur:
- Memory leaks
- due to dialogs not being destroyed/free'd
- Denial of Service
- RTP, RTCP, and UDPTL ports are not freed if the related dialog object is
not destroyed
- Asterisk core dumps
- In some cases, a dialog reference decrease will cause it to be destroyed
when such an action is not expected, nor wanted. This leads to null
pointers being used, and then KABOOM.
Some changes made may not be in all files. Some are in all patches. Some are
just in the 1.6.x patches, and others are just in the 1.8 and trunk patches.
The notes below should indicate any differences.
1) Add forward declarations for functions that will be referenced in sip_show_sched()
2) ref_peer() and unref_peer():
(1.6.x only):
- Added macro definitions (based on the REF_DEBUG #define) around functions
in a manner similar to how dialog_ref() and dialog_unref() operate. This
allows for better reference debugging as actual chan_sip line numbers will
appear in the /tmp/refs file when invoked.
(1.8 and trunk only):
- Changed to ref_peer_debug() and unref_peer_debug() similar to how
dialog_ref_debug() and dialog_unref_debug() operate. This allows for
better reference debugging as actual chan_sip line numbers will appear in
the /tmp/refs file when invoked.
3) dialog_unlink_all():
- Add null check to ensure we don't attempt to unlink a null dialog
- Change some text in dialog_unref() and registry_unref() calls to properly
explain what they are unreferencing, and make it easier to match to the
reference creation from transmit_register(). See item 27 below for
information on fix to transmit_register().
- Add call to stop_session_timer() so an ended call is not held in memory
for the duration of the session interval
4) __sip_autodestruct():
- Add final dialog_unref() needed to balance the ao2_alloc() for the dialog
There was actually a required unref here that was commented out, but it is
actually needed. You will see why later, down in item 10 for the
find_call() function change explanation.
5) sip_cancel_destroy():
- Function always returns 0. Optimized the code to use AST_SCHED_DEL_UNREF()
6) __sip_destroy():
- Moved redundant session timer free'ing to top of function only.
- Also, the first check of p->stimer at the top of the function was causing
a required timer stop, and de-referencing to not happen.
7) interpret_t38_parameters():
- Change AST_SCHED_DEL() to AST_SCHED_DEL_UNREF() for t38 abort timer as it
is referenced in the dialog.
8) Various places:
- Change any ao2_t_ref() function calls that are for a dialog object to use
dialog_ref(), or dialog_unref() as needed.
- Change any ao2_t_ref() function calls that are for a peer object to use
ref_peer(), or unref_peer() as needed.
- Change ao2_iterator_next() to ao2_t_iterator_next() in order to trouble-
shoot output of the refcounter utility better.
9) sip_alloc():
- Remove early binding of T38 UDPTL port, as it will the code is redundant,
and will be executed in handle_request_invite() anyhow.
10) find_call():
This function is the biggest source of all dialog reference unbalancing.
The function is intended to return a locked, and referenced dialog (if it
does not return NULL). However, if it had to create a dialog object in
order to return it, it would not add a reference, only the lock. With this
function not returning a consistent number of references with its dialog,
all other unreferencing is doomed to failure. The change in this area
fixes this by adding a reference before returning the pointer to the
dialog.
11) reqprep():
- Remove add_header() call for "Require: timer"
- This is mentioned in issue #18704, but I had to include it here, as the
clients I was testing with were getting all messed up when using sessions.
12) sip_show_peers():
- Add an unref_peer() call in order to properly release the ao2_iterator
reference to the peer.
- Remove extraneous unref_peer() calls where no reference was added.
13) dialog_needdestroy():
- Similar to __sip_autodestruct(), added a dialog de-reference where a
comment mistakenly tells you not to. See item 10 about find_call()
14) sip_show_sched():
- Add names, and pointers to almost all functions that can be referenced
by the chan_sip scheduler so the CLI output will minimize the count shown
for "<unknown>".
15) handle_request_invite():
- Before setting the T38 abort timer, ensure one is not already running. If
it is, stop that one, and start a new one. Previously, t38id would be
overwritten without any regard to the reference it had to the dialog.
16) handle_request_cancel():
Another large memory leak here. Any time handle_request_cancel() was
called, any packets belonging to the dialog would be free'd, but two things
were not happening: a) possible packet data was not free'd, and b) the
packet was not being unreferenced from the dialog. The change here ensures
proper memory free'ing, as well as de-referencing.
17) restart_session_timer():
- ast_debug() log write call moved to be ABOVE the AST_SCHED_DEL_UNREF()
call.
- With is BELOW, the output will always show the timer id as -1.
18) stop_session_timer():
- Removed log write for WARNING if calling function will a NULL stimer as
this function can be called to ensure a session timer is stopped, and
there's really no need for the warning wasting space in the log.
- AST_SCHED_DEL_UNREF() moved inside a check to ensure there is a timer
running before it announces to the logs that it stopped the timer.
- ast_debug() log write call moved to be ABOVE the AST_SCHED_DEL_UNREF()
call.
- With is BELOW, the output will always show the timer id as -1.
20) start_session_timer():
- In some cases a session timer is already running when start_session_timer()
is called again (for instance, an extra "200" message reply is sent). The
change here ensures any existing session timer is stopped before a new one
is started. Per RFC4028, this is the right way to do it--IE: the timer
must start from the most recent "200" message.
- The ast_debug() message is also moved to be within the portion of the
if-statement so it does not output a successful timer start should the
process fail to start a timer. The dialog callid is also added to the
output.
21) proc_session_timer():
- A call to stop_session_timer() is removed, and replaced with a portion of
the code from stop_session_timer(). This is to avoid a double
de-referencing of the timer object is the timer is NOT to be rescheduled.
[stop_session_timer() would have de-referenced, as well as the needed
dialog_unref() also in this function]
- This change prevents core dumps!
22) sip_poke_noanswer():
- dialog_unref() and dialog_unlink_all() are reversed here. This is due to
the doubly-linked nature of the peer->call and the dialog.
- Calling dialog_unlink_all() first will leave peer->call set to NULL,
causing the needed dialog_unref() to not do anything, when it needed to
23) sip_poke_peer():
- Same reversal of dialog_unref() and dialog_unlink_all() as #22 above.
- Added a peer_ref() when xmitres == XMIT_ERROR because it will directly
call sip_poke_noanswer() without the scheduler. sip_poke_noanswer() is
guaranteed to remove a reference as it is normally a scheduled callback.
This peer_ref() call keeps the references balanced.
- Removed a dialog_unref() at the end of the function as it would cause
the dialog to be free'd, and it is still needed in memory until either
the poke expires, or the response is received.
24) sip_request_call():
- Add needed dialog_unref() following dialog_unlink_all().
25) load_module():
(1.8 and trunk only):
- Change ao2_container_alloc() for sip_monitor_instances to
ao2_t_container_alloc() for tracking in the refcounter utility.
26) unload_module():
(all versions):
- Add iterator to unreference and free peers in the peers table. Memory leak
just before chan_sip exits, but it was indicated as a problem by the
refcounter utility.
(1.8 and trunk only):
- Add ao2_t_ref() to de-reference the sip_monitor_instances table
27) transmit_register():
- Change some text in different dialog_ref() calls in order to properly
match references and unreferences that occur during dialog_unlink_all().
- Add a dialog_ref() when sip_alloc() is called in the function in order
to balance with the dialog_unref() that is called just before the
function exits.
</pre>
</td>
</tr>
</table>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Testing </h1>
<table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" style="border: 1px solid #b8b5a0">
<tr>
<td>
<pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Besides myself, 2 other people on this issue have been testing using different methods: wdoekes and loloski.
Loloski (Ronald) reports sipp tests are coming out clean, and his company is applying the patch to their production servers.
wdoekes was the first to test and find a crash location in an area where I had previously not looked. That code was modified as well, and he is now reporting running without any issues.
In my case, we were running 1.6.2.16.1 release and losing all ability to provide service within 15 minutes due to memory leaking objects holding UDP ports in use, causing a denial of service. We normally handle between 8000 and 15000 call requests per hour, so it didn't take long at all for the server that was testing 1.6.2.16.1 to die. With this patch applied, we have been running for a few days, and not a single UDP port has leaked. The CLI "sip show sched" and "sip show objects" do not show increasing timers and object references. As well, a check of the "netstat -anp" command no longer shows all the UDP ports tied up.
When running T38 enabled, 3 UDP ports are tied to each channel. The output of "core show channels verbose" in order to get a live channel count directly compares to the number of UDP ports per channel shown by checking the "netstat" output. Prior to the patch, there was no linear correlation like there should be.</pre>
</td>
</tr>
</table>
<div style="margin-top: 1.5em;">
<b style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Bugs: </b>
<a href="https://issues.asterisk.org/view.php?id=16774">16774</a>,
<a href="https://issues.asterisk.org/view.php?id=17255">17255</a>,
<a href="https://issues.asterisk.org/view.php?id=18014">18014</a>,
<a href="https://issues.asterisk.org/view.php?id=18027">18027</a>,
<a href="https://issues.asterisk.org/view.php?id=18121">18121</a>,
<a href="https://issues.asterisk.org/view.php?id=18704">18704</a>
</div>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Diffs</b> </h1>
<ul style="margin-left: 3em; padding-left: 0;">
<li>http://svn.asterisk.org/svn/asterisk/trunk/channels/chan_sip.c <span style="color: grey">(307669)</span></li>
<li>http://svn.asterisk.org/svn/asterisk/trunk/channels/sip/include/sip.h <span style="color: grey">(307669)</span></li>
<li>http://svn.asterisk.org/svn/asterisk/trunk/include/asterisk/sched.h <span style="color: grey">(307669)</span></li>
<li>http://svn.asterisk.org/svn/asterisk/trunk/main/sched.c <span style="color: grey">(307669)</span></li>
</ul>
<p><a href="https://reviewboard.asterisk.org/r/1101/diff/" style="margin-left: 3em;">View Diff</a></p>
</td>
</tr>
</table>
</div>
</body>
</html>