[asterisk-dev] [Code Review] 3402: http: response body frequently missing when running in virtual machine

Scott Griepentrog reviewboard at asterisk.org
Thu Mar 27 18:23:27 CDT 2014



> On March 27, 2014, 3:30 p.m., Mark Michelson wrote:
> > My only problem with this patch is that I really want to know for sure that there is something low level and buggy happening that requires such a change before approving it. So I have a few questions:
> > 
> > 1) I notice you're using both fflush() and shutdown(). If you remove the fflush() does the bug still occur? What about if you add the fflush() but keep the close() in instead of using shutdown()? The reason I ask is that shutdown() and close() are both supposed to send a TCP FIN and flush out the write buffer on its own, so in theory the fflush() call should not be necessary. However, if removing fflush() causes the bug to return, then that leads to my next question:
> > 
> > 2) Did this bug manifest itself on many HTTP client implementations or was it only seen on one? When calling close() or shutdown(), if there is data in the socket's write buffer, this data may be sent in the TCP FIN that Asterisk sends. It may be that whatever HTTP client is being used ignores data in TCP FIN and that may be where the real bug lies. It would be interesting to use TCPdump or some tool to monitor the HTTP traffic to see if there is any correlation between the way the HTTP traffic is sent and when the bug manifests itself.
> > 
> > Overall, I'm just not comfortable saying that this fixes a race condition in the kernel since I don't know what due diligence has been done to prove that. On the other hand, I also can't really fault the changes here if they're actually helping improve the situation for an HTTP client. I guess what I'm saying is that the comments and the reasons for the code changes are bothering me more than the actual code changes themselves :)

1) The use of fflush is strictly a CYA to insure that the last data from fwrite got pushed by clib to write() prior to the low level shutdown call, and does not resolve the problem on it's own.  Nor does removing it cause the failure case to return.

2) The bug manifests itself only under very specific conditions: 

  a) POST not GET 
  b) Client is not on same ip/tcp stack/host as Asterisk
  c) request coming from json Shred user-agent (unable to duplicate failure using python requests - can't get exact matching request headers)

I used to believe that a virtual machine was a requirement, but it also fails between two physical machines - so long as the client and host have to actually transmit IP packets between them it can fail (even across WiFi and VPN it fails).

Under these circumstances, bug is about 90% failure rate (every now and then the correct body message does arrive).  Switching the fwrite() call with the body to a write() does not resolve issue.  Using write() to send an extra byte before the body results in that character arriving and not the body.  Strangely enough however, writing an extra byte after the body still results in neither arriving (may have to do with length of data written).

Exact test results with original body delivery code and alternate close sequences:

Original code that fails:
        fclose(ser->f);

Works:
        shutdown(ser->fd,2);
        fclose(ser->f);

Fails:
        close(ser->fd);
        fclose(ser->f);

Fails:
        fflush(ser->f);
        fclose(ser->f);

Fails:
        fflush(ser->f);
        close(ser->fd);

Technically, my comments mention a race condition existing without specifying where.  Honestly, I'm not entirely certain that it's in the kernel, although it's the most likely culprit at this point.


- Scott


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/3402/#review11401
-----------------------------------------------------------


On March 27, 2014, 1:52 p.m., Scott Griepentrog wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviewboard.asterisk.org/r/3402/
> -----------------------------------------------------------
> 
> (Updated March 27, 2014, 1:52 p.m.)
> 
> 
> Review request for Asterisk Developers.
> 
> 
> Bugs: ASTERISK-23548
>     https://issues.asterisk.org/jira/browse/ASTERISK-23548
> 
> 
> Repository: Asterisk
> 
> 
> Description
> -------
> 
> When running asterisk in a virtual machine, responses to ARI requests would frequently be missing.  A race condition related to closing the socket immediately after writing data onto it is resolved in this patch by insuring the output stream is flushed, and then informing TCP of the shutdown prior to the close.
> 
> 
> Diffs
> -----
> 
>   /branches/12/main/tcptls.c 411242 
>   /branches/12/main/manager.c 411242 
>   /branches/12/main/http.c 411242 
> 
> Diff: https://reviewboard.asterisk.org/r/3402/diff/
> 
> 
> Testing
> -------
> 
> Tested on Ubuntu server 12.04 with Sam's json api test from issue.
> 
> 
> Thanks,
> 
> Scott Griepentrog
> 
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.digium.com/pipermail/asterisk-dev/attachments/20140327/bbfe1c09/attachment.html>


More information about the asterisk-dev mailing list