<html>
<head>
<base href="https://wiki.asterisk.org/wiki">
<link rel="stylesheet" href="/wiki/s/2030/1/7/_/styles/combined.css?spaceKey=TOP&forWysiwyg=true" type="text/css">
</head>
<body style="background: white;" bgcolor="white" class="email-body">
<div id="pageContent">
<div id="notificationFormat">
<div class="wiki-content">
<div class="email">
<h2><a href="https://wiki.asterisk.org/wiki/display/TOP/Ice+Retry+and+the+Asterisk+SCF+Failover+demo">Ice Retry and the Asterisk SCF Failover demo</a></h2>
<h4>Page <b>edited</b> by <a href="https://wiki.asterisk.org/wiki/display/~beagles">Brent Eagles</a>
</h4>
<br/>
<h4>Changes (2)</h4>
<div id="page-diffs">
<table class="diff" cellpadding="0" cellspacing="0">
<tr><td class="diff-deleted-lines" style="color:#999;background-color:#fdd;text-decoration:line-through;">This page summarizes the results of an analysis of the automatic request retry facility built into ZeroC's Ice. <br></td></tr>
<tr><td class="diff-added-lines" style="background-color: #dfd;">This page summarizes the results of an analysis of the automatic request retry facility built into ZeroC's Ice and why it did not appear to work as expected for the fail-over portion of the Asterisk SCF demo. <br></td></tr>
<tr><td class="diff-unchanged" > <br>h2. Background <br></td></tr>
<tr><td class="diff-snipped" >...<br></td></tr>
</table>
</div> <h4>Full Content</h4>
<div class="notificationGreySide">
<p>This page summarizes the results of an analysis of the automatic request retry facility built into ZeroC's Ice and why it did not appear to work as expected for the fail-over portion of the Asterisk SCF demo.</p>
<h2><a name="IceRetryandtheAsteriskSCFFailoverdemo-Background"></a>Background</h2>
<p>Occasional communication failures between clients and servers in a distributed system are not uncommon. ZeroC's Ice product implements and retry feature that can be enabled and configured by defining the Ice.RetryIntervals property with a list of values denoting the desired intervals between attempts to send a request. The retry feature is implemented in such a way that it does not wantonly violate "at most once" semantics, a tenet that Ice pays strict attention to. </p>
<p>During testing of the Asterisk SCF demo, issues were discovered when hosts running certain services were terminated and corosync "failed the system" over to the secondary system. The Asterisk SCF components did not implement any direct handling of connection loss causing unexpected exceptions to be passed through the system inappropriately, or components reaching inconsistent states as they were unable to complete all of the operations for a given task. The team attempted to use the Ice.RetryIntervals property with the expectation that Ice would handle retrying the request once it detected the loss of connection and sending the request to the secondary. This appeared not to work and after some speculation, it was decided to simply catch the exception (Ice::ConnectionLostException) and retry programmatically at the call site.</p>
<p>The issue was revisted in conversation repeatedly and the situation described to an experienced ZeroC developer. It remained unexplained as to why Ice would not perform as expected. Until now...</p>
<h2><a name="IceRetryandtheAsteriskSCFFailoverdemo-TheExperiment"></a>The Experiment</h2>
<p>There are many ways to test the Ice retry feature. Under many scenarios it will behave exactly as expected. The scenario of interest however is created by corosync. Corosync hosts establish a "virtual network interface" that receives a shared IP common to hosts in the cluster. Only one host has the IP address actively assigned to it at a time. When corosync detects that the active host is no longer reachable, it causes the IP address to become active on another host. New connections to that IP address will be established with the new active host but existing connections are broken. </p>
<p>To simulate this situation in a controlled fashion, two VMware instances of Debian were configured to have a second network interface but were not fully activated on startup. Ice hello demo servers were configured on each host to bind a listening socket on a IP address that would later be activated on one host at a time. The test client would run on a separate host. The test basically consisted of configuring one host to have the IP address, establishing a connection between the hello demo client and server and sending a few "sayHello()" requests, deleting the IP address from the interface and adding it to the interface in the second server. The request fails with a ConnectionLostException and the request is not retried.</p>
<p>By setting a few breakpoints, it was pretty clear what was happening. The Ice request is successfully being sent on the socket even though the receiving side is not "up". Ice does not discover that the connection is lost until it attempts to read from the socket. Once the read fails, a ConnectionLostException is immediately thrown and received by the calling thread. The "send" succeeding is "by design" TCP behavior and has to do with ungraceful shutdown of the remote end of a socket. Basically immediate failure of the send call even though the remote host has "gone" should not be expected. The Ice code is structured such that a request will not be retried if the entire message has been sent. Quite reasonable as the request may have actually been received by the remote end and fully processed.</p>
<p>Interestingly, retrying on the ConnectionLostException, while needed for the demo as it was designed, could be a severe bug as ConnectionLostException's might occur for other reasons. Retrying the request may cause operations that should only be performed once to be duplicated.</p>
<p>A quick check of datagram requests using UDP was also performed for academic purposes. While a request did eventually succeed, several attempts were made before the "sayHello" request finally made it to the active server. Of course, as it was a datagram request, there was no indication of failure and autonomous program would not have known to retry.</p>
<h2><a name="IceRetryandtheAsteriskSCFFailoverdemo-TheConclusion"></a>The Conclusion</h2>
<p>Now that we know why Ice retry is not working, we may need to re-examine how we are approaching fail-over between Ice components. Unless we determine someway to make a socket write fail when the target end has moved, making requests on adapters bound to the corosync mediated shared addresses will require that we retry when ConnectionLostException's occur. When such an exception occurs for other reasons, we could very well end up forcing a violation at-most-once semantics. This might be harmless under some circumstances but could such situations as resource allocation without corresponding deallocation and duplication of records that might be used for billing.</p>
<p>That being said, there may be no alternative. In this case, operations sensitive to duplicate calls will need to be constructed to handle them. In the case of resource allocation, background management tasks might be implemented to clean up resources that have been allocated but have remained idle. The latter will likely be necessary anyways if "acquisition is ownership" rules apply. A client might "go away" without having had the chance to clean up resources it has allocated.</p>
</div>
<div id="commentsSection" class="wiki-content pageSection">
<div style="float: right;">
<a href="https://wiki.asterisk.org/wiki/users/viewnotifications.action" class="grey">Change Notification Preferences</a>
</div>
<a href="https://wiki.asterisk.org/wiki/display/TOP/Ice+Retry+and+the+Asterisk+SCF+Failover+demo">View Online</a>
|
<a href="https://wiki.asterisk.org/wiki/pages/diffpagesbyversion.action?pageId=9568568&revisedVersion=2&originalVersion=1">View Changes</a>
|
<a href="https://wiki.asterisk.org/wiki/display/TOP/Ice+Retry+and+the+Asterisk+SCF+Failover+demo?showComments=true&showCommentArea=true#addcomment">Add Comment</a>
</div>
</div>
</div>
</div>
</div>
</body>
</html>