<html>
<head>
<base href="https://wiki.asterisk.org/wiki">
<link rel="stylesheet" href="/wiki/s/2041/1/7/_/styles/combined.css?spaceKey=TOP&forWysiwyg=true" type="text/css">
</head>
<body style="background: white;" bgcolor="white" class="email-body">
<div id="pageContent">
<div id="notificationFormat">
<div class="wiki-content">
<div class="email">
<h2><a href="https://wiki.asterisk.org/wiki/display/TOP/State+Replicator+Persistence+Options">State Replicator Persistence Options</a></h2>
<h4>Page <b>edited</b> by <a href="https://wiki.asterisk.org/wiki/display/~khunt">Ken Hunt</a>
</h4>
<br/>
<h4>Changes (1)</h4>
<div id="page-diffs">
<table class="diff" cellpadding="0" cellspacing="0">
<tr><td class="diff-snipped" >...<br></td></tr>
<tr><td class="diff-unchanged" >*License:* Asterisk SCF license includes license to use Ice services. <br>*General Notes:* Ice Freeze is built on Berkeley DB. It can only be used for persisting Slice-defined types, but that should be fine for our needs. Freeze's strong point is that the Freeze Evictor works as a ServantLocator, and can persist servants to disk when they're not in use, then retrieve them as needed. It can also persist arbitrary Slice-defined types (such as our replicated component state) as Freeze Maps. Further investigation is required to determine the level of effort required to migrate Freeze to use Berkeley DB High Availability, or some other replicating storage technology. <br></td></tr>
<tr><td class="diff-changed-lines" >*Final Assessment:* Freeze <span class="diff-deleted-words"style="color:#999;background-color:#fdd;text-decoration:line-through;">would likely</span> <span class="diff-added-words"style="background-color: #dfd;">might</span> need to be re-hosted <br></td></tr>
<tr><td class="diff-unchanged" > <br>* _memcached_ <br></td></tr>
<tr><td class="diff-snipped" >...<br></td></tr>
</table>
</div> <h4>Full Content</h4>
<div class="notificationGreySide">
<h4><a name="StateReplicatorPersistenceOptions-Introduction"></a>Introduction</h4>
<p>Each major component of an Asterisk SCF system has a corresponding State Replicator component. This State Replicator component is a sink for current state changes from an active component, and a source of state changes for standby components. Initial implementations of the State Replicator components are based on a simple C++ template which uses an in-memory store of the replicated state. The state store is required to be able to provide the current state to late-joining backup components which may not have been online when the state updates were originally generated. </p>
<map name='GLIFFY_MAP_12550532_Basic_State_Replication'></map>
<table width="100%">
<tr>
<td align="left">
<table>
<caption align="bottom">
</caption>
<tr>
<td>
<img style="border: none; width: 400px; height: 200px;"
usemap="#GLIFFY_MAP_12550532_Basic_State_Replication"
src="/wiki/download/attachments/12550532/Basic+State+Replication.png?version=15&modificationDate=1300203576391"
alt="A&#32;Gliffy&#32;Diagram&#32;named&#58;&#32;Basic&#32;State&#32;Replication"/>
</td>
</tr>
</table>
</td>
</tr>
</table>
<p>Each state replication component, in its current implementation, represents a single point of failure in that the State Replicators themselves are not replicated. All of the State Replicators are, however, built to operate against well-defined APIs. This allows us to address the robustness of the state replication mechanisms in the abstract. In other words, as we look to increase the robustness of the system, we can focus on replicating generic state, independent of the particular components involved. To provide a complete replication platform, we desire to exploit a general purpose, off-the-shelf solution for persisting data and replicating that data across systems. (Note: As with all Asterisk SCF components, we fully expect 3rd parties will have reasons to build alternative implementations using different technologies than the persistence platforms that we choose to use.)</p>
<h4><a name="StateReplicatorPersistenceOptions-CandidateTechnologyRequirements"></a>Candidate Technology Requirements</h4>
<p>To build component State Replicators that are not single points of failure, we want to identify an open-source solution that can do the following:</p>
<ul class="alternate" type="square">
        <li>Replicate the state store across servers in near-real-time. The low-latency requirement is driven by the highly dynamic nature of the state being replicated.</li>
        <li>Persist the state store (or provide the appearance of persistence, in that another process can take over in the event of failure of an Asterisk SCF State Replicator.)</li>
</ul>
<p>In addition, we want the solution to have the following characteristics: </p>
<ul class="alternate" type="square">
        <li>Simple to deploy</li>
        <li>Simple to program to</li>
        <li>Lightweight installation.</li>
        <li>Compatibility with GPL V2</li>
        <li>Available on all of our target platforms</li>
</ul>
<p>We've examined several database technologies with these goals in mind. Note that the assessments below are strictly related to our particular requirements! Selecting a database / data store solution is complex, and there is no single "best answer" for all cases. </p>
<h5><a name="StateReplicatorPersistenceOptions-TheRelationalDatabaseQuestion"></a>The Relational Database Question </h5>
<p>The first choice is to decide if the solution would benefit from a relational database. While it's common for relational database systems to include a replication capability, there are no other compelling reasons to consider such a solution. Typical relational databases are accessed (from a programming POV) using SQL, and thus require additional programming expertise for the Asterisk SCF developer. These applications are often fairly large installations due to their inherent complexity, requiring significant disk storage requirements, services to manage connections to the database, tools for administering the databse, etc. While relational databases excel when the requirements include such things as the ability to make fast queries on large datasets, or the ability to manage massive volumes of information, they don't offer much for our purposes other than the ability to replicate across servers. Even that feature is not based on standards, nor universally available across the various implementations. </p>
<p>For our needs, the installation overhead of a relational database and the added complexity for the developer seems to be a poor fit. Our state data is a simple set of keyed values, which are added to and removed over time. The State Replicator is pushing updates to all standby listeners in real-time, so when the data store is needed (for late joiners), it's needed in its entirety. The query engine of a relation database doesn't appear to be any benefit. </p>
<h5><a name="StateReplicatorPersistenceOptions-NonRelationalalternatives"></a>Non-Relational alternatives</h5>
<p>There is an ever growing field of non-relational data storage solutions. These include Object Oriented databases, document-oriented databases, graph-oriented, and others. Besides the persistence / structure mechanisms, we are also interested in how each of the technologies approach replication. With our stated need of near-real-time replication due to the highly dynamic nature of the data, many database replication schemes that work fine for other purposes will not provide the solution we are looking for. </p>
<p>Note: Several of these solutions support the storage of binary arrays for values and/or keys. To provide a portable solution, we'll use Ice serialization (by way of Dynamic Ice) to ensure the data can be consumed on a platform other than the type it was generated on. </p>
<p>Some that we have looked at:</p>
<ul>
        <li><em>CouchDB</em><br/>
<b>License:</b> Apache 2.0 <br/>
<b>General Notes:</b> RESTful HTML API, plugins for JavaScript/PHP/Ruby/Python/Erlang. <br/>
Accessible from C++ using cURLpp (MIT License), a C++ wrapper for libcURL (MIT License)<br/>
Replication: A peer-based distributed system that allows peers to access and update shared data while disconnected, and then perform bi-directional replication of updates. (This is not unlike git repository approach!) While convenient for things like synchronizing with a laptop that is offline for a lot of the time, it is not designed to be particularly fast at replication. <br/>
<b>Final Assessment:</b> Replication approach isn't suitable for our application. </li>
</ul>
<ul>
        <li><em>MongoDB</em><br/>
<b>License:</b> GNU AGPL v3.0.<br/>
<b>General Notes:</b> We could enumerate all the cool features of MongoDB, such as it's C++, C#, Java, Erlan, Python APIs, it's cool binary JSON-like serialization format, fast replication, and so forth... but it's license is far from ideal. <br/>
<b>Final Assessment:</b> Incompatible with our license requirements.</li>
</ul>
<ul>
        <li><em>Voldemort</em><br/>
<b>License:</b> Apache 2.0 <br/>
<b>General Notes:</b> Relatively new project, somewhat lacking in documentation. Written in Java, and there is a C++ client demo that only runs on Linux. Provides automatic replication. Keys and values can be complex objects such as maps or lists. <br/>
<b>Final Assessment:</b> Immature platform support</li>
</ul>
<ul>
        <li><em>Tokyo Cabinet</em><br/>
<b>License:</b> LGPL<br/>
<b>General Notes:</b> Database is a simple data file containing records, each is a pair of a key and a value. Simple C API for adding, deleting and accessing key-value pairs. For replication, need the more complete Tokyo Tyrant database server with high-concurrency network interface. Tokyo Cabinet not available on Windows, but some ports using cygwin reported. <br/>
<b>Final Assessment:</b> Doesn't meet platform requirements.</li>
</ul>
<ul>
        <li><em>Berkeley DB High Availability</em><br/>
<b>License:</b> Sleepycat License <br/>
<b>General Notes:</b> Key/value database. APIs Available in almost all programming languages including ANSI-C, C++, Java, C#, Perl, Python, Ruby and Erlang. A replication group consists of a one master and one or more read-only replicas. Write operations, such as key/value insert, update, or delete, are processed transactionally at the master. The master sends log records to all replicas. Replicas apply log records only when they receive a commit record. If a master fails, a replica takes over as master.<br/>
<b>Final Assessment:</b> Promising! </li>
</ul>
<ul>
        <li><em>Ice Freeze</em><br/>
<b>License:</b> Asterisk SCF license includes license to use Ice services.<br/>
<b>General Notes:</b> Ice Freeze is built on Berkeley DB. It can only be used for persisting Slice-defined types, but that should be fine for our needs. Freeze's strong point is that the Freeze Evictor works as a ServantLocator, and can persist servants to disk when they're not in use, then retrieve them as needed. It can also persist arbitrary Slice-defined types (such as our replicated component state) as Freeze Maps. Further investigation is required to determine the level of effort required to migrate Freeze to use Berkeley DB High Availability, or some other replicating storage technology. <br/>
<b>Final Assessment:</b> Freeze might need to be re-hosted</li>
</ul>
<ul>
        <li><em>memcached</em><br/>
<b>License:</b> BSD<br/>
<b>General Notes:</b> Memcached provides a distributed memory caching system for caching data and objects. It's essentially a giant hash table distributed across multiple processors. No single instance contains the entire cache. Clients must treat memcached as transitory cache, since a server will discard the least-accessed key/values when memory starts getting full. This makes it inapplicable to our needs. <br/>
<b>Final Assessment:</b> Transitory cache not a good fit. </li>
</ul>
<ul>
        <li><em>MemcacheDB</em><br/>
<b>License:</b> BSD<br/>
<b>General Notes:</b> MemcacheDB provides a key/value persistent storage variant of memcached by applying Berkeley DB as a back-end. Any memcached client can connect since it uses the memcached protocol. In evaluating memcached, we found that there was nothing in the API that supported a "get" of the entire cache, or mechanism to iterate over all of the keys. (This isn't necessarily a show-stopper, since the State Replicators could use a well-known key to store, in the database, a list of all the keys they've added.) <br/>
<b>Final Assessment:</b> API somewhat limited</li>
</ul>
<ul>
        <li><em>Kyoto Cabinet</em><br/>
<b>License:</b> GPL3 with <a href="http://fallabs.com/license/fossexception.txt" class="external-link" rel="nofollow">FOSS exception</a><br/>
<b>General Notes:</b> API - from C++, C, Java, Python, Ruby, Perl, and Lua. Supposedly has excellent Windows support. Written in C++. Every key and value is serial bytes with variable length. Both binary data and character string can be used as a key and a value. Each key must be unique within a database. Kyoto Tycoon (server built on Kyoto Cabinet library) supports replication. <br/>
<b>Final Assessment:</b> Strong contender, but new. </li>
</ul>
<ul>
        <li><em>Constant Database (CDB)</em><br/>
<b>License:</b> CDB Library is public domain. The rest of the package is <a href="http://en.wikipedia.org/wiki/License-free_software" class="external-link" rel="nofollow">license-free software</a>. <br/>
<b>General Notes:</b> A CDB instance can only be rebuilt, not modified. (Hence, the "constant" in the name). Not applicable, since we need to be able to add/remove keys from the database. <br/>
<b>Final Assessment:</b> "Constant" not applicable to our problem. </li>
</ul>
<ul>
        <li><em>Cassandra</em><br/>
<b>License:</b> Apache 2.0.<br/>
<b>General Notes:</b> highly scalable, eventually consistent, distributed, structured key-value store. Supports synchronous / asynchronous replication. Provides a ColumnFamily-based data model richer than typical key/value systems. As opposed to the strong consistency used in typical relational databases (ACID for Atomicity Consistency Isolation Durability) Cassandra is at the other end of the spectrum (BASE, for Basically Available Soft-state Eventual consistency). <br/>
<b>Final Assessment:</b> Replication model doesn't suit our highly-dynamic data. </li>
</ul>
<ul>
        <li><em>Redis</em><br/>
<b>License:</b> BSD<br/>
<b>General Notes:</b> Keys can store different data types, not just strings, including Lists and Sets. Attempts to hold entire DB in memory for fast access. Client APIs available for C, C++, Java, Python, Ruby, Erlang, Objective-C, Scala, PHP, etc. Not supported on Windows, but several people have been able to build it. For replication, master can support multiple slaves. When a slave first connects, the master transfers the database as a file to the slave, which saves it on disk, and then loads it into memory. The master will then send to the slave all new commands received from clients that will modify the dataset. This is done as a stream of commands and is in the same format of the Redis protocol itself. <br/>
<b>Final Assessment:</b> Strong candidate for technical reasons, but not supported on Windows. </li>
</ul>
<h5><a name="StateReplicatorPersistenceOptions-Conclusion"></a>Conclusion</h5>
<p>The most promising solutions we've looked at (when we include licensing concerns) are Berkeley DB High Availability, Kyoto Cabinet, MemcacheDB and Redis. The Berkeley DB Sleepycat License is compatible with GPL V2, replication is via network log records for each transaction (which should be relatively fast), APIs are available for most languages, and it's supported on all major platforms. </p>
<p>Kyoto Cabinet/Kyoto Tycoon is relatively new. It seems to meet our stated requirements, but a few blogs mention moving back to Tokyo Cabinet/Tokyo Tyrant over technical issues. It should probably be avoided for the near-term as risk-reduction. </p>
<p>Redis may be the best technical solution. It is made specifically for fast-changing datasets (though of limited size... expected to fit in memory), and is a recommended database solution for real-time communications applications. The Windows concerns would need to be addressed. </p>
<p>Given our stated requirements and goals, we should also consider a simple modification to our current in-memory implementation which would allow a State Replicator instance to operate in a standby mode, where it is a listener to an active State Replicator. When in standby mode, the State Replicator would simply update it's in-memory state cache without forwarding the updates. This would provide a solution where, from the system's point-of-view, there is no longer a single point of failure. While the data would not be persisted to disk, it would be replicated and available for fast-switching to active mode, just as any other component. It is indeed debatable, with the short time frame in which component state will remain relevant, as to whether any disk-persisted scheme will meet our latency requirements. The downside, of course, is that an in-memory solution will be limited in the amount of state that can be replicated to the physical RAM available to the process. </p>
<map name='GLIFFY_MAP_12550532_standby_replicator'></map>
<table width="100%">
<tr>
<td align="left">
<table>
<caption align="bottom">
</caption>
<tr>
<td>
<img style="border: none; width: 500px; height: 400px;"
usemap="#GLIFFY_MAP_12550532_standby_replicator"
src="/wiki/download/attachments/12550532/standby+replicator.png?version=3&modificationDate=1300203541811"
alt="A&#32;Gliffy&#32;Diagram&#32;named&#58;&#32;standby&#32;replicator"/>
</td>
</tr>
</table>
</td>
</tr>
</table>
<p>To choose a specific technology we need to define how we want to implement state replication across the system. A significant decision is whether to provide a data store / replication mechanism for each component state replicator independently, or to provide a single mechanism that services all of the state replicators. The following diagrams depict each scenario. </p>
<map name='GLIFFY_MAP_12550532_data_store_replication_alt'></map>
<table width="100%">
<tr>
<td align="left">
<table>
<caption align="bottom">
</caption>
<tr>
<td>
<img style="border: none; width: 972px; height: 500px;"
usemap="#GLIFFY_MAP_12550532_data_store_replication_alt"
src="/wiki/download/attachments/12550532/data+store+replication+alt.png?version=5&modificationDate=1300310658713"
alt="A&#32;Gliffy&#32;Diagram&#32;named&#58;&#32;data&#32;store&#32;replication&#32;alt"/>
</td>
</tr>
</table>
</td>
</tr>
</table>
<p>Other issues that come into play are whether you want the data store replication to be mirrored (which could be of benefit in a geographically distributed scenario) or distributed across servers for increased performance. These types of concerns will impact the best choice of a solution. For example, Kyoto Cabinet provides a database library that could be wrapped by each component's State Replicator. If you wanted to provide a single back-end for all of the State Replicators, you'd more likely use Kyoto Tycoon, the server implementation. As another example, Berkeley DB HA mirrors state across all replicated nodes, while MemcacheDB distributes the data store across servers. </p>
<p>Redis, Berkeley DB HA, Kyoto Cabinet/Kyoto Tycoon are all intriguing technologies. We need to establish a design for our next iteration of State Replicator to really know which of these (if any) are the best solution to go forward with. </p>
<p>If time permits, we should also consider updating our simple State Replicator to support standby operation as described. It may be "good enough" for a significant percentage of users. </p>
</div>
<div id="commentsSection" class="wiki-content pageSection">
<div style="float: right;">
<a href="https://wiki.asterisk.org/wiki/users/viewnotifications.action" class="grey">Change Notification Preferences</a>
</div>
<a href="https://wiki.asterisk.org/wiki/display/TOP/State+Replicator+Persistence+Options">View Online</a>
|
<a href="https://wiki.asterisk.org/wiki/pages/diffpagesbyversion.action?pageId=12550532&revisedVersion=63&originalVersion=62">View Changes</a>
|
<a href="https://wiki.asterisk.org/wiki/display/TOP/State+Replicator+Persistence+Options?showComments=true&showCommentArea=true#addcomment">Add Comment</a>
</div>
</div>
</div>
</div>
</div>
</body>
</html>