Zones for replica placement

With zones, you can place replicas across data centers. A zone can be defined as different floors of a building, different buildings, or even different cities or other distinctions as configured with zone rules. With this capability, data grids of thousands of partitions can be managed with optional placement rules.

Zone rules

An eXtreme Scale partition has one primary shard and zero or more replica shards. For this example, consider the following naming convention for these shards. P is the primary shard, S is a synchronous replica and A is an asynchronous replica. A zone rule has three components:

A rule name
A list of zones
An inclusive or exclusive flag

For more information about defining a zone name for a container server, see Defining zones for container servers. A zone rule specifies the possible set of zones in which a shard can be placed. The inclusive flag indicates that after a shard is placed in a zone from the list, then all other shards are also placed in that zone. An exclusive setting indicates that each shard for a partition is placed in a different zone in the zone list. For example, using an exclusive setting means that if there are three shards (primary, and two synchronous replicas), then the zone list must have three zones.

Each shard can be associated with one zone rule. A zone rule can be shared between two shards. When a rule is shared then the inclusive or exclusive flag extends across shards of all types sharing a single rule.

Note: With XIO failure detection, zone placement follows the rules in this topic. However, core group failure detection is ignored in place of the failure detection that XIO provides. The catalog servers still monitor the heartbeat of containers when either clients report communication difficulties with the containers, or if the containers fail to check in with one of the catalogs.

Examples

A set of examples showing various scenarios and the deployment configuration to implement the scenarios follows.

Striping primaries and replicas across zones

You have three blade chassis, and want primaries that are distributed across all three, with a single synchronous replica placed in a different chassis than the primary. Define each chassis as a zone with chassis names ALPHA, BETA, and GAMMA. An example deployment XML follows:

<?xml version="1.0" encoding="UTF-8"?>
<deploymentPolicy xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance 
	xsi:schemaLocation=
	"http://ibm.com/ws/objectgrid/deploymentPolicy ../deploymentPolicy.xsd"
				xmlns="http://ibm.com/ws/objectgrid/deploymentPolicy">
		<objectgridDeployment objectgridName="library">
			<mapSet name="ms1" numberOfPartitions="37" minSyncReplicas="1"
				maxSyncReplicas="1" maxAsyncReplicas="0">
			<map ref="book" />
			<zoneMetadata>
				<shardMapping shard="P" zoneRuleRef="stripeZone"/>
				<shardMapping shard="S" zoneRuleRef="stripeZone"/>
				<zoneRule name ="stripeZone" exclusivePlacement="true" >
					<zone name="ALPHA" />
					<zone name="BETA" />
					<zone name="GAMMA" />
				</zoneRule>
			</zoneMetadata>
		</mapSet>
	</objectgridDeployment>
</deploymentPolicy>

This deployment XML contains a grid called library with a single Map called book. It uses four partitions with a single synchronous replica. The zone metadata clause shows the definition of a single zone rule and the association of zone rules with shards. The primary and synchronous shards are both associated with the zone rule "stripeZone". The zone rule has all three zones in it and it uses exclusive placement. This rule means that if the primary for partition 0 is placed in ALPHA then the replica for partition 0 is placed in either BETA or GAMMA. Similarly, primaries for other partitions are placed in other zones and the replicas are placed in another zone.

Asynchronous replica in a different zone than primary and synchronous replica

In this example, two buildings exist with a high latency connection between them. You want no data loss high availability for all scenarios. However, the performance impact of synchronous replication between buildings leads you to a trade-off. You want a primary with synchronous replica in one building and an asynchronous replica in the other building. Normally, the failures are JVM crashes or computer failures rather than large-scale issues. With this topology, you can survive normal failures with no data loss. The loss of a building is rare enough that some data loss is acceptable in that case. You can make two zones, one for each building. The deployment XML file follows:

<?xml version="1.0" encoding="UTF-8"?>
<deploymentPolicy xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
		xsi:schemaLocation="http://ibm.com/ws/objectgrid/deploymentPolicy ../deploymentPolicy.xsd"
		xmlns="http://ibm.com/ws/objectgrid/deploymentPolicy">

	<objectgridDeployment objectgridName="library">
		<mapSet name="ms1" numberOfPartitions="13" minSyncReplicas="1"
			maxSyncReplicas="1" maxAsyncReplicas="1">
			<map ref="book" />
			<zoneMetadata>
				<shardMapping shard="P" zoneRuleRef="primarySync"/>
				<shardMapping shard="S" zoneRuleRef="primarySync"/>
				<shardMapping shard="A" zoneRuleRef="aysnc"/>
				<zoneRule name ="primarySync" exclusivePlacement="false" >
						<zone name="BldA" />
						<zone name="BldB" />
				</zoneRule>
				<zoneRule name="aysnc" exclusivePlacement="true">
						<zone name="BldA" />
						<zone name="BldB" />
				</zoneRule>
			</zoneMetadata>
		</mapSet>
	</objectgridDeployment>
</deploymentPolicy>

The primary and synchronous replica share a primaySync zone rule with an exclusive flag setting of false. So, after the primary or sync gets placed in a zone, then the other is also placed in the same zone. The asynchronous replica uses a second zone rule with the same zones as the primarySync zone rule but it uses the exclusivePlacement attribute set to true. This attribute indicates that means a shard cannot be placed in a zone with another shard from the same partition. As a result, the asynchronous replica does not get placed in the same zone as the primary or synchronous replicas.

Placing all primaries in one zone and all replicas in another zone

Here, all primaries are in one specific zone and all replicas in a different zone, a primary and a single asynchronous replica. All replicas are in zone A and primaries in B.

	<?xml version="1.0" encoding="UTF-8"?>

	<deploymentPolicy xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
		xsi:schemaLocation=
			"http://ibm.com/ws/objectgrid/deploymentPolicy ../deploymentPolicy.xsd"
		xmlns="http://ibm.com/ws/objectgrid/deploymentPolicy">

		<objectgridDeployment objectgridName="library">
			<mapSet name="ms1" numberOfPartitions="13" minSyncReplicas="0"
				maxSyncReplicas="0" maxAsyncReplicas="1">
				<map ref="book" />
				<zoneMetadata>
					<shardMapping shard="P" zoneRuleRef="primaryRule"/>
					<shardMapping shard="A" zoneRuleRef="replicaRule"/>
					<zoneRule name ="primaryRule">
						<zone name="A" />
					</zoneRule>
					<zoneRule name="replicaRule">
						<zone name="B" />
							</zoneRule>
						</zoneMetadata>
					</mapSet>
			</objectgridDeployment>
	</deploymentPolicy>

Here, you can see two rules, one for the primaries (P) and another for the replica (A).

Zones over wide area networks (WAN)

You might want to deploy a single data grid over multiple buildings or data centers with slower network interconnections. Slower network connections lead to lower bandwidth and higher latency connections. The possibility of network partitions also increases in this mode due to network congestion and other factors. eXtreme Scale approaches this harsh environment by limiting heartbeating between zones.

Java™ virtual machines grouped into core groups do heartbeat each other. When the catalog service organizes Java virtual machines into core groups, those groups do not span zones. A leader within that group pushes membership information to the catalog service. The catalog service verifies any reported failures before taking action. It does this by attempting to connect to the suspect Java virtual machines. If the catalog service sees a false failure detection, then it takes no action as the core group partition heals in a short time.

The catalog service also heartbeats core group leaders periodically at a slow rate to handle the case of core group isolation.