Shard cluster definitions

The definition for a shard cluster includes information about the shard servers, the data to shard, and the sharding method.

To run the cdr define shardCollection command, which creates a sharding definition for partitioning your table data, you must specify the following information:

A name for the sharding definition
The name of the database that contains the table that is being sharded
The name of the user that owns the table that is being sharded
The sqlhosts file group name for each database server in the shard cluster
The column that is used as a shard key
Which sharding method the database server uses for determining where rows are distributed to:
- With consistent hash-based sharding, the data is automatically distributed between shard servers in a way that minimizes the data movement when you add or remove shard servers.
- With hash-based sharding, the data is automatically divided between shard servers, but when you change the shard cluster, all data is redistributed.
- With expression-based sharding, you specify how the data is divided between shard servers. You must also specify the shard server to receive the data that is outside the scope of the expression.
How you want to distribute the data:
- Insert rows on any shard server, replicate the rows to the appropriate shard server, and then delete duplicate rows from the original server. The delete method is the default method and is the same behavior as when you define sharding with MongoDB commands.
- Insert rows on any shard server, replicate the rows to the appropriate shard server, but then keep duplicate rows on the original server. The keep method is similar to a data dissemination system.
- Insert rows on the appropriate shard server and do not replicate rows. The informational method is useful if you want to query across multiple servers that have the same table, but you do not need to shard the data during loading. For example, you have a different database server for each of your three stores. The data from each store is always inserted in the appropriate server. You set up the sharding definition with an expression that matches database servers with their store identifiers. Then you can run sharded queries to aggregate data from all three stores.
The table column or collection field for tracking row updates

Consistent hash-based sharding

When you create a consistent hash-based sharding definition, IBM® Informix® uses a hash value of a specific column or field to distribute data to the servers of a shard cluster in a consistent pattern. When you add or remove a shard server, the consistent hashing algorithm redistributes a fraction of the data. You specify how many hashing partitions to create on each shard server. The default number of hashing partitions is three. The more hashing partitions, the more evenly the data is distributed among shard servers. However, if you specify more than 10 hashing partitions, the resulting SQL statement to create the sharded table might fail because it exceeds the maximum character limit for an SQL statement.

For example, the following command creates a consistent hashing index that has three partitions on each shard server:

cdr define shardCollection collection_1 db_1:john.customers   
   --type=delete --key=b --strategy=chash --partitions=3 --versionCol=column_3      
   g_shard_server_1 g_shard_server_2 g_shard_server_3

You can dynamically change the number of hashing partitions per shard server by running the cdr change shardCollection command.

Hash-based sharding

When you create a hash-based sharding definition, IBM Informix uses a hash value of a specific column or field to distribute data to the servers of a shard cluster. When you add or remove a shard server, the hashing algorithm redistributes all the data.

For example, the following command creates a hashed index that is based on shard key values, and then the Enterprise Replication determines where rows with specific hashed index values are distributed to:

cdr define shardCollection collection_1 db_1:john.customers
   --type=delete --key=state --strategy=hash --versionCol=version
   g_shard_server_A g_shard_server_B g_shard_server_C g_shard_server_D

Expression-based sharding

When you create an expression-based sharding definition, IBM Informix uses WHERE-clause syntax on a specific column or field to distributes data to the servers of a shard cluster.

For example, the following command sends rows with a shard-key value of NV to g_shard_server_B:

cdr define shardCollection collection_1 db_1:joe.clients
--type=delete --key=state --strategy=expression --versionCol=version
   g_shard_server_A "IN ('WA','OR','ID')"
   g_shard_server_B "IN ('CA','NV','UT','AZ')"
   g_shard_server_C "IN ('TX','OK','NM','AR','LA')"
   g_shard_server_D  REMAINDER

Sharding definitions must include the REMAINDER expression for rows or documents that have values that are not accounted for by the other expressions. For example, the previous sharding definition sends rows with a shard-key value of 'NY' to g_shard_server_D.

Expressions that are used for sharding data cannot overlap. For example, a sharding definition that is created with the following command is not valid because rows or documents with shard key values 40 to 60 would be sent to both g_shard_server_A and g_shard_server_B.

cdr define shardCollection collection_1 db_1:joe.clients)
--type=delete --key=age --strategy=expression --versionCol=version
   g_shard_server_A "BETWEEN 0 AND 60"
   g_shard_server_B "BETWEEN 40 AND 100"
   g_shard_server_C  REMAINDER