High availability

With high availability, WebSphere® eXtreme Scale provides reliable data redundancy and detection of failures.

WebSphere eXtreme Scale self-organizes data grids of Java™ virtual machines into a loosely federated tree. The catalog service is at the root and core groups hold container servers are at the leaves of the tree. See Caching architecture: Maps, containers, clients, and catalogs for more information.

Tip:

When XIO is enabled, the XIO transport maintains persistent socket connections between the catalogs and the containers, over and above what the high availability (HA) manager and Distribution and Consistency Services core groups provided. WebSphere eXtreme Scale is now leveraging these persistent connections directly for failure detection when a socket connection is lost, as a replacement for the core groups detecting lost socket connections. Then, the container server, which is the core group leader, reports the lost connections to the primary catalog.

Therefore, while you still will see the HA manager and DCS stack coming up in containers, and core groups formed, they are ignored. Subsequent updates to WebSphere eXtreme Scale will fully remove the HA manager and DCS stack from the containers. Core groups, HA manager and DCS are still leveraged as described here for the catalog cluster.

These changes also allow the catalogs to leverage client reports of failures as an impetus for seeing whether containers are still active, which is determined via an explicit RPC call. Containers also periodically check in with one of the catalog servers (not necessarily the primary server) to help assure the catalog that the container is not isolated by a brownout. For now, when you enable XIO, core groups are still set up and formed. However, core groups are ignored during failure detection with the containers. Core groups are still used for catalog servers. To enable this transport mechanism, see Configuring IBM eXtremeIO (XIO)

Important terms

Heartbeat: A signal that is sent between servers to convey that they are running.
Quorum: A group of catalog servers that communicate and conduct placement operations in the data grid. This group consists of all of the catalog servers in the data grid, unless you manually override the quorum mechanism with administrative actions.
Brownout: A temporary loss of connectivity between one or more servers.
Blackout: A permanent loss of connectivity between one or more servers.
Data center: A geographically located group of servers that are generally connected with a local area network (LAN).
Zone: A zone is a configuration option that is used to group servers together that share some physical characteristic. Examples of zones for a group of servers include: a data center, an area network, a building, or a floor of a building.
Network partition: Two catalog servers act as primaries concurrently. Both servers make changes to the catalog server state, which leads to data corruption.