Core groups

A core group is a high availability domain of container servers. The catalog service places container servers into core groups of a limited size. A core group tries to detect the failure of its members. A single member of a core group is elected to be the core group leader. The core group leader periodically tells the catalog service that the core group is alive and reports any membership changes to the catalog service. A membership change might be a Java virtual machine (JVM) failure or a newly added JVM that joins the core group.

Tip:

When XIO is enabled, the XIO transport maintains persistent socket connections between the catalogs and the containers, over and above what the high availability (HA) manager and Distribution and Consistency Services core groups provided. WebSphere® eXtreme Scale is now leveraging these persistent connections directly for failure detection when a socket connection is lost, as a replacement for the core groups detecting lost socket connections. Then, the container server, which is the core group leader, reports the lost connections to the primary catalog.

Therefore, while you still will see the HA manager and DCS stack coming up in containers, and core groups formed, they are ignored. Subsequent updates to WebSphere eXtreme Scale will fully remove the HA manager and DCS stack from the containers. Core groups, HA manager and DCS are still leveraged as described here for the catalog cluster.

These changes also allow the catalogs to leverage client reports of failures as an impetus for seeing whether containers are still active, which is determined via an explicit RPC call. Containers also periodically check in with one of the catalog servers (not necessarily the primary server) to help assure the catalog that the container is not isolated by a brownout. For now, when you enable XIO, core groups are still set up and formed. However, core groups are ignored during failure detection with the containers. Core groups are still used for catalog servers. To enable this transport mechanism, see Configuring IBM eXtremeIO (XIO)

If a JVM socket is closed, that JVM is regarded as being no longer available. Each core group member also heart beats over these sockets at a rate determined by configuration. If a JVM does not respond to these heartbeats within a configured maximum time period, then the JVM is considered to be no longer available, which triggers a failure detection.

If the catalog service marks a container JVM as failed and the container server is later reported as being available, the container JVM is told to shut down the container servers. A JVM in this state is not visible in xscmd utility command queries. Messages in the logs of the container JVM indicate that the container JVM has failed. You must manually restart these JVMs.

If the core group leader cannot contact any member, it continues to retry contacting the member.

The complete failure of all members of a core group is also a possibility. If the entire core group has failed, it is the responsibility of the catalog service to detect this loss.

The catalog service automatically creates each core group. And, each core group contains about 20 servers. The core group members provide health monitoring for other members of the group. Also, each core group elects a member to be the leader for communicating group information to the catalog service. Limiting the core group size allows for good health monitoring and a highly scalable environment.

Note: In a WebSphere Application Server environment, in which core group size can be altered, eXtreme Scale does not support more than 50 members per core group.

Sockets are kept open between Java™ virtual machines, and if a socket closes unexpectedly, this unexpected closure is detected as a failure of the peer Java virtual machine. This detection catches failure cases such as the Java virtual machine exiting very quickly. Such detection also allows recovery from these types of failures typically in less than a second.
Other types of failures include: an operating system panic, physical server failure, network failure, intermittent network failures, or connectivity issues. These failures are discovered through heart beating.