High-availability servers

High-availability servers run slightly differently than individual servers.

All servers use the same database and a common file share. The file share is used for logs and several other directories, specifically: logs, var/email, var/plugins, and var/repository. Each server also independently maintains some configuration information, such as ports and hosts. The database is used for configuration information, runtime data, and so on.

Because the servers share the database, all servers run on the same interval.

Some configuration properties remain on the server, such as database and JMS connection information. Database configuration is handled during product installation; no additional configuration is required post-installation.

Importing Files (CodeStation)

All servers poll for component version changes. Polling intervals are specified by a user-configured parameter (15 minutes by default). The database handles server synchronization: before it writes to the repository, a server acquires a lock in the database. Polling times are reset after a job finishes.

Events

Events are handled by the server that fires the event.

Workflow engine

Workflows consist of activities. Activities can be run sequentially, run in parallel with one another, or some combination of the two. A typical workflow might consist of several sequential activities, such as:

For JMS-based communications, agents can be configured in several ways:

Activity A: Run process A on Agent 1
Activity B: Run process B on Agent 2
Activity C: Run process C on Agent 3

All servers constantly poll for pending workflows, so any server might initiate the workflow. The server that acquires this workflow runs the following tasks:

Create a runtime instance of Activity A and acquire a database lock.
Record the command that it intends to send in the database.
Send the command to the agent over JMS.
Release the database lock.

After it completes the work, the agent sends a response message over JMS. The message will be written to the database (by one of the servers) and the next activity started (by one of the servers). The server that started Activity B runs the same steps as described above.

In the simple workflow that is sketched here, the activities might all be handled by the same server or different servers (or some combination). Of course, the same would be true if this workflow consisted of three parallel activities.

An application workflow is maintained by a single record in the database (only one thread handles a workflow at the same time).

Failure handling

During application processing, command failures are marked in the workflow. Error handling is the responsibility of the application author. Component rollback can be handled with rollback command/steps. Rolling-back, as used here, means reinstalling an earlier component version.

If the server crashes while an agent is running a command, the JMS mesh assigns the workflow to another server.

If an agent crashes or otherwise disappears while running a command (remembering that failed steps do not cause agents themselves to fail), the server assumes the command is still running; there is no automatic time out. Normally, it is neither feasible nor practicable to assign a timeout interval.

Feedback