IBM Support

Readme and Release notes for release 5.1.0.18 LoadLeveler 5.1.0.18 LL_resmgr-5.1.0.18-x86_64-Linux-SLES11 Readme

Fix Readme


Abstract

xxx

Content

Readme file for: LL_resmgr-5.1.0.18-x86_64-Linux-SLES11
Product/Component Release: 5.1.0.18
Update Name: LL_resmgr-5.1.0.18-x86_64-Linux-SLES11
Fix ID: LL_resmgr-5.1.0.18-x86_64-Linux-SLES11
Publication Date: 9 May 2014
Last modified date: 9 May 2014

Installation information

Download location

Below is a list of components, platforms, and file names that apply to this Readme file.

Fix Download for Linux

Product/Component Name: Platform: Fix:
LoadLeveler Linux 64-bit,x86_64 SLES 11
LL_resmgr-5.1.0.18-x86_64-Linux-SLES11

Prerequisites and co-requisites

None

Known limitations

  • - Known Limitations

    For LL 5.1.0:

    • LoadLeveler 5.1.0.4 is a mandatory update to provide corrective fixes for LoadLeveler v5.1.0 on Linux x86 systems.
    • LoadLeveler 5.1.0.6 is a mandatory update to provide corrective fixes for LoadLeveler v5.1.0 on Linux power systems.
    • If the scheduler and resource manager components on the same machine are not at the same level, the daemons will not start up.
    • Preemption cannot be done for jobs which use collective acceleration units (CAU) by specifying either the collective_groups LoadLeveler keyword or MP_COLLECTIVE_GROUPS environment variable. If jobs are using CAUs, the keyword PREEMPTION_SUPPORT = NONE (which is the default) has to be specified in the LoadLeveler configuration.

    For LL 5.1.0.3+:

    • When submitting a batch job that uses Collective Acceleration Unit (CAU) groups, the MP_COLLECTIVE_GROUPS environment variable must specify the number of collective groups to be used by the job.
    • If the PREEMPTION_SUPPORT keyword is set to full in the LoadLeveler configuration file:
      • The collective_groups keyword or MP_COLLECTIVE_GROUPS environment variable cannot be specified for preemptable jobs.
    • If the PREEMPTION_SUPPORT keyword is set to no_adapter in the LoadLeveler configuration file and the collective_groups keyword or the MP_COLLECTIVE_GROUPS environment variable is set, you must set the following environment variables for the job:
      • LAPI_DEBUG_COMM_TIMEOUT=yes
      • MP_DEBUG_COMM_TIMEOUT=yes

      For LL 5.1.0.7:

      • Do not install LL 5.1.0.7 service update if you are using or planning to use a database for the LoadLeveler configuration.

      For LL 5.1.0.12:

      • APAR IV26259 of Parallel Environment Runtime Environment (1.2.0.9 or higher) must be installed if you are using the checkpoint/restart function.

      For LL 5.1.0.13:

      • Do not install the LL 5.1.0.13 service update if you have PE Runtime Environment 1.1 installed.
      • Support for PE Runtime Environment 1.1 will be available with APAR IV33552.

      For LL 5.1.0.16:

      • LL 5.1.0.16 is only supported for Blue Gene/Q

Installation information

Use these instructions to install the LoadLeveler update RPMs on IBM Systems running supported versions of Red Hat Enterprise Linux and SUSE Linux Enterprise Server.


  • - Read-me-first installation notes

    This installation procedure assumes that you are updating a machine running the full version of LoadLeveler.

    License RPM

    Do not proceed if the LoadL-full-license- < OS-ARCH > - < installed_license_version > license package is not installed. For example, if you are currently running LoadLeveler 4.1.x for Linux, you cannot install an update for LoadLeveler 5.1.x. You must first upgrade to LoadLeveler 5.1.0.3-0 before you can install an update package for a 5.1.0.X-0 release. Please contact your IBM marketing representative for information on how to obtain the appropriate LoadLeveler CD-ROM that has the license package.

    When uninstalling the "LoadL- < fileset > -full" RPM currently installed on your machine, do not uninstall the "LoadL-full-license" RPM for your currently installed LoadLeveler release. The "LoadL- < fileset > -full" update RPM has a dependency on the currently installed license RPM. Also, the LoadLeveler daemons will not start unless the license package is installed and you have accepted the terms and conditions of the license agreement.

    Submit-only machines

    The update steps for a LoadLeveler submit-only machine are similar to those associated with a machine running the full product. Simply replace the relevant "full" file names by their "so" counterparts. Also, on a submit-only machine there is no need to execute the llctl drain , llctl stop , or llctl start commands because LoadLeveler daemons do not run on a submit-only machine.

    File name conventions

    Download packages and RPMs for LoadLeveler updates use consistent file naming conventions. By looking at the components of the file name, you can determine which update package you will need to download based on the machine's architecture, Linux operating system, and the installed version of LoadLeveler.

    Example

    If your system is running scheduler and resource manager RHEL 6 with LoadLeveler 5.1.0.3-0 installed, and you want to download and install the LoadLeveler 5.1.0.4-0 update, then the file name components would be as follows:

    Download package:

    LL- < fileset > - < update_version > . < arch > . < OS-ARCH > .tar.gz
    Specifies the LL_scheduler-5.1.0.4-0.x86_64-Linux-RHEL6.tar.gz file
    Specifies the LL_resmgr-5.1.0.4-0.x86_64-Linux-RHEL6.tar.gz file

    Package RPMs:

    LoadL- < fileset > -full- < OS-ARCH > - < update_version > . < arch > .rpm
    Specifies the LoadL-scheduler-full-RH6-X86_64-5.1.0.4-0.x86_64.rpm file
    Specifies the LoadL-resmgr-full-RH6-X86_64-5.1.0.4-0.x86_64.rpm file

    LoadL-scheduler-so- < OS-ARCH > - < update_version > . < arch > .rpm
    Specifies the LoadL-scheduler-so-RH6-X86_64-5.1.0.4-0.x86_64.rpm file

    LoadL-utils- < OS-ARCH > - < update_version > . < arch > .rpm
    Specifies the LoadL-utils-RH6-X86_64-5.1.0.4-0.x86_64.rpm file

    LoadL-resmgr-kbdd- < OS-ARCH > - < update_version > . < arch > .rpm
    Specifies the LoadL-resmgr-kbdd-RH6-X86_64-5.1.0.4-0.x86_64.rpm file

    BLUE GENE SYSTEM ONLY:
    LoadL-scheduler-bluegene- < OS-ARCH > - < update_version > . < arch > .rpm
    Specifies the LoadL-scheduler-bluegene-RH6-PPC64-5.1.0.6-0.ppc64.rpm file

    Currently installed LoadLeveler and License RPMs:

    LoadL- < fileset > - < OS-ARCH > - < currently_installed_version >
    Specifies that LoadL-scheduler-full-RH6-X86_64-5.1.0.3-0 is installed
    Specifies that LoadL-resmgr-full-RH6-X86_64-5.1.0.3-0 is installed

    LoadL-full-license- < OS-ARCH > - < installed_license_version >
    Specifies the LoadL-full-license-RH6-X86_64-5.1.0.3-0 license

    where

    < fileset >

    The LoadLeveler scheduler or resource manager fileset.

    < OS-ARCH >

    The Linux Operating System and platform architecture of the machine where you are installing a LoadLeveler update package. For example, if you are upgrading an installation of LoadLeveler on a 64-bit AMD Opteron and Intel EM64T processors machine running RedHat 6, then < OS-ARCH > would be RH6-X86_64.

    < update_version >

    Specifies the version number of the LoadLeveler update package that you want to install on your system. For example, if you are updating to LoadLeveler 5.1.0.4-0, then the < update_version > is the number 5.1.0.4-0. The < update_version > appears in the file name of the package download (*.tar.gz file) and in the update RPMs in the package.

    < arch >

    Used in RPM file names. Specifies the platform architecture, where < arch > =
    x86_64 (64-bit IBM System x) or

    < base_version >

    Used only in the file name of the downloadable update package (gzipped tar file). Specifies the LoadLeveler Version/Release to which the update can be applied, for example, 5.1.X.

    < currently_installed_version >

    Specifies the version number of the current LoadLeveler release installed on your system.

    < installed_license_version >

    Specifices the license RPM for the base release (or a lower release) of the LoadLeveler version you are updating. For example, to install the LoadLeveler 5.1.0.4 update, you must have an installed license RPM for LoadLeveler on Linux.

  • - Installation procedure
    1. Change to the < UpdateDirectory > directory, that is, the directory where the *.tar.gz file for the LoadLeveler update resides and where you have write access:

      cd < UpdateDirectory >
    2. Extract the RPM files from the tar file:

      tar -xzvf LL- < fileset > - < update_version > . < arch > . < OS-ARCH > .tar.gz

      At the end of this step the files extracted from the archive should match the files listed in the Readme ("View" link) for the LoadLeveler update you downloaded.
    3. Verify that the LoadLeveler "license" RPM for the LoadLeveler version that you are updating is currently installed on this system:

      rpm -qa | grep LoadL

      The output of this command should be similar to the following:

      LoadL- < fileset > -full- < OS-ARCH > - < currently_installed_version >
      LoadL-full-license- < OS-ARCH > - < installed_license_version >
    4. If LoadLeveler resource manager fileset is running on this machine, enter the following command to "drain" the LoadL_schedd and LoadL_startd daemons on this machine and all other machines in the LoadLeveler cluster:

      llctl -g drain

      Note: To avoid potential incompatibility problems, all machines in the cluster should be upgraded to the same LoadLeveler update release before restarting the LoadLeveler daemons.
    5. Use the llstatus command to verify that the LoadL_schedd and LoadL_startd daemons are in "Drned" (drained) state, and then enter the following command to stop the LoadLeveler daemons:

      llctl -g stop
    6. To apply the update, use one of the following options:

      OPTION 1

      Uninstall the currently installed "LoadL- < fileset > -full" RPM and then install the "LoadL- < fileset > -full" update package by running the following commands:

      rpm -e LoadL- < fileset > -full- < OS-ARCH > - < currently_installed_version >
      rpm -ivh LoadL- < fileset > -full- < OS-ARCH > - < update_version > . < arch > .rpm

      For Blue Gene/Q only:
      Run these additional commands:
      rpm -e LoadL-scheduler-bluegene- < OS-ARCH > - < currently_installed_version >
      rpm -ivh LoadL-scheduler-bluegene- < OS-ARCH > - < update_version > . < arch > .rpm




      OPTION 2

      Use the -U option of the rpm command to apply the update directly:

      rpm -Uvh LoadL- < fileset > -full- < OS-ARCH > - < update_version > . < arch > .rpm

      For Blue Gene/Q only:
      Run this additional command:
      rpm -Uvh LoadL-scheduler-bluegene- < OS-ARCH > - < update_version > . < arch > .rpm

    7. Repeat step 6 for all the machines in the LoadLeveler cluster. On completion of this task, restart the LoadLeveler daemons with the following command:

      llctl -g start

    For further information, consult the LoadLeveler Library for the appropriate version of the LoadLeveler Linux Installation Guide.

  • - Setting up control groups on a diskless (or stateless) cluster for preemption, process tracking, workload management (WLM), and checkpoint/restart

    Control groups are used for preemption, process tracking, workload management (WLM), and checkpoint/restart. A control group file system must be configured and mounted in order to enable these functions. The /etc/cgconfig.conf file is used to define control groups, their attributes, and their mount points. On diskless systems the /etc/cgconfig.conf file should be part of the diskless image.

    Complete the following steps to setup control groups on a diskless cluster.

    1. Create the cgconfig.conf file in either an existing directory or a new directory in the /install/postscripts directory on the EMS management server.

      For example:
      mkdir -p /install/postscripts/admin_files
      vi /install/postscripts/admin_files/cgconfig.conf

      Note: Files and subdirectories under /install/postscripts on the EMS are copied to the /xcatpost directory on the compute nodes before the postscripts are run.

    2. If you plan to use checkpoint/restart or LoadLeveler process tracking and preemption, the first entry in the cgconfig.conf file must be:

      mount {ns = /cgroup/freezer;freezer = /cgroup/freezer;}

    3. If you plan to use WLM, include the following in the cgconfig.conf file:

      mount {
        cpu = /cgroup/cpu;
        cpuacct = /cgroup/cpuacct;
        memory = /cgroup/memory;
      }

      group LOADL {
        memory{
          memory.limit_in_bytes = ;
          memory.soft_limit_in_bytes = ;
          memory.memsw.limit_in_bytes = ;
       }
       cpu{}
       cpuacct{}
      }

    4. In a script that is configured in the xCAT database as a postscript for the compute nodes, add the following:

      cp /xcatpost/admin_files/cgconfig.conf /etc/cgconfig.conf /sbin/cgconfigparser -l /etc/cgconfig.conf

      where "admin_files" is the directory name you used in step 1.

      If you plan to use scheduling affinity, add the following lines to this script after invoking /sbin/cgconfigparser

      mkdir /dev/cpuset
      mount -t cpuset none /dev/cpuset

Additional information

  • - Package contents

    Update for SUSE Linux Enterprise Server 11 (SLES 11) on IBM servers with 64-bit AMD Opteron and Intel EM64T processors

    LL_resmgr-5.1.0.18-x86_64-Linux-SLES11 is a corrective fix for LoadLeveler for SLES 11 on 64-bit Opteron and EM64T systems, version 5.1.0.X.

    The updates contained in the LL_resmgr-5.1.0.18-x86_64-Linux-SLES11.tar.gz file available for download from this site provide corrective fixes for LoadLeveler for SLES 11 on 64-bit Opteron and EM64T systems, version 5.1.0. Updates for the full LoadLeveler product are provided.

    Update to Version:

    5.1.0.18

    Update from Version:

    5.1.0.4 through 5.1.0.18

    Update resource manager (tar file) contents:

    RPM files:

    LoadL-resmgr-full-SLES11-X86_64-5.1.0.18-0.x86_64.rpm

    LoadL-resmgr-kbdd-SLES11-X86_64-5.1.0.18-0.x86_64.rpm

    README file:

    README

    For further information, consult the LoadLeveler for Linux Version 5.1.0 Installation Guide.

  • - Changelog

    Notes

    Unless specifically noted otherwise, this history of problems fixed for LoadLeveler 5.1.0.x applies to:

    • LoadLeveler 5.1.0.x for Red Hat Enterprise Linux 6 (RHEL6) on servers with 64-bit Opteron or EM64T processors
    • LoadLeveler 5.1.0.x for SUSE LINUX Enterprise Server 11 (SLES11) on servers with 64-bit Opteron or EM64T processors
    • LoadLeveler 5.1.0.x for Red Hat Enterprise Linux 6 (RHEL6) on POWER servers
    • LoadLeveler 5.1.0.x for AIX 7

    Restriction section
    For LL 5.1.0:
    • If the scheduler and resource manager components on the same machine are not at the same level, the daemons will not start up.
    • Please refer to the "Known Limitations" section under the fix pack README for more limitation information for this release.


    Additional Information section
    For LL 5.1.0:
    • Please refer to the "Setting up control groups on a diskless (or stateless) cluster for preemption, process tracking, workload management (WLM), and checkpoint/restart" under the "Installation Information" section for more information on how to setup control groups.

    General LoadLeveler problems

    Problems fixed in LoadLeveler 5.1.0.18 [May 8, 2014]

    • LoadLeveler startd coredump problem was fixed while attempting to save a log file if the open() fails for the .old file.
    • A bug in the routing code of class BgSwitch/BgCable was fixed.
    • Loadleveler was changed to not use cables, for passthrough, if a midplane has any nodeboards unavailable.
    • LoadLeveler can support the job command file with multiple runjob command lines contained.
    • LoadLeveler has been changed to correct the compatibility problem with release 5.1.0.15 and future service levels.
    • The LoadL_master daemon has been changed to correct the serialization issue, eliminating the case where resources may be unavailable because a manager daemon is not running.
    • LoadLeveler is changed to remove the code which attempts to transmit machine data which became obsolete.
    • API ll_get_data with type LL_StepBgSizeAllocated from history file return correct Bg Size Allocated value for sub-block jobs now.
    • Resource Manager only:
      • LoadLeveler Startd daemon is changed to remove a synchronization issue which delayed returning a job to the job queue for re-dispatch after the job was rejected.
      • The startd has been changed to clear the starter process id as soon as the startd detects that the starter processs has terminated.
      • The llq -w command will not cause the LoadL_startd daemon to terminate with a SEGV.
      • The LoadL_schedd daemons will not terminate with a SEGV while processing a command from a remote cluster.
      • The LoadLeveler algorithm for calulating cpu shares is changed to successfully create WLM classes for jobs requesting a cosumable cpu requirement of 66 or greater when the WLMSHARES policy is used to enforce CPU usage.
      • LoadLeveler can support the job command file to run multiple sub-block runjob command lines at the same time.
      • LoadLeveler is changed to set the positive OOM killer value for the slave task now. The OOM killer can successfully kill slave task in the non-IBM MPI environment.
    • Scheduler only:
      • LoadL_negotiator will not assign the resources from drained midplanes.
      • The block which is used to run sub-block steps, will not be freed when there are running steps on it.

    Problems fixed in LoadLeveler 5.1.0.17 [September 4, 2013]

    • The LoadLeveler code has been modified to record timestamps for all 4 configuration files in the shared memory and to compare each time stamp from the SHM buffer with the time stamp from the corresponding file time stamp to decide whether the shared memory needs to be refreshed.
    • Added energy capping support on Power Linux.
    • Added a check to the LlNetProcess::cmRecovery member function to bypass taking any action for non-daemon processes to avoid the llctl hang.
    • Shortened the length of suspend_control filed in table TLL_CFGCluster to avoid the row length limit.
    • Fixed the issue that the LoadL_startd daemon will take too long to discover an alternate region manager when re-starting an execute node (one running the LoadL_startd daemon) after a region manager failover takes place.
    • Skipped the nodes on which LoadL_Schedd failed to get energy consumption for energy consumption calculation.
    • Fixed that issue that energy consumption was incorrect when removing ibmaem module.
    • Removed the check of decreasing column size when updating DB from PTF16 to PTF17.
    • Fixed the issue that llstatus -l failed to print energy after reconfiguration.
    • Fixed the issue that the LoadL_negotiator crashed because of processing incomplete jobs in a send all jobs transaction.
    • Added README file to explain the restrictions regarding the PTF 14 incompatibility.
    • The LoadLeveler code has been modified not to generate the output file when LoadL_schedd can't access the energy output file directory.
    • Resource Manager only:
      • Added the necessary checking for NULL before attempting to reference the object pointer to avoid core dump of LoadL_Startd.
      • The serialization issue has been corrected to avoid the core dump of Resource Manager in high stress conditions.
      • The LoadLeveler code has been changed to set the LOADL_TOTAL_TASKS environment variable for LL jobs with job type of PARALLEL.
      • Changed the description of the -p option for llrstatus command.
      • Fixed the issue that LoadL_startd crashed because a single thread attempted to acquire the same UID lock twice.
    • Blue Gene:
      • Fixed the issue when BlueGene job completes, LoadLeveler may free the blocks before the runjob client processes exit.
      • Added the support for BlueGene API LiveModel::monitorBlockAllocate to receive the block deallocation event.
      • Added new catalog messages for BlueGene sub block support.
      • Fixed the issue that LoadL_negotiator crashed at deallocateBlockThread when shutting down LL.
      • Added the support BGQ co-schedule job.
      • Fixed the issue that LoadLeveler allocated blocks with wrong midplanes.
      • Fixed the issue that LoadLeveler tracked dual-use I/O link usage in a wrong way.

    Problems fixed in LoadLeveler 5.1.0.16 [July 30, 2013]

    • Update for Blue Gene/Q only
    • Remove the misleading error message caused by unthread_open.
    • The LoadLeveler code has been modified to ignore the pending flush or vacate when completing an interactive step.
    • Add support for GFlops in energy reports.
    • The problem that upgrading the LoadLeveler utility package failed has been fixed.
    • The problem that calling std::sort causing invalid object pointers leads to LoadL_negotiator crashed has been fixed.
    • The resource manager never discovers the current serving CM after a CM failover has been fixed.
    • The LoadLeveler code has been modified to avoid referencing a null pointer which leads to llsummary command core dump.
    • Resource Manager only:
      • The problem that LoadL_schedd failed to start if there are corrupted spool files has been fixed.
      • Add the required synchronization to the LoadL_startd to ensure the Max_Starters value does not get set incorrectly.
      • The problem that no events found in RAS log by killing LoadL_schedd has been fixed.
      • The problem that the aggregate adapter with no managed adapters leads to LoadL_startd core dump has been fixed.
    • Blue Gene:
      • A new configuration keyword enforce_bg_min_block_size is added. When the value is true, the I/O ratio will not have effect to the block size. When the value is false, the behavior is the same as before.
      • The problem that resources in a block which failed to be released are reused has been fixed.
      • The problem that llbgstatus query caused LoadL_negotiator core dump has been fixed.
      • The LoadLeveler code has been modified to add newly created block into hash table so that the coschedule job can be dispatched.
      • The problem that reservation for large block was not honored has been fixed.
      • Add support for Blue Gene sub-block jobs.
      • The problem that LoadLever didn't calculate IO links correctly has been fixed.
      • Remove the misleading message for the drained resources.
      • Remove the flooding messages when the reservation becomes active.
    • Scheduler only:
      • The problem that LoadL_negotiator managing lists of job steps that require floating resources leads to a core dump has been fixed.

    Problems fixed in LoadLeveler 5.1.0.15 [June 19, 2013]

    • Update for X86 LINUX (on May 20, 2013) and AIX (on June 19, 2013) ONLY
    • LoadLever is changed to ensure that whenever a resource manager daemon is started, it is notified of the active central manager.
    • The code path to write RAS records has been changed to avoid deadlock. The Schedd will no longer hang if there are a large number of jobs starting and terminating.
    • The S3 policy enhancement for the cluster level is added.
    • The LoadLeveler code has been modified to rename the free_list function so that the LoadL_negotiator will not terminate abnormally.
    • A new keyword which decides if gather Hardware Performance Monitor counters is added.
    • The problem that LoadL_schedd reported incorrect power value has been fixed.
    • The problem that llq printed incorrect Coschedule state for step has been fixed.
    • Resource Manager only:
      • LoadLeveler has been changed to ensure that accounting records for terminating events are transmitted and recorded in the LoadLeveler history file for all machines used to run a parallel job.
      • The resource manager crash problem after reconfig power policy for the machine has been fixed.
      • The mail sent to the LoadLeveler administrators when a switch table error occurs is modified to reference a current document which provides information on debugging switch table problems.
      • Reference counting of LoadLeveler job objects has been corrected in the resource manager.
      • The LoadLeveler LoadL_startd daemon is fixed to remove the synchronization defect between the job termination and job step verification threads. Jobs completing normally will not be vacated for this reason.
    • Scheduler only:
      • The central manager will now recognize and account for all resources for all machines added to a LoadLeveler cluster whether or not the machine is listed in the LoadLevler admin file when machine authentication is disabled.

    Problems fixed in LoadLeveler 5.1.0.14 [March 18, 2013]

    • Update for POWER LINUX ONLY
    • Fixed a dispatch problem of Intermittently jobs which are submitted after a llctl drain startd command.
    • Fixed the problem of llsummary command core dumps.
    • LoadLeveler is modified to ignore a failure of the mkdir system call if the directory already exists.
    • Resource Manager only:
      • Fixed issue of Startd daemon slow startup
    • Blue Gene:
      • The class job count will be decremented if a bluegene job fails because of a failed block boot, then subsequent jobs can be scheduled with correct class slots value.
      • When some nodeboard is not available, LL will not add the whole midplane in the block for some step.
      • One new keyword value, loadl, for bg_cache_blocks has been added. when bg_cache_blocks = loadl, the initialized static block will not be reused by LoadLeveler if it is not requried explicitly, and the static block will be freed after use.
      • When some cable is not available for job step, LoadLeveler will show both ends of the cable in llq -s command.
    • Scheduler only:
      • LoadLeveler has been changed to keep rejected step away from the reject machine in scheduler.
      • Performance improvement to the scheduling of work by the central manager.
      • When LoadL Negotiator dispatch one co-scheduled steps job, if someone step is failed to dispatched, LoadL Negotiator will not abort.
      • The central manager will not core dump when removing unusable RunClassRec objects successfully.

    Problems fixed in LoadLeveler 5.1.0.14 [March 15, 2013]

    • Update for X86 LINUX ONLY
    • Fixed a dispatch problem of Intermittently jobs which are submitted after a llctl drain startd command.
    • Fixed the problem of llsummary command core dumps.
    • LoadLeveler is modified to ignore a failure of the mkdir system call if the directory already exists.
    • Resource Manager only:
      • Fixed issue of Startd daemon slow startup
    • Scheduler only:
      • LoadLeveler has been changed to keep rejected step away from the reject machine in scheduler.
      • Performance improvement to the scheduling of work by the central manager.
      • When LoadL Negotiator dispatch one co-scheduled steps job, if someone step is failed to dispatched, LoadL Negotiator will not abort.
      • The central manager will not core dump when removing unusable RunClassRec objects successfully.

    Problems fixed in LoadLeveler 5.1.0.13 [March 11, 2013]

    • Update for LINUX POWER ONLY
    • Corrected an issue causing the LoadL_negotiator daemon to stall for several minutes at a time.
    • Refresh of the man pages
    • Fixed an issue where the consumablememory setting of a node was being set to 0.
    • Fixed a problem the Negotiator code dumping after a reconfig
    • The import of environment variables containing semicolons has been corrected
    • A new LoadLeveler job command file keyword first_node_tasks is added.
    • Blue Gene:
      • The BlueGene block holding the nodeboards which are in software error state will be freed after the job is terminated/completed. The nodeboards can be used for future scheduling.
    • Resource Manager only:
      • The Region Manager will no longer exit when the dgram port is in use.
      • Fixed issue of LOADL_PROCESSOR_LIST not being set correct for serial jobs.
      • The startd daemon will not crash when the value of keyword power_management_policy is reconfigured.
    • Scheduler only:
      • Fixed issue of scheduler getting into especially long dispatching cycles.
      • Performance improvement to the scheduling of work by the central manager.

    Problems fixed in LoadLeveler 5.1.0.14 [March 4, 2013]

    • Update for AIX ONLY
    • Fixed a dispatch problem of Intermittently jobs which are submitted after a llctl drain startd command.
    • Fixed the problem of llsummary command core dumps.
    • LoadLeveler is modified to ignore a failure of the mkdir system call if the directory already exists.
    • Resource Manager only:
      • Fixed issue of Startd daemon slow startup
    • Scheduler only:
      • LoadLeveler has been changed to keep rejected step away from the reject machine in scheduler.
      • Performance improvement to the scheduling of work by the central manager.
      • When LoadL Negotiator dispatch one co-scheduled steps job, if someone step is failed to dispatched, LoadL Negotiator will not abort.
      • The central manager will not core dump when removing unusable RunClassRec objects successfully.

    Problems fixed in LoadLeveler 5.1.0.13 [December 10, 2012]

    • Update for AIX ONLY
    • Corrected an issue causing the LoadL_negotiator daemon to stall for several minutes at a time.
    • Refresh of the man pages
    • Performance improvement for termination of interactive jobs.
    • Fixed an issue where the consumablememory setting of a node was being set to 0.
    • Fixed a problem the Negotiator code dumping after a reconfig
    • The import of environment variables containing semicolons has been corrected
    • Resource Manager only:
      • The Region Manager will no longer exit when the dgram port is in use.
      • Fixed issue of LOADL_PROCESSOR_LIST not being set correct for serial jobs.
    • Scheduler only:
      • Fixed issue of scheduler getting into especially long dispatching cycles.
      • Performance improvement to the scheduling of work by the central manager.

    Problems fixed in LoadLeveler 5.1.0.12 [October 12, 2012]

    • LoadLeveler now shows correct value of ConsumableCpus when machine group is configured.
    • The LoadLeveler job query commands will now return the correct "Step Cpus" value for the running job that requires ConsumableCpus in the node_resources keyword.
    • The central manager daemon will not core dump when attempting to use the VerifyJobs transaction to contact thousands of LoadLeveler stard daemons.
    • The LoadL_configurator daemon will not crash when the node tries to get the configuration data from the config hosts.
    • A core dump problem when running command llstatus -L machine has been fixed.
    • Resource Manager only:
      • The handling of hierarchical communication errors is restored to the prior release behavior.
      • LoadLeveler Startd daemon will ensure that the cpu map files are created before terminating a checkpointing job.
      • The LOADL_HOSTFILE environment variable will be set in the environment of the job prolog and the user environment prolog.
      • Obsolete code which attempts to terminate left over job processes is removed.
      • LoadLeveler enables the use of mdcr 5 for checkpoint/restart on AIX. The name of the mdcr library will be changed to libmdcr5.so and the binary ll_mdcr-checkpoint will be built as a 64 bit binary since libmdcr5 is 64 bit.
    • Blue Gene:
      • Once LoadLeveler defects the error of BlueGene I/O node or compute node, it will put the nodes into drain state. And if a block fails to boot for three times, it will be destroyed.
    • Scheduler only:
      • The scheduler will ignore any floating resource requirement with a 0 value.
      • A dead lock problem in resource manager daemon has been fixed.

    Problems fixed in LoadLeveler 5.1.0.11 [August 27, 2012]

    • The coredump problem when fetching step adapter usage information has been fixed.
    • It has been fixed that the command "llstatus -l -L" shows submit-only node down.
    • The negotiator daemon now correctly free the memory so that the core dump will not occur.
    • Resource Manager only:
      • The Region Manager has been modified to ignore all adapters on the same subnet as the adapter that was filtered out with adapter_list. Instead of the Region Manager marking those adapters down, those adapters will remain in an HB_UNKNOWN state.
    • Blue Gene:
      • If a Blue Gene job terminates due to a kill timeout, the node used by the job is availble for future jobs after the block in use has been freed.
    • Scheduler only:
      • Only the messages from the last iteration of topdog scheduling is printed out in the command "llq -s".The intermediate message is not printed.
      • The accounting record which has a negative wall clock value is now skipped by the llsummary command.

    Problems fixed in LoadLeveler 5.1.0.10 [July 20, 2012]

    • The region manager failover and recovery code is changed to ensure that the resource manager is notified when a region manager becomes active which makes all active nodes and adapters available for scheduling.
    • Resource Manager only:
      • The resource manager daemon will not crash once startup LL if set D_FULLDEBUG for RESOURCE_MGR_DEBUG in LoadL_config file.
    • Blue Gene:
      • LoadLeveler Changes to use the new checkIO() call for V1R1M1 BlueGene software.
      • The dependency check for the libbgsched shared object is removed from the LoadLeveler Blue Gene rpm so that the rpm nodeps option is no longer required.
      • LoadLeveler llqres command will display the information for the Blue Gene reservation which specifying bg_block.
      • A check that was preventing Blue Gene reservations from being modified has been fixed so the change request can be processed.
      • When some nodeborard is down in one midplane, the Blue Gene small block job can run in the midplane if the resource can meet the job requirement.
      • The nodeboard list that is returned from the BGQ scheduler API may not always be in order. LoadLeveler will sort this list to ensure it is in order before indexing on it.

    Problems fixed in LoadLeveler 5.1.0.9 [June 19, 2012]

    • Update for LINUX on 64-bit Opteron or EM64T processors ONLY
    • Implemented internal LoadLeveler data contention improvements.
    • Jobs were rejected when the schedd daemon was unable to determine the protocol versions for the nodes allocated to a job step it was trying to dispatch. The corre ct protocol version is being called now so that the jobs will be started correctly.
    • Fixed Negotiator daemon memory leaks.
    • Incorrect error messages seen for user prolog/epilog during llctl ckconfig command which is fixed by correcting the internal user variables names.
    • Corrected inefficiency when reading configuration data from the database and protect against these kinds of performance issues that had caused LoadLeveler from st arting when large systems are configured.
    • Corrected the lldbupdate to be able to update from 5.1.0.6 to 5.1.0.9.

    Problems fixed in LoadLeveler 5.1.0.8 [June 15, 2012]

    • Update for LINUX on POWER ONLY
    • Implemented internal LoadLeveler data contention improvements.
    • Jobs were rejected when the schedd daemon was unable to determine the protocol versions for the nodes allocated to a job step it was trying to dispatch. The correct protocol version is being called now so that the jobs will be started correctly.
    • Fixed Negotiator daemon memory leaks.
    • Incorrect error messages seen for user prolog/epilog during llctl ckconfig command which is fixed by correcting the internal user variables names.
    • Corrected inefficiency when reading configuration data from the database and protect against these kinds of performance issues that had caused LoadLeveler from starting when large systems are configured.
    • Corrected the lldbupdate to be able to update from 5.1.0.6 to 5.1.0.8.

    Problems fixed in LoadLeveler 5.1.0.7 [June 8, 2012]

    • Do not install LL 5.1.0.7 service update if you are using or planning to use a database for the LoadLeveler configuration.
    • The llstatus command shows the startds to be up even though the llrstatus command shows the startd and the region manager they report to is actually down. The central manager will now be notified by the resource manager when the startd is marked as down by the resource manager so the llstatus command will now show the correct output state as the llrstatus command.
    • Fixed llconfig from core dumping if trying to add a new machine_group or region to a cluster that has more than 128 machines.
    • Fixed llconfig to correctly set the island in the maching_group.
    • Blue Gene:
      • LoadLeveler will correctly calculate the I/O ratio per midplane based on hardware state to support a mixed I/O environment on Blue Gene/Q.

    Problems fixed in LoadLeveler 5.1.0.6 [April 27, 2012]

    • Mandatory service pack for Red Hat Enterprise Linux 6 (RHEL6) on POWER servers.
    • The CAU value is now allocated correctly on all the nodes on which the job is run.
    • Resource Manager only:
      • Fixed dead lock in Region manager daemon when determining heartbeat status and llstatus information will now show the correct status after reconfig.
      • Fixed startd daemon from core dump when preempting a running job via suspend method.
      • Fixed checking of process tracking during job termination so jobs will be able to terminate correctly in an environment that does that hve process tracking set.
    • Blue Gene:
      • Enhanced the support for Blue Gene block booting failures by draining problem hardware from the LoadLeveler cluster.
      • Fixed problems with LoadLeveler scheduling blocks using pass through.
      • Updated llq -h command output to reflect changes in Blue Gene terminology (Partitions are now referred to as Blocks)
      • Corrected display of connectivity for large blocks in llsummary output.
      • Fixed a problem calculating the minimum block size for LoadLeveler jobs when midplanes contain error with iolinks.

    Problems fixed in LoadLeveler 5.1.0.5 [April 4, 2012]

    • Fixed some memory leaks in Startd and Schedd daemons.
    • If there is no network statement in the job command file, then the default network is used, which assumes ethernet. If the cluster does not have ethernet configured, then the job will stay in the "ST" state and not run. The default network support will now use the adapter associated with the hostname with which the machine is configured in the administration file.
    • Fixed LoadL_master from core dumping during llctl stop in database environment due to timing locks.
    • Fixed LoadL_negotiator from core dumping by not sending corrupted job step data to the central manager.
    • Fixed lldbupdate command from getting the 2544-019 error message by parsing the database information correctly so LoadLever will be able to start up.
    • Resource Manager only:
      • A problem in pe_rm_connect() that caused read() to be called on a socket that was not ready to be read has been corrected, allowing pe_rm_connect() to continue to retry to the connection for the specified rm_timeout amount of time.
    • Scheduler only:
      • The list of reserved resources was not being updated properly when the reservation requesting a 0 count ended, leading to the core dump. That reservation list is now being being updated correctly in all cases.

    Problems fixed in LoadLeveler 5.1.0.4 [March 16, 2012]

    • Mandatory service pack for Red Hat Enterprise Linux 6 (RHEL6) and SUSE LINUX Enterprise Server 11 (SLES11) on servers with 64-bit Opteron or EM64T processors.
    • LoadLeveler can now display the host name correctly based on the name_server configuration. The previous limitation of the name_server keyword being ignore is now lifted.
    • On SLES11, the lldbupdate fails to connect to the database due to incorrect odbc.ini location is now corrected.
    • Fixed Linux schedd daemon core dump in a mixed AIX and Linux cluster when submitting a job from the AIX cluster.
    • Fixed potential central manager deadlock.

    LoadLeveler Corrective Fix listing
    Fix Level AIX APAR numbers
    LL 5.1.0.18 resource manager: IV54299 IV54330 IV00018 IV54332 IV54333 IV54334 IV53405 IV55247 IV55997 IV55306
    scheduler: IV54300 IV44552 IV00018 IV48119 IV50037 IV54335 IV54336 IV55248 IV55307 IV55998
    LL 5.1.0.14 resource manager: IV33552 IV29510 IV34246 IV34249 IV34251 IV32497
    scheduler: IV23900 IV34256 IV34254 IV34248 IV34247 IV34257 IV34258 IV32624 IV34259 IV34250 IV34252 IV34255 IV34818 IV33554
    LL 5.1.0.13 resource manager: IV29990 IV31169 IV31171 IV31174 IV31177 IV31178 IV31267
    scheduler: IV30011 IV31170 IV31172 IV31175 IV31176 IV31179 IV31400
    LL 5.1.0.8 resource manager: IV25429 IV18362 IV27838 IV27839 IV23242 IV23444 IV27848 IV25772
    scheduler: IV25430 IV27835 IV20675 IV21261 IV27840 IV25447
    LL 5.1.0.7 resource manager: IV23818 IV22084
    scheduler: IV23819 IV23820 IV23821
    LL 5.1.0.6 resource manager: IV16600 IV19851 IV19911 IV20248
    scheduler: IV18061 IV19617 IV19910
    LL 5.1.0.5 resource manager: IV17276
    scheduler: IV18682
    LL 5.1.0.4* resource manager: IV13778 IV14094 IV14182 IV14380 IV14458 IV15105 IV16304 IV16306
    scheduler: IV14096 IV14408 IV15545 IV16135
    LL 5.1.0.3 resource manager: IV11682 IV11747 IV11750 IV11938 IV12585
    scheduler: IV11748 IV12199 IV12586 IV12587 IV12588
    LL 5.1.0.2 resource manager: IV07387 IV07389 IV08361 IV08531
    scheduler: IV07388 IV07390 IV08362 IV08364 IV08532
    LL 5.1.0.1 resource manager: IV03487 IV03490 IV03498 IV05131 IZ03487
    scheduler: IV03488 IV03489 IV03496 IV03497 IV03499 IV03500 IV05132 IZ03488

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SGS8DD","label":"Tivoli Workload Scheduler LoadLeveler for AIX"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB57","label":"Power"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SGS8DD","label":"Tivoli Workload Scheduler LoadLeveler for AIX"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB57","label":"Power"}}]

Document Information

Modified date:
17 March 2022

UID

isg400001787