Readme and Release notes for release 4.1.1.13 LoadLeveler 4.1.1.13 LL_scheduler-4.1.1.13-power-AIX Readme

Readme file for: LL_scheduler-4.1.1.13-power-AIX
Product/Component Release: 4.1.1.13
Update Name: LL_scheduler-4.1.1.13-power-AIX
Fix ID: LL_scheduler-4.1.1.13-power-AIX
Publication Date: 1 November 2012
Last modified date: 1 November 2012

Installation information

Download location

Below is a list of components, platforms, and file names that apply to this Readme file.

Fix Download for AIX

Product/Component Name: Platform: Fix:
LoadLeveler AIX 5.3
AIX 6.1
LL_scheduler-4.1.1.13-power-AIX

Prerequisites and co-requisites

None

Known issues

Known limitations

  • - Known Limitations

    For LL 4.1.1:

    • On any machine where you plan to install the scheduler rpm, you must install the resource manager rpm and the resource manager rpm must be installed before the scheduler rpm.
    • If the scheduler and resource manager filesets are installed on the same machine, then the LoadLeveler version of those filesets have to be at the same level in order for transactions to be processed correctly.

Installation information

  • - Installation procedure

    Install the LoadLeveler updates on your system by using the normal, smit update_all command.

    For further information, consult the LoadLeveler Library for the appropriate version of the LoadLeveler AIX Installation Guide.

Additional information

  • - Package contents

    LoadL.scheduler.full.bff |4.1.1.13
    LoadL.scheduler.so.bff| 4.1.1.13
    LoadL.scheduler.msg.en_US.bff | 4.1.1.4
    LoadL.scheduler.webui.bff | 4.1.1.1

  • - Changelog

    Notes

    Unless specifically noted otherwise, this history of problems fixed for LoadLeveler 4.1.1.x applies to:

    • LoadLeveler 4.1.1.x for AIX 6
    • LoadLeveler 4.1.1.x for AIX 5
    • LoadLeveler 4.1.1.x for SUSE LINUX Enterprise Server 11 (SLES11) on POWER servers
    • LoadLeveler 4.1.1.x for SUSE LINUX Enterprise Server 11 (SLES11) on servers with 64-bit Opteron or EM64T processors
    • LoadLeveler 4.1.1.x for Red Hat Enterprise Linux 6 (RHEL6) on POWER servers
    • LoadLeveler 4.1.1.x for Red Hat Enterprise Linux 6 (RHEL6) on servers with 64-bit Opteron or EM64T processors
    • LoadLeveler 4.1.1.x for Red Hat Enterprise Linux 5 (RHEL5) on servers with 64-bit Opteron or EM64T processors
    • LoadLeveler 4.1.1.x for Red Hat Enterprise Linux 5 (RHEL5) on Intel based servers
    Restriction section

    For LL 4.1.1:

    • If the scheduler and resource manager components on the same machine are not at the same level, the daemons will not start up.

    General LoadLeveler problems

    Problems fixed in LoadLeveler 4.1.1.13 [November 1, 2012]

    • There is no error message displayed in Startd log when NRT is not installed on the cluster.
    • The llq command can now show error messages when an user id issue is encountered.
    • The central manager daemon stalling and llq commands hanging problem has been fixed.
    • LoadLeveler will not core dump when the user issues the command "llstatus -L machine".
    • Resource Manager only:
      • Obsolete code which attempted to terminate left over job processes is removed.
      • LoadLeveler startd daemon will no longer abort when trying to reject a job when a network table load fails.
    • Scheduler only:
      • It has been fixed that the LoadL_negotiator daemon terminated with a SIGSEGV when trying to schedule a job step while the number of machines in the cluster had changed.
      • The top dog start time search has been optimized and additional changes reduce the number of waiting threads that are started during the dispatching loop.

    Problems fixed in LoadLeveler 4.1.1.12 [August 29, 2012]

    • An coredump problem has been fixed when querying the step adapter usage information.
    • The fix eliminates the deadlock as a cause for jobs to become stuck in RP state and permits llctl stop to terminate the LoadL_startd process.
    • LL has sufficiently reduceds its calls to the user registry.
    • The issues that completed jobs not giving back resources for a long time and the appearance that no jobs are starting at all have been fixed.
    • Scheduler only:
      • The llmovespool commands now works well for multistep job which has some steps completed and others still running.
      • LoadLeveler will not display the misleading message about image_size check in the command "llq -s" and in the Negotiator log for determining that a machine could not be used was found already.
      • The accounting record which has a negative wall clock value is now skipped by the llsummary command.
      • The problem that central manager crashes with signal SIGABT when removing a job step has been fixed.

    Problems fixed in LoadLeveler 4.1.1.11 [June 7, 2012]

    • Implemented internal LoadLeveler data contention improvements.
    • Resource Manager only:
      • Under some rare conditions, the LoadL_schedd daemon can core dump when a job is rejected multiple times. The core dump was the result of an array index not being reset properly upon a 2nd dispatch of the same job step. This problem has been corrected by setting that array index back to -1 when a job step is redispatched.
      • The adapter state shown in llstatus is not correct due to a deadlock in the region manager is now resolved.
    • Scheduler only:
      • If the step requires CPU affinity by blocking, LoadLeveler will assign more CPUs than requested. LoadLeveler will now assign the correct number of cpus for blocking steps requesting cpu affinity.

    Problems fixed in LoadLeveler 4.1.1.10 [April 5, 2012]

    • Parsing of the alternate regional managers is now corrected to not get the error message of 2512-636 so defining alternate regional managers will not prevent LoadLeveler from starting up. Also, defining name_server keyword to "LOCAL" will now only send one stop to the machine using the machine's short host name.
    • Fixed startd daemon core dumping in cases where the job was rejected due to adapter windows failures.
    • Resource Manager only:
      • LoadLeveler was modified to set the correct userid to prevent checkpoint files from being deleted and the correct checkpoint file is being read by the starter so that job will restart when the llctl flush and resume commands are executed twice.
      • Locking is added to the LoadLeveler schedd daemon to serialize threads receiving multi-cluster jobs from threads processing llq -x requests which will prevent the schedd from core dumping.
      • The LoadLeveler schedd daemon will now write the host smt status to the accounting history file before the job gets terminated so that all the host smt status will be shown in the llsummary -l output.
      • A problem in pe_rm_connect() that caused read() to be called on a socket that was not ready to be read has been corrected, allowing pe_rm_connect() to continue to retry to the connection for the specified rm_timeout amount of time.
    • Scheduler only:

    Problems fixed in LoadLeveler 4.1.1.9 [February 9, 2012]

    • LoadLeveler can now display the host name correctly based on the name_server configuration. The previous limitation of the name_server keyword being ignore is now lifted.
    • LoadLeveler has been changed to prevent unnecessary logging of multi-cluster messages to the SchedLog.
    • The llconfig -c command will coredump if cluster has more than 128 machines.
    • If there are adapter windows failure from unloading or cleaning, LoadLeveler will now mark these adapters as unusable and will give the central manager only windows that can be used.
    • Resource Manager only:
      • The timing issue between the cancel and job start transaction is now corrected so that the job will be canceled after the llcancel command is issued and the startd daemon will no longer hang after executing the llctl stop command.
      • The LoadLeveler method for reporting job step status has been corrected to report R state, even for parallel jobs which do not invoke an mpi run time manager (e.g. poe).
    • Scheduler only:
      • There can be a problem in determining the count of idle job steps towards the maxidle limit for a user within a class in instances where the class of a job step is modified with the llmodify command. The step count limitation is now calculated correctly for the user for each of its class.

    Problems fixed in LoadLeveler 4.1.1.8 [December 15, 2011]

    • LoadLeveler can prevent the potential core dump caused by a race condition when querying a terminating job.
    • LoadLeveler will ignore the machine_list keyword if the syntax is not defined correctly.
    • Changes have been made to the processing of the machine_list keyword so that hyphens can now be used as part of the machine name and that multiple number ranges can be specified in each of the machine name expressions.
      e.g. machine_list = c250f01c[02-08]n[01-08]-ib0,c250f[02-04]c[01-08]n[01-08]-ib0
    • Changes have been made in the way job keys are handled in LoadLeveler so that it is no longer possible for more than one job having the same job key to be active in the cluster at the same time.
    • Resource Manager only:
      • The LOADL_HOSTFILE environment variable is now set in the environment for the user prolog when the job type is set to mpich.
      • The abort is now prevented by correcting the startd daemon locking when processing files in the execute directory during startup.

    Problems fixed in LoadLeveler 4.1.1.7 [October 27, 2011]

    • The llsubmit command will fail if the smt and rset keywords are used together.
    • The processing of the preempt_class configuration keywords has been fixed so that changes will take effect after the llctl reconfig command is issued.
    • The Negotiator has been changed so that it no longer depends on processor core 0 having CPUs configured. The Negotiator will no longer core dump if it encounters such a configuration.
    • The memory error in the LoadLeveler String library is corrected to prevent crashes if the function is used.
    • The LoadLeveler commands will not generate the 2512-030 error message when there is no /etc/LoadL.cfg file on the system.
    • Resource Manager only:
      • The Startd has been fixed to ensure that the correct effective user ID is used when cleaning up job status and job usage files in the execute directory during job termination.
    • Scheduler only:
      • LoadLeveler will now select and hold cpus that are already in used for top dog usage; therefore, other jobs can now run with cpus that are currently available.

    Problems fixed in LoadLeveler 4.1.1.6 [September 1, 2011]

    • LoadLeveler can now handle jobs from users who belong to more than 64 system groups.
    • LoadLeveler is now able to support ETHoIB using bond0 interface mapped to IB User Space device on linux system if the fileset rsct.lapi.rte apar IV06393 is also applied.

    Problems fixed in LoadLeveler 4.1.1.5 [July 28, 2011]

    • Multiple configuration editor and form-based GUI issues are resolved.
    • LoadLeveler will not submit the job if there are no class in the default class list that can satisfy the job requirements.
    • LoadLeveler now creates cpuset files with permissions that are searchable by non-root users under the /dev/cpuset directory.
    • The unthread_open() error in the Schedd Log will no longer be printed when querying the remote cluster job since LoadLeveler will no longer try to route a nonexistent remote submit job command file in a multi cluster environment.
    • LoadLeveler has been enhanced so it now displays the job eligibility time.
    • Intel MPI and Open MPI are now supported under LoadLeveler.
    • Resource Manager only:
      • The llctl command is now able to support "start drained" option on the remote node.
    • Scheduler only:
      • LoadLeveler LoadL_negotiator daemon will not core dump when processing a multi-step job which contains a dependency statement longer than 2048 character s.
      • LoadLeveler "llq -s" command will provide information about why a step is in Deferred state.
      • The llsummary command and API will no longer core dump if the number of history files are greater than or equal to the PTHREAD_DATAKEYS_MAX constant value.

    Problems fixed in LoadLeveler 4.1.1.4 [May 27, 2011]

    • The llctl command will now check to make sure the Schedd daemon's port is available to be used before starting up LoadLeveler.
    • A new keyword, restart, is implemented for the class stanza in the admin configuration.
    • If a value is not set for the keyword max_starters in database configuration mode, the default value used for max_starters will be adjusted when the count of classes specified in the keyword class is changed.
    • Absolute paths containing http/https are changed to relative paths for the configuration editor to run.
    • Resource Manager only:
      • Loadleveler will now set the right environment variables when executing the user epilog script.
      • The llmkres command should now be able to create the reservations consistently without hitting the timing error message 2512-856.
      • In a multicluster environment, the llq -s command will now invoke the correct query command on the remote cluster.
    • Scheduler only:
      • Modifying the recurring reservation's attribute will now be seen in the first occurrence's attribute value under the llqres -l command.
      • A unique security issue has been identified for TWS LoadLeveler Web User Interface that could potentially compromise your system. It is recommended that you apply this update to protect your system.

    Problems fixed in LoadLeveler 4.1.1.3 [March 25, 2011]

    • The ability to set the name_server in LoadLeveler is now disabled. The setting under LoadLeveler will now always be set to DNS.
    • When configuring class limits using the config editor adding or updating when there is more than one class limit will fail. Now the config editor can be used to update class limits or add new hard and soft limits.
    • If the class-user sub-stanzas in the "default" class stanza are not defined in alphabetical order, the class-user sub-stanzas might incorrectly inherit the wrong values from the default class. LoadLeveler will now inherit the default values for the class-user sub-stanzas from the "default" class stanza correctly.
    • On Linux/P nodes, jobs requesting memory affinity with MCM_MEM_NONE, the job will always consume memory from the local MCM and will start paging once memory on the local MCM is over consumed; even though memory is available on other MCMs on the node. Now, if a job is submitted with memory affinity option, MCM_MEM_NONE, the task will be bounded to all the MCMs on the node and the memory will be consumed from all the MCMs on the node.
    • An incorrect spelling prevented the class stanza keyword striping_with_minimum_networks from being set when DB configuration was used. The spelling of the column name in the database is now corrected.
    • LoadLeveler schedd may ignore jobs if the job queue contains invalid job keys. Now, LoadLeveler schedd will collect the correct job data when scanning the job queue files.
    • The llrstatus -a reports "No adapters are available" after issuing the llrctl reconfig command. When a machine running a Resource Manager or Region Manager daemon is reconfigured, information about adapters on other machines was being wiped out. The configuration processing code in Resource Manager and Region Manager has been fixed so that existing adapter information will remain intact.
    • When configuring the resources=keyword(all) in the machine group stanza in database mode, the llstatus -R command will show no resources being set. Resources will now become effective when setting the resources=keyword(all) in the machine group stanza in database mode.
    • The schedd can core dump when a scale-across multi-cluster environment is configured incorrectly. This can happen if scale-across multi-cluster is configured and the same cluster stanza is specified as local for more than one cluster. LoadLeveler is changed to protect the schedd from core dumping the same cluster stanza is configured as local for more than one cluster in a scale-across multi-cluster environment.
    • Resource Manager only:
      • The LoadL_startd daemon leaks memory due to a failure to release memory allocated for data structures to hold switch table data for a job step. The LoadL_startd daemon is corrected to release all memory allocated for data structures to hold switch table data for a job step, when the job step data structure is de-allocated.
      • A crash may occur in either the resource manager daemon or the negotiator daemon if those daemons received incorrect routing data during an update from startd. This could have happened when the feature keyword was used in the machine_group stanza under the administration file or database setup. The correct bits are now set by the startd daemon so that routing of the data will not cause the resource manager or the central manager to core dump.
    • Scheduler only:
      • LoadLeveler will occasionally show the wrong number of class resource slots or even miss some classes from the llclass output if too many class query requests come in simultaneously. LoadLeveler is now fixed to show the correct class resources in the llclass output.
      • When maxidle is used for a given user within a class, dependent steps can be queued at a higher priority than non-dependent steps. Dependent steps are not given a new qdate when they are put onto the idle queue, while steps at the maxidle limit for a given user within a class are given a new qdate and a new sysprio based on that qdate. A change was made so that dependent steps are also counted as "queued" steps for the purposes of enforcing maxqueued and maxidle limits, and so a dependent step which is at the maxidle limit will get a new qdate.

    Problems fixed in LoadLeveler 4.1.1.2 [January 28, 2011]

    • The LoadLeveler startd drain status will be lost if the negotiator daemon restarts. Fixed the startd drain status to be stored onto each individual startds. When the negotiator daemon restarts, the startd drain information will be restored from all the startds.
    • The llsummary command might crash if the default class requirement value doesn't match the job requirement value. Fixed the llsummary command to select the correct requirement value from the default class list if there is no job class specified in job command file.
    • The llsummary command will fail when it tries to access invalid data memory in the job history file. Fixed the llsummary command to be able to ignore the bad data areas and just report the valid data in the job history file.
    • LoadLeveler schedd may ignore jobs if the job queue contains invalid job keys. The schedd will now collect the correct job data when scanning the job queue files.
    • The LoadLeveler command, llmodify, has a limitation where the startdate and wall_clock_limit job attributes cannot be modify for idle jobs. llmodify is now enhanced to be able to modify the startdate and wall_clock_limit job attributes for idle jobs.
        New documentation:
      • In the LoadLeveler Command and API Reference, SC23-6701-00, under Chapter 1. Commands, llmodify - Change attributes of a submitted job step,
        • New keyword wall_clock_limit for the -k option: Changes the wall clock limit of a job step. The value of the specified wall clock limit must be longer than the value of the current wall clock limit. This is a LoadLeveler administrator only option.
        • New keyword startdate for the -k option: Changes the start time of a idle-like job step. This is a LoadLeveler administrator only option.
    • Resource Manager only:
      • User jobs will not be launched on AIX if the group name did not match the one from the job submission. Fixed the job launch program so it d oes not need to verify the group name so jobs will be executed using the submitting GID number.
      • A crash may occur in either the LoadL_resource_mgr daemon or the LoadL_negotiator daemon when the feature keyword is used in the machine_group stanza under the administration file or database setup. A fix has been made in supporting the specifying of the feature keyword in the machine_group stanza in the administration file or in the database.
    • Scheduler only:
      • LoadLeveler machines and jobs may have the wrong state if some startd are down and the region manager is enabled. Fixed LoadLeveler to handle machines and jobs status correctly when the region manager detects a machine to be down.
      • LoadLeveler was trying to load the network table for jobs with job_type=MPICH and the job will fail to run if the network table can not be loaded. Since jobs with a job_type=MPICH do not require the loading of the network table. LoadLeveler will not load the network table with this job type specified in the job command file.

    Problems fixed in LoadLeveler 4.1.1.1 [December 10, 2010]

    • Fixed the schedd daemon so it will not crash if the job's output file path contains the "%" character.
    • Fixed the -s and -e options in the llsummary command to report all the jobs that match the filter requirement. In the TWS LoadLeveler documentation, Command and API Reference and the llsummary.l manual page, the -s and -e options will state the accounting data report will contain information about every job that contains at least one step that falls within the specified range.
    • Resource Manager only:
      • Fixed the processor affinity environment to be setup correctly for jobs to run in when the job prolog is configured in LoadLeveler.
    • Scheduler only:
      • Fixed the llclass command to show the correct value for the "Free Slots" field when LoadLeveler is configured to use the LL_DEFAULT scheduler.
      • Fixed the llchres command to check requested node additions to make sure that those nodes have no jobs running on them or already assigned to another reservation. If no idle nodes can be found, the llchres command will fail.
      • Fixed LoadLeveler to correctly reserve the reservation's resources after the central manager daemon restarts so that jobs with overlapping resources with the reservations will not be allowed to start.
      • Fixed the central manager to make sure pending status changes to the machines are properly locked so that jobs being scheduled to the down machines will no longer crash the central manager daemon.

    TWS LoadLeveler Corrective Fix listing
    Fix Level APAR numbers
    LL 4.1.1.13 AIX : resource manager: IV30380 IV30377 IV30381 IV27947 IV28196
      Linux : resource manager:
    LL 4.1.1.13 AIX : scheduler: IV30379 IV30378 IV24379 IV27942 IV30423
      Linux : scheduler:
    LL 4.1.1.12 AIX : resource manager: IV25837 IV21264
      Linux : resource manager:
    LL 4.1.1.12 AIX : scheduler: IV25839 IV25841 IV25840 IV25448
      Linux : scheduler:
    LL 4.1.1.11 AIX : resource manager: IV21359 IV21366
      Linux : resource manager:
    LL 4.1.1.11 AIX : scheduler: IV21368 IV21370
      Linux : scheduler:
    LL 4.1.1.10 AIX : resource manager: IV16305 IV17277 IV18007 IV18020 IV18021
      Linux : resource manager:
    LL 4.1.1.10 AIX : scheduler: IV18009
      Linux : scheduler:
    LL 4.1.1.9 AIX : resource manager: IV09137 IV13850 IV13916 IV13921
      Linux : resource manager:
    LL 4.1.1.9 AIX : scheduler: IV13843 IV13917 IV13918 IZ99309
      Linux : scheduler:
    LL 4.1.1.8 AIX : resource manager: IV10207 IV11370 IV11559
      Linux : resource manager:
    LL 4.1.1.8 AIX : scheduler: IV11371 IV11561
      Linux : scheduler:
    LL 4.1.1.7 AIX : resource manager: IV07759 IV08359 IV09015
      Linux : resource manager:
    LL 4.1.1.7 AIX : scheduler: IV08360 IV08533 IV09016
      Linux : scheduler:
    LL 4.1.1.6 AIX : resource manager: IV06462 IV06510 IV06512
      Linux : resource manager:
    LL 4.1.1.6 AIX : scheduler: IV06463 IV06511 IV06513
      Linux : scheduler:
    LL 4.1.1.5 AIX : resource manager: IV00833 IV01116 IV01321 IV01332 IV02937 IV03232 IV03277 IV03299 IV03304
      Linux : resource manager:
    LL 4.1.1.5 AIX : scheduler: IV00813 IV00834 IV01325 IV01333 IV01390 IV02945 IV03233 IV03278 IV03300 IV03303 IV03305 IV03309
      Linux : scheduler: IZ99118 IV01036
    LL 4.1.1.4 AIX : resource manager: IV00027 IV00031 IV00037 IV00277 IV00304 IV00462 IZ93228 IZ99666
      Linux : resource manager:
    LL 4.1.1.4 AIX : scheduler: IV00028 IV00029 IV00032 IV00280 IV00299 IV00463
      Linux : scheduler:
    LL 4.1.1.3 AIX : resource manager: IZ93225 IZ93259 IZ93267 IZ94345 IZ94801 IZ96421 IZ96428 IZ96430
      Linux : resource manager:
    LL 4.1.1.3 AIX : scheduler: IZ89344 IZ93154 IZ93226 IZ94800 IZ96422 IZ96423 IZ96425 IZ96431 IZ96433
      Linux : scheduler:
    LL 4.1.1.2 AIX : resource manager: IZ89829 IZ90705 IZ90707 IZ91597 IZ91599 IZ92052 IZ92374
      Linux : resource manager: IZ91715
    LL 4.1.1.2 AIX : scheduler: IZ90487 IZ90706 IZ90708 IZ90875 IZ91596 IZ91598 IZ91600 IZ92375
      Linux : scheduler:
    LL 4.1.1.1 AIX : resource manager: IZ88502 IZ88504 IZ89257 IZ89260
      Linux : resource manager:
    LL 4.1.1.1 AIX : scheduler: IZ88503 IZ88506 IZ88509 IZ88511 IZ88513 IZ88681 IZ89259
      Linux : scheduler:

Rate this page:

(0 users)Average rating

Document information


More support for:

LoadLeveler

Reference #:

00001346

Modified date:

2012-11-01

Translate my page

Machine Translation

Content navigation