This page is meant to serve as a general System Administrators Guide (SAG) for the HPC's Casper system.
root@bet201 (/)# llctl -g drain
root@bet201 (/)# llctl -g stop
root@bet201 (/)# llctl -g start
# mkdir -p /mnt/fix/
# touch /mnt/fix/nfs.not.mounted
# mount -o ro honggao:/mnt/cdrom/ /mnt/fix/
# smit
Do the following to install LoadLeveler 3.1 using SMIT:
Select Software Installation and Maintenance
Select Install and Update Software
SElect Install Software
Select The device or directory containing the install images(/mnt/fix/)
Select The appropriate information to specify options
Press Enter (Do)
Requisite Failures
------------------
SELECTED FILESETS: The following is a list of filesets that you asked to
install. They cannot be installed until all of their requisite filesets
are also installed. See subsequent lists for details of requisites.
LoadL.full 3.1.0.0 # LoadLeveler
LoadL.tguides 3.1.0.0 # LoadLeveler TaskGuides
MISSING REQUISITES: The following filesets are required by one or more
of the selected filesets listed above. They are not currently installed
and could not be found on the installation media.
Java130.rte.bin 1.3.0.0 # Base Level Fileset
bos.cpr 5.1.0.10 # Fileset Update
After installing Java130.rte.bin and bos.cpr, installed LoadLeveler 3.1 successfully using SMIT
# umount /mnt/fix
root@veg001 (/usr/sys/inst.images/installp/ppc)# ftp service.boulder.ibm.com Connected to service.boulder.ibm.com. 220 service.boulder.ibm.com FTP server (Version wu-2.6.2(1) Thu Aug 7 15:45:13 MDT 2003) ready. Name (service.boulder.ibm.com:root): anonymous 331 Guest login ok, send your complete e-mail address as password. Password: 230 Guest login ok, access restrictions apply. ftp> mget aix/fixes/pkgs/wvgww/ mget aix/fixes/pkgs/wvgww/LoadL.full.3.1.0.17.bff? y 200 PORT command successful. 150 Opening ASCII mode data connection for aix/fixes/pkgs/wvgww/LoadL.full.3.1.0.17.bff (14333952 bytes). 226 Transfer complete. 14387449 bytes received in 140.7 seconds (99.89 Kbytes/s) local: LoadL.full.3.1.0.17.bff remote: aix/fixes/pkgs/wvgww/LoadL.full.3.1.0.17.bff mget aix/fixes/pkgs/wvgww/LoadL.html.3.1.0.6.bff? y 200 PORT command successful. 150 Opening ASCII mode data connection for aix/fixes/pkgs/wvgww/LoadL.html.3.1.0.6.bff (129024 bytes). 226 Transfer complete. 130584 bytes received in 1.221 seconds (104.4 Kbytes/s) local: LoadL.html.3.1.0.6.bff remote: aix/fixes/pkgs/wvgww/LoadL.html.3.1.0.6.bff mget aix/fixes/pkgs/wvgww/LoadL.msg.En_US.3.1.0.7.bff? y 200 PORT command successful. 150 Opening ASCII mode data connection for aix/fixes/pkgs/wvgww/LoadL.msg.En_US.3.1.0.7.bff (465920 bytes). 226 Transfer complete. 471438 bytes received in 9.389 seconds (49.04 Kbytes/s) local: LoadL.msg.En_US.3.1.0.7.bff remote: aix/fixes/pkgs/wvgww/LoadL.msg.En_US.3.1.0.7.bff mget aix/fixes/pkgs/wvgww/LoadL.msg.en_US.3.1.0.7.bff? y 200 PORT command successful. 150 Opening ASCII mode data connection for aix/fixes/pkgs/wvgww/LoadL.msg.en_US.3.1.0.7.bff (465920 bytes). 226 Transfer complete. 471435 bytes received in 4.369 seconds (105.4 Kbytes/s) local: LoadL.msg.en_US.3.1.0.7.bff remote: aix/fixes/pkgs/wvgww/LoadL.msg.en_US.3.1.0.7.bff mget aix/fixes/pkgs/wvgww/LoadL.so.3.1.0.17.bff? y 200 PORT command successful. 150 Opening ASCII mode data connection for aix/fixes/pkgs/wvgww/LoadL.so.3.1.0.17.bff (10788864 bytes). 226 Transfer complete. 10828482 bytes received in 139.8 seconds (75.66 Kbytes/s) local: LoadL.so.3.1.0.17.bff remote: aix/fixes/pkgs/wvgww/LoadL.so.3.1.0.17.bff ftp> quit 221-You have transferred 26289388 bytes in 5 files. 221-Total traffic for this session was 26291252 bytes in 6 transfers. 221-Thank you for using the FTP service on service.boulder.ibm.com. 221 Goodbye. root@veg001 (/usr/sys/inst.images/installp/ppc)# smit update_all
root@peg304 (/)# lslpp -l | grep LoadL* LoadL.full 3.1.0.17 COMMITTED LoadLeveler LoadL.html 3.1.0.6 COMMITTED LoadLeveler HTML Pages LoadL.msg.en_US 3.1.0.7 COMMITTED LoadLeveler Messages - U.S. LoadL.pdf 3.1.0.0 COMMITTED LoadLeveler PDF Documentation LoadL.so 3.1.0.17 COMMITTED LoadLeveler (Submit only) LoadL.tguides 3.1.0.0 COMMITTED LoadLeveler TaskGuides bos.rte.bind_cmds 5.1.0.50 COMMITTED Binder and Loader Commands LoadL.full 3.1.0.0 COMMITTED LoadLeveler
# su - loadl
# pwd /home/loadl
# /usr/lpp/LoadL/full/bin/llinit -local /var/loadl -release /usr/lpp/LoadL/full -cm bet201 llinit does the following: Copies the LoadL_admin and the LoadL_config files from the release directory (in the samples subdirectory) into the home directory of loadl. Creates the LoadLeveler log, spool, and execute directories in the local directory with permissions set to 775, 700, and 1777, respectively. Copies the LoadL_config.local file from the release directory (in the samples subdirectory) into the local directory. Creates symbolic links from the loadl home directory to the spool, execute, and log subdirectories and the LoadL_config.local file in the local directory (if home and local directories are not identical). Creates symbolic links from the home directory to the bin, lib, man, samples, and include subdirectories in the release directory. The above files are copied and directories and symbolic links are created only if they don’t already exist.
/usr/lib/libllapi.a -> /usr/lpp/LoadL/full/lib/libllapi.a /usr/lib/libllmulti.a -> /usr/lpp/LoadL/full/lib/libllmulti.a /usr/lib/llapi_shr.o -> /usr/lpp/LoadL/full/lib/llapi_shr.o /home/loadl/include -> /usr/lpp/LoadL/full/include /home/loadl/bin -> /usr/lpp/LoadL/full/bin /home/loadl/lib -> /usr/lpp/LoadL/full/lib /home/loadl/man -> /usr/lpp/LoadL/full/man /home/loadl/samples -> /usr/lpp/LoadL/full/samples /home/loadl/spool -> /var/loadl/spool /home/loadl/execute -> /var/loadl/execute
root@bet201 (/)# llextSDR > /tmp/llextSDR.out
#llextSDR: System Partition = "bet100" on Sat Aug 9 09:50:10 2003
bet316.ocs.lsu.edu: type = machine
adapter_stanzas = beu316.ocs.lsu.edu bsw316.ocs.lsu.edu bet316.ocs.lsu.edu
spacct_excluse_enable = false
alias = beu316.ocs.lsu.edu bsw316.ocs.lsu.edu
beu316.ocs.lsu.edu: type = adapter
adapter_name = en1
network_type = ethernet
interface_address = 130.39.187.82
interface_name = beu316.ocs.lsu.edu
bsw316.ocs.lsu.edu: type = adapter
adapter_name = css0
network_type = switch
interface_address = 130.39.234.86
interface_name = bsw316.ocs.lsu.edu
switch_node_number = 47
css_type = SP_Switch_MX_Adapter
bet316.ocs.lsu.edu: type = adapter
adapter_name = en0
network_type = ethernet
interface_address = 130.39.242.86
interface_name = bet316.ocs.lsu.edu
In the past(LoadLeveler 2.1), we did not specify the value of css_type for the css0(switch) on each parallel
node in the LoadL_admin file. However, we must set css_type value for LoadLeveler 3.1(add line " css_type =
SP_Switch_MX_Adapter").Otherwise, we got error when we start LoadLeveler.
root@bet100 (/)# spmon -d
----------------------------------- Frame 2 ----------------------------------
Host Switch Key Env Front Panel LCD/LED
Slot Node Type Power Responds Responds Switch Error LCD/LED Flashes
---- ---- ----- ----- -------- -------- ------- ----- ---------------- -------
1 17 wide on yes no N/A no LCDs are blank no
----------------------------------- Frame 3 ----------------------------------
Host Switch Key Env Front Panel LCD/LED
Slot Node Type Power Responds Responds Switch Error LCD/LED Flashes
---- ---- ----- ----- -------- -------- ------- ----- ---------------- -------
1 33 thin on yes yes N/A no LCDs are blank no
2 34 thin on yes yes N/A no LCDs are blank no
3 35 thin on yes yes N/A no LCDs are blank no
4 36 thin on yes yes N/A no LCDs are blank no
5 37 thin on yes yes N/A no LCDs are blank no
6 38 thin on yes yes N/A no LCDs are blank no
7 39 thin on yes yes N/A no LCDs are blank no
8 40 thin on yes yes N/A no LCDs are blank no
9 41 thin on yes yes N/A no LCDs are blank no
10 42 thin on yes yes N/A no LCDs are blank no
11 43 thin on yes yes N/A no LCDs are blank no
12 44 thin on yes yes N/A no LCDs are blank no
13 45 thin on yes yes N/A no LCDs are blank no
14 46 thin on yes yes N/A no LCDs are blank no
15 47 thin on yes yes N/A no LCDs are blank no
16 48 thin on yes yes N/A no LCDs are blank no
root@peg304 (/)#llctl start llctl: Attempting to start LoadLeveler on host peg304.ocs.lsu.edu. LoadL_master 3.1.0.17 rlyns24a 2003/07/22 AIX 5.1 71 CentralManager = bat201 root@peg304 (/)#ps -ef | grep LoadL* root 18378 12440 0 11:16:56 pts/1 0:00 grep LoadL* root@peg304 (/)#tail /syslog/SchedLog 08/14 08:48:49 TI-1 ************************************************ 08/14 08:48:49 TI-1 ** LOADL_SCHEDD STARTING UP ** 08/14 08:48:49 TI-1 ************************************************ 08/14 08:48:49 TI-1 08/14 08:48:49 TI-1 LoadLeveler: LoadL_schedd started, pid = 33674 08/14 08:48:49 TI-1 init_wakeup_timers: Set SCHEDD_INTERVAL to 60. 08/14 08:49:39 TI-4 LoadL_schedd shutting down now. 08/14 08:49:39 TI-4 LmDaemon::shutdown(): schedd shutting down with 0 jobs on the queue. 08/14 08:49:39 TI-4 Shutdown completed. Exiting now. lslpp: 0504-132 Fileset dce.client.rte not installed. 08/14 08:53:26 TI-1 LoadLeveler: MAX_JOB_REJECT is 10. 08/14 08:53:26 TI-1 LoadLeveler: ACTION_ON_MAX_REJECT is CANCEL. 08/14 08:53:26 TI-1 08/14 08:53:26 TI-1 LmDaemon::createJobQueue(): Cannot create job queue. 08/14 08:53:26 TI-1 08/14 08:53:26 TI-1 LoadL_schedd shutting down now. 08/14 08:53:26 TI-1 LmDaemon::shutdown(): schedd shutting down with 0 jobs on the queue. 08/14 08:53:26 TI-1 Shutdown completed. Exiting now.
root@bet201 (/)# llctl -g start
root@bet201 (/)# llctl -h hostname start where hostname is the name of the machine to start.
# mkdir -p /mnt/fix/ # touch /mnt/fix/nfs.not.mounted # mount -o ro honggao:/mnt/cdrom/ /mnt/fix/ # smit Do the following to install LoadLeveler 3.3.1 using SMIT: Select Software Installation and Maintenance Select Install and Update Software Select Install Software Select The device or directory containing the install images(/mnt/fix/) Select The appropriate information to specify options Press Enter (Do) # umount /mnt/fix
# su - loadl
# pwd /usr/local/home/loadl
# /usr/lpp/LoadL/full/bin/llinit -local /var/loadl -release /usr/lpp/LoadL/full -cm l1f1n01
root@l2f1c01 (/home/install)# cat loadl_init su - loadl -c "/usr/lpp/LoadL/full/bin/llinit -local /var/loadl -release /usr/lpp/LoadL/full -cm l2f1n01" root@l2f1c01 (/home/install)# dcp /home/install/loadl_init /usr/local/home/loadl/. root@l2f1c01 (/home/install)# dsh /usr/local/home/loadl/loadl_init
root@l1f1n01 (/)# llextRPD
llextRPD: 2512-678 Unable to obtain machine and adapter information for the current RSCT Peer Domain.
cu_get_cluster_info utility returned a cluster name of IW indicating that this is an Independent Workstation and that no cluster has been started. Extracted data will contain information only for the local machine "l1f1n01.sys.loni.org".
#llextRPD: Cluster = "IW" ID = "3380539212" on Thu May 18 16:36:56 2006
l1f1n01.sys.loni.org: type = machine
adapter_stanzas = l1f1b01.sys.loni.org l1f1n01.sys.loni.org l1f1s01.sn.loni.org l1f1m01.ml.loni.org
alias = l1f1b01.sys.loni.org l1f1m01.ml.loni.org
l1f1b01.sys.loni.org: type = adapter
adapter_name = en4
network_type = ethernet
interface_address = 138.47.23.141
interface_name = l1f1b01.sys.loni.org
l1f1n01.sys.loni.org: type = adapter
adapter_name = en0
network_type = ethernet
interface_address = 138.47.23.21
interface_name = l1f1n01.sys.loni.org
device_driver_name = ent0
l1f1s01.sn.loni.org: type = adapter
adapter_name = sn0
network_type = switch
interface_address = 192.168.22.113
interface_name = l1f1s01.sn.loni.org
multilink_address = 192.168.24.113
logical_id = 6
adapter_type = Switch_Network_Interface_For_HPS
device_driver_name = sni0
network_id = 1
l1f1m01.ml.loni.org: type = adapter
adapter_name = ml0
network_type = multilink
interface_address = 192.168.24.113
interface_name = l1f1m01.ml.loni.org
multilink_list = sn0
root@l1f1n01 (/)# llctl -g start
root@l1f1c01 (/)# dcp /home/install/.rhosts /.rhosts root@l1f1c01 (/)# dcp /home/install/hosts.equiv /etc/hosts.equivMake sure to uncomment the rsh line in /etc/inetd.conf.
shell stream tcp6 nowait root /usr/sbin/rshd rshdIf inetd is not running under SRC control you need to start it by "startsrc -s inetd". If inetd is running then the "refresh -s inetd" command needs to be executed for inetd to re-read the inetd.conf file.
root@l1f1c01 (/)# dsh refresh -s inetd
ENFORCE_RESOURCE_USAGE = ConsumableMemory
ENFORCE_RESOURCE_MEMORY = true
SCHEDULE_BY_RESOURCES = ConsumableMemory
ENFORCE_RESOURCE_SUBMISSION = true
resources = ConsumableMemory(16 GB)
workq: type = class
default_resources = ConsumableMemory(1600 MB)
root@l1f1c01 (/)# dcp /home/install/epilog /usr/local/home/loadl/. root@l1f1c01 (/home/install)# cc -o WLM_epilog WLM_epilog.c root@l1f1c01 (/)# dcp /home/install/WLM_epilog /usr/local/home/loadl/.
In AIX 5.1D and subsequent releases, a feature called Technical Large Page Support is available. Technical Large Page Support involves the selective use of large virtual and physical memory pages to back private data segments of a process. When specified, the user process heap, the main program BSS, and the main program data areas are backed by large pages. LoadLeveler users can take advantage of this AIX feature and enable Large Page support for their jobs.
For more information about LDR_CNTRL, please refer to the AIX documentation.
If some of the machines in your LoadLeveler cluster are configured to exploit the Large Page feature, and if you want LoadLeveler to provide support for large pages, then the information contained in the items listed below is needed for effective use of this feature:
The LoadLeveler configuration keyword VM_IMAGE_ALGORITHM must be set to the value FREE_PAGING_SPACE_PLUS_FREE_REAL_MEMORY if jobs are submitted with large_page = M. This keyword specifies which algorithm the Central Manager uses to decide whether a machine has enough virtual memory to meet the requirement of the image_size keyword of a job step.
Jobs which specify large_page = M will remain in the idle state if the default setting of FREE_PAGING_SPACE is used for VM_IMAGE_ALGORITHM keyword. Using the FREE_PAGING_SPACE_PLUS_FREE_REAL_MEMORY algorithm allows LoadLeveler to consider both the free "regular" memory and the free Large Page memory when deciding if a machine in the cluster has enough virtual memory to run a job step.
The job command file keyword large_page is used to inform LoadLeveler that a job step requires Large Page support from AIX. The syntax of this keyword is:
The "llq -l" and "llsummary -l" commands have been enhanced to display information associated with the large_page keyword for LoadLeveler jobs. The output listings of these commands contain lines similar to the following:
The "llstatus -l" command has been enhanced to show information for total memory and free memory for both Large Page memory and regular memory. Below is a fragment of a representative "llstatus -l" command output:
In the above listing, Total Memory refers to the sum of regular and Large Page memory. Memory and FreeRealMemory refer to regular and free regular memory.
Two new LoadLeveler variables, TotalMemory and LargePageMemory, have been added. These variables are supported by the requirements and preferences expressions.
In the following sample job command file, the person submitting this job to LoadLeveler has requested that the job be run on a machine that has at least 1500 MB of Large Page memory configured and that machines having total memory (regular and Large Page) greater than 2800 MB are preferred.
Notes:
For more information about LDR_CNTRL, please refer to the AIX documentation.
The function ll_get_data() of the LoadLeveler API has been enhanced so that Large Page information of machines can be accessed by the specifications:
The large_page information associated with job steps can be accessed by the specification LL_StepLargePage.
The AIX Workload Manager (WLM) program does not support Large Page memory. The information associated with memory statistics in the outputs of commands such as "llq -w <job_id>" is not meaningful and should be ignored.
root@peg303 (/)# cd /usr/opt/ifor/ls/os/aix/bin root@peg303 (/usr/opt/ifor/ls/os/aix/bin)# ./i4configSetup for:
GRL-2050: *** Fatal error from I4LLMD: License database on an invalid node. CFG-20040: Error(s) reported by 1 or more services: 'Start Services' failedFix the License Server problem GRL-2050: *** Fatal error from I4LLMD:
root@peg303 (/usr/opt/ifor/ls/os/aix/bin)# cd /var/ifor root@peg303 (/var/ifor)# i4cfg -stop i4cfg Version 4.5.5 AIX -- LUM Configuration Tool (c) Copyright 1995-1998, IBM Corporation, All Rights Reserved US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. 0513-044 The i4llmd Subsystem was requested to stop. root@peg303 (/var/ifor)# rm *.dat *.idx root@peg303 (/var/ifor)# i4cfg -start i4cfg Version 4.5.5 AIX -- LUM Configuration Tool (c) Copyright 1995-1998, IBM Corporation, All Rights Reserved US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. 0513-059 The i4llmd Subsystem has been started. Subsystem PID is 344140. 'Start Services' has completed successfullyMake sure the IFor really is running:
root@peg303 (/var/ifor)# lssrc -a |grep -i if i4llmd iforls 344140 active llbd iforncs inoperative glbd iforncs inoperative i4lmd iforls inoperative i4glbcd iforncs inoperative i4gdb iforls inoperative NOTE: just the i4llmd needs to be running, the rest are network, etc.Re-enroll licenses fro C, C++,and Fortran
root@peg303 (/var/ifor)# cd /usr/opt/ifor/ls/os/aix/bin root@peg303 (/usr/opt/ifor/ls/os/aix/bin)# ./i4blt -a -f /usr/vacpp/vacpp_cn.lic -T 10 -R "root LSU OCS BDRH" i4blt Version 4.6.5 AIX -- LUM Basic License Tool (c) Copyright 1995-1998, IBM Corporation, All Rights Reserved US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. (c) Copyright 1991-1997 Gradient Technologies Inc., All Rights Reserved (c) Copyright 1991,1992,1993, Hewlett-Packard Company, All Rights Reserved ADM-10099: Product successfully enrolled root@peg303 (/usr/opt/ifor/ls/os/aix/bin)# ./i4blt -a -f /usr/vac/cforaix_cn.lic -T 10 -R "root LSU OCS BDRH" i4blt Version 4.6.5 AIX -- LUM Basic License Tool (c) Copyright 1995-1998, IBM Corporation, All Rights Reserved US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. (c) Copyright 1991-1997 Gradient Technologies Inc., All Rights Reserved (c) Copyright 1991,1992,1993, Hewlett-Packard Company, All Rights Reserved ADM-10099: Product successfully enrolled root@peg303 (/usr/opt/ifor/ls/os/aix/bin)# ./i4blt -a -f /usr/lpp/xlf/xlfaix_cn.lic -T 10 -R "root LSU OCS BDRH" i4blt Version 4.6.5 AIX -- LUM Basic License Tool (c) Copyright 1995-1998, IBM Corporation, All Rights Reserved US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. (c) Copyright 1991-1997 Gradient Technologies Inc., All Rights Reserved (c) Copyright 1991,1992,1993, Hewlett-Packard Company, All Rights Reserved ADM-10099: Product successfully enrolledIf there is still error information, you can use the following method to rebuild the License Server:
#i4cfg -stop #cd /var/ifor #rm *.dat *.idx *.error *.out i4ls.ini #/usr/opt/ifor/ls/os/aix/bin/i4cnvini #/var/ifor/i4cfg -script (rebuild license)To automatically to start compiler license server on interative nodes, the follow is added in /etc/rc.local on peg304 and peg303
# start compiler license server /var/ifor/i4cfg -start
root@bet100 (/)# kdestroy Tickets destroyed. root@bet100 (/)# kinit root.admin Kerberos V4 Initialization for "root.admin" Password: root@bet100 (/)# klist Ticket file: /tmp/tkt0 Principal: root.admin@OCS.LSU.EDU Issued Expires Principal Sep 3 15:42:30 Oct 3 15:42:30 krbtgt.OCS.LSU.EDU@OCS.LSU.EDU root@bet100 (/)# dsh -w peg301,peg302,peg303,peg304,peg502,peg503,veg001 uptime peg301: 03:58PM up 21 days, 5:40, 0 users, load average: 0.00, 0.01, 0.02 peg302: 03:58PM up 28 days, 18:50, 0 users, load average: 0.00, 0.01, 0.02 peg303: 03:58PM up 5 days, 6 mins, 0 users, load average: 0.00, 0.00, 0.00 peg304: 03:58PM up 19 days, 5 mins, 0 users, load average: 0.21, 0.09, 0.06 peg502: 03:58PM up 12 days, 1:08, 0 users, load average: 6.16, 6.01, 6.04 peg503: 03:58PM up 11 days, 22:36, 0 users, load average: 5.98, 5.95, 6.02 veg001: 03:58PM up 25 days, 4:06, 1 user, load average: 4.14, 4.08, 4.20 root@bet100 (/)# dsh -w bet301,bet201,bet101 uptime bet301: 04:10PM up 8 days, 20:06, 0 users, load average: 0.04, 0.05, 0.04 bet201: 04:10PM up 11 days, 23:04, 0 users, load average: 0.10, 0.16, 0.15 bet101: 04:10PM up 23 days, 7:11, 1 user, load average: 0.05, 0.35, 0.61
root@peg304 (/)# rpm -Uvh /ld/RPMS/local/lsu-hpc-monitord-0.1-0.aix5.1.ppc.rpm lsu-hpc-monitord ##################################################
root@peg304 (/)# rpm -qil lsu-hpc-monitord Name : lsu-hpc-monitord Relocations: (not relocateable) Version : 0.1 Vendor: (none) Release : 0 Build Date: Fri Sep 19 17:49:28 CDT 2003 Install date: Fri Sep 19 17:59:19 CDT 2003 Build Host: bet201.ocs.lsu.edu Group : System Environment/Daemons Source RPM: lsu-hpc-monitord-0.1-0.src.rpm Size : 251362 License: GPL Packager : OCS HPC at Louisiana State UniversityURL : http://unixdoc.lsu.edu/sag/ Summary : Process monitoring daemon Description : monitord watches all system processes. Any process owned by a non-excluded user that has exceeded the maximum established CPU time will be killed and the owner will be sent an e-mail message indicating as much. Currently parameters such as excluded users or processes and the maximum CPU time can only be changed by recompiling the code3. /usr/local/bin/monitord_notify /usr/local/sbin/monitord
root@peg304 (/)# /usr/local/sbin/monitord
root@peg304 (/)# ps -ef |grep monitord root 499920 1 0 08:36:04 pts/13 0:00 /usr/local/sbin/monitord root 737438 745542 0 09:17:09 pts/13 0:00 grep monitord
root@peg304 (/)# cat /usr/local/bin/monitord_notify
#!/bin/sh
####################################################################
#
# MONITORD_NOTIFY
#
# Notify the users about their killed processes and log each kill
#
####################################################################
####################################################################
#
# GLOBAL VARIABLES
#
####################################################################
the_date=`date +"%b %d %H:%M:%S"`
the_host=`uname -n`
the_user=`ypcat passwd.byname |grep ${1} |awk -F: '{print $5}'`
log_file=/syslog/monitord_log
####################################################################
#
# FUNCTION notify_user ()
#
# Notifies the user by electronic mail why they were logged off
# Also sends a copy to root.
#
####################################################################
notify_user ()
{
/usr/bin/mail -s "A process has exceed CPU time limits on Casper" -c casper@lsu.edu ${1}@lsu.edu <<-!END
The process you had running on the Casper interactive node ${the_host}
has exceeded our established CPU limit of 30 minutes. The CPU time
accumulated by your process at the time of cancelation was ${4} minutes.
For long running jobs please use our established batch queues.
See our documentation on the web at:
http://www.lsu.edu/hpc/
and click on the Support link.
Thank you,
The Casper System Administration Team
!END
}
####################################################################
#
# FUNCTION log ()
#
# Record the event
#
####################################################################
log()
{
echo "${the_date} ${the_host} monitord: \"${1} ${2} ${3} ${4} ${5}\"" >> $the_log
}
notify_user $*
log $*
# mpage -bLetter -loafRH /usr/local/bin/monitord_notify | lpr --printer=PASS@lasr2e.ocs.lsu.edu
# start monitord (do not start monitord when this node is used for batch execution node) if [ -x /usr/local/sbin/monitord ] && [ ! -f /etc/nomonitord ] then /usr/local/sbin/monitord fi
# chfs -a "quota = userquota" /homeTo enable both user and group quotas on the /home file system, enter:
# chfs -a "quota = userquota,groupquota" /homeThe corresponding entry in the /etc/filesystems would appear as follows:
/home: dev = /dev/hd1 vfs = jfs log = /dev/hd8 mount = true check = true quota = userquota,groupquota options = rwOptionally, specify alternate disk quota file names. The file names quota.user and quota.group are the default names located at the root directories of the file systems enabled with quotas. You can specify alternate names or directories for these quota files with the userquota and groupquota attributes in the /etc/filesystems file. The following sample chfs command establishes user for the /exports/u00/fs000 file system on Casper, and names the quota files quota.user:
# chfs -a "userquota = /exports/u00/fs000/quota.user" /exports/u00/fs000The corresponding entry in /etc/filesystems would appear as follows:
/exports/u00/fs000:
dev = /dev/u00_00lv
vfs = jfs
log = /dev/loglv00
mount = automatic
check = false
options = rw
account = true
quota = userquota
userquota = /exports/u00/fs000/quota.user
adsmfsm = true
Use the edquota command to create each user or group's soft and hard limits for allowable disk space and maximum number of files.
The following sample entry shows quota limits for user karsten :
root@cet101 (/)# edquota -u karsten
Quotas for user karsten:
/exports/u00/fs000: blocks in use: 45376, limits (soft = 50000, hard = 60000)
inodes in use: 846, limits (soft = 2000, hard = 2500)
This user has used 45MB of the maximum 50MB of disk space. Of the maximum 2000 files, karsten has created 846. This user has buffers of
10MB of disk space and 500 files that can be allocated to temporary storage.
# edquota -p davec nanc
Note: It is recommended that you do this each time you first enable quotas on a file system and after you reboot the system.To enable this check and to turn on quotas during system startup, add the following lines at the /etc/rc.local file:
#Enable the Casper user quota /usr/sbin/quotacheck -a /usr/sbin/quotaon -a
Note: If a particular user has no files in a file system on which that user has a quota, this command displays quota: none for that user. The user's actual quota is displayed when the user has files in the filesystem.To display quotas as the root user for user karsten , enter:
root@cet101 (/)# quota -u karstenThe system displays the following information:
Disk quotas for user karsten (uid 1130):
Filesystem blocks quota limit grace files quota limit grace
/exports/u00/fs000 45376 50000 60000 846 2000 2500
root@peg301 (/)# cd /usr/local/stata9
root@peg301 (/)# ./stinit
root@peg301 (/)# /usr/local/stata9/stata-se . simulinit . exitsimulinit creates the file /usr/local/stata9/.license/stata.sim, which must be readable and writable by all Stata users.
root@peg301 (/exports/local/stata9)# vi stata.lft /work/.stata_license/stata.sim
root@peg301 (/exports/local/stata9)# /usr/local/stata9/stata-se
___ ____ ____ ____ ____ tm
/__ / ____/ / ____/
___/ / /___/ / /___/ 9.0 Copyright 1984-2005
Statistics/Data Analysis StataCorp
4905 Lakeway Drive
Special Edition College Station, Texas 77845 USA
800-STATA-PC http://www.stata.com
979-696-4600 stata@stata.com
979-696-4601 (fax)
2-user Stata for IBM RISC64 (network) perpetual license:
Serial number: 890514525
Licensed to: Tibor Besedes
Louisiana State University
Notes:
1. (-m# option or -set memory-) 10.00 MB allocated to data
2. (-v# option or -set maxvar-) 5000 maximum variables
3. Command line editing enabled
Note: Your site can add messages to the introduction by editing the file
stata.msg in the directory where Stata is installed.
. simulinit file /work/.stata_license/stata.sim created
. exit