This document briefly summarizes important steps needed to make LCG2 software running on top of PBSPro (with server on another node than CE), instead of default openpbs, distributed along with LCG2 middleware. It is known to work for at least PBSPro 5.3.3.31432 and PBSPro 5.4.0.40152 - for other versions, modifications depending on new specific PBSPro bugs and possible change of output of pbs commandline utilities could be needed. This document describes a situation where scp is used for transmission of files between WNs and CE (so without shared home directories), as this is a little bit more difficult to set up. Credit for discovering some culprits (scp.fix thing) goes to Peer Hasselmeyer from FZK, and transitively to David Smith from CERN. Any bugreports or improvements to this document please report to Jiri Kosina * Adapting openpbs infoprovider for pbspro is needed. The script for openpbs is located in /opt/edg/libexec/edg-ce-pbs and is called from ceinfo-wrapper.sh. So you can simply modify edg-ce-pbs and you don't need to touch ceinfo-wrapper.sh, or you can make ceinfo-wrapper.sh call your newly created edg-ce-pbspro - but this requires changes to LCFG server (if you are using it), because ceinfo-wrapper.sh script is automatically generated from LCFG server. So let's assume you are going to change edg-ce-pbs. In Appendix A you can find diff against edg-ce-pbs from LCG-2_0_0. * PBSpro uses slightly different ports than openpbs, so a little modification to xinetd configuration object is needed. Also PBSpro has different paths from what openpbs has, this also needs to be reflected. In LCFG you can do it in the following way: - in source/pbs-client-cfg.h, comment out EXTRA(xinetd.etcservices) pbs_mom pbs_remom_tcp pbs_remom_udp xinetd.etcsrvconf_pbs_mom pbs_mom 15002/tcp xinetd.etcsrvconf_pbs_remom_tcp pbs_remom 15003/tcp xinetd.etcsrvconf_pbs_remom_udp pbs_remom 15003/udp - add these fields (substituting your.real.pbspro.server by hostname of your real PBSpro server) #define PBS_MASTER your.real.pbspro.server +ceinfo.SubClusterUniqueID PBS_MASTER +ceinfo.ClusterUniqueID PBS_MASTER +pbsexechost.pbshome /var/spool/PBS +pbsexechost.initname pbs EXTRA(xinetd.etcservices) pbsp_batch pbsp_mom pbsp_mgr pbsp_sched pbsp_globus_mom pbsp_globus_mgr xinetd.etcsrvconf_pbsp_batch pbspro_batch 15011/tcp xinetd.etcsrvconf_pbsp_mom pbspro_mom 15012/tcp xinetd.etcsrvconf_pbsp_mgr pbspro_mgr 15013/tcp xinetd.etcsrvconf_pbsp_sched pbspro_sched 15014/tcp xinetd.etcsrvconf_pbsp_globus_mom pbspro_globus_mom 15015/tcp xinetd.etcsrvconf_pbsp_globus_mgr pbspro_globus_mgr 15016/tcp - in srouce/ComputingElement-novoms-cfg.h, comment out ceinfo.SubClusterUniqueID CE_HOSTNAME ceinfo.ClusterUniqueID CE_HOSTNAME * Then you have to uninstall openpbs RPMs on your CE and install PBSpro (installation of just utilities is sufficient, if you have PBS server on a node different from your CE). Don't forget to remove openpbs from LCFG rpmlists of your CE, otherwise your PBSpro will get replaced by openpbs very soon automatically, by updaterpms LCFG object. * Now to the really tricky part. Due to some strange bugs in PBSpro, you have to touch files on WNs after copying them from CE (this is true at least for versions 5.3.3.31432 and 5.4.0.40152, Peer told me that he has been told that this is not needed in some newer versions of PBSpro). This can be done in following way: - on all WNs, add this line to /etc/pbs.conf PBS_SCP=/usr/bin/scp.fix - on all WNs, you should have file /usr/bin/scp.fix looking like this: #!/bin/bash cd ~ /usr/bin/scp $* rc=$? touch 2>/dev/null $3 exit $rc * Now, provided you have functional PBS queues and that your WNs can copy output back to CE without password (so you should either have hostbased or RSA authentication in ssh working, if you are not using shared homes), your LCG2 jobs should run flawlessly in your PBSPro. Appendix A - patch for edg-ce-pbs ================================= --- edg-ce-pbs.old Tue May 25 18:47:48 2004 +++ edg-ce-pbs Wed May 26 14:32:40 2004 @@ -67,7 +67,9 @@ B - '-cluster ' specifies the name of the server. + '-cluster ' specifies the name of the server. The server name supplied + is only honored for the PBS batch system. If the PBS server named in + doesn't exist an attempt is made to use the default PBS server for this client. B @@ -361,6 +363,7 @@ # set default values for common 'globally' computed attributes %AllCE = (); # key: dn string of ce data: CE Attributes @AllQueues = (); # all queues hosted the cluster + %QueueAliases = (); # key: queue name, value: hash of queue aliases name (used for minimal handling of PBS Routing queue) @AuthorizedUsersList = ();# authorized users from gridmapfile $CbsBinPath = ""; # the cluster batch system bin path @CESEBinds = (); # the queue, se, mountpt triples/tuples @@ -688,15 +691,17 @@ =cut sub getCeInformation { - # all lrms get value for LRMSVersion and determine available CE's - # pbs and lsf get value for $Cluster & $Server - # condor create CondorHosts hash + + # all lrms get value for LRMSVersion and determine available CE Id's + # also get value for $Cluster. $Server is defined for lsf and pbs but + # for condor it must be deterived on a per pool basis from the + # CondorHosts hash + &getStaticData; # all lrms read static config, worker node, authUser information &readStaticInformation; - # copy matching records from static configfile to CE hashs &staticCEDataToQueue; @@ -740,26 +745,29 @@ B post: - the Cluster name is ClusterArg or if none given the server info from - qstat -B -f shell command - the LrmsVersion is either pbs_version info taken from qstat -B -f - shell command or "-" if info could not be found - the ServerParam is specified as @ and the name of the Cluster - $ServerParam = "@$Cluster" + the Cluster name is defined to be the gatekeeper host. ClusterArg, + if specified, is used as the hostname of the PBS server to query. + The server name as returned by the server will be used to set + Server for subsequent PBS commands. The LrmsVersion is either + pbs_version info taken from the + qstat -B -f + shell command, or "-" if info could not be found + the ServerParam is specified as @ and the name of the Server + $ServerParam = "@$Server" [Max|Default][CPU|Wall]TimeServer is the value in seconds of resources_[max|default].[cput|walltime] from qstat -B -f $Server" command or if the value is missing "-" @AllQueues array contains queuenames from qstat -Q -f $ServerParam - aborts: $Cluster could not be determined, + aborts: no queues could be found on the cluster, no node information found B post: - the [Server|Cluster] name is ClusterArg or the full hostname determined by lsid shell command + The Cluster name is defined to be the gatekeeper host. the LrmsVersion value is taken from lsid shell command node and the running jobs information are gathered here in the hashs %Jobs and %Nodes @@ -769,17 +777,16 @@ @AllQueues contains all queues found by bqueues - aborts: $Cluster = "" Cluster could not be determinated, + aborts: no queues could be found on the cluster, no node information found B post: + The Cluster name is defined to be the gatekeeper host. the LrmsVersion value is taken from condor_version shell command - set Cluster attribute value to the condor_config_val CONDOR_HOST value of the first pool - of the pools specified by -queue or the default pool - @AllQueues contains the collector names (CENames) of all condor hosts + @AllQueues contains the collector names (CEIdNames) of all condor hosts specified with condor_config_val -pool CONDOR_HOST -name CONDOR_HOST COLLECTOR_NAME %CondorHosts hash contains a map collector name (CE name) -> CONDOR_HOST @@ -791,16 +798,12 @@ sub getStaticData { # PBS if ($Lrms eq "pbs") { - # full hostname for short hostname clusterargs - $ClusterArg = &getFullHostname($ClusterArg) if $ClusterArg ne "default"; &getPbsServerInfo; # aquire node values, build nodes hash &getPbsCpuValues; - $Cluster = $Server = &getFullHostname($Server); - # aquire job values, build jobs hash &getPBSJobAttributes; @@ -815,9 +818,6 @@ } elsif ($Lrms eq "condor") { # CONDOR - # full hostname for short hostname clusterargs - $ClusterArg = &getFullHostname($ClusterArg) if $ClusterArg ne "default"; - # obtain the 'LRMSVersion' open CONDORVERSION, "$CbsBinPath" . "condor_version |" or warn ("getStaticData could not open condor_version.\n"); @@ -828,7 +828,7 @@ # add the default pool to @Queues if no condor pools were specified # otherwise set the $ENV(CONDOR_CONFIG) to the first pools condor_config. - &setCondorDefaultPool; + &setCondorDefaultPoolHost; &getCollectorNames; (@AllQueues) || die "getStaticData: getLsfServerInfo: no pools found on $Cluster.\n"; @@ -858,36 +858,72 @@ # obtain values for the attributes $Server and $LrmsVersion and all queuenames sub getPbsServerInfo{ - my $clusterParam = ($ClusterArg ne "default") ? $ClusterArg : ""; - open QSTAT, "$CbsBinPath" . "qstat -B -f $clusterParam 2>&1 |" or - die "getPbsServerInfo: could not open qstat.\n"; + my $clusterParam = ($ClusterArg ne "default") ? $ClusterArg : ""; - while() { - $Cluster = $Server = $1, next if /^Server:\s+(\S+)/; - $LrmsVersion = $1, next if /pbs_version\s+=\s+(\S+)/; - $MaxRunningUndefined = $1, next if /^\s+max_running\s+=\s+(\d+)/; - $MaxCPUTimeServer = &convertHhMmSs($1), - next if /^\s+resources_max.cput\s+=\s+(\S+)/; - $MaxWallTimeServer = &convertHhMmSs($1), - next if /^\s+resources_max.walltime\s+=\s+(\S+)/; - $DefaultCPUTimeServer = &convertHhMmSs($1), - next if /^\s+resources_default.cput\s+=\s+(\S+)/; - $DefaultWallTimeServer = &convertHhMmSs($1), - next if /^\s+resources_default.walltime\s+=\s+(\S+)/; - } - close QSTAT; + my $retry; + do { + open QSTAT, "$CbsBinPath" . "qstat -B -f $clusterParam 2>&1 |" or + die "getPbsServerInfo: could not open qstat.\n"; + + my $bad_server = 0; + + while() { + $bad_server =1, next if /cannot connect to server/; + $Server = $1, next if /^Server:\s+(\S+)/; + $LrmsVersion = $1, next if /pbs_version\s+=\s+(\S+)/; + $MaxRunningUndefined = $1, next if /^\s+max_running\s+=\s+(\d+)/; + $MaxCPUTimeServer = &convertHhMmSs($1), + next if /^\s+resources_max.cput\s+=\s+(\S+)/; + $MaxWallTimeServer = &convertHhMmSs($1), + next if /^\s+resources_max.walltime\s+=\s+(\S+)/; + $DefaultCPUTimeServer = &convertHhMmSs($1), + next if /^\s+resources_default.cput\s+=\s+(\S+)/; + $DefaultWallTimeServer = &convertHhMmSs($1), + next if /^\s+resources_default.walltime\s+=\s+(\S+)/; + } + close(QSTAT); + + $retry = 0; + if ($bad_server) { + if ($clusterParam ne "") { + $retry++; + $clusterParam=""; + } else { + die "getPbsServerInfo: qstat could not connect to server"; + } + } + } while($retry); - $Server=$Cluster=$ClusterArg if $ClusterArg ne "" && $ClusterArg ne "default"; + # For now we always define the cluster name to be the CE name + $Cluster = $GlobusGatekeeperHost; # qstat server param $ServerParam = "\@$Server"; + my $queue; open QSTAT, "$CbsBinPath" . "qstat -Q -f $ServerParam 2>&1 |" or die "getPbsServerInfo: could not open qstat.\n"; - @AllQueues = map { /Queue:\s(\S+)/ } ; - close QSTAT; + while() { + chomp(my $line = $_); + if ($line =~ /Queue:\s+(\S+)/) { + $queue = $1; + push(@AllQueues,$queue); + next; + } + if ($line =~ /route_destinations\s+=\s+(\S+)/) { + my $routes = $1; + my @routes = split(",",$routes); + my %alias_hash; + foreach my $route (@routes) { + $alias_hash{$route} = 1; + } + $QueueAliases{$queue} = \%alias_hash; + next; + } + } + close(QSTAT); - foreach my $queue (@Queues){ + foreach $queue (@Queues){ die "Queue $queue does not exist\n" unless grep {$_ eq $queue} @AllQueues; } @@ -917,28 +953,32 @@ my $jobCount=0; my $type = ""; - open NODES, "$CbsBinPath" . "pbsnodes -a -s $Cluster |" or + open NODES, "$CbsBinPath" . "pbsnodes -a -s $Server |" or die "getPbsCpuValues: could not open pbsnodes.\n"; while() { $node = $1, next if /^(\S+)/i; - $state = "down", next if /^\s+state\s+=\s+.*down.*/i; - $state = $1, next if /^\s+state\s+=\s+(\S+)/i; - $cpus = $1, next if /^\s+np\s+=\s+(\d+)/i; - $state = $1, next if /^\s+state\s+=\s+(\S+)/i; $type = $1, next if /^\s+ntype\s+=\s+(\S+)/i; + $state = "down", next if /^\s+state\s+=\s+.*offline.*/i; + $state = $1, next if /^\s+state\s+=\s+(\S+)/i; + $cpus = $1, next if /^\s+pcpus\s+=\s+(\d+)/i; + $jobCount= $1, next if /^\s+resources_assigned.ncpus\s+=\s+(\d+)/i; + $queue = $1, next if /^\s+queue\s+=\s+(\S+)/i; - # PARSE Jobs line - if (/^\s+jobs\s+=\s+(\S+.*)$/i){ - $jobList = $1; - for (map { /^\s?\d+\/(\S+)/ } split(/,/,$jobList)){ - $jobCount++; - $Jobs{$_} = {}; - push (@{$Nodes{$node}{JOBS}}, $_); - } - } if ( /^$/ ){ - # end of node record found - if (($state !~ /down/i) && ($node ne "-")){ + # end of node record found + my $qvalid = 0; + if (defined $queue) { + foreach my $testq (@Queues){ + if ($queue eq $testq) { + $qvalid = 1; + } + } + } else { + $qvalid = 1; + } + + if (($state !~ /down/i) && ($node ne "-") && ($qvalid eq 1)){ + #STORE NODE data $Nodes{$node}{TOTALCPUS} = $cpus; $Nodes{$node}{NUMJOBS} = $jobCount; @@ -982,44 +1022,53 @@ return ($maxa > $maxb)? $maxa : $maxb; } -sub getPBSJobAttributes{ - foreach $job (sort keys %Jobs){ - open QSTAT, "$CbsBinPath" . "qstat -f $job$ServerParam 2>&1 |" or +sub getPBSJobAttributes { + my $JobId; + my %local_jobs; + + open QSTAT, "$CbsBinPath" . "qstat -f $ServerParam 2>&1 |" or warn "getPBSJobAttributes: unable to open qstat.\n"; while (){ - $Jobs{$job}{QUEUE} = $1, next if /^\s+queue\s=\s(\S+)/; - $Jobs{$job}{JOBSTATE} = $1, next if /^\s+job_state\s+=\s+(\S+)/; + $JobId = $1, next if /^Job Id:\s+(\S+)/; + next if !defined $JobId; - $Jobs{$job}{USEDWALLTIME} = &convertHhMmSs($1), + $local_jobs{$JobId}{QUEUE} = $1, next if /^\s+queue\s=\s(\S+)/; + $local_jobs{$JobId}{JOBSTATE} = $1, next if /^\s+job_state\s+=\s+(\S+)/; + + $local_jobs{$JobId}{USEDWALLTIME} = &convertHhMmSs($1), next if /^\s+resources_used.walltime\s=\s(\S+)/; - $Jobs{$job}{USEDCPUTIME} = &convertHhMmSs($1), + $local_jobs{$JobId}{USEDCPUTIME} = &convertHhMmSs($1), next if /^\s+resources_used.cput\s=\s(\S+)/; - $Jobs{$job}{WALLTIME} = &convertHhMmSs($1), + $local_jobs{$JobId}{WALLTIME} = &convertHhMmSs($1), next if /^\s+Resource_List.walltime\s+=\s+(\S+)/; - $Jobs{$job}{CPUTIME} = &convertHhMmSs($1), + $local_jobs{$JobId}{CPUTIME} = &convertHhMmSs($1), next if /^\s+Resource_List.cput\s+=\s+(\S+)/; - $Jobs{$job}{NODECOUNT} = $1, + $local_jobs{$JobId}{NODECOUNT} = $1, next if /^\s+Resource_List.nodect\s+=\s+(\S+)/; } - close QSTAT; + close(QSTAT); + foreach my $job (keys %Jobs) { + # store only running jobs + delete $Jobs{$job},next unless (exists $local_jobs{$job}{JOBSTATE} && $local_jobs{$job}{JOBSTATE} eq "R"); + delete $Jobs{$job},next unless exists $local_jobs{$job}{QUEUE}; - # set to zero if missing - $Jobs{$job}{USEDWALLTIME} = 0 unless exists $Jobs{$job}{USEDWALLTIME}; - $Jobs{$job}{USEDCPUTIME} = 0 unless exists $Jobs{$job}{USEDCPUTIME}; + foreach my $attr (keys %{$local_jobs{$job}}) { + $Jobs{$job}{$attr} = $local_jobs{$job}{$attr}; + } - # if no nodecount was found in qstat record assume job need 1 node - $Jobs{$job}{NODECOUNT} = 1 unless exists $Jobs{$job}{NODECOUNT}; + # set to zero if missing + $Jobs{$job}{USEDWALLTIME} = 0 unless exists $Jobs{$job}{USEDWALLTIME}; + $Jobs{$job}{USEDCPUTIME} = 0 unless exists $Jobs{$job}{USEDCPUTIME}; - # store only running jobs - delete $Jobs{$job} unless $Jobs{$job}{JOBSTATE} eq "R"; - delete $Jobs{$job} unless exists $Jobs{$job}{QUEUE}; - } + # if no nodecount was found in qstat record assume job need 1 node + $Jobs{$job}{NODECOUNT} = 1 unless exists $Jobs{$job}{NODECOUNT}; + } } sub lsf_cache_cmd @@ -1046,22 +1095,20 @@ } -# obtain values for the attributes $Cluster and $LrmsVersion +# obtain values for the attributes $Cluster and $LrmsVersion +# sub getLsfServerInfo{ # obtain values for the hostname and the LRMSVersion open LSID, "$CbsBinPath" . "lsid 2>&1 |" or die "getLsfServerInfo: unable to open lsid.\n"; while() { - $Cluster = $1, last if /^My cluster name is (\S+)/; $LrmsVersion = $1, next if /^LSF ([^,]+), /; } close LSID; - $Cluster = $ClusterArg if $ClusterArg ne "" && $ClusterArg ne "default"; - - $Cluster ne "" or - die ("getLsfServerInfo: Clustername could not be determinated.\n"); + # For now we always define the cluster name to be the CE name + $Cluster = $GlobusGatekeeperHost; $LrmsVersion ne "-" or warn ("getLsfServerInfo: LrmsVersion could not be determinated.\n"); @@ -1130,7 +1177,7 @@ my $type = ""; # get node information - open LSHOSTS, "$CbsBinPath" . "lshosts 2>&1 |" or + open LSHOSTS, "$CbsBinPath" . "lshosts -w 2>&1 |" or die "generateLsfInformationPool: unable to open lshosts command.\n"; while() { next if ! /^(\S+)\s+\S+\s+\S+\s+\S+\s+(\S+)\s+.*\((.*)\)/; @@ -1240,22 +1287,28 @@ } my %hosts_matching_resource; - my @queueName; - my $lsfhosts; - my $resreq; + my $p_queueName; + my $p_lsfhosts; + my $p_resreq; open BQUEUES, "$CbsBinPath" . "bqueues -l 2>&1 |" or die "generateLsfInformationPool: Unable to open bqueues command.\n"; while() { - $queueName = $1, next if /^QUEUE:\s+(\S+)/; - $lsfhosts = $1, next if /^HOSTS:\s+(.+)$/; + $p_queueName = $1, next if /^QUEUE:\s+(\S+)/; + $p_lsfhosts = $1, next if /^HOSTS:\s+(.+)$/; + $p_resreq = $1, next if /^RES_REQ:\s+(.+)$/; - next if ! /^RES_REQ:\s+(.*)$/; - $resreq = $1; + next if !(defined $p_queueName && defined $p_lsfhosts); + next unless /^\s*$/; + + my ($queueName,$lsfhosts,$resreq) = ($p_queueName, $p_lsfhosts,$p_resreq); + $p_queueName = $p_lsfhosts = $p_resreq = undef; push(@AllQueues, $queueName); - next if ($lsf_queues>0 && !exists $lsf_queues{$queueName}); + + # Need info from all queues to calculate LSFCPUPower + # next if ($lsf_queues>0 && !exists $lsf_queues{$queueName}); my @hosts_in_queuespec; if ($lsfhosts =~ /all hosts used by the LSF Batch system/) { @@ -1289,13 +1342,13 @@ } } - if ($resreq ne "") { + if (defined $resreq) { my $hosts_selected_ref; if (! exists $hosts_matching_resource{$resreq}) { my %hosts_selected; - open LSHOSTS, $CbsBinPath."lshosts -R \"".$resreq."\" 2>&1 |" or + open LSHOSTS, $CbsBinPath."lshosts -w -R \"".$resreq."\" 2>&1 |" or die "generateLsfInformationPool: unable to open lshosts command.\n"; while() { next if ! /^(\S+)\s+\S+\s+\S+\s+\S+\s+(\S+)\s+.*\((.*)\)/; @@ -1329,15 +1382,14 @@ } -# Add the default pool to @Queues if no condor pools were specified -# Set the cluster variable to the first Condor pool in @Queues, -# if no ClusterArg was specified in commandline. - -sub setCondorDefaultPool { +# Add the default condor pool host to @Queues if no condor hosts were specified +# +sub setCondorDefaultPoolHost { my $condorConfigFile = ""; # use the hosts default config & localconfigfile to print information - # of STANDARD pool if no pools are specified. + # of STANDARD pool host if no pool hosts are specified. + unless (@Queues){ # these 3 ways lead to condor_config if (exists $ENV{CONDOR_CONFIG}) { @@ -1351,17 +1403,17 @@ } $ENV{"CONDOR_CONFIG"} = $condorConfigFile; - $Cluster = &getFullHostname(`"$CbsBinPath"condor_config_val CONDOR_HOST 2>&1`); - chomp($Cluster); - push (@Queues,$Cluster); - } else { - $Cluster=&getFullHostname($Queues[0]); + chomp(my $default_host = `"$CbsBinPath"condor_config_val CONDOR_HOST 2>&1`); + $default_host = getFullHostname($default_host); + push (@Queues,$default_host); } - $Cluster = $ClusterArg if $ClusterArg ne "default"; + + # For now we always define the cluster name to be the CE name + $Cluster = $GlobusGatekeeperHost; } -# Get all CENames from the pools specified by -queue switch +# Get all CENames from the pool hosts specified by -queue switch # and store them to @AllQueue array using the COLLECTOR_NAME attribute of the # condor hosts local config file. # Print a warning message if a collector name @@ -1382,6 +1434,7 @@ if ($val !~ /^(\S+)/ || $val eq "Host not found.\n"){ warn "getCollectorNames:Host not found for $Queues[$pool].\n"; $Queues[$pool] = $Queues[0]; shift @Queues; + $pool--; next; } $Queues[$pool] = $1; @@ -1610,6 +1663,24 @@ $lastDn = $_; + # We rewrite the cluster/subcluster IDs if the cluster name we're + # using doesn't match the one the user supplied. + # (NB For now we always define cluster as gatekeeper host, because of WP1 requirement) + + if ($lastDn =~ /GlueClusterUniqueID=([^(,|\s)]+)/) { + my $cluster_id = $1; + if ($cluster_id eq $ClusterArg) { + $lastDn =~ s|GlueClusterUniqueID=$cluster_id|GlueClusterUniqueID=$Cluster|m; + } + } + if ($lastDn =~ /GlueSubClusterUniqueID=([^(,|\s)]+)/) { + my $subcluster_id = $1; + if ($subcluster_id eq $ClusterArg) { + $lastDn =~ s|GlueSubClusterUniqueID=$subcluster_id|GlueSubClusterUniqueID=$Cluster|m; + } + } + $_ = $lastDn; + #try to read a attribute or initialize with "-" $fileSys = /GlueHostRemoteFileSystemName=([^,\s]+)/ ? $1 : "-"; $cl = /GlueClusterUniqueID=([^,\s]+)/ ? $1 : "-"; @@ -2163,8 +2234,13 @@ foreach $allQueue (@AllQueues){ $dn = &getDnCEString(&getCeid,$allQueue,$MdsArg); - foreach $job (grep {$Jobs{$_}{QUEUE} eq $allQueue} keys %Jobs) - { push @{$AllCE{$dn}{JOBS}}, $job; } + foreach $job (keys %Jobs) { + my $jc = $Jobs{$job}{QUEUE}; + if ($jc eq $allQueue || + (exists $QueueAliases{$jc} && exists $QueueAliases{$jc}->{$allQueue})) { + push(@{$AllCE{$dn}{JOBS}}, $job); + } + } } } @@ -2186,7 +2262,7 @@ push @{$AllCE{$dn}{NODES}}, $node; } } -} +} =head2 glueCEToQueue @@ -2262,7 +2338,9 @@ if ($Lrms eq "pbs"){ $cpus=0; - $cpus += $Nodes{$_}{TOTALCPUS} for grep {defined $Nodes{$_}{TOTALCPUS}} keys %Nodes; + foreach $node (keys %Nodes) { + $cpus += $Nodes{$node}{TOTALCPUS} if exists $Nodes{$node}{QUEUE}{$allQueue} && defined $Nodes{$node}{TOTALCPUS}; + } } elsif ($Lrms eq "lsf"){ $cpus = 0; # add cpus if queue uses $node @@ -2652,34 +2730,70 @@ $dn = &getDnCEString(&getCeid,$allQueue,$MdsArg); $foundOneOrMoreQueues=0; - open QUEUES, "${CbsBinPath}qstat -Q -f $allQueue$ServerParam 2>&1 |" or - die "glueCEStateToQueue: could not open qstat.\n"; - # parse queue - while() { - # if error due to unknown queue, leave the loop - warn ($_), last if /unknown queue/i; + my %queryQueues; + $queryQueues{$allQueue} = 1; + if (exists $QueueAliases{$allQueue}) { + my $alias_hash = $QueueAliases{$allQueue}; + foreach my $alias (keys %$alias_hash) { + $queryQueues{$alias} = 1; + } + } - # We found a queue! - if (/^Queue:\s+(\S+)/) { + my $totalJobs = 0; + my $running = 0; + my $enabled = $started = undef; + my $foundOneOrMoreQueues = 0; + + foreach my $queue (keys %queryQueues) { + + open QUEUES, "${CbsBinPath}qstat -Q -f $queue$ServerParam 2>&1 |" or + die "glueCEStateToQueue: could not open qstat.\n"; + + my $i_totalJobs = 0; + my $i_running = 0; + my $i_enabled = $i_started = 0; + my $i_foundOneOrMoreQueues = 0; + + # parse queue + while() { + # if error due to unknown queue, leave the loop + warn ($_), last if /unknown queue/i; + + # We found a queue! + if (/^Queue:\s+(\S+)/) { #INIT - $totalJobs = 0; - $running = 0; - $enabled = $started = "false"; - #save queue information - $foundOneOrMoreQueues = 1; - next; - } + #save queue information + $i_foundOneOrMoreQueues = 1; + next; + } #FETCH PARAM - $totalJobs = $1, next if /total_jobs\s+=\s+(\S+)/i; + $i_totalJobs = $1, next if /total_jobs\s+=\s+(\S+)/i; - $enabled = $1 eq "True", next if /enabled\s+=\s+(\S+)/i; + $i_enabled = $1 eq "True", next if /enabled\s+=\s+(\S+)/i; + + $i_started = $1 eq "True", next if /started\s+=\s+(\S+)/i; - $started = $1 eq "True", next if /started\s+=\s+(\S+)/i; + $i_running= $1, next if /(\d+)\s+Exiting:\d+\s$/; - $running= $1, next if /(\d+)\s+Exiting:\d+\s$/; + } + close QUEUES; + $foundOneOrMoreQueues = 1 if $i_foundOneOrMoreQueues; + + if ($i_foundOneOrMoreQueues) { + $totalJobs += $i_totalJobs; + $running += $i_running; + $enabled = 1 if (!defined $enabled && $i_enabled); + $started = 1 if (!defined $started && $i_started); + if ($queue eq $allQueue) { + $enabled = 0 if !$i_enabled; + $started = 0 if !$i_started; + } + } } - close QUEUES; + + $enabled = 0 if !defined $enabled; + $started = 0 if !defined $started; #STORE QUEUE DATA if ($foundOneOrMoreQueues){ @@ -2704,7 +2818,9 @@ unless exists $AllCE{$dn}{GlueCEStateStatus}; $freeCpus = 0; - $freeCpus += $Nodes{$_}{FREECPUS} for keys %Nodes; + foreach my $node (keys %Nodes) { + $freeCpus += $Nodes{$node}{FREECPUS} if exists $Nodes{$node}{QUEUE}{$allQueue}; + } $AllCE{$dn}{GlueCEStateFreeCPUs} = $freeCpus unless exists $AllCE{$dn}{GlueCEStateFreeCPUs}; @@ -4031,3 +4147,4 @@ '-ttl ' specifies the value for entryTtl __USAGE__ } +