Installing and Configuring OSG Resource and Service Validation (RSV) Service

Table of Content

  1. Introduction
  2. Installing RSV package
  3. Configuring and Starting RSV services
  4. RSV Output: Results of running the probes
  5. Testers: What to look for?
  6. Using Existing Condor 6.9.x Install
  7. General overview of the Directory Hierarchy


Introduction

The OSG-RSV package is now available through VDT 1.8.1 (that corresponds to OSG Production CE 0.8.0 and ITB CE 0.7.1. The package includes:

  1. The default probes (all the probes in the RSV probes version 1.5.0, 2007-10-24).
  2. A condor-cron based scheduling infrastructure: This includes:
    • Condor 6.9.4, necessary to get the condor-cron functionality. Note that this package is pulled in only if it's not already installed. To force RSV to use a pre-existing Condor 6.9.x install, refer to the Using Existing Condor 6.9.x Install section below, before you begin installing your CE (or RSV if you're installing it separately on a monitoring host)!
    • A configuration and submission system for the probes (on a per host basis, if you're running RSV on a monitoring host to monitor multiple CEs)
    • A mechanism to add new probes, and alter configuration options to the existing probes
    • A consumer system that digests the probe output to generate HTML pages locally for a site admin's reference, and also uploads probe output data into a central RSV database maintained by the GOC
  3. Some accompanying Gratia bits that the above consumer uses to upload data to the central RSV database.

If installed on a monitoring host by itself, without the OSG CE, this package will also bring along Globus bits it needs to run the probes, and core Gratia modules to be able to upload probe results to a central RSV database.


Getting RSV package

Installing RSV package along with fresh 0.8.0 CE install ((or ITB 0.7.1)

You do not need to do anything extra, just install the CE, and the OSG-RSV package is brought along with it.

Note: Also, note that by default, OSG-RSV brings along Condor 6.9.4 -- it is necessary to get the condor-cron functionality. To force RSV to use a pre-existing Condor 6.9.x install, refer to the Using Existing Condor 6.9.x Install section below, before you begin installing your CE (or RSV if you're installing it separately on a monitoring host)!

Installing RSV package on separate monitoring host

You can get the RSV package and its dependencies by doing the following. Also, see note above about using pre-existing Condor 6.9.x installs.

pacman -get http://vdt.cs.wisc.edu/vdt_181_cache:OSG-RSV

For ITB resorces only: Installing RSV package on resource with existing ITB 0.7.1 CE

Thi section only applies to ITB testers. It is possible to upgrade the RSV package on an existing OSG 0.7.1 CE which is based on VDT 1.8.1. Please refer to the VDT update instructions here.


Configuring and Starting RSV services

There is a default configuration that the VDT package can provide but we recommend you customize the configuration of RSV using the following steps.

Note: Before we begin, an important but probably obvious note: The only step in the following configuration and starting-service sections that you need to do as a non-root user is the proxy creation. Every other step needs to be done as root.

Copy over sample_metrics.conf file and edit it

We provide a sample metrics.conf file, on the web or in your VDT installation at $VDT_LOCATION/osg-rsv/config/, with recommended time-intervals for various metrics.

Copy that file over for each of the hosts you want RSV to monitor:

$ cp $VDT_LOCATION/osg-rsv/config/sample_metrics.conf \
     $VDT_LOCATION/osg-rsv/config/<FQDN_of_monitored_host>_metrics.conf
and make further edits to it as described below.

Note: If you have a metrics.conf or <FQDN_of_monitored_host>_metrics.conf from previously configured RSV, say in ITB 0.7.0, then you can copy that file over to $VDT_LOCATION/osg-rsv/config/<FQDN_of_monitored_host>_metrics.conf and skip along to the next step. We do recommend you set the time-interval, various probes run similar to our sample_metrics.conf file provided (see above).

Customize metrics config file

You can configure, on a per host basis:

  1. Enable/Disable a particular metric: What metrics you'd like to run against your resource.

    Any line that starts with on is enabled; change that keyword to off to disable a metric.

    For example, if you site only runs fork, PBS and condor job-managers, then you might disable the other jobmanagers-status probes.

    on jobmanagers-status-probe@org.osg.batch.jobmanager-condor-status 2 */2 * * *
    on jobmanagers-status-probe@org.osg.batch.jobmanager-fork-status 9 */2 * * *
    off jobmanagers-status-probe@org.osg.batch.jobmanager-loadleveler-status 19 */2 * * *
    off jobmanagers-status-probe@org.osg.batch.jobmanager-lsf-status 32 */2 * * *
    off jobmanagers-status-probe@org.osg.batch.jobmanager-managedfork-status 33 */2 * * *
    on jobmanagers-status-probe@org.osg.batch.jobmanager-pbs-status 24 */2 * * *
  2. Time interval between probe metric runs: How often would you like to run each of those metrics? For example, */2 as the fourth value in any line indicates that that metric (probe) will run every two hours. If you'd like the ping probe, (for example) to run every 15 mins, then you'd edit that line to have something like this:
    on ping-host-probe@org.osg.general.ping-host */15 * * * *

    Note to ITB testers only: For testing purposes, we strongly recommend, you set at least 2 or 3 of the metrics to run really often, say, every time 10 minutes; This will cause one of the output files, mentioned later in this document, to be generated earlier (so you won't have to wait too long).

The header field in the sample_metrics file (or in any of the generated files) has more information on what each field means.


OPTIONAL STEPS

Additional command line arguments for the probes

There are files named <probe-file-name>@<metric-name>.spec" inside $VDT_LOCATION/osg-rsv/specs/global-specs and/or $VDT_LOCATION/osg-rsv/specs/<FQDN_of_monitored_host>.

  • You can specify additional command-line arguments for each probe on a global (all hosts) or a per host basis, in these files. Possible options are listed on each probe's usage information; type $VDT_LOCATION/osg-rsv/bin/probes/<probe-name> -h for more information.
  • Important Note: The .spec files must NOT contain the following switches:
    • -m <metricname>
    • --uri <hostname>
    • -x <proxy file>
    These switches are added to the Condor submission file by the configure_osg_rsv script.

If you make a change to a spec file, it can be reloaded into Condor by stopping the osg-rsv service, running configure_osg_rsv, and then starting the service again. The commands to be used are listed in the Changes to the .conf file or the .spec files section below.

Local Status Page Archival Configuration

The status.html that is generated locally for a site admin's viewing, will also have a link to archives. By default, the pages are rotated at mid-night every night, and archives are kept for 7 days. This can be reconfigured by editing the following files:

  • $VDT_LOCATION/osg-rsv/submissions/consumers/rotate_html_files.sub to modify when the log-rotate job is run, and how often the logs are rotated.

    Note that the above submission file will be regenerated everytime configure_osg_rsv.sh is run with the --consumers flag passed, and when it is regenerated, the default will be restored.

  • $VDT_LOCATION/osg-rsv/logs/logrotate/rotate_html_files.conf.tmpl to modify the number of days. archived logs are kept.


RUN CONFIGURE_OSG_RSV SCRIPT

Run RSV configuration scripts as shown below to generate submission files for the probes. Submission files for the consumers are automatically generated at installation.

You will need to configure which username the probe will run against (by default, it tries to run against a username called rsvuser) -- we recommend you use another non-root username, possibly your personal site-admin account; it must be a valid Unix account with a valid shell.

To ensure the rsv user is working properly, try running the following command as root:

su -c "/bin/date" rsvuser

If system date is not printed, and/or the exit value of the above command is non-zero, then RSV probes will not be able to run on your host.

Also, ensure that you do not set CONDOR_CONFIG to your production condor (if applicable) in global .[t]cshrc or .bashrc files. Or if you do, then create a .[t]cshrc/.bashrc in the home directory of the RSV user and set CONDOR_CONFIG to ${VDT_LOCATION}/condor-devel/etc/condor_config. This will avoid any mix ups by way of which RSV jobs may end up in your production Condor queue -- see related VDT ticket.

For example, let us say you have a valid non-root user on your host named rsvuser, and its UID is 550 in the /etc/passwd file. Then, run the configure_osg_rsv script as follows, after replacing the entries in italics below with values that apply to your resource:

Configure Init script -- what user to run RSV under

$VDT_LOCATION/vdt/setup/configure_osg_rsv \
 --user rsvuser --init --server y 

The above command will also register the RSV service and force it to run under the username provided.

Configure running of probes for each of your CEs (separated by a space as shown below)

$VDT_LOCATION/vdt/setup/configure_osg_rsv \
 --uri "<FQDN_of_host1_to_probe FQDN_of_host2_to_probe> " \
 --proxy /tmp/x509up_u550 \
 --probes --gratia --verbose

The above execution of configure_osg_rsv will generate job submission files for the RSV probes to monitor host1 and host2; the files will include switches to the probes to create Gratia data uploading scripts and printing verbose information in .err files. The script will also auto-generate a <FQDN>_metrics.conf file for the host you specify, if the file does not exist already. Also, in the above command execution, it's assumed: that /tmp/x509up_u550 is owned by rsvuser and has 600 permissions

Note about timestamp in local timezone: If you'd like your local status pages to show timestamps in your local time zone, then add the --print-local-time switch to the above execution.

Configure local status pages to be served by VDT Apache

Note: This is an optional step. The following execution of configure_osg_rsv will edit the VDT Apache config file to serve local site-level RSV status pages on the web.

$VDT_LOCATION/vdt/setup/configure_osg_rsv \
 --setup-for-apache

If you execute this step, then after you start RSV and probe results are available, you will be able to go to the URL http://<Host's FQDN>:8080/rsv/, on a web browser that has your OSG approved cert loaded in it, to view status pages. More information on the status page here.

Note: You can throw all the switches used in the above three execution of the configure_osg_rsv script in one execution -- the options are shown above in separate execution instances for clarity. The following webpage has more information about the configure_osg_rsv script.

Configure Gratia's metric probe to report to the appropriate RSV collector (see below):

Note: At the configure_osg.sh stage, if you answered 'y' to the question if the Gratia Metric Probe should be configured, then you can skip this step. Or you could be safe, and execute it anyway :-)

$VDT_LOCATION/vdt/setup/configure_gratia \
 --probe metric \
 --report-to <Appropriate_RSV_collector_shown_below:port>

RSV Collector to use

  • Production: rsv.grid.iu.edu:8880ITB: rsv-itb.grid.iu.edu:8880


Start Condor-devel and RSV

Valid user proxy?: Before you start the osg-rsv service, make sure you have a valid proxy under the username you specified in the above configure_osg_rsv script execution.

Note: Also, you need to ensure the proxy is renewed before it expires. For example, you might create a proxy, that's valid for 8 days, every Monday morning (which gives you the cushion of one day in case you forget to renew it on Monday). Background: We toyed with the idea of using MyProxy but decided to shelve it for the initial release. We've also had a couple of suggestions from our WLCG friends about other options but we've not reviewed anything seriously yet. For now, manual renewal of the user proxy is required.

As root, start up Condor-devel and then deploy the probes and consumers with the following commands

Note: You need to stop the osg-rsv service in case it's already started before you start it again; Similarly if condor-devel is not started, then you need to start it. It might be best if you go ahead and (attempt to ) turn both off, before starting them.
vdt-control --off osg-rsv
vdt-control --off condor-devel

This will push the RSV jobs into Condor under the username you configured in the previous step. The log files for the probes and consumers can be found in: $VDT_LOCATION/osg-rsv/logs/probes and $VDT_LOCATION/osg-rsv/logs/consumers respectively.

vdt-control --on condor-devel
vdt-control --on  osg-rsv 

If you configured RSV to use the VDT Apache to serve local status pages, then enable apache and then turn it on.

$VDT_LOCATION/vdt/setup/configure_apache --server y
vdt-control --on  apache 

Changes to the .conf file or the .spec files

Just remember, any changes you make to the <FQDN_of_monitored_host>_metrics.conf file or any of the .spec files, require you to stop RSV, run its configure script again, and then start it.

vdt-control --off osg-rsv
# configure_osg_rsv for the probes, as shown above in the
#   Run configure_osg_rsv script section
vdt-control --on osg-rsv

Unregister RSV service

Run the above configure_osg_rsv script with the argument --server n to unregister the RSV service.


RSV Output: Results of running the probes

Once you have RSV installed and configured RSV as documented above, there are a couple of pieces of output you can expect.

Status page for local site admin

A webpage containing the results on each day is created each day. By default, this page is $VDT_LOCATION/osg-rsv/output/html/status.html, but can be reconfigured using the configure_osg_rsv script. You can either view the local status pages directly on the monitoring host (assuming you have a HTML viewer/browser installed on it); or you can configure RSV to have the pages served on the web. Once again, beware that you need to have an OSG approved cert loaded in your web browser to be able to view these pages on the web.

The status pages should look similar to what is documented here: sample status page.

Note: Depending on how often you chose to run the various metrics in the <FQDN_of_monitored_host>_metrics.conf file, you might have to wait 5-30 mins before anything shows up in the status.html file, or even before the file shows up for the first time.

Gratia Record Uploading

Unfortunately, as of the date of writing this document, we do not have a decent way of looking up database records of probe output data that might be uploaded from your resource to the central RSV database.

You could still confirm if the records are being uploaded properly by looking at the Gratia consumer log file: $VDT_LOCATION/osg-rsv/logs/consumers/gratia-script-consumer.out. You should a bunch of entries similar to this:

08-20-2007 16:35:26 - Executing script '/usr/local/grid/ . . . record.py'
OK

If you're really interested in finding out if database records have been uploaded, then please contact the RSV developers, and we will be happy to query the database, and tell you if your records were uploaded properly.



Testers: What else to look for?

Once you have RSV installed and configured, as an ITB tester (or a site admin once RSV goes into production), you can look for a couple of things:

Where are the condor-cron jobs?

You can look at the jobs submitted through condor-cron, by running the condor-devel version of condor_q. You probably want to do this in a separate shell:

. $VDT_LOCATION/vdt/etc/condor-devel-env.sh
condor_q

Note: Most of the time, you'll notice only two of those jobs are running: html-consumer and gratia-script-consumer; these jobs consume the output of the probes, and create the status.html pages, while also uploading RSV database records. The rotate_html_files job will run once every night. The RSV metric probe jobs will run periodically as per your setup (in <FQDN_of_monitored_host>_metrics.conf), and the load on the node.

Output File(s)

As already mentioned, look at $VDT_LOCATION/osg-rsv/output/html/status.html. If you have a web browser on the monitoring host that the file is on, then you could look at the file using it.

It should link all your probe-metric results to an index.html page on a per host basis (for all the hosts you have RSV configured to monitor). As mentioned earlier, it should also have a link to archives of the results; and should look similar to what is documented here: sample status page.

Gratia Records

As stated earlier, you can confirm if the records are being uploaded properly by contacting the RSV developers, or by looking at the Gratia consumer log file: $VDT_LOCATION/osg-rsv/logs/consumers/gratia-script-consumer.out. You should a bunch of entries similar to this:

08-20-2007 16:35:26 - Executing script '/usr/local/grid/ . . . record.py'
OK

Possible Gratia log error

You might have seen an error in $VDT_LOCATION/osg-rsv/logs/consumer/gratia-script-consumer.out similar to this:

08-23-2007 10:59:01 - Executing script
'. . . /osg-rsv/output/gratia/2007-08-23T17:57:00Z-. . .gratia-record.py'
Gratia: Unable to log to file:
. . . /gratia/var/logs/2007-08-23.log   (<class exceptions.OSError
. . .
Operation not permitted: '. . . gratia/var/logs/2007-08-23.log'
OK

This error should not occur any more following bug fixes. Please let the RSV team know if you see it.

Verbose information from probes if --verbose is turned on

If you provide the --verbose switch to the configure_osg_rsv.sh script, then you can look for verbose information printed in the .err files in $VDT_LOCATION/osg-rsv/logs/probes/.

Note that content in .err files do not necessarily indicate errors in the probe execution. Rather, they help the RSV developers in investigating any errors that may occur. may



Using Existing Condor 6.9.x Install

If you want RSV to use a pre-existing Condor 6.9.x installation (instead of pulling in its own condor-devel, a.k.a. Condor 6.9.4), then be sure the following variable is set appropriately. The Condor 6.9.x bin/, sbin/, etc/, lib/ ... directories should be directly under this location, e.g. /opt/condor-6.9.3.

export VDTSETUP_CONDOR_DEVEL_LOCATION=/my/condor-6.9.x/location/

Follow these links to go back to the Install RSV section or the RSV Configuration section of this document.



General overview of the Directory Hierarchy

Here's an overview of the directory hierarchy inside $VDT_LOCATION/osg-rsv when you install the OSG-RSV package:

  1. $VDT_LOCATION/osg-rsv/config

    Contains the .conf file(s) describing what metrics to run when, so forth for each host you'd like to monitor. Note that this directory is a sym-link to $VDT_LOCATION/vdt-app-data/osg-rsv/config

  2. $VDT_LOCATION/osg-rsv/bin/{probes|consumers}

    Where the probes and consumer executables are stored.

  3. $VDT_LOCATION/osg-rsv/submissions/{probes|consumers}
    • The probe and consumer Condor submission files (for all the hosts that you configure RSV to monitor). These are generated at installation time by $VDT_LOCATION/vdt/setup/configure_osg-rsv.
    • This is what is submitted into Condor (note: Condor-devel as of now) when vdt-control --on osg-rsv is executed.
  4. $VDT_LOCATION/osg-rsv/specs/
    • As explained earlier in the Additional command line arguments for the probes section, these are the extra command-line argument files for the probes and consumers. Note that this directory is a sym-link to $VDT_LOCATION/vdt-app-data/osg-rsv/specs
    • There are two types of specs:
      1. Global specs: If you'd like to use some probe's command line switch(es) on all the hosts you monitor, then you can include those switch(es) in the <probe-file-name>@<metric-name>.spec file within the $VDT_LOCATION/osg-rsv/specs/global-specs directory.
      2. Host Specific specs: Note this option does not apply to you if you're only monitoring one host! If you'd like to use some probe's command line switch(es) when the probe runs only on a particular host you monitor, then you can include those switch(es) in the <probe-file-name>@<metric-name>.spec file within the $VDT_LOCATION/osg-rsv/specs/<hostname> directory.
    • Please be sure to read the caveat about not adding certain switches in these files in the above section.
  5. $VDT_LOCATION/osg-rsv/output

    This is a placeholder directory for any output produced by the consumers. Right now we have a sample consumer that will generate simple result webpages based on the probe's output status. You can see an example of the sample consumer's output here.

    Gratia record uploading scripts are also stored in a sub-directory within this directory, by default.

  6. $VDT_LOCATION/osg-rsv/logs/{probes|consumers|logrotate}

    Where Condor stores the log files for the probes and the consumers; and also where logrotation related information is stored.