16.5. Portable Suites¶
A portable or interoperable suite can run “out of the box” at different sites, or in different environments such as research and operations within a site. For convenience we just use the term site portability.
Lack of portability is a major barrier to collaborative development when sites need to run more or less the same workflow, because it is very difficult to translate changes manually between large, complicated suites.
Most suites are riddled with site-specific details such as local build configurations, file paths, host names, and batch scheduler directives, etc.; but it is possible to cleanly factor all this out to make a portable suite. Significant variations in workflow structure can even be accommodated quite easily. If the site workflows are too different, however, you may decide that it is appropriate for each site to maintain separate suites.
The recommended way to do this, which we expand on below, is:
- Put all site-specific settings in include-files loaded at the end of a generic “core” suite definition.
- Use “optional” app config files for site-specific variations in the core suite’s Rose apps.
- (Make minimal use of inlined site switches too, if necessary).
- When referencing files, reference them within the suite structure and use an install task to link external files in.
The result should actually be tidier than the original in one respect: all the messy platform-specific resource directives etc., will be hidden away in the site include-files.
16.5.1. The Jinja2 SITE Variable¶
First a suite Jinja2 variable called SITE
should be set to the site
name, either in rose-suite.conf
, or in the suite definition itself
(perhaps automatically, by querying the local environment in some way).
#!Jinja2
{% set SITE = "niwa" %}
#...
This will be used to select site-specific configuration, as described below.
16.5.2. Site Include-Files¶
If a section heading in a suite.rc file is repeated the items under it simply add to or override those defined under the same section earlier in the file (but note List Item Override In Site Include-Files). For example, this task definition:
[runtime]
[[foo]]
script = run-foo.sh
[[[remote]]]
host = hpc1.niwa.co.nz
can equally be written like this:
[runtime] # Part 1 (site-agnostic).
[[foo]]
script = run-foo.sh
[runtime] # Part 2 (site-specific).
[[foo]]
[[[remote]]]
host = hpc1.niwa.co.nz
Note
If Part 2 had also defined script
the new value would
override the original. It can sometimes be useful to set a widely used
default and override it in a few cases, but be aware that this can
make it more difficult to determine the origin of affected values.
In this way all site-specific [runtime]
settings, with their
respective sub-section headings, can be moved to the end of the file, and then
out into an include-file (file inclusion is essentially just literal inlining):
#...
{% set SITE = "niwa" %}
# Core site-agnostic settings:
#...
[runtime]
[[foo]]
script = run-foo.sh
#...
# Site-specific settings:
{% include 'site/' ~ SITE ~ '.rc' %}
where the site include-file site/niwa.rc
contains:
# site/niwa.rc
[runtime]
[[foo]]
[[[remote]]]
host = hpc1.niwa.co.nz
16.5.3. Site-Specific Graphs¶
Repeated graph
strings under the same graph section headings are
always additive (graph strings are the only exception to the normal repeat item
override semantics). So, for instance, this graph:
[scheduling]
initial cycle point = 2025
[[dependencies]]
[[[P1Y]]]
graph = "pre => model => post => niwa_archive"
can be written like this:
[scheduling]
initial cycle point = 2025
[[dependencies]]
[[[P1Y]]]
graph = "pre => model => post"
[[[P1Y]]]
graph = "post => niwa_archive"
and again, the site-specific part can be taken out to a site include-file:
#...
{% set SITE = "niwa" %}
# Core site-agnostic settings.
#...
[scheduling]
initial cycle point = 2025
[[dependencies]]
[[[P1Y]]]
graph = "pre => model => post"
#...
# Site-specific settings:
{% include 'site/' ~ SITE ~ '.rc' %}
where the site include-file site/niwa.rc
contains:
# site/niwa.rc
[scheduling]
[[dependencies]]
[[[P1Y]]]
graph = "post => niwa_archive"
Note that the site-file graph needs to define the dependencies of the
site-specific tasks, and thus their points of connection to the core
suite - which is why the core task post
appears in the graph here (if
post
had any site-specific runtime settings, to get it to run at
this site, they would also be in the site-file).
16.5.4. Inlined Site-Switching¶
It may be tempting to use inlined switch blocks throughout the suite instead of site include-files, but this is not recommended - it is verbose and untidy (the greater the number of supported sites, the bigger the mess) and it exposes all site configuration to all users:
#...
[runtime]
[[model]]
script = run-model.sh
{# Site switch blocks not recommended:#}
{% if SITE == 'niwa' %}
[[[job]]]
batch system = loadleveler
[[[directives]]]
# NIWA Loadleveler directives...
{% elif SITE == 'metoffice' %}
[[[job]]]
batch system = pbs
[[[directives]]]
# Met Office PBS directives...
{% elif SITE == ... %}
#...
{% else %}
{{raise('Unsupported site: ' ~ SITE)}}
{% endif %}
#...
Inlined switches can be used, however, to configure exceptional behaviour at one site without requiring the other sites to duplicate the default behaviour. But be wary of accumulating too many of these switches:
# (core suite.rc file)
#...
{% if SITE == 'small' %}
{# We can't run 100 members... #}
{% set ENSEMBLE_SIZE = 25 %}
{% else %}
{# ...but everyone else can! #}
{% set ENSEMBLE_SIZE = 100 %}
{% endif %}
#...
Inlined switches can also be used to temporarily isolate a site-specific change to a hitherto non site-specific part of the suite, thereby avoiding the need to update all site include-files before getting agreement from the suite owner and collaborators.
16.5.5. Site-Specific Suite Variables¶
It can sometimes be useful to set site-specific values of suite variables that
aren’t exposed to users via rose-suite.conf
. For example, consider
a suite that can run a special post-processing workflow of some kind at sites
where IDL is available. The IDL-dependence switch can be set per site like this:
#...
{% from SITE ~ '-vars.rc' import HAVE_IDL, OTHER_VAR %}
graph = """
pre => model => post
{% if HAVE_IDL %}
post => idl-1 => idl-2 => idl-3
{% endif %}
"""
where for SITE = niwa
the file niwa-vars.rc
contains:
{# niwa-vars.rc #}
{% set HAVE_IDL = True %}
{% set OTHER_VAR = "the quick brown fox" %}
Note we are assuming there are significantly fewer options (IDL or not, in this case) than sites, otherwise the IDL workflow should just go in the site include-files of the sites that need it.
16.5.6. Site-Specific Optional Suite Configs¶
During development and testing of a portable suite you can use an optional Rose
suite config file to automatically set site-specific suite inputs and thereby
avoid the need to make manual changes every time you check out and run a new
version. The site switch itself has to be set of course, but there may be other
settings too such as model parameters for a standard local test domain. Just
put these settings in opt/rose-suite-niwa.conf
(for site “niwa”)
and run the suite with rose suite-run -O niwa
.
16.5.7. Site-Agnostic File Paths in App Configs¶
Where possible apps should be configured to reference files within the suite structure itself rather than outside of it. This makes the apps themselves portable and it becomes the job of the install task to ensure all required source files are available within the suite structure e.g. via symlink into the share directory. Additionally, by moving the responsibility of linking files into the suite to an install task you gain the added benefit of knowing if a file is missing at the start of a suite rather than part way into a run.
16.5.8. Site-Specific Optional App Configs¶
Typically a few but not all apps will need some site customization, e.g. for local archive configuration, local science options, or whatever. To avoid explicit site-customization of individual task-run command lines use Rose’s built-in optional app config capability:
[runtime]
[[root]]
script = rose task-run -v -O '({{SITE}})'
Normally a missing optional app config is considered to be an error, but the round parentheses here mean the named optional config is optional - i.e. use it if it exists, otherwise ignore.
With this setting in place we can simply add a opt/rose-app-niwa.conf
to
any app that needs customization at SITE = niwa
.
16.5.9. An Example¶
The following small suite is not portable because all of its tasks are submitted to a NIWA HPC host; two task are entirely NIWA-specific in that they respectively install files from a local database and upload products to a local distribution system; and one task runs a somewhat NIWA-specific configuration of a model. The remaining tasks are site-agnostic apart from local job host and batch scheduler directives.
[cylc]
UTC mode = True
[scheduling]
initial cycle point = 2017-01-01
[[dependencies]]
[[[R1]]]
graph = install_niwa => preproc
[[[P1D]]]
graph = """
preproc & model[-P1D] => model => postproc => upload_niwa
postproc => idl-1 => idl-2 => idl-3"""
[runtime]
[[root]]
script = rose task-run -v
[[HPC]] # NIWA job host and batch scheduler settings.
[[[remote]]]
host = hpc1.niwa.co.nz
[[[job]]]
batch system = loadleveler
[[[directives]]]
account_no = NWP1623
class = General
job_type = serial # (most jobs in this suite are serial)
[[install_niwa]] # NIWA-specific file installation task.
inherit = HPC
[[preproc]]
inherit = HPC
[[model]] # Run the model on a local test domain.
inherit = HPC
[[[directives]]] # Override the serial job_type setting.
job_type = parallel
[[[environment]]]
SPEED = fast
[[postproc]]
inherit = HPC
[[upload_niwa]] # NIWA-specific product upload.
inherit = HPC
To make this portable, refactor it into a core suite.rc file that contains the clean site-independent workflow configuration and loads all site-specific settings from an include-file at the end:
# suite.rc: CORE SITE-INDEPENDENT CONFIGURATION.
{% set SITE = 'niwa' %}
{% from 'site/' ~ SITE ~ '-vars.rc' import HAVE_IDL %}
[cylc]
UTC mode = True
[scheduling]
initial cycle point = 2017-01-01
[[dependencies]]
[[[P1D]]]
graph = """
preproc & model[-P1D] => model => postproc
{% if HAVE_IDL %}
postproc => idl-1 => idl-2 => idl-3
{% endif %}
"""
[runtime]
[[root]]
script = rose task-run -v -O '({{SITE}})'
[[preproc]]
inherit = HPC
[[preproc]]
inherit = HPC
[[model]]
inherit = HPC
[[[environment]]]
SPEED = fast
{% include 'site/' ~ SITE ~ '.rc' %}
plus site files site/niwa-vars.rc
:
# site/niwa-vars.rc: NIWA SITE SETTINGS FOR THE EXAMPLE SUITE.
{% set HAVE_IDL = True %}
and site/niwa.rc
:
# site/niwa.rc: NIWA SITE SETTINGS FOR THE EXAMPLE SUITE.
[scheduling]
[[dependencies]]
[[[R1]]]
graph = install_niwa => preproc
[[[P1D]]]
graph = postproc => upload_niwa
[runtime]
[[HPC]]
[[[remote]]]
host = hpc1.niwa.co.nz
[[[job]]]
batch system = loadleveler
[[[directives]]]
account_no = NWP1623
class = General
job_type = serial # (most jobs in this suite are serial)
[[install_niwa]] # NIWA-specific file installation.
[[model]]
[[[directives]]] # Override the serial job_type setting.
job_type = parallel
[[upload_niwa]] # NIWA-specific product upload.
and finally, an optional app config file for the local model domain:
app/model/rose-app.conf # Main app config.
app/model/opt/rose-app-niwa.conf # NIWA site settings.
Some points to note:
- It is straightforward to extend support to a new site by copying an existing site file(s) and adapting it to the new job host and batch scheduler etc.
- Batch system directives should be considered site-specific unless all supported sites have the same batch system and the same host architecture (including CPU clock speed and memory size etc.).
- We’ve assumed that all tasks run on a single HPC host at both
sites. If that’s not a valid assumption the
HPC
family inheritance relationships would have to become site-specific. - Core task runtime configuration aren’t needed in site files at all
if their job host and batch system settings can be defined in common
families that are (
HPC
in this case).
16.5.10. Collaborative Development Model¶
Official releases of a portable suite should be made from the suite trunk.
Changes should be developed on feature branches so as not to affect other users of the suite.
Site-specific changes shouldn’t touch the core suite.rc file, just the relevant site include-file, and therefore should not need close scrutiny from other sites.
Changes to the core suite.rc file should be agreed by all stakeholders, and should be carefully checked for effects on site include-files:
- Changing the name of tasks or families in the core suite may break sites that add configuration to the original runtime namespace.
- Adding new tasks or families to the core suite may require corresponding additions to the site files.
- Deleting tasks or families from the core suite may require corresponding parts of the site files to be removed. And also, check for site-specific triggering off of deleted tasks or families.
However, if the owner site has to get some changes into the trunk before all collaborating sites have time to test them, version control will of course protect those lagging behind from any immediate ill effects.
When a new feature is complete and tested at the developer’s site, the suite owner should check out the branch, review and test it, and if necessary request that other sites do the same and report back. The owner can then merge the new feature to the trunk once satisfied.
All planning and discussion associated with the change should be documented on MOSRS Trac tickets associated with the suite.
16.5.11. Research-To-Operations Transition¶
Under this collaborative development model it is possible to use the same suite in research and operations, largely eliminating the difficult translation between the two environments. Where appropriate, this can save a lot of work.
Operations-specific parts of the suite should be factored out (as for site portability) into include-files that are only loaded in the operational environment. Improvements and upgrades can be developed on feature branches in the research environment. Operations staff can check out completed feature branches for testing in the operational environment before merging to trunk or referring back to research if problems are found. After sufficient testing the new suite version can be deployed into operations.
Note
This obviously glosses over the myriad complexities of the technical and scientific testing and validation of suite upgrades; it merely describes what is possible from a suite design and collaborative development perspective.