cylc-admin

Proposal - initial support for platforms

This will close cylc-flow #3421 and partially address cylc-flow #2199.

Key functionality

Example platform configurations

[platforms]
    [[desktop\d\d,laptop\d\d]]
        # hosts = platform name (default)
        # Note: "desktop01" and "desktop02" are both valid and distinct platforms
    [[sugar]]
        hosts = localhost
        batch system = slurm
    [[hpc]]
        hosts = hpcl1, hpcl2
        retrieve job logs = True
        batch system = pbs
    [[hpcl1-bg]]
        hosts = hpcl1
        retrieve job logs = True
        batch system = background
    [[hpcl2-bg]]
        hosts = hpcl2
        retrieve job logs = True
        batch system = background
[platform aliases]
    [[hpc-bg]]
        platforms = hpcl1-bg, hpcl2-bg
        # other config(s) to be added here for load balancing etc

Note:

Example platform usage

These are examples of how the platform may be assigned in the [runtime] section for a job. They refer to the example platforms shown above.

platform = desktop01

platform = special

platform = hpc

platform = hpc-bg

platform = $(rose host-select linux)

Deprecation method

Deprecation examples

[runtime]
    [[alpha]]
        [[[remote]]]
            host = localhost
        [[[job]]]
            batch system = background
        # => platform = localhost (set at load time)

    [[beta]]
        [[[remote]]]
            host = desktop01
        [[[job]]]
            batch system = slurm
        # => validation failure (no matching platform)

    [[gamma]]
        [[[job]]]
            batch system = slurm
        # => platform = sugar (set at load time)

    [[delta]]
        [[[remote]]]
            host = $(rose host-select hpc)
            # assuming this returns "hpcl1" or "hpcl2"
        [[[job]]]
            batch system = pbs
        # => platform = hpc (set at job submission time)

    [[epsilon]]
        [[[remote]]]
            host = $(rose host-select hpc)
        [[[job]]]
            batch system = slurm
        # => job submission failure (no matching platform)

    [[zeta]]
        [[[remote]]]
            host = hpcl1
        [[[job]]]
            batch system = background
        # => platform = hpcl1-bg (set at load time)

Enhancements to support rose suite-run functionality

Several additions to the platform support will be needed to support rose suite-run functionality:

  1. Installation of suites on platforms
    • We already have logic to install service files on remote platforms just before the first set of jobs is submitted to each platform after suite start-up or on suite reload. This system needs to be extended to other items of the suite.
    • We will modify this to work on an install target basis rather than a platform basis.
      • Configure the id of the target used on a particular platform (install target: defaults to the platform name).
      • If multiple platforms share the same filesystems then use a common install target.
      • Note that, when we add we support for separate client keys, these will now be per install target rather than per platform.
      • Cylc 7 creates a .service/uuid file to detect whether a remote host shares the same filesystem as the scheduler host - we can remove this functionality. Instead we should check for the existence of a client key and fail the remote-init if one is found for a different install target (different targets used within the same workflow should not share the same run directory).
    • Currently, the files which need to be installed are transferred as a tar file via stdin. This has the advantage that the remote init can be done as a single ssh. We will change the installation to use rsync which will require a second ssh but this will be better for installing more files and allow the logging to report which files have been changed after a reload.
      • If we change cylc to re-use ssh connections by default (using ControlPersist) then this will avoid the ssh overhead (and will benefit other ssh connections as well).
      • We will probably need a third ssh connection for transferring the client public key.
    • The remote installation will include particular files/directories.
      • Items to include: .service/server.key app/ bin/ etc/ lib/. Note that .service/contact is not included since we may need to be able to modify this file on remote hosts depending on the comms method.
      • We will use the --delete rsync option so that any files removed from from installed directories also get removed from the install targets on reload/restart.
      • Does this need to be configurable? (i.e. configure additional items to include)? If so, global and/or workflow level?
        • Is there any reason why workflow designers can’t just stick to the list of supported directories for files that need to be installed?
        • Are there many existing Rose suites which rely on other directories being installed? A quick search implies there are quite a few suites using a wide range of sub-directories such as python, util, data (although it is not clear how many of these would need to be installed on remote platforms).
        • If we allow global configuration this is likely to lead to portability issues where a suite only works if you modify your global config. If additional items need to be installed it’s better to force these to be configured in the suite. Note that this will require changes to some Cylc 7 suites in order for them to work at Cylc 8.
        • We will only support installing the same files to all install targets initially. If there are cases where this is an issue we can recommend the use of install tasks as a workaround.
          • We can add support for installing on a task or platform basis later if needed.
        • rose suite-run currently installs everything other than a set list of exclusions. However, it has been known for suites to write to top level directories (possibly not deliberate) so this is not a good default behaviour (especially with the --delete rsync option).
        • Proposal:
          • Install the same items to all install targets.
          • Allow the installation items to be configured at the workflow level via a new setting [scheduler]install (with a default value of app/ bin/ etc/ lib/) where you can specify which top level directories and files are to be installed, e.g. dir-to-copy/ file-to-copy (directories must have a trailing slash). Patterns are allowed (as defined by rsync), e.g */ * would mimic the current rose suite-run behaviour.
          • The rsync filter options will be applied in the following order (note that the first matching pattern is acted on):
            • Include files required by cylc: --include='/.service/' --include='/.service/server.key'
            • Exclude directories which should never be copied: --exclude='.service/***' --exclude='log' --exclude='share' --exclude='work'
            • Add any items defined in the [scheduler]install workflow setting. Note that all items will have a leading slash added and any items ending in / will have a trailing *** added to the pattern (which matches everything in the directory). e.g. --include='/dir-to-copy/***' --include='/file-to-copy'
            • Exclude everything else: --exclude='*'
  2. Support moving some directories to different locations with symlinks to the original location.
    • New settings:
      • [symlink dirs][<install target>]run (default: none): Specifies the directory where the workflow run directories are created. If specified, the workflow run directory will be created in <run dir>/<workflow-name> and a symbolic link will be created from $HOME/cylc-run/<workflow-name>. If not specified the workflow run directory will be created in $HOME/cylc-run/<workflow-name>. All the workflow files and the .service directory get installed into this directory.
      • [symlink dirs][<install target>]log (default: none): Specifies the directory where log directories are created. If specified the workflow log directory will be created in <log dir>/<workflow-name>/log and a symbolic link will be created from $HOME/cylc-run/<workflow-name>/log. If not specified the workflow log directory will be created in $HOME/cylc-run/<workflow-name>/log.
      • [symlink dirs][<install target>]share (default: none): Specifies the directory where share directories are created. If specified the workflow share directory will be created in <share dir>/<workflow-name>/share and a symbolic link will be created from <$HOME/cylc-run/<workflow-name>/share. If not specified the workflow share directory will be created in $HOME/cylc-run/<workflow-name>/share.
      • [symlink dirs][<install target>]share/cycle (default: none): Specifies the directory where share/cycle directories are created. If specified the workflow share/cycle directory will be created in <share/cycle dir>/<workflow-name>/share/cycle and a symbolic link will be created from $HOME/cylc-run/<workflow-name>/share/cycle. If not specified the workflow share/cycle directory will be created in $HOME/cylc-run/<workflow-name>/share/cycle.
      • [symlink dirs][<install target>]work (default: none): Specifies the directory where work directories are created. If specified the workflow work directory will be created in <work dir>/<workflow-name>/work and a symbolic link will be created from $HOME/cylc-run/<workflow-name>/work. If not specified the workflow work directory will be created in $HOME/cylc-run/<workflow-name>/work.
    • What happens if you want to be able to have some workflows which use different directory configurations on the same “platform”? For example, you might have some workflows where you want the run directory on $HOME and others on $DATADIR.
      • For platforms other than localhost the answer is to create separate platforms and then choose the appropriate platform in the workflow.
      • For localhost it is more complicated since there is only one localhost platform and the settings are applied earlier; at workflow install or startup time (rather than prior to first job submission). We will need a way to override the directory settings for localhost, e.g. on the command line.
    • What environment variables can be used in these settings?
      • Cylc currently only allows $HOME or $USER in the run and work directory settings. With Rose you can use any variable which is available when the remote-init command is invoked on the remote platform. Should we adopt a similar approach for Cylc?
      • Rose also provides the option of using the login shell for remote commands (which may give access to more environment variables). Should we add similar support to Cylc for sites that allow it?
    • Symlink directories are specified per install target rather than per platform. Any platforms which use the same install target must use the same symlink directories. Note that the assumption is that there are no other settings which need to be defined per install target.
      • What about comms method? If possible this should be supported per platform. For example you could have a case where HPC login nodes can use a different method compared with the nodes where jobs are run. This implies comms method needs to go back to being specified in the job rather rather than in the contact file.
      • If we ever find a need to define other settings per install target we should introduce a new install target section and move the [symlink dirs] section.
    • In order to reduce duplication (and configuration errors) we will support platform inheritance so that you can define a platform based on an existing platform.
      • Note: inherit will imply using the same install target.
    • The [symlink dirs] settings are only applied when a workflow is installed (or first run on a target). Therefore, changes to these settings have no affect on running or restarting workflows.
    • At the moment this proposal does not provide any way to relocate the top level $HOME/cylc-run directory (which you can do currently via the existing Cylc run directory setting). The assumption is that it is sufficient to provide a way to move the run directory and there is no reason not to use $HOME/cylc-run.
    • Note that these new settings replace the existing Cylc run directory and work directory settings.

Example platform configurations:

[platforms]
    # The localhost platform is defined by default so there is no need to
    # specify it if it just uses default settings
    # [[localhost]]
    [[desktop\d\d,laptop\d\d] # Specify install target
        install target = localhost
    [[desktop\d\d,laptop\d\d]] # Equivalent using inherit
        inherit = localhost
    [[sugar]]
        inherit = localhost
        hosts = localhost
        batch system = slurm
    [[hpc]]
        hosts = hpcl1, hpcl2
        retrieve job logs = True
        batch system = pbs
    [[hpcl1-bg]]
        inherit = hpc
        hosts = hpcl1
        batch system = background
    [[hpcl2-bg]]
        inherit = hpcl1-bg
        hosts = hpcl2
[symlink dirs]
    [[localhost]]
        log = $DATADIR
        share = $DATADIR
        share/cycle = $SCRATCH
        work = $SCRATCH
    [[hpc]]
        run = $DATADIR
        share/cycle = $SCRATCH
        work = $SCRATCH
# Alternative if we are concerned there may be other install target properties
[install targets]
    [[localhost]]
        [[[symlink dir]]]
            log = $DATADIR
            share = $DATADIR
            share/cycle = $SCRATCH
            work = $SCRATCH

Further enhancements

There are a number of ideas for enhancements (some of which are referenced in cylc-flow #2199) which we will not attempt to address in the initial implementation. These include:

Note that these are just ideas for possible enhancements - no assumptions are made at this stage as to which ones are worth implementing.