This will close cylc-flow #3421 and partially address cylc-flow #2199.
localhost
).If a platform has a list of hosts then the host used for each operation on that platform (job submission, polling, log retrieval, etc) will be randomised. This should help balance the load. If a host becomes unavailable, try another random host from the list, until exhausted. This will handle hosts becoming unavailable after a job is submitted which is a major limitation of the current logic.
platform = $ROSE_ORIG_HOST
).
$ROSE_ORIG_HOST
relies on Rose specifying extra options to set this
in [cylc][environment]
which we should avoid if possible.[platforms]
[[desktop\d\d,laptop\d\d]]
# hosts = platform name (default)
# Note: "desktop01" and "desktop02" are both valid and distinct platforms
[[sugar]]
hosts = localhost
batch system = slurm
[[hpc]]
hosts = hpcl1, hpcl2
retrieve job logs = True
batch system = pbs
[[hpcl1-bg]]
hosts = hpcl1
retrieve job logs = True
batch system = background
[[hpcl2-bg]]
hosts = hpcl2
retrieve job logs = True
batch system = background
[platform aliases]
[[hpc-bg]]
platforms = hpcl1-bg, hpcl2-bg
# other config(s) to be added here for load balancing etc
Note:
The default for hosts
is the platform name and platform section headings can
be regular expressions to match multiple platform names. This allows us to
define many platforms in a single section and handles the case where you have
a large number of separate servers which are identical in terms of their
platform properties other than their host name.
re.match
which is currently used for matching hosts).The names of all new and existing settings are subject to change until we release cylc 8.0. Initially names will be chosen which match current usage wherever possible. We intend to review all the config settings once we have the key functionality in place.
The default platform is localhost
. At cylc 7, any settings for localhost
act as defaults for all other hosts but we do not propose doing this for
platforms. The downside is more duplication if you need to override a
particular setting for every platform but we think this is clearer and safer.
hosts
is defined then the platform section heading should specify a single
platform name - otherwise you are defining lots of identical platforms which
makes no sense (although it is relatively harmless so we probably won’t try to
prevent this or give any warnings if this happens).
[platforms]
[[hpcl[12]-bg]]
hosts = ("-bg$","") # or "hosts = s/-bg$//" ?
retrieve job logs = True
batch system = background
These are examples of how the platform may be assigned in the [runtime]
section for a job. They refer to the example platforms shown above.
platform = desktop01
desktop01
.platform = special
special
defined so this will fail validation.platform = hpc
platform = hpc-bg
platform = $(rose host-select linux)
You can’t mix old with new. If platform
is defined for any task then any
settings in the [remote]
or [job]
sections of any task result in a
validation failure.
Any settings in the [remote]
or [job]
sections which have equivalent
settings in the new runtime section are converted using the normal deprecate
method. e.g. [job] execution time limit
.
The remaining [remote]
& [job]
settings are converted by searching for a
platform which matches all the items set. If there is none the conversion
fails resulting in an error.
For fixed remote hosts the conversion happens at load time and the lack of a matching platform results in a validation failure.
For remote hosts defined as a command or an environment variable the
remaining [remote]
& [job]
settings are stored and the conversion happens
at job submission. In this case the lack of a matching platform results in a
submission failure.
The [remote] host
setting is matched first by checking the platforms
hosts
setting. If this is not defined then the match is made against
the valid platform names.
[remote] host
is set to the workflow
server hostname then this is converted to localhost
before matching
occurs.[runtime]
[[alpha]]
[[[remote]]]
host = localhost
[[[job]]]
batch system = background
# => platform = localhost (set at load time)
[[beta]]
[[[remote]]]
host = desktop01
[[[job]]]
batch system = slurm
# => validation failure (no matching platform)
[[gamma]]
[[[job]]]
batch system = slurm
# => platform = sugar (set at load time)
[[delta]]
[[[remote]]]
host = $(rose host-select hpc)
# assuming this returns "hpcl1" or "hpcl2"
[[[job]]]
batch system = pbs
# => platform = hpc (set at job submission time)
[[epsilon]]
[[[remote]]]
host = $(rose host-select hpc)
[[[job]]]
batch system = slurm
# => job submission failure (no matching platform)
[[zeta]]
[[[remote]]]
host = hpcl1
[[[job]]]
batch system = background
# => platform = hpcl1-bg (set at load time)
Several additions to the platform support will be needed to support rose suite-run functionality:
install target
: defaults to the platform name).install target
.install target
rather than per platform..service/uuid
file to detect whether a remote host
shares the same filesystem as the scheduler host - we can remove this
functionality. Instead we should check for the existence of a client key
and fail the remote-init if one is found for a different install target
(different targets used within the same workflow should not share the
same run directory).ControlPersist
) then this will avoid the ssh overhead (and will
benefit other ssh connections as well)..service/server.key app/ bin/ etc/ lib/
. Note that
.service/contact
is not included since we may need to be able to modify
this file on remote hosts depending on the comms method.--delete
rsync option so that any files removed from
from installed directories also get removed from the install targets on
reload/restart.python
, util
, data
(although it is not clear how many of these would need to be installed
on remote platforms).rose suite-run
currently installs everything other than a set list of
exclusions. However, it has been known for suites to write to top level
directories (possibly not deliberate) so this is not a good default
behaviour (especially with the --delete
rsync option).[scheduler]install
(with a default value of
app/ bin/ etc/ lib/
) where you can specify which top level
directories and files are to be installed, e.g.
dir-to-copy/ file-to-copy
(directories must have a trailing slash).
Patterns are allowed (as defined by rsync), e.g */ *
would mimic
the current rose suite-run
behaviour.--include='/.service/' --include='/.service/server.key'
--exclude='.service/***' --exclude='log' --exclude='share' --exclude='work'
[scheduler]install
workflow setting.
Note that all items will have a leading slash added and any items
ending in /
will have a trailing ***
added to the pattern
(which matches everything in the directory).
e.g. --include='/dir-to-copy/***' --include='/file-to-copy'
--exclude='*'
[symlink dirs][<install target>]run
(default: none
):
Specifies the directory where the workflow run directories are
created. If specified, the workflow run directory will be created in
<run dir>/<workflow-name>
and a symbolic link will be
created from $HOME/cylc-run/<workflow-name>
. If not specified the
workflow run directory will be created in
$HOME/cylc-run/<workflow-name>
. All the workflow files and the
.service
directory get installed into this directory.[symlink dirs][<install target>]log
(default: none
):
Specifies the directory where log directories are created. If
specified the workflow log directory will be created in
<log dir>/<workflow-name>/log
and a symbolic link will be
created from $HOME/cylc-run/<workflow-name>/log
. If not specified
the workflow log directory will be created in
$HOME/cylc-run/<workflow-name>/log
.[symlink dirs][<install target>]share
(default: none
):
Specifies the directory where share directories are created. If
specified the workflow share directory will be created in
<share dir>/<workflow-name>/share
and a symbolic link will
be created from <$HOME/cylc-run/<workflow-name>/share
. If not
specified the workflow share directory will be created in
$HOME/cylc-run/<workflow-name>/share
.[symlink dirs][<install target>]share/cycle
(default: none
):
Specifies the directory where share/cycle directories are created.
If specified the workflow share/cycle directory will be created in
<share/cycle dir>/<workflow-name>/share/cycle
and a symbolic link
will be created from $HOME/cylc-run/<workflow-name>/share/cycle
.
If not specified the workflow share/cycle directory will be created in
$HOME/cylc-run/<workflow-name>/share/cycle
.[symlink dirs][<install target>]work
(default: none
):
Specifies the directory where work directories are created. If
specified the workflow work directory will be created in
<work dir>/<workflow-name>/work
and a symbolic link will be
created from $HOME/cylc-run/<workflow-name>/work
. If not
specified the workflow work directory will be created in
$HOME/cylc-run/<workflow-name>/work
.$HOME
or $USER
in the
run and work directory settings.
With Rose you can use any variable which is available when the
remote-init command is invoked on the remote platform. Should we adopt a
similar approach for Cylc?install target
section and move the
[symlink dirs]
section.inherit
will imply using the same install target
.[symlink dirs]
settings are only applied when a workflow is installed
(or first run on a target). Therefore, changes to these settings have no
affect on running or restarting workflows.$HOME/cylc-run
directory (which you can do currently via the
existing Cylc run directory
setting). The assumption is that it is
sufficient to provide a way to move the run directory and there is no
reason not to use $HOME/cylc-run
.run directory
and
work directory
settings.Example platform configurations:
[platforms]
# The localhost platform is defined by default so there is no need to
# specify it if it just uses default settings
# [[localhost]]
[[desktop\d\d,laptop\d\d] # Specify install target
install target = localhost
[[desktop\d\d,laptop\d\d]] # Equivalent using inherit
inherit = localhost
[[sugar]]
inherit = localhost
hosts = localhost
batch system = slurm
[[hpc]]
hosts = hpcl1, hpcl2
retrieve job logs = True
batch system = pbs
[[hpcl1-bg]]
inherit = hpc
hosts = hpcl1
batch system = background
[[hpcl2-bg]]
inherit = hpcl1-bg
hosts = hpcl2
[symlink dirs]
[[localhost]]
log = $DATADIR
share = $DATADIR
share/cycle = $SCRATCH
work = $SCRATCH
[[hpc]]
run = $DATADIR
share/cycle = $SCRATCH
work = $SCRATCH
# Alternative if we are concerned there may be other install target properties
[install targets]
[[localhost]]
[[[symlink dir]]]
log = $DATADIR
share = $DATADIR
share/cycle = $SCRATCH
work = $SCRATCH
There are a number of ideas for enhancements (some of which are referenced in cylc-flow #2199) which we will not attempt to address in the initial implementation. These include:
Define default directives for all jobs on a platform.
Support task management commands (kill, poll) by platform.
Limit the number of jobs submitted to a platform at any one time.
Custom logic to invoke for collecting job accounting information when a job completes.
rose host-select
.platforms
to be specified as a command? This would
provide an alternative to specifying a platform as a command in the suite
(configure a platform alias instead).The [suite servers]
section is configuring something very similar to a
platform alias. Should we try to unify the approach (i.e specify the suite
servers as a platform)?
Note that these are just ideas for possible enhancements - no assumptions are made at this stage as to which ones are worth implementing.