13. Running Suites¶
This chapter currently features a diverse collection of topics related to running suites. Please also see Tutorial and Command Reference, and experiment with plenty of examples.
13.1. Suite Start-Up¶
There are three ways to start a suite running: cold start and warm start, which start from scratch; and restart, which starts from a prior suite state checkpoint. The only difference between cold starts and warm starts is that warm starts start from a point beyond the suite initial cycle point.
Once a suite is up and running it is typically a restart that is needed most
often (but see also cylc reload
). Be aware that cold and warm
starts wipe out prior suite state, so you can’t go back to a restart if you
decide you made a mistake.
13.1.1. Cold Start¶
A cold start is the primary way to start a suite run from scratch:
$ cylc run SUITE [INITIAL_CYCLE_POINT]
The initial cycle point may be specified on the command line or in the suite.rc file. The scheduler starts by loading the first instance of each task at the suite initial cycle point, or at the next valid point for the task.
13.1.2. Warm Start¶
A warm start runs a suite from scratch like a cold start, but from the beginning of a given cycle point that is beyond the suite initial cycle point. This is generally inferior to a restart (which loads a previously recorded suite state - see Restart and Suite State Checkpoints) because it may result in some tasks rerunning. However, a warm start may be required if a restart is not possible, e.g. because the suite run database was accidentally deleted. The warm start cycle point must be given on the command line:
$ cylc run --warm SUITE [START_CYCLE_POINT]
The original suite initial cycle point is preserved, but all tasks and dependencies before the given warm start cycle point are ignored.
The scheduler starts by loading a first instance of each task at the warm
start cycle point, or at the next valid point for the task.
R1
-type tasks behave exactly the same as other tasks - if their
cycle point is at or later than the given start cycle point, they will run; if
not, they will be ignored.
13.1.3. Restart and Suite State Checkpoints¶
At restart (see cylc restart --help
) a suite server program
initializes its task pool from a previously recorded checkpoint state. By
default the latest automatic checkpoint - which is updated with every task
state change - is loaded so that the suite can carry on exactly as it was just
before being shut down or killed.
$ cylc restart SUITE
Tasks recorded in the “submitted” or “running” states are automatically polled (see Task Job Polling) at start-up to determine what happened to them while the suite was down.
13.1.3.1. Restart From Latest Checkpoint¶
To restart from the latest checkpoint simply invoke the cylc restart
command with the suite name (or select “restart” in the GUI suite start dialog
window):
$ cylc restart SUITE
13.1.3.2. Restart From Another Checkpoint¶
Suite server programs automatically update the “latest” checkpoint every time a task changes state, and at every suite restart, but you can also take checkpoints at other times. To tell a suite server program to checkpoint its current state:
$ cylc checkpoint SUITE-NAME CHECKPOINT-NAME
The 2nd argument is a name to identify the checkpoint later with:
$ cylc ls-checkpoints SUITE-NAME
For example, with checkpoints named “bob”, “alice”, and “breakfast”:
$ cylc ls-checkpoints SUITE-NAME
#######################################################################
# CHECKPOINT ID (ID|TIME|EVENT)
1|2017-11-01T15:48:34+13|bob
2|2017-11-01T15:48:47+13|alice
3|2017-11-01T15:49:00+13|breakfast
...
0|2017-11-01T17:29:19+13|latest
To see the actual task state content of a given checkpoint ID (if you need to), for the moment you have to interrogate the suite DB, e.g.:
$ sqlite3 ~/cylc-run/SUITE-NAME/log/db \
'select * from task_pool_checkpoints where id == 3;'
3|2012|model|1|running|
3|2013|pre|0|waiting|
3|2013|post|0|waiting|
3|2013|model|0|waiting|
3|2013|upload|0|waiting|
Note
A checkpoint captures the instantaneous state of every task in the suite, including any tasks that are currently active, so you may want to be careful where you do it. Tasks recorded as active are polled automatically on restart to determine what happened to them.
The checkpoint ID 0 (zero) is always used for latest state of the suite, which is updated continuously as the suite progresses. The checkpoint IDs of earlier states are positive integers starting from 1, incremented each time a new checkpoint is stored. Currently suites automatically store checkpoints before and after reloads, and on restarts (using the latest checkpoints before the restarts).
Once you have identified the right checkpoint, restart the suite like this:
$ cylc restart --checkpoint=CHECKPOINT-ID SUITE
or enter the checkpoint ID in the space provided in the GUI restart window.
13.1.3.3. Checkpointing With A Task¶
Checkpoints can be generated automatically at particular points in the
workflow by coding tasks that run the cylc checkpoint
command:
[scheduling]
[[dependencies]]
[[[PT6H]]]
graph = "pre => model => post => checkpointer"
[runtime]
# ...
[[checkpointer]]
script = """
wait "${CYLC_TASK_MESSAGE_STARTED_PID}" 2>/dev/null || true
cylc checkpoint ${CYLC_SUITE_NAME} CP-${CYLC_TASK_CYCLE_POINT}
"""
Note
We need to “wait” on the “task started” message - which is sent in the background to avoid holding tasks up in a network outage - to ensure that the checkpointer task is correctly recorded as running in the checkpoint (at restart the suite server program will poll to determine that that task job finished successfully). Otherwise it may be recorded in the waiting state and, if its upstream dependencies have already been cleaned up, it will need to be manually reset from waiting to succeeded after the restart to avoid stalling the suite.
13.1.3.4. Behaviour of Tasks on Restart¶
All tasks are reloaded in exactly their checkpointed states. Failed tasks are not automatically resubmitted at restart in case the underlying problem has not been addressed yet.
Tasks recorded in the submitted or running states are automatically polled on restart, to see if they are still waiting in a batch queue, still running, or if they succeeded or failed while the suite was down. The suite state will be updated automatically according to the poll results.
Existing instances of tasks removed from the suite configuration before restart are not removed from the task pool automatically, but they will not spawn new instances. They can be removed manually if necessary, with~``cylc remove``.
Similarly, instances of new tasks added to the suite configuration before
restart are not inserted into the task pool automatically, because it is
very difficult in general to automatically determine the cycle point of
the first instance. Instead, the first instance of a new task should be
inserted manually at the right cycle point, with cylc insert
.
13.2. Reloading The Suite Configuration At Runtime¶
The cylc reload
command tells a suite server program to reload its
suite configuration at run time. This is an alternative to shutting a
suite down and restarting it after making changes.
As for a restart, existing instances of tasks removed from the suite
configuration before reload are not removed from the task pool
automatically, but they will not spawn new instances. They can be removed
manually if necessary, with cylc remove
.
Similarly, instances of new tasks added to the suite configuration before
reload are not inserted into the pool automatically. The first instance of each
must be inserted manually at the right cycle point, with cylc insert
.
13.3. Task Job Access To Cylc¶
Task jobs need access to Cylc on the job host, primarily for task messaging, but also to allow user-defined task scripting to run other Cylc commands.
Cylc should be installed on job hosts as on suite hosts, with different
releases installed side-by-side and invoked via the central Cylc
wrapper according to the value of $CYLC_VERSION
- see
Installing Cylc. Task job scripts set $CYLC_VERSION
to the
version of the parent suite server program, so that the right Cylc will
be invoked by jobs on the job host.
Access to the Cylc executable (preferably the central wrapper as just
described) for different job hosts can be configured using site and user
global configuration files (on the suite host). If the environment for running
the Cylc executable is only set up correctly in a login shell for a given host,
you can set [hosts][HOST]use login shell = True
for the relevant
host (this is the default, to cover more sites automatically). If the
environment is already correct without the login shell, but the Cylc executable
is not in $PATH
, then [hosts][HOST]cylc executable
can
be used to specify the direct path to the executable.
To customize the environment more generally for Cylc on jobs hosts,
use of job-init-env.sh
is described in
Configure Environment on Job Hosts.
13.4. The Suite Contact File¶
At start-up, suite server programs write a suite contact file
$HOME/cylc-run/SUITE/.service/contact
that records suite host,
user, port number, process ID, Cylc version, and other information. Client
commands can read this file, if they have access to it, to find the target
suite server program.
13.5. Task Job Polling¶
At any point after job submission task jobs can be polled to check that
their true state conforms to what is currently recorded by the suite server
program. See cylc poll --help
for how to poll one or more tasks
manually, or right-click poll a task or family in GUI.
Polling may be necessary if, for example, a task job gets killed by the
untrappable SIGKILL signal (e.g. kill -9 PID
), or if a network
outage prevents task success or failure messages getting through, or if the
suite server program itself is down when tasks finish execution.
To poll a task job the suite server program interrogates the
batch system, and the job.status
file, on the job host. This
information is enough to determine the final task status even if the
job finished while the suite server program was down or unreachable on
the network.
13.5.1. Routine Polling¶
Task jobs are automatically polled at certain times: once on job submission timeout; several times on exceeding the job execution time limit; and at suite restart any tasks recorded as active in the suite state checkpoint are polled to find out what happened to them while the suite was down.
Finally, in necessary routine polling can be configured as a way to track job status on job hosts that do not allow networking routing back to the suite host for task messaging by HTTPS or ssh. See Polling to Track Job Status.
13.6. Tracking Task State¶
Cylc supports three ways of tracking task state on job hosts:
- task-to-suite messaging via HTTPS
- task-to-suite messaging via non-interactive ssh to the suite host, then local HTTPS
- regular polling by the suite server program
These can be configured per job host in the Cylc global config file - see Global (Site, User) Config File Reference.
If your site prohibits HTTPS and ssh back from job hosts to suite hosts, before resorting to the polling method you should consider installing dedicated Cylc servers or VMs inside the HPC trust zone (where HTTPS and ssh should be allowed).
It is also possible to run Cylc suite server programs on HPC login nodes, but this is not recommended for load, run duration, and GUI reasons.
Finally, it has been suggested that port forwarding may provide another solution - but that is beyond the scope of this document.
13.6.1. HTTPS Task Messaging¶
Task job wrappers automatically invoke cylc message
to report
progress back to the suite server program when they begin executing,
at normal exit (success) and abnormal exit (failure).
By default the messaging occurs via an authenticated, HTTPS connection to the suite server program. This is the preferred task communications method - it is efficient and direct.
Suite server programs automatically install suite contact information and credentials on job hosts. Users only need to do this manually for remote access to suites on other hosts, or suites owned by other users - see Remote Control.
13.6.2. Ssh Task Messaging¶
Cylc can be configured to re-invoke task messaging commands on the suite host via non-interactive ssh (from job host to suite host). Then a local HTTPS connection is made to the suite server program.
(User-invoked client commands (aside from the GUI, which requires HTTPS)
can do the same thing with the --use-ssh
command option).
This is less efficient than direct HTTPS messaging, but it may be useful at sites where the HTTPS ports are blocked but non-interactive ssh is allowed.
13.6.3. Polling to Track Job Status¶
Finally, suite server programs can actively poll task jobs at configurable intervals, via non-interactive ssh to the job host.
Polling is the least efficient task communications method because task state is updated only at intervals, not when task events actually occur. However, it may be needed at sites that do not allow HTTPS or non-interactive ssh from job host to suite host.
Be careful to avoid spamming task hosts with polling commands. Each poll opens (and then closes) a new ssh connection.
Polling intervals are configurable under [runtime]
because
they should may depend on the expected execution time. For instance, a
task that typically takes an hour to run might be polled every 10
minutes initially, and then every minute toward the end of its run.
Interval values are used in turn until the last value, which is used
repeatedly until finished:
[runtime]
[[foo]]
[[[job]]]
# poll every minute in the 'submitted' state:
submission polling intervals = PT1M
# poll one minute after foo starts running, then every 10
# minutes for 50 minutes, then every minute until finished:
execution polling intervals = PT1M, 5*PT10M, PT1M
A list of intervals with optional multipliers can be used for both submission and execution polling, although a single value is probably sufficient for submission polling. If these items are not configured default values from site and user global config will be used for the polling task communication method; polling is not done by default under the other task communications methods (but it can still be used if you like).
13.6.4. Task Communications Configuration¶
13.7. The Suite Service Directory¶
At registration time a suite service directory,
$HOME/cylc-run/<SUITE>/.service/
, is created and populated
with a private passphrase file (containing random text), a self-signed
SSL certificate (see Client-Server Interaction), and a symlink to the
suite source directory. An existing passphrase file will not be overwritten
if a suite is re-registered.
At run time, the private suite run database is also written to the service directory, along with a suite contact file that records the host, user, port number, process ID, Cylc version, and other information about the suite server program. Client commands automatically read daemon targetting information from the contact file, if they have access to it.
13.8. File-Reading Commands¶
Some Cylc commands and GUI actions parse suite configurations or read other files from the suite host account, rather than communicate with a suite server program over the network. In future we plan to have suite server program serve up these files to clients, but for the moment this functionality requires read-access to the relevant files on the suite host.
If you are logged into the suite host account, file-reading commands will just work.
13.8.2. Remote Host, Different Home Directory¶
If you are logged into another host with no shared home directory, file-reading
commands require non-interactive ssh to the suite host account, and use of the
--host
and --user
options to re-invoke the command
on the suite account.
13.8.3. Same Host, Different User Account¶
(This is essentially the same as Remote Host, Different Home Directory.)
13.9. Client-Server Interaction¶
Cylc server programs listen on dedicated network ports for HTTPS communications from Cylc clients (task jobs, and user-invoked commands and GUIs).
Use cylc scan
to see which suites are listening on which ports on
scanned hosts (this lists your own suites by default, but it can show others
too - see cylc scan --help
).
Cylc supports two kinds of access to suite server programs:
- public (non-authenticated) - the amount of information revealed is configurable, see Public Access - No Auth Files
- control (authenticated) - full control, suite passphrase required, see Full Control - With Auth Files
13.9.1. Public Access - No Auth Files¶
Without a suite passphrase the amount of information revealed by a suite server program is determined by the public access privilege level set in global site/user config ([authentication]) and optionally overidden in suites ([cylc] -> [[authentication]]):
- identity - only suite and owner names revealed
- description - identity plus suite title and description
- state-totals - identity, description, and task state totals
- full-read - full read-only access for monitor and GUI
- shutdown - full read access plus shutdown, but no other control.
The default public access level is state-totals.
The cylc scan
command and the cylc gscan
GUI can print
descriptions and task state totals in addition to basic suite identity, if the
that information is revealed publicly.
13.9.2. Full Control - With Auth Files¶
Suite auth files (passphrase and SSL certificate) give full control. They are loaded from the suite service directory by the suite server program at start-up, and used to authenticate subsequent client connections. Passphrases are used in a secure encrypted challenge-response scheme, never sent in plain text over the network.
If two users need access to the same suite server program, they must both possess the passphrase file for that suite. Fine-grained access to a single suite server program via distinct user accounts is not currently supported.
Suite server programs automatically install their auth and contact files to job hosts via ssh, to enable task jobs to connect back to the suite server program for task messaging.
Client programs invoked by the suite owner automatically load the passphrase, SSL certificate, and contact file too, for automatic connection to suites.
Manual installation of suite auth files is only needed for remote control, if you do not have a shared filesystem - see below.
13.10. GUI-to-Suite Interaction¶
The gcylc GUI is mainly a network client to retrieve and display suite status information from the suite server program, but it can also invoke file-reading commands to view and graph the suite configuration and so on. This is entirely transparent if the GUI is running on the suite host account, but full functionality for remote suites requires either a shared filesystem, or (see Remote Control) auth file installation and non-interactive ssh access to the suite host. Without the auth files you will not be able to connect to the suite, and without ssh you will see “permission denied” errors on attempting file access.
13.11. Remote Control¶
Cylc client programs - command line and GUI - can interact with suite server programs running on other accounts or hosts. How this works depends on whether or not you have:
- a shared filesystem such that you see the same home directory on both hosts.
- non-interactive ssh from the client account to the server account.
With a shared filesystem, a suite registered on the remote (server) host is
also - in effect - registered on the local (client) host. In this case you
can invoke client commands without the --host
option; the client
will automatically read the host and port from the contact file in the
suite service directory.
To control suite server programs running under other user accounts or on other
hosts without a shared filesystem, the suite SSL certificate and passphrase
must be installed under your $HOME/.cylc/
directory:
$HOME/.cylc/auth/OWNER@HOST/SUITE/
ssl.cert
passphrase
contact # (optional - see below)
where OWNER@HOST
is the suite host account and SUITE
is the suite name. Client commands should then be invoked with the
--user
and --host
options, e.g.:
$ cylc gui --user=OWNER --host=HOST SUITE
Note
Remote suite auth files do not need to be installed for read-only access - see Public Access - No Auth Files - via the GUI or monitor.
The suite contact file (see The Suite Contact File) is not needed if
you have read-access to the remote suite run directory via the local
filesystem or non-interactive ssh to the suite host account - client commands
will automatically read it. If you do install the contact file in your auth
directory note that the port number will need to be updated if the suite gets
restarted on a different port. Otherwise use cylc scan
to determine
the suite port number and use the --port
client command option.
Warning
Possession of a suite passphrase gives full control over the target suite, including edit run functionality - which lets you run arbitrary scripting on job hosts as the suite owner. Further, non-interactive ssh gives full access to the target user account, so we recommended that this is only used to interact with suites running on accounts to which you already have full access.
13.12. Scan And Gscan¶
Both cylc scan
and the cylc gscan
GUI can display
suites owned by other users on other hosts, including task state totals if the
public access level permits that (see Public Access - No Auth Files). Clicking on a
remote suite in gscan
will open a cylc gui
to connect to that
suite. This will give you full control, if you have the suite auth files
installed; or it will display full read only information if the public access
level allows that.
13.13. Task States Explained¶
As a suite runs, its task proxies may pass through the following states:
- waiting - still waiting for prerequisites (e.g. dependence on other tasks, and clock triggers) to be satisfied.
- held - will not be submitted to run even if all prerequisites are satisfied, until released/un-held.
- queued - ready to run (prerequisites satisfied) but temporarily held back by an internal cylc queue (see Limiting Activity With Internal Queues).
- ready - ready to run (prerequisites satisfied) and handed to cylc’s job submission sub-system.
- submitted - submitted to run, but not executing yet (could be waiting in an external batch scheduler queue).
- submit-failed - job submission failed or submitted job killed (cancelled) before commencing execution.
- submit-retrying - job submission failed, but a submission retry was configured. Will only enter the submit-failed state if all configured submission retries are exhausted.
- running - currently executing (a task started message was received, or the task polled as running).
- succeeded - finished executing successfully (a task succeeded message was received, or the task polled as succeeded).
- failed - aborted execution due to some error condition (a task failed message was received, or the task polled as failed).
- retrying - job execution failed, but an execution retry was configured. Will only enter the failed state if all configured execution retries are exhausted.
- runahead - will not have prerequisites checked (and so automatically held, in effect) until the rest of the suite catches up sufficiently. The amount of runahead allowed is configurable - see Runahead Limiting.
- expired - will not be submitted to run, due to falling too far behind the wall-clock relative to its cycle point - see Clock-Expire Triggers.
13.14. What The Suite Control GUI Shows¶
The GUI Text-tree and Dot Views display the state of every task proxy present in the task pool. Once a task has succeeded and Cylc has determined that it can no longer be needed to satisfy the prerequisites of other tasks, its proxy will be cleaned up (removed from the pool) and it will disappear from the GUI. To rerun a task that has disappeared from the pool, you need to re-insert its task proxy and then re-trigger it.
The Graph View is slightly different: it displays the complete dependency graph over the range of cycle points currently present in the task pool. This often includes some greyed-out base or ghost nodes that are empty - i.e. there are no corresponding task proxies currently present in the pool. Base nodes just flesh out the graph structure. Groups of them may be cut out and replaced by single scissor nodes in sections of the graph that are currently inactive.
13.15. Network Connection Timeouts¶
A connection timeout can be set in site and user global config files
(see Global (Site, User) Configuration Files) so that messaging commands
cannot hang indefinitely if the suite is not responding (this can be
caused by suspending a suite with Ctrl-Z) thereby preventing the task
from completing. The same can be done on the command line for other
suite-connecting user commands, with the --comms-timeout
option.
13.16. Runahead Limiting¶
Runahead limiting prevents the fastest tasks in a suite from getting too far ahead of the slowest ones. Newly spawned tasks are released to the task pool only when they fall below the runahead limit. A low runhead limit can prevent cylc from interleaving cycles, but it will not stall a suite unless it fails to extend out past a future trigger (see Inter-Cycle Triggers). A high runahead limit may allow fast tasks that are not constrained by dependencies or clock-triggers to spawn far ahead of the pack, which could have performance implications for the suite server program when running very large suites. Succeeded and failed tasks are ignored when computing the runahead limit.
The preferred runahead limiting mechanism restricts the number of consecutive active cycle points. The default value is three active cycle points; see [scheduling] -> max active cycle points. Alternatively the interval between the slowest and fastest tasks can be specified as hard limit; see [scheduling] -> runahead limit.
13.17. Limiting Activity With Internal Queues¶
Large suites can potentially overwhelm task hosts by submitting too many tasks at once. You can prevent this with internal queues, which limit the number of tasks that can be active (submitted or running) at the same time.
Internal queues behave in the first-in-first-out (FIFO) manner, i.e. tasks are released from a queue in the same order that they were queued.
A queue is defined by a name; a limit, which is the maximum number of active tasks allowed for the queue; and a list of members, assigned by task or family name.
Queue configuration is done under the [scheduling]
section of the suite.rc
file (like dependencies, internal queues constrain when a task runs).
By default every task is assigned to the default queue, which by default has a zero limit (interpreted by cylc as no limit). To use a single queue for the whole suite just set the default queue limit:
[scheduling]
[[ queues]]
# limit the entire suite to 5 active tasks at once
[[[default]]]
limit = 5
To use additional queues just name each one, set their limits, and assign members:
[scheduling]
[[ queues]]
[[[q_foo]]]
limit = 5
members = foo, bar, baz
Any tasks not assigned to a particular queue will remain in the default queue. The queues example suite illustrates how queues work by running two task trees side by side (as seen in the graph GUI) each limited to 2 and 3 tasks respectively:
[meta]
title = demonstrates internal queueing
description = """
Two trees of tasks: the first uses the default queue set to a limit of
two active tasks at once; the second uses another queue limited to three
active tasks at once. Run via the graph control GUI for a clear view.
"""
[scheduling]
[[queues]]
[[[default]]]
limit = 2
[[[foo]]]
limit = 3
members = n, o, p, FAM2, u, v, w, x, y, z
[[dependencies]]
graph = """
a => b & c => FAM1
n => o & p => FAM2
FAM1:succeed-all => h & i & j & k & l & m
FAM2:succeed-all => u & v & w & x & y & z
"""
[runtime]
[[FAM1, FAM2]]
[[d,e,f,g]]
inherit = FAM1
[[q,r,s,t]]
inherit = FAM2
13.18. Automatic Task Retry On Failure¶
See also [runtime] -> [[__NAME__]] -> [[[job]]] -> execution retry delays.
Tasks can be configured with a list of “retry delay” intervals, as ISO 8601 durations. If the task job fails it will go into the retrying state and resubmit after the next configured delay interval. An example is shown in the suite listed below under Task Event Handling.
If a task with configured retries is killed (by cylc kill
or
via the GUI) it goes to the held state so that the operator can decide
whether to release it and continue the retry sequence or to abort the retry
sequence by manually resetting it to the failed state.
13.19. Task Event Handling¶
See also [cylc] -> [[events]] and [runtime] -> [[__NAME__]] -> [[[events]]].
Cylc can call nominated event handlers - to do whatever you like - when certain suite or task events occur. This facilitates centralized alerting and automated handling of critical events. Event handlers can be used to send a message, call a pager, or whatever; they can even intervene in the operation of their own suite using cylc commands.
To send an email, use the built-in setting [[[events]]]mail events
to specify a list of events for which notifications should be sent. (The
name of a registered task output can also be used as an event name in
this case.) E.g. to send an email on (submission) failed and retry:
[runtime]
[[foo]]
script = """
test ${CYLC_TASK_TRY_NUMBER} -eq 3
cylc message -- "${CYLC_SUITE_NAME}" "${CYLC_TASK_JOB}" 'oopsy daisy'
"""
[[[events]]]
mail events = submission failed, submission retry, failed, retry, oops
[[[job]]]
execution retry delays = PT0S, PT30S
[[[outputs]]]
oops = oopsy daisy
By default, the emails will be sent to the current user with:
to:
set as$USER
from:
set asnotifications@$(hostname)
- SMTP server at
localhost:25
These can be configured using the settings:
[[[events]]]mail to
(list of email addresses),[[[events]]]mail from
[[[events]]]mail smtp
.
By default, a cylc suite will send you no more than one task event email every 5 minutes - this is to prevent your inbox from being flooded by emails should a large group of tasks all fail at similar time. See [cylc] -> task event mail interval for details.
Event handlers can be located in the suite bin/
directory;
otherwise it is up to you to ensure their location is in $PATH
(in
the shell in which the suite server program runs). They should require little
resource and return quickly - see Managing External Command Execution.
Task event handlers can be specified using the
[[[events]]]<event> handler
settings, where
<event>
is one of:
- ‘submitted’ - the job submit command was successful
- ‘submission failed’ - the job submit command failed
- ‘submission timeout’ - task job submission timed out
- ‘submission retry’ - task job submission failed, but will retry after a configured delay
- ‘started’ - the task reported commencement of execution
- ‘succeeded’ - the task reported successful completion
- ‘warning’ - the task reported a WARNING severity message
- ‘critical’ - the task reported a CRITICAL severity message
- ‘custom’ - the task reported a CUSTOM severity message
- ‘late’ - the task is never active and is late
- ‘failed’ - the task failed
- ‘retry’ - the task failed but will retry after a configured delay
- ‘execution timeout’ - task execution timed out
The value of each setting should be a list of command lines or command line templates (see below).
Alternatively you can use [[[events]]]handlers
and
[[[events]]]handler events
, where the former is a list of command
lines or command line templates (see below) and the latter is a list of events
for which these commands should be invoked. (The name of a registered task
output can also be used as an event name in this case.)
Event handler arguments can be constructed from various templates
representing suite name; task ID, name, cycle point, message, and submit
number name; and any suite or task [meta]
item.
See [cylc] -> [[events]] and [runtime] -> [[__NAME__]] -> [[[events]]] for options.
If no template arguments are supplied the following default command line will be used:
<task-event-handler> %(event)s %(suite)s %(id)s %(message)s
Note
Substitution patterns should not be quoted in the template strings. This is done automatically where required.
For an explanation of the substitution syntax, see String Formatting Operations in the Python documentation.
The retry event occurs if a task fails and has any remaining retries configured (see Automatic Task Retry On Failure). The event handler will be called as soon as the task fails, not after the retry delay period when it is resubmitted.
Note
Event handlers are called by the suite server program, not by
task jobs. If you wish to pass additional information to them use
[cylc] -> [[environment]]
, not task runtime environment.
The following two suite.rc
snippets are examples on how to specify
event handlers using the alternate methods:
[runtime]
[[foo]]
script = test ${CYLC_TASK_TRY_NUMBER} -eq 2
[[[events]]]
retry handler = "echo '!!!!!EVENT!!!!!' "
failed handler = "echo '!!!!!EVENT!!!!!' "
[[[job]]]
execution retry delays = PT0S, PT30S
[runtime]
[[foo]]
script = """
test ${CYLC_TASK_TRY_NUMBER} -eq 2
cylc message -- "${CYLC_SUITE_NAME}" "${CYLC_TASK_JOB}" 'oopsy daisy'
"""
[[[events]]]
handlers = "echo '!!!!!EVENT!!!!!' "
# Note: task output name can be used as an event in this method
handler events = retry, failed, oops
[[[job]]]
execution retry delays = PT0S, PT30S
[[[outputs]]]
oops = oopsy daisy
The handler command here - specified with no arguments - is called with the default arguments, like this:
echo '!!!!!EVENT!!!!!' %(event)s %(suite)s %(id)s %(message)s
13.19.1. Late Events¶
You may want to be notified when certain tasks are running late in a real time production system - i.e. when they have not triggered by the usual time. Tasks of primary interest are not normally clock-triggered however, so their trigger times are mostly a function of how the suite runs in its environment, and even external factors such as contention with other suites [3] .
But if your system is reasonably stable from one cycle to the next such that a
given task has consistently triggered by some interval beyond its cycle point,
you can configure Cylc to emit a late event if it has not triggered by
that time. For example, if a task forecast
normally triggers by 30
minutes after its cycle point, configure late notification for it like this:
[runtime]
[[forecast]]
script = run-model.sh
[[[events]]]
late offset = PT30M
late handler = my-handler %(message)s
Late offset intervals are not computed automatically so be careful to update them after any change that affects triggering times.
Note
Cylc can only check for lateness in tasks that it is currently aware of. If a suite gets delayed over many cycles the next tasks coming up can be identified as late immediately, and subsequent tasks can be identified as late as the suite progresses to subsequent cycle points, until it catches up to the clock.
13.20. Managing External Command Execution¶
Job submission commands, event handlers, and job poll and kill commands, are executed by the suite server program in a “pool” of asynchronous subprocesses, in order to avoid holding the suite up. The process pool is actively managed to limit it to a configurable size (process pool size). Custom event handlers should be light-weight and quick-running because they will tie up a process pool member until they complete, and the suite will appear to stall if the pool is saturated with long-running processes. Processes are killed after a configurable timeout (process pool timeout) however, to guard against rogue commands that hang indefinitely. All process kills are logged by the suite server program. For killed job submissions the associated tasks also go to the submit-failed state.
13.21. Handling Job Preemption¶
Some HPC facilities allow job preemption: the resource manager can kill or suspend running low priority jobs in order to make way for high priority jobs. The preempted jobs may then be automatically restarted by the resource manager, from the same point (if suspended) or requeued to run again from the start (if killed).
Suspended jobs will poll as still running (their job status file says they
started running, and they still appear in the resource manager queue).
Loadleveler jobs that are preempted by kill-and-requeue (“job vacation”) are
automatically returned to the submitted state by Cylc. This is possible
because Loadleveler sends the SIGUSR1 signal before SIGKILL for preemption.
Other batch schedulers just send SIGTERM before SIGKILL as normal, so Cylc
cannot distinguish a preemption job kill from a normal job kill. After this the
job will poll as failed (correctly, because it was killed, and the job status
file records that). To handle this kind of preemption automatically you could
use a task failed or retry event handler that queries the batch scheduler queue
(after an appropriate delay if necessary) and then, if the job has been
requeued, uses cylc reset
to reset the task to the submitted state.
13.22. Manual Task Triggering and Edit-Run¶
Any task proxy currently present in the suite can be manually triggered at any
time using the cylc trigger
command, or from the right-click task
menu in gcylc. If the task belongs to a limited internal queue
(see Limiting Activity With Internal Queues), this will queue it; if not, or if it is already
queued, it will submit immediately.
With cylc trigger --edit
(also in the gcylc right-click task menu)
you can edit the generated task job script to make one-off changes before the
task submits.
13.23. Cylc Broadcast¶
The cylc broadcast
command overrides [runtime]
settings in a running suite. This can
be used to communicate information to downstream tasks by broadcasting
environment variables (communication of information from one task to
another normally takes place via the filesystem, i.e. the input/output
file relationships embodied in inter-task dependencies). Variables (and
any other runtime settings) may be broadcast to all subsequent tasks,
or targeted specifically at a specific task, all subsequent tasks with a
given name, or all tasks with a given cycle point; see broadcast command help
for details.
Broadcast settings targeted at a specific task ID or cycle point expire and are forgotten as the suite moves on. Un-targeted variables and those targeted at a task name persist throughout the suite run, even across restarts, unless manually cleared using the broadcast command - and so should be used sparingly.
13.24. The Meaning And Use Of Initial Cycle Point¶
When a suite is started with the cylc run
command (cold or
warm start) the cycle point at which it starts can be given on the command
line or hardwired into the suite.rc file:
cylc run foo 20120808T06Z
or:
[scheduling]
initial cycle point = 20100808T06Z
An initial cycle given on the command line will override one in the suite.rc file.
13.24.1. The Environment Variable CYLC_SUITE_INITIAL_CYCLE_POINT¶
In the case of a cold start only the initial cycle point is passed
through to task execution environments as
$CYLC_SUITE_INITIAL_CYCLE_POINT
. The value is then stored in
suite database files and persists across restarts, but it does get wiped out
(set to None
) after a warm start, because a warm start is really an
implicit restart in which all state information is lost (except that the
previous cycle is assumed to have completed).
The $CYLC_SUITE_INITIAL_CYCLE_POINT
variable allows tasks to
determine if they are running in the initial cold-start cycle point, when
different behaviour may be required, or in a normal mid-run cycle point.
Note however that an initial R1
graph section is now the preferred
way to get different behaviour at suite start-up.
13.25. Simulating Suite Behaviour¶
Several suite run modes allow you to simulate suite behaviour quickly without running the suite’s real jobs - which may be long-running and resource-hungry:
- dummy mode - runs dummy tasks as background jobs on configured
job hosts.
- simulates scheduling, job host connectivity, and generates all job files on suite and job hosts.
- dummy-local mode - runs real dummy tasks as background jobs on
the suite host, which allows dummy-running suites from other sites.
- simulates scheduling and generates all job files on the suite host.
- simulation mode - does not run any real tasks.
- simulates scheduling without generating any job files.
Set the run mode (default live) in the GUI suite start dialog box, or on the command line:
$ cylc run --mode=dummy SUITE
$ cylc restart --mode=dummy SUITE
You can get specified tasks to fail in these modes, for more flexible suite testing. See [runtime] -> [[__NAME__]] -> [[[simulation]]] for simulation configuration.
13.25.1. Proportional Simulated Run Length¶
If task [job]execution time limit
is set, Cylc divides it by
[simulation]speedup factor
(default 10.0
) to compute
simulated task run lengths (default 10 seconds).
13.25.2. Limitations Of Suite Simulation¶
Dummy mode ignores batch scheduler settings because Cylc does not know which
job resource directives (requested memory, number of compute nodes, etc.) would
need to be changed for the dummy jobs. If you need to dummy-run jobs on a
batch scheduler manually comment out script
items and modify
directives in your live suite, or else use a custom live mode test suite.
Note
The dummy modes ignore all configured task script
items
including init-script
. If your init-script
is required
to run even dummy tasks on a job host, note that host environment
setup should be done
elsewhere - see Configure Site Environment on Job Hosts.
13.25.3. Restarting Suites With A Different Run Mode?¶
The run mode is recorded in the suite run database files. Cylc will not let you restart a non-live mode suite in live mode, or vice versa. To test a live suite in simulation mode just take a quick copy of it and run the the copy in simulation mode.
13.26. Automated Reference Test Suites¶
Reference tests are finite-duration suite runs that abort with non-zero exit status if any of the following conditions occur (by default):
- cylc fails
- any task fails
- the suite times out (e.g. a task dies without reporting failure)
- a nominated shutdown event handler exits with error status
The default shutdown event handler for reference tests is
cylc hook check-triggering
which compares task triggering
information (what triggers off what at run time) in the test run suite
log to that from an earlier reference run, disregarding the timing and
order of events - which can vary according to the external queueing
conditions, runahead limit, and so on.
To prepare a reference log for a suite, run it with the
--reference-log
option, and manually verify the
correctness of the reference run.
To reference test a suite, just run it (in dummy mode for the most
comprehensive test without running real tasks) with the
--reference-test
option.
A battery of automated reference tests is used to test cylc before
posting a new release version. Reference tests can also be used to check that
a cylc upgrade will not break your own complex
suites - the triggering check will catch any bug that causes a task to
run when it shouldn’t, for instance; even in a dummy mode reference
test the full task job script (sans script
items) executes on the
proper task host by the proper batch system.
Reference tests can be configured with the following settings:
[cylc]
[[reference test]]
suite shutdown event handler = cylc check-triggering
required run mode = dummy
allow task failures = False
live mode suite timeout = PT5M
dummy mode suite timeout = PT2M
simulation mode suite timeout = PT2M
13.26.1. Roll-your-own Reference Tests¶
If the default reference test is not sufficient for your needs, firstly
note that you can override the default shutdown event handler, and
secondly that the --reference-test
option is merely a short
cut to the following suite.rc settings which can also be set manually if
you wish:
[cylc]
abort if any task fails = True
[[events]]
shutdown handler = cylc check-triggering
timeout = PT5M
abort if shutdown handler fails = True
abort on timeout = True
13.27. Triggering Off Of Tasks In Other Suites¶
Note
Please read External Triggers before using the older inter-suite triggering mechanism described in this section.
The cylc suite-state
command interrogates suite run databases. It
has a polling mode that waits for a given task in the target suite to achieve a
given state, or receive a given message. This can be used to make task
scripting wait for a remote task to succeed (for example).
Automatic suite-state polling tasks can be defined with in the graph. They get
automatically-generated task scripting that uses cylc suite-state
appropriately (it is an error to give your own script
item for these
tasks).
Here’s how to trigger a task bar
off a task foo
in
a remote suite called other.suite
:
[scheduling]
[[dependencies]]
[[[T00, T12]]]
graph = "my-foo<other.suite::foo> => bar"
Local task my-foo
will poll for the success of foo
in suite other.suite
, at the same cycle point, succeeding only when
or if it succeeds. Other task states can also be polled:
graph = "my-foo<other.suite::foo:fail> => bar"
The default polling parameters (e.g. maximum number of polls and the interval
between them) are printed by cylc suite-state --help
and can be
configured if necessary under the local polling task runtime section:
[scheduling]
[[ dependencies]]
[[[T00,T12]]]
graph = "my-foo<other.suite::foo> => bar"
[runtime]
[[my-foo]]
[[[suite state polling]]]
max-polls = 100
interval = PT10S
To poll for the target task to receive a message rather than achieve a state, give the message in the runtime configuration (in which case the task status inferred from the graph syntax will be ignored):
[runtime]
[[my-foo]]
[[[suite state polling]]]
message = "the quick brown fox"
For suites owned by others, or those with run databases in non-standard
locations, use the --run-dir
option, or in-suite:
[runtime]
[[my-foo]]
[[[suite state polling]]]
run-dir = /path/to/top/level/cylc/run-directory
If the remote task has a different cycling sequence, just arrange for the
local polling task to be on the same sequence as the remote task that it
represents. For instance, if local task cat
cycles 6-hourly at
0,6,12,18
but needs to trigger off a remote task dog
at 3,9,15,21
:
[scheduling]
[[dependencies]]
[[[T03,T09,T15,T21]]]
graph = "my-dog<other.suite::dog>"
[[[T00,T06,T12,T18]]]
graph = "my-dog[-PT3H] => cat"
For suite-state polling, the cycle point is automatically converted to the cycle point format of the target suite.
The remote suite does not have to be running when polling commences because the command interrogates the suite run database, not the suite server program.
Note
The graph syntax for suite polling tasks cannot be combined with cycle point offsets, family triggers, or parameterized task notation. This does not present a problem because suite polling tasks can be put on the same cycling sequence as the remote-suite target task (as recommended above), and there is no point in having multiple tasks (family members or parameterized tasks) performing the same polling operation. Task state triggers can be used with suite polling, e.g. to trigger another task if polling fails after 10 tries at 10 second intervals:
[scheduling]
[[dependencies]]
graph = "poller<other-suite::foo:succeed>:fail => another-task"
[runtime]
[[my-foo]]
[[[suite state polling]]]
max-polls = 10
interval = PT10S
13.28. Suite Server Logs¶
Each suite maintains its own log of time-stamped events under the suite server log directory:
$HOME/cylc-run/SUITE-NAME/log/suite/
By way of example, we will show the complete server log generated (at
cylc-7.2.0) by a small suite that runs two 30-second dummy tasks
foo
and bar
for a single cycle point
2017-01-01T00Z
before shutting down:
[cylc]
cycle point format = %Y-%m-%dT%HZ
[scheduling]
initial cycle point = 2017-01-01T00Z
final cycle point = 2017-01-01T00Z
[[dependencies]]
graph = "foo => bar"
[runtime]
[[foo]]
script = sleep 30; /bin/false
[[bar]]
script = sleep 30; /bin/true
By the task scripting defined above, this suite will stall when foo
fails. Then, the suite owner vagrant@cylon manually resets the failed
task’s state to succeeded, allowing bar
to trigger and the
suite to finish and shut down. Here’s the complete suite log for this run:
$ cylc cat-log SUITE-NAME
2017-03-30T09:46:10Z INFO - Suite starting: server=localhost:43086 pid=3483
2017-03-30T09:46:10Z INFO - Run mode: live
2017-03-30T09:46:10Z INFO - Initial point: 2017-01-01T00Z
2017-03-30T09:46:10Z INFO - Final point: 2017-01-01T00Z
2017-03-30T09:46:10Z INFO - Cold Start 2017-01-01T00Z
2017-03-30T09:46:11Z INFO - [foo.2017-01-01T00Z] -submit_method_id=3507
2017-03-30T09:46:11Z INFO - [foo.2017-01-01T00Z] -submission succeeded
2017-03-30T09:46:11Z INFO - [foo.2017-01-01T00Z] status=submitted: (received)started at 2017-03-30T09:46:10Z for job(01)
2017-03-30T09:46:41Z CRITICAL - [foo.2017-01-01T00Z] status=running: (received)failed/EXIT at 2017-03-30T09:46:40Z for job(01)
2017-03-30T09:46:42Z WARNING - suite stalled
2017-03-30T09:46:42Z WARNING - Unmet prerequisites for bar.2017-01-01T00Z:
2017-03-30T09:46:42Z WARNING - * foo.2017-01-01T00Z succeeded
2017-03-30T09:47:58Z INFO - [client-command] reset_task_states vagrant@cylon:cylc-reset 1e0d8e9f-2833-4dc9-a0c8-9cf263c4c8c3
2017-03-30T09:47:58Z INFO - [foo.2017-01-01T00Z] -resetting state to succeeded
2017-03-30T09:47:58Z INFO - Command succeeded: reset_task_states([u'foo.2017'], state=succeeded)
2017-03-30T09:47:59Z INFO - [bar.2017-01-01T00Z] -submit_method_id=3565
2017-03-30T09:47:59Z INFO - [bar.2017-01-01T00Z] -submission succeeded
2017-03-30T09:47:59Z INFO - [bar.2017-01-01T00Z] status=submitted: (received)started at 2017-03-30T09:47:58Z for job(01)
2017-03-30T09:48:29Z INFO - [bar.2017-01-01T00Z] status=running: (received)succeeded at 2017-03-30T09:48:28Z for job(01)
2017-03-30T09:48:30Z INFO - Waiting for the command process pool to empty for shutdown
2017-03-30T09:48:30Z INFO - Suite shutting down - AUTOMATIC
The information logged here includes:
- event timestamps, at the start of each line
- suite server host, port and process ID
- suite initial and final cycle points
- suite start type (cold start in this case)
- task events (task started, succeeded, failed, etc.)
- suite stalled warning (in this suite nothing else can run when
foo
fails) - the client command issued by vagrant@cylon to reset
foo
to {em succeeded} - job IDs - in this case process IDs for background jobs (or PBS job IDs etc.)
- state changes due to incoming task progress message (“started at …” etc.) suite shutdown time and reasons (AUTOMATIC means “all tasks finished and nothing else to do”)
Note
Suite log files are primarily intended for human eyes. If you need to have an external system to monitor suite events automatically, interrogate the sqlite suite run database (see Suite Run Databases) rather than parse the log files.
13.29. Suite Run Databases¶
Suite server programs maintain two sqlite
databases to record
restart checkpoints and various other aspects of run history:
$HOME/cylc-run/SUITE-NAME/log/db # public suite DB
$HOME/cylc-run/SUITE-NAME/.service/db # private suite DB
The private DB is for use only by the suite server program. The identical
public DB is provided for use by external commands such as
cylc suite-state
, cylc ls-checkpoints
, and
cylc report-timings
. If the public DB gets locked for too long by
an external reader, the suite server program will eventually delete it and
replace it with a new copy of the private DB, to ensure that both correctly
reflect the suite state.
You can interrogate the public DB with the sqlite3
command line tool,
the sqlite3
module in the Python standard library, or any other
sqlite interface.
$ sqlite3 ~/cylc-run/foo/log/db << _END_
> .headers on
> select * from task_events where name is "foo";
> _END_
name|cycle|time|submit_num|event|message
foo|1|2017-03-12T11:06:09Z|1|submitted|
foo|1|2017-03-12T11:06:09Z|1|output completed|started
foo|1|2017-03-12T11:06:09Z|1|started|
foo|1|2017-03-12T11:06:19Z|1|output completed|succeeded
foo|1|2017-03-12T11:06:19Z|1|succeeded|
13.30. Disaster Recovery¶
If a suite run directory gets deleted or corrupted, the options for recovery are:
- restore the run directory from back-up, and restart the suite
- re-install from source, and warm start from the beginning of the current cycle point
A warm start (see Warm Start) does not need a suite state checkpoint, but it wipes out prior run history, and it could re-run a significant number of tasks that had already completed.
To restart the suite, the critical Cylc files that must be restored are:
# On the suite host:
~/cylc-run/SUITE-NAME/
suite.rc # live suite configuration (located here in Rose suites)
log/db # public suite DB (can just be a copy of the private DB)
log/rose-suite-run.conf # (needed to restart a Rose suite)
.service/db # private suite DB
.service/source -> PATH-TO-SUITE-DIR # symlink to live suite directory
# On job hosts (if no shared filesystem):
~/cylc-run/SUITE-NAME/
log/job/CYCLE-POINT/TASK-NAME/SUBMIT-NUM/job.status
Note
This discussion does not address restoration of files generated and consumed by task jobs at run time. How suite data is stored and recovered in your environment is a matter of suite and system design.
In short, you can simply restore the suite service directory, the log
directory, and the suite.rc file that is the target of the symlink in the
service directory. The service and log directories will come with extra files
that aren’t strictly needed for a restart, but that doesn’t matter - although
depending on your log housekeeping the log/job
directory could be
huge, so you might want to be selective about that. (Also in a Rose suite, the
suite.rc
file does not need to be restored if you restart with
rose suite-run
- which re-installs suite source files to the run
directory).
The public DB is not strictly required for a restart - the suite server program
will recreate it if need be - but it is required by
cylc ls-checkpoints
if you need to identify the right restart
checkpoint.
The job status files are only needed if the restart suite state checkpoint contains active tasks that need to be polled to determine what happened to them while the suite was down. Without them, polling will fail and those tasks will need to be manually set to the correct state.
Warning
It is not safe to copy or rsync a potentially-active sqlite DB - the copy might end up corrupted. It is best to stop the suite before copying a DB, or else write a back-up utility using the official sqlite backup API.
13.31. Auto Stop-Restart¶
Cylc has the ability to automatically stop suites running on a particular host and optionally, restart them on a different host. This is useful if a host needs to be taken off-line e.g. for scheduled maintenance.
This functionality is configured via the following site configuration settings:
[run hosts][suite servers]auto restart delay
[run hosts][suite servers]condemned hosts
[run hosts][suite servers]run hosts
The auto stop-restart feature has two modes:
- [Normal Mode]
- When a host is added to the
condemned hosts
list, any suites running on that host will automatically shutdown then restart selecting a new host fromrun hosts
. - For safety, before attempting to stop the suite cylc will first wait for any jobs running locally (under background or at) to complete.
- In order for Cylc to be able to successfully restart suites the ``run hosts`` must all be on a shared filesystem.
- When a host is added to the
- [Force Mode]
- If a host is suffixed with an exclamation mark then Cylc will not attempt to automatically restart the suite and any local jobs (running under background or at) will be left running.
For example in the following configuration any suites running on
foo
will attempt to restart on pub
whereas any suites
running on bar
will stop immediately, making no attempt to restart.
[suite servers]
run hosts = pub
condemned hosts = foo, bar!
To prevent large numbers of suites attempting to restart simultaneously the
auto restart delay
setting defines a period of time in seconds.
Suites will wait for a random period of time between zero and
auto restart delay
seconds before attempting to stop and restart.
Suites that are started up in no-detach mode cannot be auto stop-restart on a different host - as it will still end up attached to the condemned hosts. Therefore, a suite in no-detach mode running on a condemned host will abort with a non-zero return code. The parent process should manually handle the restart of the suite if desired.
See the [suite servers]
configuration section
([suite servers]) for more details.
[3] | Late notification of clock-triggered tasks is not very useful in any case because they typically do not depend on other tasks, and as such they can often trigger on time even if the suite is delayed to the point that downstream tasks are late due to their dependence on previous-cycle tasks that are delayed. |
13.32. Alternate Suite Run Directories¶
The cylc register
command normally creates a suite run directory at
the standard location ~/cylc-run/<SUITE-NAME>/
. With the --run-dir
option it can create the run directory at some other location, with a symlink
from ~/cylc-run/<SUITE-NAME>
to allow access via the standard file path.
This may be useful for quick-running Sub-Suites that generate large numbers of files - you could put their run directories on fast local disk or RAM disk, for performance and housekeeping reasons.
13.33. Sub-Suites¶
A single Cylc suite can configure multiple cycling sequences in the graph, but cycles can’t be nested. If you need cycles within cycles - e.g. to iterate over many files generated by each run of a cycling task - current options are:
- parameterize the sub-cycles
- this is easy but it makes more tasks-per-cycle, which is the primary determinant of suite size and server program efficiency
- run a separate cycling suite over the sub-cycle, inside a main-suite task,
for each main-suite cycle point - i.e. use sub-suites
- this is very efficient, but monitoring and run-directory housekeeping may be more difficult because it creates multiple suites and run directories
Sub-suites must be started with --no-detach
so that the containing task
does not finish until the sub-suite does, and they should be non-cycling
or have a final cycle point
so they don’t keep on running indefinitely.
Sub-suite names should normally incorporate the main-suite cycle point (use
$CYLC_TASK_CYCLE_POINT
in the cylc run
command line to start the
sub-suite), so that successive sub-suites can run concurrently if necessary and
do not compete for the same run directory. This will generate a new sub-suite
run directory for every main-suite cycle point, so you may want to put
housekeeping tasks in the main suite to extract the useful products from each
sub-suite run and then delete the sub-suite run directory.
For quick-running sub-suites that generate large numbers of files, consider using Alternate Suite Run Directories for better performance and easier housekeeping.