Task Implementation
Existing scripts and executables can be used as Cylc tasks without modification so long as they return standard exit status - zero on success, non-zero for failure - and do not spawn detaching processes internally (see Avoid Detaching Processes).
Job Scripts
When the scheduler determines that a task is ready to run it generates a job script for the task, and submits it to run (see Job Submission and Management).
Job scripts encapsulate configured task runtime settings:
script
and [environment]
items, if defined, are just
concatenated in the order shown below, to make the job script.
The job script is separated into two parts. User scripts and environment
(env-script
, [environment]
, pre-script
,
script
and post-script
) are isolated and executed in
a separate subshell process. Any changes to environment, traps, etc., done in
any of them are visible in subsequent parts, but will not interfere with the
parent shell process. This parent shell executes and shares environment with
init-script
, exit-script
and err-script
.
In particular, any environment changes in init-script
will be
visible in the parent shell, as well as in the subshell hosting user scripts.
The parent shell sets trap handlers for some signals (the exact list depends on
a particular batch system being used) and typically resends any received signal
to the subshell and its children (the whole process group, to be precise), unless
the batch system doesn’t execute the job script as process leader, in which case
the signal is resent to the subshell process only. Immediately after that
err-script
is executed without waiting for the subshell process to
finish, unless you start it with the wait
command to prevent that.
The two “Cylc defined scripts” are:
cylc-env
Which provides default
CYLC_*
environment variables e.g.CYLC_TASK_NAME
.user-env
Which is the contents of the
[environment]
section.
Job scripts are written to the workflow’s job log directory. They can be
printed with cylc cat-log
.
Inlined Tasks
Task script items can be multi-line strings of bash
code, so many tasks
can be entirely inlined in the flow.cylc
file.
For anything more than a few lines of code, however, we recommend using external shell scripts to allow independent testing, re-use, and shell mode editing.
Interpreter
The job script (which incorporates the *-script
items) runs in the
bash
interpreter.
Cylc searches for bash
in the $PATH
by first running a login bash
shell which means you can choose the bash interpreter used by modifying
the $PATH
in your bash configuration files (e.g. .bashrc
).
Task Messages
Jobs send status messages back to the scheduler to report that execution has started, succeeded, or failed. Custom messages can also be sent by the same mechanism, with various severity levels. These can be used to trigger other tasks off specific task outputs (see Message Triggers), or to trigger execution of event handlers by the scheduler (see Event Handling), or just to write information to the scheduler log.
(If polling is configured as the communication method
for a
platform
, the messaging system just writes messages to the
local job status file for recovery by the scheduler at the next poll).
Normal severity messages are printed to job.out
and logged by the scheduler:
cylc message -- "${CYLC_WORKFLOW_ID}" "${CYLC_TASK_JOB}" \
"Hello from ${CYLC_TASK_ID}"
“CUSTOM” severity messages are printed to job.out
, logged by the
scheduler, and can be used to trigger custom
event handlers:
cylc message -- "${CYLC_WORKFLOW_ID}" "${CYLC_TASK_JOB}" \
"CUSTOM:data available for ${CYLC_TASK_CYCLE_POINT}"
These can be used to signal special events that are neither routine information nor an error condition, such as production of a particular data file (a “data availability” event).
“WARNING” severity messages are printed to job.err
, logged by the
scheduler, and can be passed to warning event handlers:
cylc message -- "${CYLC_WORKFLOW_ID}" "${CYLC_TASK_JOB}" \
"WARNING:Uh-oh, something's not right here."
“CRITICAL” severity messages are printed to job.err
, logged by the
scheduler, and can be passed to critical event handlers:
cylc message -- "${CYLC_WORKFLOW_ID}" "${CYLC_TASK_JOB}" \
"CRITICAL:ERROR occurred in process X!"
Jobs no longer (since Cylc 8) attempt to resend messages if the server cannot be reached. Send failures normally imply a network or Cylc configuration problem that will not recover by itself, in which case a series of messaging retries just holds up job completion unnecessarily. If a job status message does not get through, the server will recover the correct task status by polling on job timeout (or earlier if regular polling is configured).
Task messages are validated by
cylc.flow.unicode_rules.TaskMessageValidator
.
- class cylc.flow.unicode_rules.TaskMessageValidator
The rules for valid task messages:
cannot contain
:
unless it occurs at the end of the first word
Aborting Job Scripts on Error
Job scripts use set -x
to abort on any error, and trap ERR, EXIT, and
SIGTERM to send task failed messages back to the scheduler before
aborting. Other scripts called from job scripts should therefore abort with
standard non-zero exit status on error, to trigger the job script error trap.
To prevent a command that is expected to generate a non-zero exit status from triggering the exit trap, protect it with a control statement such as:
if cmp FILE1 FILE2; then
: # success: do stuff
else
: # failure: do other stuff
fi
Job scripts also use set -u
to abort on referencing any
undefined variable (useful for picking up typos); and set -o pipefail
to abort if any part of a pipe fails (by default the shell only returns the
exit status of the final command in a pipeline).
Custom Failure Messages
Critical events normally warrant aborting a job script rather than just
sending a message. As described just above, exit 1
or any failing command
not protected by the surrounding scripting will cause a job script to abort
and report failure to the scheduler, potentially triggering a
failed task event handler.
For failures detected by the scripting you could send a critical message back before aborting, potentially triggering a critical task event handler:
if ! /bin/false; then
cylc message -- "${CYLC_WORKFLOW_ID}" "${CYLC_TASK_JOB}" \
"CRITICAL:ERROR: /bin/false failed!"
exit 1
fi
To abort a job script with a custom message that can be passed to a
failed task event handler, use the built-in cylc__job_abort
shell
function:
if ! /bin/false; then
cylc__job_abort "ERROR: /bin/false failed!"
fi
Avoid Detaching Processes
If a task script starts background sub-processes and does not wait on them, or
internally submits jobs to a job runner and then exits immediately, the
detached processes will not be visible to Cylc and the task will appear to
finish when the top-level script finishes. You will need to modify scripts
like this to make them execute all sub-processes in the foreground (or use the
shell wait
command to wait on them before exiting) and to prevent job
submission commands from returning before the job completes (e.g.
llsubmit -s
for Loadleveler,
qsub -sync yes
for Sun Grid Engine, and
qsub -W block=true
for PBS).
If this is not possible - perhaps you don’t have control over the script or can’t work out how to fix it - one alternative approach is to use another task to repeatedly poll for the results of the detached processes:
[scheduling]
[[graph]]
R1 = "model => checker => post-proc"
[runtime]
[[model]]
# Uh-oh, this script does an internal job submission to run model.exe:
script = "run-model.sh"
[[checker]]
# Fail and retry every minute (for 10 tries at the most) if model's
# job.done indicator file does not exist yet.
script = "[[ ! -f $RUN_DIR/job.done ]] && exit 1"
execution retry delays = 10 * PT1M
Use of Pipes in Job Scripts
In bash, the return status of a pipeline is normally the exit status of the last command. This is unsafe, because if any command in the pipeline fails, the script will continue nevertheless.
For safety, a Cylc job script running in bash will have the
set -o pipefail
option turned on automatically. If a pipeline
exists in a task’s script
, etc section, the failure of any part of
a pipeline will cause the command to return a non-zero code at the end, which
will be reported as a job failure. Due to the unique nature of a pipeline,
the job file will trap the failure of the individual commands, as well as the
whole pipeline, and will attempt to report a failure back to the workflow twice.
The second message is ignored by the workflow, and so the behaviour can be safely
ignored. (You should probably still investigate the failure, however!)