.. _Basic Principles:

Basic Principles
================

This section covers general principles that should be kept in mind when
writing any workflow. More advanced topics are covered later:
:ref:`Efficiency And Maintainability` and :ref:`Portable Workflows Label`.


UTC Mode
--------

Cylc has full timezone support if needed, but real time NWP workflows should use
UTC mode to avoid problems at the transition between local standard time and
daylight saving time, and to enable the same workflow to run the same way in
different timezones.

.. code-block:: cylc

   [scheduler]
       UTC mode = True


Fine Or Coarse-Grained Workflows
--------------------------------

Workflows can have many small simple tasks, fewer large complex tasks, or anything
in between. A task that runs many distinct processes can be split into many
distinct tasks. The fine-grained approach is more transparent and it allows
more task level concurrency and quicker failure recovery - you can rerun just
what failed without repeating anything unnecessarily.


rose bunch
^^^^^^^^^^

One caveat to our fine-graining advice is that submitting a large number of
small tasks at once may be a problem on some platforms. If you have many
similar concurrent jobs you can use ``rose bunch`` to pack them into a
single task with incremental rerun capability: retriggering the task will rerun
just the component jobs that did not successfully complete earlier.


.. _Monolithic Or Interdependent Workflows:

Monolithic Or Interdependent Workflows
--------------------------------------

When writing workflows from scratch you may need to decide between putting
multiple loosely connected sub-workflows into a single large workflow, or
constructing a more modular system of smaller workflows that depend on each other
through :ref:`inter-workflow triggering <Built-in Workflow State Triggers>`.
Each approach has its pros and cons, depending on your requirements and
preferences with respect to the complexity and manageability of the resulting
system.


.. _Self-Contained Workflows:

Self-Contained Workflows
------------------------

All files generated by Cylc during a workflow run are confined to the workflow
:term:`run directory` ``$HOME/cylc-run/<workflow-id>``. However, Cylc has no
control over the locations of the programs, scripts, and files, that are
executed, read, or generated by your tasks at runtime. It is up to you to
ensure that all of this is confined to the run directory too, as far as
possible.

Self-contained workflows are more robust, easier to work with, and more portable.
Multiple instances of the same workflow (with different workflow names) should be
able to run concurrently under the same user account without mutual
interference.


Avoiding External Files
^^^^^^^^^^^^^^^^^^^^^^^

Workflows that use external scripts, executables, and files beyond the essential
system libraries and utilities are vulnerable to external changes: someone
else might interfere with these files without telling you.

In some case you may need to symlink to large external files anyway, if space
or copy speed is a problem, but otherwise workflows with private copies of all the
files they need are more robust.


Confining Output To The Run Directory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Output files should be confined to the run directory tree. Then all
output is easy to find, multiple instances of the same workflow can run
concurrently without interference, and other users should be able to copy and
run your workflow with few modifications. Cylc provides a ``share``
directory for generated files that are used by several tasks in a workflow
(see :ref:`Shared Task IO Paths`). Archiving tasks can use ``rose arch``
to copy or move selected files to external locations as needed (see
:ref:`Workflow Housekeeping`).


Task Host Selection
-------------------

The ``rose host-select`` command is now deprecated. Workflows should migrate
to using :term:`platforms <platform>` which provide a more efficient
solution.
See :ref:`MajorChangesPlatforms` for details.


Task Scripting
--------------

Non-trivial task scripting should be held in separate script files rather than
inlined in :cylc:conf:`flow.cylc`. This keeps the workflow definition tidy, and it
allows proper shell-mode text editing and independent testing of task scripts.

For automatic access by jobs, task-specific scripts should be kept in
Rose app bin directories, and shared scripts kept in (or installed to) the
workflow bin directory.


Coding Standards
^^^^^^^^^^^^^^^^

When writing your own task scripts make consistent use of appropriate coding
standards such as:

- `PEP8 for Python <https://www.python.org/dev/peps/pep-0008/>`_
- `Google Shell Style Guide for
  Bash <https://google.github.io/styleguide/shell.xml>`_


Basic Functionality
^^^^^^^^^^^^^^^^^^^

In consideration of future users who may not be expert on the internals of your
workflow and its tasks, all task scripts should:

- Print clear usage information if invoked incorrectly (and via the
  standard options ``-h, --help``).
- Print useful diagnostic messages in case of error. For example, if a
  file was not found, the error message should contain the full path to the
  expected location.
- Always return correct shell exit status - zero for success, non-zero
  for failure. This is used by Cylc job wrapper code to detect success and
  failure and report it back to the :term:`scheduler`.
- In shell scripts use ``set -u`` to abort on any reference to
  an undefined variable. If you really need an undefined variable to evaluate
  to an empty string, make it explicit: ``FOO=${FOO:-}``.
- In shell scripts use ``set -e`` to abort on any error without
  having to failure-check each command explicitly.
- In shell scripts use ``set -o pipefail`` to abort on any error
  within a pipe line. Note that all commands in the pipe line will still
  run, it will just exit with the right most non-zero exit status.

.. note::

   Examples and more details `are available <https://vaneyckt.io/posts/safer_bash_scripts_with_set_euxo_pipefail/>`_
   for the above three ``set`` commands.


Rose Apps
---------

Rose apps allow all non-shared task configuration - which is not relevant to
workflow automation - to be moved from the workflow definition into app config
files. This makes workflows tidier and easier to understand, and it allows
``rose edit`` to provide a unified metadata-enhanced view of the workflow
and its apps (see :ref:`Rose Metadata Compliance`).

Rose apps are a clear winner for tasks with complex configuration requirements.
It matters less for those with little configuration, but for consistency and to
take full advantage of ``rose edit`` it makes sense to use Rose apps
for most tasks.

When most tasks are Rose apps, set the app-run command as a root-level default,
and override it for the occasional non Rose app task:

.. code-block:: cylc

   [runtime]
       [[root]]
           script = rose task-run -v
       [[rose-app1]]
           #...
       [[rose-app2]]
           #...
       [[hello-world]]  # Not a Rose app.
           script = echo "Hello World"


.. _Rose Metadata Compliance:

Rose Metadata Compliance
------------------------

Rose metadata drives page layout and sort order in ``rose edit``, plus
help information, input validity checking, macros for advanced checking and app
version upgrades, and more.

To ensure the workflow and its constituent applications are being run as intended
it should be valid against any provided metadata: launch the
``rose edit`` GUI or run ``rose macro --validate`` on the
command line to highlight any errors, and correct them prior to use. If errors
are flagged incorrectly you should endeavour to fix the metadata.

When writing a new workflow or application, consider creating metadata to
facilitate ease of use by others.


Task Independence
-----------------

Essential dependencies must be encoded in the workflow graph, but
tasks should not rely unnecessarily on the action of other tasks.
For example, tasks should create their own output directories if they don't
already exist, even if they would normally be created by an earlier task
in the workflow. This makes it is easier to run tasks alone during
development and testing.


.. _Clock-Triggered Tasks:

Clock-Triggered Tasks
---------------------

Tasks that wait on real time data should use
:ref:`clock triggers <Built-in Clock Triggers>`
to delay job submission until the expected data arrival time:

.. code-block:: cylc

   [scheduling]
       initial cycle point = now
       [[xtriggers]]
           # Trigger 5 min after wallclock time is equal to cycle point.
           clock = wall_clock(offset=PT5M)
       [[graph]]
           T00 = @clock => get-data => process-data

.. cylc-scope:: flow.cylc[runtime][<namespace>]

Clock-triggered tasks typically have to handle late data arrival. Task
:cylc:conf:`execution retry delays` can be used to simply retrigger
the task at intervals until the data is found, but frequently retrying small
tasks is inefficient, and multiple task
failures will be logged for what is a essentially a normal condition (at least
it is normal until the data is really late).

.. cylc-scope::

Rather than using task execution retry delays to repeatedly trigger a task that
checks for a file, it may be better to have the task itself repeatedly poll for
the data (see :ref:`Custom Trigger Functions`).


.. _Rose App File Polling:

Rose App File Polling
---------------------

Rose apps have built-in polling functionality to check repeatedly for the
existence of files before executing the main app. See the ``[poll]``
section in Rose app config documentation. This is a good way to implement
check-and-wait functionality in clock-triggered tasks
(:ref:`Clock-Triggered Tasks`), for example.

It is important to note that frequent polling may be bad for some filesystems,
so be sure to configure a reasonable interval between polls.


Task Execution Time Limits
--------------------------

Instead of setting job wallclock limits directly in :term:`job runner`
directives, use
:cylc:conf:`flow.cylc[runtime][<namespace>]execution time limit`.
Cylc automatically derives the correct job runner directives from this,
and it is also used to run ``background`` and ``at`` jobs via
the ``timeout`` command, and to poll tasks that haven't reported in
finished by the configured time limit.


.. _Restricting Workflow Activity:

Restricting Workflow Activity
-----------------------------

It may be possible for large workflows to overwhelm a job host by submitting too
many jobs at once:

- Large workflows that are not sufficiently limited by real time clock
  triggering or intercycle dependence may generate a lot of *runahead*
  (this refers to Cylc's ability to run multiple cycles at once, restricted
  only by the dependencies of individual tasks).
- Some workflows may have large families of tasks whose members all
  become ready at the same time.

These problems can be avoided with *runahead limiting* and *internal
queues*, respectively.


.. _Runahead Limiting:

Runahead Limiting
^^^^^^^^^^^^^^^^^

By default Cylc allows a maximum of five cycle points to be active at the same
time, but this value is configurable:

.. code-block:: cylc

   [scheduling]
       initial cycle point = 2020-01-01T00
       # Don't allow any cycle interleaving:
       runahead limit = P0


Internal Queues
^^^^^^^^^^^^^^^

Tasks can be assigned to named internal queues that limit the number of members
that can be active (i.e. submitted or running) at the same time:

.. code-block:: cylc

   [scheduling]
       initial cycle point = 2020-01-01T00
       [[queues]]
           # Allow only 2 members of BIG_JOBS to run at once:
           [[[big_jobs_queue]]]
               limit = 2
               members = BIG_JOBS
       [[graph]]
           T00 = pre => BIG_JOBS
   [runtime]
       [[BIG_JOBS]]
       [[foo, bar, baz, ...]]
           inherit = BIG_JOBS


.. _Workflow Housekeeping:

Workflow Housekeeping
---------------------

Ongoing cycling workflows can generate an enormous number of output files and logs
so regular housekeeping is very important. Special housekeeping tasks,
typically the last tasks in each cycle, should be included to archive selected
important files and then delete everything at some offset from the current
cycle point.

The Rose built-in apps ``rose_arch`` and ``rose_prune``
provide an easy way to do this. They can be configured easily with
file-matching patterns and cycle point offsets to perform various housekeeping
operations on matched files.


Complex Jinja2 Code
-------------------

The Jinja2 template processor provides general programming constructs,
extensible with custom Python filters, that can be used to *generate* the
workflow definition. This makes it possible to write flexible multi-use
workflows with structure and content that varies according to various input
switches. There is a cost to this flexibility however: excessive use of Jinja2
can make a workflow hard to understand and maintain. It is difficult to say
exactly where to draw the line, but we recommend erring on the side of
simplicity and clarity: write workflows that are easy to understand and therefore
easy to modify for other purposes, rather than extremely complicated workflows
that attempt do everything out of the box but are hard to maintain and modify.

Note that use of Jinja2 loops for generating tasks is now deprecated in favour
of built-in parameterized tasks - see :ref:`User Guide Param`.


Shared Configuration
--------------------

Configuration that is common to multiple tasks should be defined in one
place and used by all, rather than duplicated in each task. Duplication is
a maintenance risk because changes have to be made consistently in several
places at once.


Jinja2 Variables
^^^^^^^^^^^^^^^^

In simple cases you can share by passing a Jinja2 variable to all the tasks
that need it:

.. code-block:: cylc

   {% set JOB_VERSION = 'A23' %}
   [runtime]
       [[foo]]
           script = run-foo --version={{JOB_VERSION}}
       [[bar]]
           script = run-bar --version={{JOB_VERSION}}


Inheritance
^^^^^^^^^^^

Sharing by inheritance of task families is recommended when more than a few
configuration items are involved.

The simplest application of inheritance is to set global defaults in the
``[runtime][root]`` namespace that is inherited by all tasks.
However, this should only be done for settings that really are used
by the vast majority of tasks. Over-sharing of via root, particularly of
environment variables, is a maintenance risk because it can be very
difficult to be sure which tasks are using which global variables.

Any :cylc:conf:`[runtime]` settings can be shared - scripting, platform
configuration, environment variables, and so on - from
single items up to complete task or app configurations. At the latter extreme,
it is quite common to have several tasks that inherit the same complete
job configuration followed by minor task-specific additions:

.. code-block:: cylc

   [runtime]
       [[FILE-CONVERT]]
           script = convert-netcdf
           #...
       [[convert-a]]
           inherit = FILE-CONVERT
           [[[environment]]]
                 FILE_IN = file-a
       [[convert-b]]
           inherit = FILE-CONVERT
           [[[environment]]]
                 FILE_IN = file-b

Inheritance is covered in more detail from an efficiency perspective in
:ref:`The Task Family Hierarchy`.


.. _Shared Task IO Paths:

Shared Task IO Paths
^^^^^^^^^^^^^^^^^^^^

If one task uses files generated by another task (and both see the same
filesystem) a common IO path should normally be passed to both tasks via a
shared environment variable. As far as Cylc is concerned this is no different
to other shared configuration items, but there are some additional aspects
of usage worth addressing here.

Primarily, for self-containment (see :ref:`Self-Contained Workflows`) shared IO
paths should be under the *workflow share directory*, the location of which is
passed to all tasks as ``$CYLC_WORKFLOW_SHARE_DIR``.

The ``rose task-env`` utility can provide additional environment
variables that refer to static and cyclepoint-specific locations under the
workflow share directory.

.. code-block:: cylc

   [runtime]
       [[my-task]]
           env-script = $(eval rose task-env -T P1D -T P2D)

For a current cycle point of ``20170105`` this will make the following
variables available to tasks:

.. code-block:: bash

   ROSE_DATA=$CYLC_WORKFLOW_SHARE_DIR/data
   ROSE_DATAC=$CYLC_WORKFLOW_SHARE_DIR/cycle/20170105
   ROSE_DATACP1D=$CYLC_WORKFLOW_SHARE_DIR/cycle/20170104
   ROSE_DATACP2D=$CYLC_WORKFLOW_SHARE_DIR/cycle/20170103

Subdirectories of ``$ROSE_DATAC`` etc. should be agreed between
different sub-systems of the workflow; typically they are named for the
file-generating tasks, and the file-consuming tasks should know to look there.

The share-not-duplicate rule can be relaxed for shared files whose names are
agreed by convention, so long as their locations under the share directory are
proper shared workflow variables. For instance the Unified Model uses a large
number of files whose conventional names (``glu_snow``, for example)
can reasonably be expected not to change, so they are typically hardwired into
app configurations (as ``$ROSE_DATA/glu_snow``, for example) to avoid
cluttering the workflow definition.

Here two tasks share a workspace under the workflow share directory
by inheritance:

.. code-block:: cylc

   # Sharing an I/O location via inheritance.
   [scheduling]
       [[graph]]
           R1 = write_data => read_data
   [runtime]
       [[root]]
           env-script = $(eval rose task-env)
       [[WORKSPACE]]
           [[[environment]]]
               DATA_DIR = ${ROSE_DATA}/png
       [[write_data]]
           inherit = WORKSPACE
           script = """
               mkdir -p $DATA_DIR
               write-data.exe -o ${DATA_DIR}
           """
       [[read_data]]
           inherit = WORKSPACE
           script = read-data.exe -i ${DATA_DIR}

In simple cases where an appropriate family does not already exist paths can
be shared via Jinja variables:

.. code-block:: cylc

   # Sharing an I/O location with Jinja2.
   {% set DATA_DIR = '$ROSE_DATA/stuff' %}
   [scheduling]
       [[graph]]
           R1 = write_data => read_data
   [runtime]
       [[write_data]]
           script = """
               mkdir -p {{DATA_DIR}}
               write-data.exe -o {{DATA_DIR}}
           """
       [[read_data]]
           script = read-data.exe -i {{DATA_DIR}}

For completeness we note that it is also possible to configure multiple tasks
to use the same work directory so they can all share files in ``$PWD``.
(Cylc executes tasks in special work directories that by default are unique
to each task). This may simplify the workflow slightly, and it may be useful if
you are unfortunate enough to have executables that are designed for IO in
``$PWD``, *but it is not recommended*. There is a higher risk
of interference between tasks; it will break ``rose task-run``
incremental file creation mode; and ``rose task-run --new`` will in
effect delete the work directories of tasks other than its intended target.

.. code-block:: cylc

   # Shared work directory: tasks can read and write in $PWD - use with caution!
   [scheduling]
       initial cycle point = 2018
       [[graph]]
           P1Y = write_data => read_data
   [runtime]
       [[WORKSPACE]]
           work sub-directory = $CYLC_TASK_CYCLE_POINT/datadir
       [[write_data]]
           inherit = WORKSPACE
           script = write-data.exe
       [[read_data]]
           inherit = WORKSPACE
           script = read-data.exe


Varying Behaviour By Cycle Point
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To make a cycling job behave differently at different cycle points you
*could* use a single task with scripting that reacts to the cycle point it finds
itself running at, but it is better to use different tasks (in different
cycling sections) that inherit the same base job configuration. This results
in a more transparent workflow that can be understood just by inspecting the
graph:

.. code-block:: cylc

   # Run the same job differently at different cycle points.
   [scheduling]
       initial cycle point = 2020-01-01T00
       [[graph]]
           T00 = pre => long_fc => post
           T12 = pre => short_fc => post
   [runtime]
       [[MODEL]]
           script = run-model.sh
       [[long_fc]]
           inherit = MODEL
           execution time limit = PT30M
           [[[environment]]]
               RUN_LEN = PT48H
       [[short_fc]]
           inherit = MODEL
           execution time limit = PT10M
           [[[environment]]]
               RUN_LEN = PT12H

The few differences between ``short_fc`` and ``long_fc``,
including :term:`job runner` resource requests, can be configured after common
settings are inherited.

At Start-Up
^^^^^^^^^^^

Similarly, if a cycling job needs special behaviour at the initial (or any
other) cycle point, just use a different logical task in an ``R1`` graph and
have it inherit the same job as the general cycling task, not a single task
with scripting that behaves differently if it finds itself running at the
initial cycle point.


Automating Failure Recovery
---------------------------


Job Submission Retries
^^^^^^^^^^^^^^^^^^^^^^

When submitting jobs to a remote host, use job submission retries to
automatically resubmit tasks in the event of network outages.

Note that this is distinct from job retries for
job execution failure (just below).

Job Execution Retries
^^^^^^^^^^^^^^^^^^^^^

Automatic retry on job execution failure is useful if you have good reason to
believe that a simple retry will usually succeed. This may be the case if the
job host is known to be flaky, or if the job only ever fails for one known
reason that can be fixed on a retry. For example, if a model fails occasionally
with a numerical instability that can be remedied with a short timestep rerun,
then an automatic retry may be appropriate.

.. code-block:: cylc

   [runtime]
       [[model]]
           script = """
               if [[ $CYLC_TASK_TRY_NUMBER > 1 ]]; then
                   SHORT_TIMESTEP=true
               else
                   SHORT_TIMESTEP=false
               fi
               model.exe
           """
           execution retry delays = 1*PT0M


Failure Recovery Workflows
^^^^^^^^^^^^^^^^^^^^^^^^^^

For recovery from failures that require explicit diagnosis you can configure
alternate graph branches. In the following example, if the model fails a
diagnosis task will trigger; if it determines the cause of the failure is a
known numerical instability (e.g. by parsing model job logs) it will succeed,
triggering a short timestep run. Postprocessing can proceed from either the
original or the short-step model run.

.. Need to use a 'container' directive to get centered image with
   left-aligned caption (as required for code block text).

.. _fig-failure-rec:

.. container:: twocol

   .. container:: image

      .. figure:: ../img/failure-recovery.png
         :align: center

   .. container:: caption

      .. code-block:: cylc

         [scheduling]
             [[graph]]
                 R1 = """
                     model | model_short => postproc
                     model:fail => diagnose => model_short
                 """


Include Files
-------------

Include-files should not be overused, but they can sometimes be useful
(e.g. see :ref:`Portable Workflows Label`):

.. code-block:: cylc

   #...
   {% include 'inc/foo.cylc' %}

(Technically this inserts a Jinja2-rendered file template). Cylc also has a
native include mechanism that pre-dates Jinja2 support and literally inlines
the include-file:

.. code-block:: cylc

   #...
   %include 'inc/foo.cylc'

The two methods normally produce the same result, but use the Jinja2 version if
you need to construct an include-file name from a variable (because Cylc
include-files get inlined before Jinja2 processing is done):

.. code-block:: cylc

   #...
   {% include 'inc/' ~ SITE ~ '.cylc' %}