Using Data To Define Your Workflow

Get a copy of this example

$ cylc get-resources examples/external-data-files

We often want to read in a dataset for use in defining our workflow.

The Cylc tutorial is an example of this where we want one get_observations task for each of a list of weather stations. Each weather station has a name (e.g. “heathrow”) and an ID (e.g. 3772).

[runtime]
    [[get_observations_heathrow]]
        script = get-observations
        [[[environment]]]
            SITE_ID = 3772
    [[get_observations_camborne]]
        script = get-observations
        [[[environment]]]
            SITE_ID = 3808
    [[get_observations_shetland]]
        script = get-observations
        [[[environment]]]
            SITE_ID = 3005
    [[get_observations_aldergrove]]
        script = get-observations
        [[[environment]]]
            SITE_ID = 3917

It can be inconvenient to write out the name and ID of each station in your workflow like this, however, you may already have this information in a more convenient format (i.e. a data file of some form).

With Cylc, we can use Jinja2 to read in a data file and use that data to define your workflow.

The Approach

This example has three components:

A JSON file containing a list of weather stations along with all the data associated with them.

stations.json

[
    {
        "name": "camborne",
        "wmo": "03808",
        "alt": 87,
        "lat": 50.21841,
        "lon": -5.32753
    },
    {
        "name": "heathrow",
        "wmo": "03772",
        "alt": 25,
        "lat": 51.47922,
        "lon": -0.45061
    },
    {
        "name": "lerwick",
        "wmo": "03005",
        "alt": 82,
        "lat": 60.13893,
        "lon": -1.18491
    },
    {
        "name": "aldergrove",
        "wmo": "03917",
        "alt": 63,
        "lat": 54.66365,
        "lon": -6.22534
    },
    {
        "name": "exeter",
        "wmo": "03844",
        "alt": 27,
        "lat": 50.73717,
        "lon": -3.40579
    },
    {
        "name": "middle_wallop",
        "wmo": "03749",
        "alt": 90,
        "lat": 51.14987,
        "lon": -1.56994
    }
]

A Python function that reads the JSON file.
lib/python/load_data.py
```
import json


def load_json(filename):
    with open(filename, 'r') as json_file:
        return json.load(json_file)
```
We put this Python code in the workflow’s lib/python directory which allows us to import it from within our workflow.
A flow.cylc file that uses the Python function to load the data file.

We can import Python functions with Jinja2 using the following syntax:
```
{% from "load_data" import load_json %}
```
For more information, see Importing Python modules.

The Workflow

The three files are arranged like so:

File Structure

|-- flow.cylc
|-- lib
|   `-- python
|       `-- load_data.py
`-- stations.json

The flow.cylc file:

Imports the Python function.
Uses it to load the data.
Then uses the data to define the workflow.

flow.cylc

#!Jinja2

[meta]
    title = Weather Station Workflow
    description = """
        This workflow demonstrates how to read in a data file for use in
        defining your workflow.

        We have a file called "stations.json" which contains a list of weather
        stations with some data for each. This workflow reads the
        "stations.json" file and creates a family for each weather station
        with an environment variable for each data field.

        You can load data in other formats too. Try changing "load_json" to
        "load_csv" and "stations.json" to "stations.csv" for a CSV example.
    """


{# Import a Python function to load our data. #}
{% from "load_data" import load_json %}

{# Load data from the specified file. #}
{% set stations = load_json('stations.json') %}

{# Extract a list of station names from the data file. #}
{% set station_names = stations | map(attribute="name") | list %}


{# Provide Cylc with a list of weather stations. #}
[task parameters]
    station = {{ station_names | join(', ') }}


[scheduling]
    initial cycle point = 2000-01-01
    final cycle point = 2000-01-02
    [[graph]]
        P1D = fetch<station> => process<station> => collate


[runtime]
{# Define a family for each weather station #}
{% for station in stations %}
    [[STATION<station={{ station["name"] }}>]]
        [[[environment]]]
            {# Turn the <station> parameter into an environment variable #}
            {# NB: Just to show how, we could also have used `station["name"]`. #}
            name = %(station)s
            {# Turn the data for this station into environment variables. #}
            wmo = {{ station["wmo"] }}
            alt = {{ station["alt"] }}
            lat = {{ station["lat"] }}
            lon = {{ station["lon"] }}
{% endfor %}

    # a task that gets data
    [[fetch<station>]]
        inherit = STATION<station>
        script = echo "fetch data for $name, WMO ID: $wmo"

    [[process<station>]]
        inherit = STATION<station>
        script = echo "process data for $name, location: $lat,$lon"

    [[collate]]
        script = "echo collate data for stations: {{ station_names }}"

Data Types

We can load other types of data file too. This example also includes the same data in CSV format along with a Python function to load CSV data. To try it out, open the flow.cylc file and replace stations.json with stations.csv and load_json with load_csv.

Any Python code that you import using Jinja2 will be executed using the Python environment that Cylc is running in. So if you want to import Python code that isn’t in the standard library, you may need to get your system administrator to install this dependency into the Cylc environment for you.