Using Data To Define Your Workflow
Get a copy of this example
$ cylc get-resources examples/external-data-files
We often want to read in a dataset for use in defining our workflow.
The Cylc tutorial is an
example of this where we want one get_observations
task for each of a list
of weather stations. Each weather station has a name (e.g. “heathrow”) and an
ID (e.g. 3772).
[runtime]
[[get_observations_heathrow]]
script = get-observations
[[[environment]]]
SITE_ID = 3772
[[get_observations_camborne]]
script = get-observations
[[[environment]]]
SITE_ID = 3808
[[get_observations_shetland]]
script = get-observations
[[[environment]]]
SITE_ID = 3005
[[get_observations_aldergrove]]
script = get-observations
[[[environment]]]
SITE_ID = 3917
It can be inconvenient to write out the name and ID of each station in your workflow like this, however, you may already have this information in a more convenient format (i.e. a data file of some form).
With Cylc, we can use Jinja2 to read in a data file and use that data to define your workflow.
The Approach
This example has three components:
A JSON file containing a list of weather stations along with all the data associated with them.
[ { "name": "camborne", "wmo": "03808", "alt": 87, "lat": 50.21841, "lon": -5.32753 }, { "name": "heathrow", "wmo": "03772", "alt": 25, "lat": 51.47922, "lon": -0.45061 }, { "name": "lerwick", "wmo": "03005", "alt": 82, "lat": 60.13893, "lon": -1.18491 }, { "name": "aldergrove", "wmo": "03917", "alt": 63, "lat": 54.66365, "lon": -6.22534 }, { "name": "exeter", "wmo": "03844", "alt": 27, "lat": 50.73717, "lon": -3.40579 }, { "name": "middle_wallop", "wmo": "03749", "alt": 90, "lat": 51.14987, "lon": -1.56994 } ]
A Python function that reads the JSON file.
import json def load_json(filename): with open(filename, 'r') as json_file: return json.load(json_file)
We put this Python code in the workflow’s
lib/python
directory which allows us to import it from within our workflow.A
flow.cylc
file that uses the Python function to load the data file.We can import Python functions with Jinja2 using the following syntax:
{% from "load_data" import load_json %}
For more information, see Importing Python modules.
The Workflow
The three files are arranged like so:
|-- flow.cylc
|-- lib
| `-- python
| `-- load_data.py
`-- stations.json
The flow.cylc
file:
Imports the Python function.
Uses it to load the data.
Then uses the data to define the workflow.
#!Jinja2
[meta]
title = Weather Station Workflow
description = """
This workflow demonstrates how to read in a data file for use in
defining your workflow.
We have a file called "stations.json" which contains a list of weather
stations with some data for each. This workflow reads the
"stations.json" file and creates a family for each weather station
with an environment variable for each data field.
You can load data in other formats too. Try changing "load_json" to
"load_csv" and "stations.json" to "stations.csv" for a CSV example.
"""
{# Import a Python function to load our data. #}
{% from "load_data" import load_json %}
{# Load data from the specified file. #}
{% set stations = load_json('stations.json') %}
{# Extract a list of station names from the data file. #}
{% set station_names = stations | map(attribute="name") | list %}
{# Provide Cylc with a list of weather stations. #}
[task parameters]
station = {{ station_names | join(', ') }}
[scheduling]
initial cycle point = 2000-01-01
final cycle point = 2000-01-02
[[graph]]
P1D = fetch<station> => process<station> => collate
[runtime]
{# Define a family for each weather station #}
{% for station in stations %}
[[STATION<station={{ station["name"] }}>]]
[[[environment]]]
{# Turn the <station> parameter into an environment variable #}
{# NB: Just to show how, we could also have used `station["name"]`. #}
name = %(station)s
{# Turn the data for this station into environment variables. #}
wmo = {{ station["wmo"] }}
alt = {{ station["alt"] }}
lat = {{ station["lat"] }}
lon = {{ station["lon"] }}
{% endfor %}
# a task that gets data
[[fetch<station>]]
inherit = STATION<station>
script = echo "fetch data for $name, WMO ID: $wmo"
[[process<station>]]
inherit = STATION<station>
script = echo "process data for $name, location: $lat,$lon"
[[collate]]
script = "echo collate data for stations: {{ station_names }}"
Data Types
We can load other types of data file too. This example also includes the same
data in CSV format along with a Python function to load CSV data. To try it
out, open the flow.cylc
file and replace stations.json
with
stations.csv
and load_json
with load_csv
.
Any Python code that you import using Jinja2 will be executed using the Python environment that Cylc is running in. So if you want to import Python code that isn’t in the standard library, you may need to get your system administrator to install this dependency into the Cylc environment for you.