Work on the new
cylc clean is already underway.
This command is able to locate workflow installations on different platforms and delete them.
Ideally we would want
cylc clean to be able to do a lot more.
One, simple example is re-running workflows, which requires moving/removing the log dir and database. As per the rose suite-run migration proposal this shall be done by cylc clean.
you will be able to use cylc clean to clean out the log directory (and whatever other directories you want removing)
This means that we need an interface in
cylc clean to perform this.
We could give users a pre-defined option, however, the exact requirements for this niche usage are hard to predict.
This proposal suggests developing more generic functionality in
for this and other purposes. It also highlights some viable possibilities for
future functionality which could see it largely surpass
rose_arch. Not of top importance now but worth considering for interface
related https://github.com/cylc/cylc-flow/issues/3887 closes https://github.com/cylc/cylc-flow/issues/1159
rose_arch is ~500 lines of Python and does some things we
might not want to do in Cylc).
Control over which installations of a flow are cleaned:
# remove a workflow installation on the scheduler host cylc clean myflow --local # remove a workflow installation on any remote task hosts cylc clean myflow --remote # remove a workflow installation on the scheduler host and # on any remote task hosts cylc clean myflow
Control over what is cleaned:
# delete a directory in a workflow installation cylc clean myflow --rm share/cycle/1234 # delete two directores in a workflow installation cylc clean myflow --rm share --rm work cylc clean myflow --rm share:work # equivalent
Basic glob support:
# delete all job log files from 2020 cylc clean myflow --rm 'share/cycle/2020*' # delete everything cylc clean myflow --rm '*' cylc clean myflow # equivalent
Listing rather than removing things:
# list stuff in share/cycle for cycles in 2020 cylc clean --ls 'share/cycle/2020*'
Tools for moving and tar’ing things:
# rename a directory with a timestamp cylc clean myflow --mv share/cycle/1234 # rename the log dir AND suite DB cylc clean myflow --mv log # tar a directory in a workflow installation cylc clean myflow --tar share/cycle/1234
log dir and
.service/db together implicitly
cylc play --re-run):
# remove the log dir AND suite DB cylc clean myflow --rm log # rename the log dir AND suite DB cylc clean myflow --mv log # tar the log dir AND suite DB cylc clean myflow --tar log # tar two directories together cylc clean myflow --tar work:share # --tar is shorthand for --compress <path> <format> cylc clean --compress work:share archive.tar
Cycle aware housekeeping (
# remove job logs in the range 2020-2021 cylc clean myflow --rm log/job/<cycle> --cycle-range=2021/-P1Y # tar job logs for "foo" in the range 2020-2021 cylc clean myflow --tar log/job/<cycle>/foo --cycle-range=2021/-P1Y # tar job logs for "foo" for the year before $CYLC_TASK_CYCLE_POINT inclusive cylc clean myflow --tar log/job/<cycle>/foo --cycle-range=<cycle>/-P1Y # delete the work dir for the task being run cylc clean myflow --rm work/<cycle>/<task>
# rsync the work directory to myhost # NOTE: --rsync has two arguments cylc clean myflow --rsync work myhost:archived-work # tar the work directory and scp it to myhost # NOTE: use MIME types to map compression to file extension cylc clean myflow --scp work myhost:archived-work.tar # tar two cycle directories together and scp them to myhost cylc clean myflow --scp work/1234:work/2345 myhost:archived-work.tar
Extendable interface with MIME types:
# plugins can add new clean options cylc clean myflow --[tool] [input:[input:...] [[location:]output[.format]]] # e.g. scp cylc clean myflow --scp work myhost:archived-work-dir # use MIME types to detect output format e.g. tar cylc clean myflow --scp work myhost:archived-work-dir.tar # plugins can add support for new MIME types cylc clean myflow --mytool input output.pax
Mix and match:
cylc clean \ --local --mv log/job \ -- \ --remote --tar log/job \ -- \ --rm work/<cycle>:share/<cycle> --cycle-range=2021/-P1Y
Call from Python:
from cylc.flow.clean import clean, cleanitem clean( ['myflow'], cleanitem(mv='['log/job']), cleanitem(tar='['log/job']), cleanitem(rm='['work/<cycle>', 'share/<cycle>'], cycle_range='2021/-P1Y') )
Build into a trivial main-loop plugin:
[scheduler] [[main loop]] # calls cylc-clean from the main loop via an async subprocess # so it doesn't cause any load on the scheduler process [[[clean]]] args = """ # tar log files on the scheduler host that are P1Y behind the # oldest active cycle point --local --tar log/job/<cycle> --cycle-range=<cycles>/-P1Y -- # delete work dir and log files for cycles that are P1Y behind # the oldest active cycle point on all platforms --remote --rm work/<cycle> --cycle-range=<cycles>/-P1Y """ # run every 12 hours interval = P12H # can potentially allow multiple calls of the same plugin with # different options like so [[[clean(model-output)]]] # remove output files in a targeted manner args = """ --rm work/<cycle>/model*/bigfile --cycle-range=<cycles>/-P1D """ # we could add a main-loop event which gets fired whenever a cycle # "closes" which would be good for housekeep purposes # can always use the same options configured here on the command line
# to illustrate how trivial this plugin would be ... from cylc.flow.clean import clean from cylc.flow.scripts.clean import get_option_parser @main_loop.startup def startup(_, conf) args = re.sub('^\s*#.*$', '', conf['args']) args.replace('\n', ' ') conf['args'] = get_option_parser().parse_args(args) @main_loop.periodic def periodic(_, conf): clean(conf['args'])
The rest according to future ideas and priorities.