Summarizing Aggregation over the Summer
Hello, welcome to the final blog in the Series of Blogs by Hardik. If you haven’t read my previous blogs then feel free to have a look at them. The links to the blogs are provided below.
- The Last Phase, GSoC 2020!
- Aggregates, a user problem?
- Local SLURM cluster setup
- Coding Aggregation Begins...
- Introducing Aggregation
So finally an amazing journey of Google Summer of Code (GSoC) has come to an end. Over the summer I learned about a lot of things and the slope of the learning curve is definitely going to increase in future. I’d like to thank Bradley, Brandon, Vyas, Alyssa, Mike, Simon and rest of the signac team for their constant help and support throughout the summer. This blog post documents the work I did for GSoC 2020 on introducing a feature of aggregate operations in signac-flow.
Project Description
The signac data and workflow model is primarily designed around the concept of operations acting on jobs, where the management of the job’s data is handled by the signac package and the workflow definition and execution is handled by signac-flow. The current workflow model treats operations as always acting on single jobs. This project allows the users to execute operations that accept multiple jobs as its arguments.
A practical example of using an aggregate operation is described in this pull request. In the example we aim to generate a plot of temperatures (in °C) v/s days of a month having 31 days. After that, we compare that plot with the average temperature of that month.
I raised a pull request (now closed) which gave the team an overview of aggregation. This pull request helped me track my project. This was a very large pull request hence it wouldn’t been a nice decision to merge that pull request, hence the team suggested me to break my work into several small steps.
I’ll now describe my approach to the project in several points.
Make FlowCondition
class private (#315)
No user facing method currently requires the access to this class directly or returns the instances of this class. Moreover, this class should never be instantiated by the users directly. Also, the condition functions are evaluated using this class and we’ll see in the upcoming points that internally every method is passed in a list of jobs as a single positional argument rather than a variable argument. This could lead to confusion for the users to handle such classes. Hence, this class doesn’t need to be in the public API.
Make FlowOperation
callable (#326)
The classes FlowCmdOperation
and FlowOperation
are responsible for handling logics associated with signac operations with or without the @cmd
decorator respectively.
Previously the logic of calling these operation functions was different internally but since aggregation is getting introduced, we will need to maintain consistency throughout the code base in order to avoid confusion.
Make JobOperation
private (#325)
The JobOperation
class was exposed to users for the primary purpose of using with submit_operations
and run_operations
.
The use case for this class was small and the structure of the class (after aggregation) was supposed to get changed.
Hence, JobOperation
and the methods which returns its instances were deprecated and are scheduled to be removed in signac-flow version 0.13.
Deprecate eligible
and complete
methods from the user API (#337)
The eligible
and complete
methods were originally used for checking whether a job operation pair was eligible to run or get submitted or is complete respectively.
The use case for this class is small and it may create confusion for users to deal while checking eligibilty of aggregate-operation pair.
Hence, eligible
and complete
methods of FlowGroup
and BaseFlowOperation
were deprecated and are scheduled to be removed in signac-flow version 0.13.
Enable aggregate logic in flow (#324)
Reviewing actual aggregation will becomes much easier if signac-flow starts supporting the logic of aggregates of 1. This pull request internally converts all jobs into aggregates of one. Hence we now pass in a tuple of a single job to every method internally. This was the first major pull request which got merged into master branch.
Change submission ID to support aggregation (#334)
Previously, every JobOperation
instance which was submitted holded a submission id which was a unique id responsible to identify the job associated to any group of operation.
Decisions like how should an aggregate-operation be represented in a submission script, how to make an id of aggregate associated with a group unique were made in this PR.
The id will now contain details like group name, length of aggregate, concatenated job ids of the jobs in the aggregate.
The representation of an aggregate-operation in a script will be as follows:
my-op[#1](26021048)
my-op[#2](26021048, 42b7b4f2)
my-op[#3](26021048, 42b7b4f2, 44550aef)
my-op[#4](26021048, ..., 4b893796)
Add aggregator classes to flow (#348)
This pull request is currently the latest one that I’ve filed.
A new and a more efficient way for storing aggregates suggested by my mentors is now getting implemented in this PR.
This PR introduces aggregator classes to flow which are responsible for registering, storing, and generating aggregates whenever required.
Those are namely aggregator
, _AggregatesStore
, _DefaultAggregateStore
, and _MakeAggregate
.
The aggregator
class will be used by the users as a decorator class for the operation functions.
It includes features of aggregating the jobs by some number or grouping them by multiple statepoint parameters, sorting them via some statepoint parameter (in a reversed order as well), and selecting only a few jobs from the project using a select
argument.
Enable aggregate status check (#335)
The changes made in this pull request handles the issue with status check as described in one of my blog post.
Decisions like aggregates will now get registered on initialization of a FlowProject
and if a user decides to change the aggregator associated with an operation function then the user has to register aggregates using the register_aggregates()
method (or initialize the project once again) else the previously registered aggregates will be used.
Every aggregate will now have an id associated with it.
So, if a user wants to run an operation for that particular aggregate then the user can get the id of an aggregate using the get_aggregate_id()
method.
After that the command line option -j
can be used to specify the id.
An example command to run the operation which either accepts a single job having id job_id1
or an aggregate having id aggregate_id1
would be:
python project.py run -j job_id1 aggregate_id1
Work left to do in this PR: Since #348 is the latest PR for the project, hence this PR needs to add support for all the new features introduced.
Add aggregation feature to flow (#336)
This pull request, when merged, will enable the users to perform actual aggregation for their workflow.
In this PR I have refactored all the templates used for status printing in order to support aggregation.
I also wrote all the necessary tests for testing the aggregation feature.
signac-flow will now provide a per aggregate detailed status overview which will show all the jobs in aggregates associated with every aggregated operation.
Users can also use --orphan
command line option with status check to fetch the details of “orphaned” aggregates which were submitted previously but are no longer considered for execution because of modifications in the data space (e.g. the deletion of a job in the aggregate or creation of new jobs that belong in that aggregate).
A sample aggregate-status view while using the status query python project.py status --detailed
is given below.
Detailed Aggregate View:
operation jobs_in_aggregate length_of_aggregate status
----------- -------------------------------- --------------------- --------
compute_sum ee5ca9ab62e9dbb7b6abbaaac6443d49 4 [U]
compute_sum 03086200817396c6083c34ac025ec4d5 4 [U]
compute_sum 92f821919b4b0a2f15d0ef3f5d433550 4 [U]
compute_sum 5d63db8dc4821a190f690fd66e4dd0be 4 [U]
compute_sum as32asdga2e9dbb7b6abbaaac6443d49 2 [U]
compute_sum 2sgj2k00817396c6083c34ac025ec4d5 2 [U]
Work left to do in this PR: Add support for classes responsible for storing aggregates in #348.
I’m really looking forward to see the aggregation feature being used in the real-world. This was the major portion of my GSoC journey. Now I’ll describe the work I did during the summer which was loosely related to the aggregation project.
Add tests for Directives class (#283)
Wrote the tests for two classes Directives
and DirectivesItem
that serve as a smart mapping for the environment, user-specified directives and a specification for environment directives respectively.
Add pre-commit hooks to signac and signac-flow (#358, #333)
This ensures that the code and documentation written by developers are compliant before committing. The documentation on how to setup a pre-commit hook can be found here.
For prospective GSoC 2021 students
Students appearing in GSoC 2021 should start contributing to the open source community in order to get some basic concepts of programming and version control system used by the organization (mostly Git). Communicating with the team is the most important part, this will improve your bonding with the community and will always help you in your life somewhere.