Summarizing Aggregation over the Summer

7 minute read

Hello, welcome to the final blog in the Series of Blogs by Hardik. If you haven’t read my previous blogs then feel free to have a look at them. The links to the blogs are provided below.

So finally an amazing journey of Google Summer of Code (GSoC) has come to an end. Over the summer I learned about a lot of things and the slope of the learning curve is definitely going to increase in future. I’d like to thank Bradley, Brandon, Vyas, Alyssa, Mike, Simon and rest of the signac team for their constant help and support throughout the summer. This blog post documents the work I did for GSoC 2020 on introducing a feature of aggregate operations in signac-flow.

Project Description

The signac data and workflow model is primarily designed around the concept of operations acting on jobs, where the management of the job’s data is handled by the signac package and the workflow definition and execution is handled by signac-flow. The current workflow model treats operations as always acting on single jobs. This project allows the users to execute operations that accept multiple jobs as its arguments.

A practical example of using an aggregate operation is described in this pull request. In the example we aim to generate a plot of temperatures (in °C) v/s days of a month having 31 days. After that, we compare that plot with the average temperature of that month.

I raised a pull request (now closed) which gave the team an overview of aggregation. This pull request helped me track my project. This was a very large pull request hence it wouldn’t been a nice decision to merge that pull request, hence the team suggested me to break my work into several small steps.

I’ll now describe my approach to the project in several points.

Make FlowCondition class private (#315)

No user facing method currently requires the access to this class directly or returns the instances of this class. Moreover, this class should never be instantiated by the users directly. Also, the condition functions are evaluated using this class and we’ll see in the upcoming points that internally every method is passed in a list of jobs as a single positional argument rather than a variable argument. This could lead to confusion for the users to handle such classes. Hence, this class doesn’t need to be in the public API.

Make FlowOperation callable (#326)

The classes FlowCmdOperation and FlowOperation are responsible for handling logics associated with signac operations with or without the @cmd decorator respectively. Previously the logic of calling these operation functions was different internally but since aggregation is getting introduced, we will need to maintain consistency throughout the code base in order to avoid confusion.

Make JobOperation private (#325)

The JobOperation class was exposed to users for the primary purpose of using with submit_operations and run_operations. The use case for this class was small and the structure of the class (after aggregation) was supposed to get changed. Hence, JobOperation and the methods which returns its instances were deprecated and are scheduled to be removed in signac-flow version 0.13.

Deprecate eligible and complete methods from the user API (#337)

The eligible and complete methods were originally used for checking whether a job operation pair was eligible to run or get submitted or is complete respectively. The use case for this class is small and it may create confusion for users to deal while checking eligibilty of aggregate-operation pair. Hence, eligible and complete methods of FlowGroup and BaseFlowOperation were deprecated and are scheduled to be removed in signac-flow version 0.13.

Enable aggregate logic in flow (#324)

Reviewing actual aggregation will becomes much easier if signac-flow starts supporting the logic of aggregates of 1. This pull request internally converts all jobs into aggregates of one. Hence we now pass in a tuple of a single job to every method internally. This was the first major pull request which got merged into master branch.

Change submission ID to support aggregation (#334)

Previously, every JobOperation instance which was submitted holded a submission id which was a unique id responsible to identify the job associated to any group of operation. Decisions like how should an aggregate-operation be represented in a submission script, how to make an id of aggregate associated with a group unique were made in this PR. The id will now contain details like group name, length of aggregate, concatenated job ids of the jobs in the aggregate. The representation of an aggregate-operation in a script will be as follows:

my-op[#2](26021048, 42b7b4f2)
my-op[#3](26021048, 42b7b4f2, 44550aef)
my-op[#4](26021048, ..., 4b893796)

Add aggregator classes to flow (#348)

This pull request is currently the latest one that I’ve filed. A new and a more efficient way for storing aggregates suggested by my mentors is now getting implemented in this PR. This PR introduces aggregator classes to flow which are responsible for registering, storing, and generating aggregates whenever required. Those are namely aggregator, _AggregatesStore, _DefaultAggregateStore, and _MakeAggregate. The aggregator class will be used by the users as a decorator class for the operation functions. It includes features of aggregating the jobs by some number or grouping them by multiple statepoint parameters, sorting them via some statepoint parameter (in a reversed order as well), and selecting only a few jobs from the project using a select argument.

Enable aggregate status check (#335)

The changes made in this pull request handles the issue with status check as described in one of my blog post. Decisions like aggregates will now get registered on initialization of a FlowProject and if a user decides to change the aggregator associated with an operation function then the user has to register aggregates using the register_aggregates() method (or initialize the project once again) else the previously registered aggregates will be used. Every aggregate will now have an id associated with it. So, if a user wants to run an operation for that particular aggregate then the user can get the id of an aggregate using the get_aggregate_id() method. After that the command line option -j can be used to specify the id. An example command to run the operation which either accepts a single job having id job_id1 or an aggregate having id aggregate_id1 would be:

python run -j job_id1 aggregate_id1

Work left to do in this PR: Since #348 is the latest PR for the project, hence this PR needs to add support for all the new features introduced.

Add aggregation feature to flow (#336)

This pull request, when merged, will enable the users to perform actual aggregation for their workflow. In this PR I have refactored all the templates used for status printing in order to support aggregation. I also wrote all the necessary tests for testing the aggregation feature. signac-flow will now provide a per aggregate detailed status overview which will show all the jobs in aggregates associated with every aggregated operation. Users can also use --orphan command line option with status check to fetch the details of “orphaned” aggregates which were submitted previously but are no longer considered for execution because of modifications in the data space (e.g. the deletion of a job in the aggregate or creation of new jobs that belong in that aggregate). A sample aggregate-status view while using the status query python status --detailed is given below.

Detailed Aggregate View:

operation    jobs_in_aggregate                   length_of_aggregate  status
-----------  --------------------------------  ---------------------  --------
compute_sum  ee5ca9ab62e9dbb7b6abbaaac6443d49                      4  [U]
compute_sum  03086200817396c6083c34ac025ec4d5                      4  [U]
compute_sum  92f821919b4b0a2f15d0ef3f5d433550                      4  [U]
compute_sum  5d63db8dc4821a190f690fd66e4dd0be                      4  [U]

compute_sum  as32asdga2e9dbb7b6abbaaac6443d49                      2  [U]
compute_sum  2sgj2k00817396c6083c34ac025ec4d5                      2  [U]

Work left to do in this PR: Add support for classes responsible for storing aggregates in #348.

I’m really looking forward to see the aggregation feature being used in the real-world. This was the major portion of my GSoC journey. Now I’ll describe the work I did during the summer which was loosely related to the aggregation project.

Add tests for Directives class (#283)

Wrote the tests for two classes Directives and DirectivesItem that serve as a smart mapping for the environment, user-specified directives and a specification for environment directives respectively.

Add pre-commit hooks to signac and signac-flow (#358, #333)

This ensures that the code and documentation written by developers are compliant before committing. The documentation on how to setup a pre-commit hook can be found here.

For prospective GSoC 2021 students

Students appearing in GSoC 2021 should start contributing to the open source community in order to get some basic concepts of programming and version control system used by the organization (mostly Git). Communicating with the team is the most important part, this will improve your bonding with the community and will always help you in your life somewhere.