Skip to content

Specifications for referencing external data sources in serialized work #237

@eirrgang

Description

@eirrgang

We can embed arbitrary data in the serialized work document (a JSON file), but we don't always want to.

In some cases, we give file names as parameters, but this should be restricted to parameters of operations whose sole job is to provide the contents of the file to another operation through an API compatible interface. Even in this case, we have not clearly defined the semantics for resolving filesystem location or verifying the identity/correctness of the file contents.

In other cases, we may prefer to refer to a binary large object without embedding it in the work specification.

I believe this can be addressed with a concept that will help with several other design issues. Elements in a given work document may reference entities that are not required to exist in the same document, as long as there is a mechanism for the Context to locate these entities. We had started to consider fingerprinting graphs and subgraphs for cross-reference within a document, but this may not be the right approach (see below). We have already begun to develop the metadata requirements of a Session, by which a Context can determine and restore execution state from partially complete work. As a lightweight solution (at least initially), we can provide calls with which to provision a Context with additional knowledge of available resources, such as the existence of a file, or of a resource that is understood by the Context implementation, but not generic enough to be specified in the API, such as a Google Cloud data object or a database connection.

This does not necessarily need additional specification to the schema or API. Non-specified operations (operations outside of the "gmxapi" or "gromacs" namespaces) may not be supported in all environments. Since it is the job of the Context to find the implementation of an operation, and an operation may be provided by a Context implementation directly, we can defer to the Context on how to label and access external resources. We assume that any data source could be defined in terms of an operation with the BLOB serialized in its entirety as a JSON value, but do not require that it is ever present as such when the Context can resolve the operation output some other way.

In the immediate future, in order to chunk work for dispatching, we will need to be able to repackage work graph records into different sub-graphs and such. If we expand the target of fingerprinting uniqueness for nodes so as not to require separately unique identification of graphs, then we design in terms of global reference to unique information, but we are required to build a robust hierarchy or "routing" machinery for a Context inspecting one element to track down the references.

In short, let's develop the deserialization heuristics to remove the requirement and assumption that a given single JSON record corresponds to a complete work graph.

Also reference https://redmine.gromacs.org/issues/2901

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions