Data Model and Permissions

This document describes a high-level overview of the proposed data model and related permissions for the IRIDA platform.

Document History

Authors

Background

This document describes an extension to the data model exposed by the NGS Archive, and is a proposed partial implementation of the data model described by the REST API for GMI-compliant repositories.

The data model in the NGS Archive consists of five, interrelated units:

NGS Archive data model.

A sequence file is the atomic unit in the NGS Archive, holding a collection of metadata and read data generated by the act of sequencing an isolate.

A run is a collection of all files and metadata generated during the execution of a sequencer.

A sample (equivalently, an isolate) refers to a physical sample collected from an environment. Each sample may have one or more files representing the digital data derived from that physical sample.

A project is a collection of related samples.

The resources in the NGS Archive are protected, requiring both authentication and authorization to view any resource:

  1. Any user is allowed to create a project.
  2. The user creating the project is assigned as a manager of that project.
  3. Managers of a project can assign permissions to other users to read (project role of ‘user’) or write (project role of ‘manager’) to the project.
  4. Project permissions are transitively applied to resources contained by a project (if a user can view a project, then the user can view the samples and files contained within that project).

The proposed REST API for GMI-compliant repositories further separates a run into two resources: Run and Experiment. The experiment resource is designed to capture the laboratory methods used to prepare a sample for sequencing. The run resource captures the metadata related to the sequencing process itself (i.e., the type of sequencer used, etc.) and the files generated by the sequencer.

The data model implemented by the NGS Archive is suitable for the storage of sequencing data, but needs to be extended to support the execution of workflows.

Requirements

The IRIDA platform has several requirements not considered by the data model implemented as part of the NGS Archive:

Workflows

A key requirement of the IRIDA platform is ease-of-use. To that end, requirements for the execution of workflows and data selection of those workflows are:

  1. A user must be able to select a workflow for execution.
  2. A user must be presented with a collection of data to be used as input by the workflow during execution. The collection of data that the user is presented with must only be the data that they are permitted to read (i.e., the projects where the user is given read permissions, see section titled ‘User Groups and Permissions’, below).
  3. The user must be able to specify who (i.e., users and groups) is allowed to read the outputs produced by the execution of the workflow.

Solutions

A proposal for a data model that meets the requirements set out above includes two major components: a data project (corresponding to a study in the GMI REST API proposal and a Project in the NGS Archive) and an analysis (a container storing metadata and data related to the execution of a workflow).

Groups

A group is a logical collection of user accounts that should have the same permissions applied to a resource.

To reduce complexity, groups are only allowed to be created by administrative users (including managers) and will be visible to all users in the local system.

Data model:

Samples

Samples will exist as separate, distinct entities. In the NGS Archive, a sample was required to be contained within a parent project, however in IRIDA a sample exists outside of a project and may be referenced by zero or more projects.

A sample also corresponds to an isolate, and should have relevant epidemiological metadata attached. We derive some of the metadata definition from the NCBI BioSample project, specifically the Genome Trakr/GMI samples, that have some epidemiological metadata attached.

Data Project

Analysis

An important note discussed as part of the creation of this document regarding permissions related to analyses and the data executed on by an analysis: We specified that analysis and data project have a distinct set of permissions. One problem that arises from separate permissions is that a user may have permissions to read an analysis, and may not have permissions to read the data used to generate the analysis. We identified two potential problems with this scenario:

  1. Reproducibility of an analysis: a user would be able to read the instructions executed on some input data (the workflow) and the output data generated by workflow execution, but would not be able to reproduce the outputs because they lack inputs.
  2. Reconstruction of inputs from outputs: the simplest example of this is a workflow consisting of reference mapping; the outputs of a reference mapping (a BAM file, for example) can be used to trivially reconstruct the input reads, regardless of user permissions on the data project.

Workflow Execution

The general approach to workflow execution should be:

  1. The user selects the type of analysis they would like to run.
  2. The user is provided with a list of inputs that are:
    1. Suitable for use as inputs by the selected analysis (i.e., requires a reference genome, sets of reads, etc.), and
    2. Readable by the user.
  3. The user is provided with the opportunity to select the permissions to be applied to an analysis (i.e., who can see the outputs?)

User Groups and Permissions

User groups are synonymous with project membership. Projects have associated permissions that relate user accounts and projects. If a user has any relationship with a project, then a user is considered to be part of the group represented by that project.

Permissions Applied to Data Projects

When assigning permissions to a data project, a project owner should be allowed to choose from:

  1. Individual user accounts (by specifying an e-mail address or some other unique locator for a user account),
  2. Groups of users on the local system.

The project owner should also be provided the option to select the access level that the user(s) should be given (i.e., one of user or manager).

If the project owner selects an individual user account, then the user account should be trivially added to the project members with the appropriate role.

If the project owner selects a group of users, then the group reference should be added to the project groups with the appropriate role. By adding a group, any changes made to the group are immediately applied to the permissions on all resources that the group is related to (for example, if a user account is removed from a group, then all permissions for all resources granted to the group are revoked for the individual user account immediately). In other words, a group is not added as a snapshot of the current group members, but rather is retained as a reference in the set of permissions applied to a resource.

The system should take the following steps when a user attempts to access a data project or data project sub-resource:

  1. Check to see if the data project is publicly available without authentication. If the data project is publicly available without authentication, access is permitted. If the data project is not publicly available without authentication, proceed to step 2.
  2. Check to see if the data project is publicly available with authentication. If the user has not provided authentication details, then permission is denied. If the user has provided authentication details, then access is permitted. If the data project is privately available, then proceed to step 3.
  3. The user must have provided authentication details to proceed. Check to see if the user is a member of the data project. If the user is a member of the data project, then the user has authorization to view the data project and access is permitted. If the user is not a member of the data project, then proceed to step 4.
  4. If the user is not a member of the data project, then iterate over the collection of group references. If the user is a member of any of those groups, access is permitted. If the user is not a member of any referred groups, access is denied.

Permissions Applied to an Analysis

When executing a workflow, a user should be provided with the opportunity to choose which other user accounts will have read privileges on the analysis outputs. The analysis creator should be able to choose from:

  1. Individual user accounts (by specifying an e-mail address or some other unique locator for a user account),
  2. Groups of users on the local system.

If the analysis creator selects an individual user account, then the user account should be given read permissions on the analysis only.

If the analysis creator selects a group of users (by selecting a project on the local system), then the selected data project should be stored as a reference.

The system should take the following steps when a user attempts to access a data project or data project sub-resource:

  1. Check to see if the analysis is publicly available without authentication. If the analysis is publicly available without authentication, access is permitted. If the analysis is not publicly available without authentication, proceed to step 2.
  2. Check to see if the analysis is publicly available with authentication. If the user has not provided authentication details, then permission is denied. If the user has provided authentication details, then access is permitted. If the analysis is privately available, then proceed to step 3.
  3. The user must have provided authentication details to proceed. Check to see if the user has direct read access to the analysis. If the user has direct read access to the analysis, then the user has authorization to view the analysis and access is permitted. If the user does not have direct read access to the analysis, then proceed to step 4.
  4. If the user does not have direct read access to the analysis, then iterate over the collection of group references. If the user is a member of any of those groups, access is permitted. If the user is not a member of any referred groups, access is denied.

Decision

This document is a proposal for implementation of two major components required by the IRIDA platform. The document will be distributed as a formal request for comments to the participants of the IRIDA platform and will change over time, including during development of the platform and related data models.