Storage and processing of data

  • Naming convention: Use short, meaningful file and folder names for easy identification.
  • Folder structure: Organize the data logically in a clear, uniform folder structure.
  •  Data formats: Choose open, documented, standard formats recommended by the subject community to ensure long-term readability.
  •  Metadata: Capture sufficient standardized metadata to enable interpretability and reproducibility.
  • Checksums: Calculate checksums (e.g. MD5 or SHA256) for files to verify data integrity.
  • Versioning: Use a versioning system (e.g.  Git) to track changes to data and metadata.
  •  Storage location: Choose a long-term archiving system that ensures permanent data integrity and access.

Create checksum

1. press Windows+R to open the Run window.

2. enter cmd and click OK.

3. the Command Prompt window opens.

4. execute the following command:

certutil -hashfile C:\path\my_file.tif SHA256

Info: A path with spaces must be enclosed in quotation marks.

Format for storage and long-term archiving

Biological research generates a wide range of data from devices such as microscopes, sequencers and mass spectrometers. Each of these instruments often uses its own (proprietary) data formats, which can often only be read with special software. However, this makes data exchange, further processing of the data and long-term use or long-term archiving of the data extremely difficult.

In such cases, we recommend additional conversion of the data into a common data format (standard format). Standard formats are characterized by publicly accessible specifications, manufacturer independence and broad software compatibility.

Standard formats for microscopy data

When storing and archiving microscopy data, please avoid proprietary formats from manufacturers, as well as JPEG or PNG. Instead, use the most common data formats:

OME-TIFF: Recommended if OME-XML metadata is to be saved directly in the image file.

  • Advantages: Standardized metadata structure (OME-XML), optimized for large image data sets, supports multidimensional data, widely used.
  • Disadvantage: Not optimal S3 access.

OME-Zarr: Object-based format for large data sets and cloud-based workflows.

  • Advantages: Combination of OME-XML metadata and efficient storage for fast access - whether local or online ("in the cloud"), scalability, S3-optimized (e.g. Amazon Simple Storage Service or comparable object storage), parallel access, support for large data sets and metadata.
  • Disadvantage: Not ideal for very small data sets. Splitting into many small files can lead to problems on classic hard disks and file systems.

HDF5 (Hierarchical Data Format version 5): Suitable for very large data sets and multidimensional data.

  • Advantages: Efficient storage of large amounts of data, supports metadata, flexible.
  • Disadvantage: Not optimal S3 access.

Please use the infrastructure provided by the IT of the School of Biology/Chemistry for storing and  archiving digital research data:

Alternatively or additionally, please only use recognized national or international (subject-specific)  repositories or archives.

The retention period for research data and records is at least ten years either after publication of the research data or from publication of the research results or after completion of the respective research activity. Deviations may result from legal or contractual regulations, from requirements of third-party funding bodies or internal guidelines.

Further open offers

Software-supported processing and analysis of research data makes it possible to evaluate this data, but this process also requires comprehensive data management, including metadata, versioning and secure archiving, in order to ensure the reproducibility and usability of the research data.

Tips:
  • For image data, use a direct link from Fiji/ImageJ and OMERO to automatically link your results.
  • Use tools such as  Jupyter Notebook or the  Macrorecorder from Fiji which, in addition to automation, also ensure reproducible documentation of your analysis workflow, and link them to your data.
Künstlerische Darstellung eines Szenarios, in dem ein Forscher seinen Rechen-Workflow mit anderen in der Jupyter-Umgebung teilt und dabei die Vorteile des Binder-Projekts nutzt.
© Juliette Taka and Nicolas M. Thiéry. Publishing reproducible logbooks explainer comic strip. Zenodo. DOI: 10.5281/zenodo.4421040 (2018)