HDF5 Data Model, File Format and Library – HDF5 1.6
This document defines HDF5, a data model, file format and I/O library designed for storing, exchanging, managing and archiving complex data including scientific, engineering, and remote sensing data.
The HDF TWG, with concurrence from the SPG, recommends that NASA endorse HDF 5 as a recommended standard for NASA Earth Science Data Systems.
ESDS-RFC-007 Technical Working Group Final Report
The ESDS-RFC-007 Technical Working Group (TWG) has conducted a review of ESDS-RFC-007, Hierarchical Data Format version 5 (HDF5) with the following conclusion:
That the Standards Process Group should endorse ESDS-RFC-007 (HDF5) as a Recommended Standard.
The TWG bases its recommendation on an analysis of the following factors in a NASA context:
Strengths: HDF and HDF-EOS data formats, software libraries and application programming interfaces (APIs), have been widely used for NASA earth observation mission data for many years. The latest version of HDF, HDF5 is the current or planned data format for missions including OCO and NPOESS, totaling many 10s of terabytes of data. Users cite many strengths, including:
Widespread planned use for NASA Earth science data.
Data users read only the data that they need, not the whole file. Data producers can put images, tables, multidimensional arrays, etc into the same file.
Users do not need to be concerned with the platform in which the data are produced.
Its limited primary structures, i.e. groups and datasets, makes the file design simple.
Ample metadata can be added to the file, groups and dataset, making the file self describing.
Data files can be internally compressed using different schemes making better data storage and usage.
The ability to store data compactly, yet allow it to be read on any platform.
Source code for writing and reading data in the format is widely and publicly available.
Supported by many third party applications such as IDL and Matlab.
Support for a rich set of data types including composite and user-defined data types.
Support for extensions and profiles, including HDF-EOS5.
Weaknesses: HDF5 is undeniably complex, and requires a significant learning curve. However, users also applaud the quality of documentation and help-desk support available. Third-party tools with HDF5 support, such as IDL and Matlab, also help hide complexity from users. Users have expressed concern about the availability of long-term support for HDF5 and related tools, but this concern is somewhat alleviated by the availability of the source code.
Applicability: HDF5 is used for data archive and distribution. The strengths cited above, together with the availability of analysis tools, make the format suitable for data analysis as well. The new netCDF 4.0 will include the capability to use HDF5 as the data storage layer for the netCDF API, with the addition of many new features available in HDF5 such as user defined types, multiple unlimited dimensions, and per-variable data compression. This merger of the two formats will further extend the HDF5 user community.
Limitations: A major limitation for HDF5 is the loss of backward compatibility with HDF4 and earlier versions. Also, unlike less complex formats, users cannot read the HDF5 files directly without using the HDF5 software library. Of greater concern are recent postings on a mailing list discussing use of netCDF and HDF5 in high performance computing applications with thousands of processors using parallel I/O, which warn of the danger of file corruption during parallel I/O if a client dies at a particular time. The HDF Group is aware of this problem and is addressing it.
Overall, HDF5 is a widely used data format with a well-defined specification that provides a standard way of storing and working with science data. The ESDS-RFC-007 TWG thus recommends its endorsement by the SPG as an Earth Science Data Systems Standard.
The TWG conducted three reviews, a Technical Review designed to determine the technical validity of the specification within a NASA context, a Usability Survey designed to identify any usability issues or concerns, and an Operational Suitability Review designed to determine the operational readiness of the specification within a NASA context.
The review of the HDF5 specification was completed over a period of approximately 15 months. Based on the responses received from three sets of survey questions and from additional research the TWG concludes that the HDF5 specification demonstrates sufficient operational readiness to be endorsed by the SPG.
ESDS-RFC-007 proposes the Hierarchical Data Format version 5 (HDF5) as an ESDS Standard. Its functionality is described in the excerpt below:
HDF5 consists of three major components: (1) a general-purpose data model, (2) a file format, and (3) an I/O library.
The HDF5 data model provides structures and operations to allow creation, storage, and access to almost any kind of scientific data structure or collection of structures. In addition to the HDF5 file object, the data model includes two primary objects (datasets and groups), a number of supporting objects (e.g., attributes and datatypes), and metadata describing how HDF5 files and objects are to be organized and accessed.
The HDF5 file format describes how HDF5 data structures are represented in storage, in memory, or on other media. Because HDF5 is designed for managing large data objects and complex heterogeneous collections easily and efficiently, the HDF5 format allows for alternate representations of many objects. The format is self-describing in the sense that the structures of HDF5 objects are described within the file. The HDF5 I/O library implements the data model in a number of programming languages, including C, Fortran, C++, and Java. These APIs are designed for flexibility they give applications full access to available HDF5 storage structures and provide features for tuning applications for particular platforms, storage requirements, or I/O access patterns.
Complementing these three technical components are the approaches of the HDF5 project to intellectual property and community standards.
The HDF5 library, which is owned by the University of Illinois, is open source, and the HDF5 copyright allows it to be used at no cost by all applications, including commercial applications. The HDF5 project and its sponsors work closely with vendors and non-commercial applications developers, to enable their products to support HDF5 effectively and to make sure that HDF5 meets the demands of these products for quality and performance.
As for community standards, the HDF5 project and its sponsors dedicate significant resources to developing, supporting, and enforcing standard uses of HDF5. Adherence to standards makes it possible to share data easily, and to build and share tools for accessing and analyzing data stored in HDF5. Standardization activities include establishing conventions for the use of HDF5 for particular applications. For example, HDF-EOS defines a data model built for earth science data, and the HDF-EOS API implements that data model. As another example, the HDF5 project defines standard ways to store raster images, tables, and other complex objects in HDF5, and provides high-level APIs to encourage adherence to these standards.
The RFC, a copy of the specification, and a reference manual can be downloaded here: http://earthdata.nasa.gov/our-community/esdswg/standards-process-spg/rfc... .
The specification, software library and additional information can also be downloaded here: http://hdf.ncsa.uiuc.edu/HDF5/ .