HDF-EOS5 Data Model, File Format and Library
HDF-EOS is a software library designed to support NASA Earth Observing System (EOS) science data. HDF is the Hierarchical Data Format developed by the National Center for Supercomputing Applications. Specific data structures which are containers for science data are: Grid, Point, Zonal Average and Swath. These data structures are constructed from standard HDF data objects, using EOS conventions, through the use of a software library. A key feature of HDF-EOS is a standard prescription for associating geolocation data with science data through internal structural metadata. The relationship between geolocation and science data is transparent to the end-user. Instrument and data typeindependent services, such as subsetting by geolocation, can be applied to files across a wide variety of data products through the same library interface. The library is extensible and new data structures can be added. This document describes a proposed standard for HDF-EOS5 Grid and Swath structures, which is based on the HDF5 data model and file format, provided by the HDF Group. The HDF Group was part of the National Center for Supercomputing Applications (NCSA) until July 2006, at which time it began full operations as a non-profit 501(c)(3) company.
ESDS-RFC-008 Technical Working Group Final Report
The ESDS-RFC-008 Technical Working Group (TWG) has conducted a review of ESDS-RFC-008, Hierarchical Data Format for EOS version 5 (HDF-EOS5) with the following conclusion:
That the Standards Process Group should endorse ESDS-RFC-008 (describing the HDF-EOS5 API and data model) as a Recommended Standard.
Both HDF-EOS and HDF-EOS5 data formats, software libraries and application programming interfaces (APIs), are already widely used in NASA Earth Science Data Systems. While the APIs for these two related data formats are nearly identical, HDF-EOS5 is built on the more feature-rich HDF5 format, which is also the basis for the new netCDF 4. Therefore, we recommend that any new HDF-EOS data sets be implemented in HDF-EOS5. The TWG bases its recommendation on an analysis of the following factors in a NASA context:
Strengths: HDF and HDF-EOS have been widely used for NASA earth observation mission data for many years. The latest version of HDF-EOS, HDF-EOS5 is the data format for four instruments on NASAs Aura satellite. Users cite many strengths, including:
Widespread use of HDF-EOS formats for NASA Earth science data. Reviewers cite 10s of Terabytes of data in HDF-EOS5, with thousands of users.
HDF-EOS5 inherits the benefits of HDF5, including open-source software support, internal compression, portability, support for structural data, self-describing file metadata enhanced performance over HDF4/HDF-EOS2, and xml support. To these, HDF-EOS5 adds full support of earth science data types. All these factors make it a flexible data format which can be easily mapped to complex earth science data.
Reviewers note that using the HDF-EOS API is much easier than using HDF5 directly.
The HDF-EOS library enforces adherence to a specific HDF profile. By using the HDF-EOS library, developers create files which have a specific format. Also, as the developers on Aura discovered, adherence to an even more stringent set of specifications can lead to even more conformity and allow for easier data sharing.
The HDF-EOS API allows users to easily migrate from HDF4-based files to HDF5-based files. This migration would be much more difficult without the HDF-EOS API hiding most of the HDF4 to HDF5 API changes.
HDF-EOS5 takes full advantage of the HDF5 library and file format. It can handle very efficiently huge volumes of data in the current and in the emerging computational environments without any changes to the HDF-EOS5 applications.
Source code for writing and reading data in the format is publicly available.
HDF-EOS5 Data files are also readable by theHDF5 library and tools which support HDF5. Several reviewers mentioned that they use IDL very effectively with data in HDF-EOS5.
Weaknesses: HDF-EOS5 is undeniably complex, and requires a significant learning curve. Users have also expressed concern about the availability of long-term support for HDF-EOS5 and related tools, but this concern is somewhat alleviated by the availability of the source code.
One challenge is that the HDF-EOS5 package actually consists of multiple libraries (HDF-EOS5, HDF5 and the SDP Toolkit) maintained by different organizations. When one encounters a problem or has a question, it is not always clear which organization needs to be contacted.
While HDF-EOS5 provides a valuable profile of HDF5, it still allows data to be stored in non-standard ways.
One reviewer cites problems that are encountered on files which have been SZIP compressed. But this isn't just related to HDF5/HDF-EOS5 it also applies to HDF/HDF-EOS with SZIP compression.
Another reviewer identified a problem with the earlier version of HDF-EOS, which occurs when a file contains 2 or more grids, and the grids each contain identically named fields. This file structure is supported by the HDF-EOS interface, but users of tools which in turn use the basic HDF4 interface are not able to distinguish between them. It is not clear whether this remains a problem with HDF-EOS5.
Applicability: HDF-EOS5 is used for archive and distribution of Earth Science data. The strengths cited above, together with the availability of analysis tools, make the format suitable for data analysis as well. As a notable example of the use of HDF-EOS5 in NASA Earth Science Data Systems, the instrument teams from the Aura satellite jointly developed an HDF-EOS5 profile for their datasets, thus facilitating data sharing from four coincident instruments. Coordinated development and use of specific HDF-EOS profiles should be strongly encouraged.
Limitations: Reviewers note that development of HDF-EOS5 necessarily lags behind its parent HDFF5 format. Users may be affected when a new feature is added to HDF5 which is not readily supported through the current HDF-EOS5 interface. As HDF5 continues to be actively developed, it is important that HDF-EOS5 be maintained just as actively. Further, the level of technical support available to new users of HDF-EOS5 has dropped, which may limit its adoption by new data providers. Other limitations noted by users include:
HDF-EOS5 is not supported by many third party applications such as IDL and Matlab. However, HDF-EOS5 data can be read with the HDF5 interfaces that are more frequently supported.
HDF allows parallel I/O while HDF-EOS does not.
Suggestions for enhancements to address current limitations include: Both forward and backward compatibility are important. In particular, tools built with new releases must be capable of reading data files written with older versions. No improvement can compensate for orphaned or lost data. The HDF-EOS5 API is very similar to HDF-EOS4, but not identical. NASA should consider making the HDF-EOS5 library fully backward compatible.
A set of Quality Assurance (QA) tools should be developed which analyze a target dataset to verify that it is a lexically and syntactically correct HDF-EOS5 formated dataset. The QA tools should be both distributed as open source and made available as a Web based service.
Overall, HDF-EOS5 is a widely used data format that provides a standard way of storing and working with science data. The ESDS-RFC-008 TWG thus recommends its endorsement by the SPG as an Earth Science Data Systems Standard.
The TWG conducted three reviews, a Technical Review designed to determine the technical validity of the specification within a NASA context, a Usability Survey designed to identify any usability issues or concerns, and an Operational Suitability Review designed to determine the operational readiness of the specification within a NASA context.
The review of the HDF-EOS5 specification was completed over a period of approximately 18 months. Based on the responses received from three sets of survey questions and from additional research the TWG concludes that the HDF-EOS5 specification demonstrates sufficient operational readiness to be endorsed by the SPG.
ESDS-RFC-008, available at http://earthdata.nasa.gov/our-community/esdswg/standards-process-spg/rfc..., proposes the Hierarchical Data Format for EOS version 5 (HDF-EOS5) as a NASA ESDS Recommended Standard. Its functionality is described in the excerpt below:
HDF-EOS is a software library designed to support NASA Earth Observing System (EOS) science data. HDF is the Hierarchical Data Format developed by the National Center for Supercomputing Applications. Specific data structures which are containers for science data are: Grid, Point, Zonal Average and Swath. These data structures are constructed from standard HDF data objects, using EOS conventions, through the use of a software library. A key feature of HDF-EOS is a standard prescription for associating geolocation data with science data through internal structural metadata. The relationship between geolocation and science data is transparent to the end-user. Instrument and data type independent services, such as subsetting by geolocation, can be applied to files across a wide variety of data products through the same library interface. The library is extensible and new data structures can be added. This document describes a proposed standard for HDF-EOS5 Grid and Swath structures, which is based on the HDF5 data model and file format, provided by the HDF Group. The HDF Group was part of the National Center for Supercomputing Applications (NCSA) until July 2006, at which time it began full operations as a non-profit 501(c)(3) company.
HDF5 files consist of a directory and a collection of data objects. Every data object has a directory entry, containing a pointer to the data object location, and information defining the datatype (much more information about HDF5 can be found in the NCSA documentation (HDF5 API Specification Reference Manual, http://hdf.ncsa.uiuc.edu/HDF5/doc/RM_H5Front.html) Many of the NCSA defined datatypes map well to EOS datatypes. Examples include raster images, multi-dimensional arrays, and text blocks. There are other EOS datatypes, however, that do not map directly to NCSA datatypes, particularly in the case of geolocated datatypes. Examples include projected grids, satellite swaths, and field campaign or point data. Therefore, some additions to conventional HDF5 datatypes were required to fully support these datatypes.
To bridge the gap between the needs of EOS data products and the capabilities of HDF, new EOS specific datatypes Point, Swath, and Grid were defined within the HDF framework. Each of these new datatypes was constructed using conventions for combining standard HDF datatypes and is supported by an Application Programming Interface (API) which aids the data product user or producer in the application of the conventions. The APIs allow data products to be created and manipulated in ways appropriate to each datatype, without regard to or the users needing to manipulate the underlying HDF objects.
The sum of these APIs comprise the HDF-EOS library. The Point interface is designed to support data that has associated geolocation information, but is not organized in any well defined spatial or temporal way. The Swath interface is tailored to support time-ordered data such as satellite swaths (which consist of a time-ordered series of scanlines), or profilers (which consist of a time-ordered series of profiles). The Grid interface is designed to support data that has been stored in a rectilinear array based on a well defined and explicitly supported projection. Profile data is Swath-like data without geo-referencing information attached.
The original HDF-EOS library was constructed beginning in 1995, using the version of HDF available at the time, HDF4. The HDF-EOS version was called HDF-EOS2, the version number being a historical artifact. In 2001, a completely new version of HDF was introduced, HDF5. This library was based on a different data model (HDF5 for HDF4 Users: a short guide, National Center for Supercomputing Applications, University of Illinois, Urbana-Champaign, December 3, 2002, http://www.hdfgroup.uiuc.edu/papers/papers/h4toh5/HDF5forHDF4Users.pdf) and had an interface which was very different than that of HDF4. HDF-EOS was upgraded to support HDF5 and is called HDF-EOS5. This new version of HDF-EOS supports the same data model as does HDF-EOS2 and maintains the HDF-EOS2 interface to the maximum extent possible. Besides the three data types mentioned above, i.e. Grid, Swath, and Point, HDF-EOS5 also supports Zonal Average data type which is basically a swath like datatype without geolocation mapping. At the present time, most EOS data products, several petabytes worth (1015), are produced and stored in HDF-EOS2. A growing volume of data is being created in HDF-EOS5 and both libraries are supported by NASA. Production of EOS data will continue so long as instruments continue to operate.
The software library, documentation, tools and additional information can be downloaded here: http://www.hdfeos.org/ .