Data Product Development Guide for Data Producers |
Version 2.0, July 10, 2024 |
Status of this Memo
This memo provides information to the NASA Earth Science Data Systems (ESDS) community. This memo describes a “Suggested Practice” and does not define any new NASA Earth Science Data and Information System (ESDIS) Standards. Distribution of this memo is unlimited.
Change Explanation
Many changes have been implemented in this revision to Version 1.1. New material, including Figure 3, has been added regarding data products from airborne and field campaign investigations. Section 3.3 on cloud-optimized formats and services is new. Section 8 has been expanded to include Earthdata Pub. Appendices D and E have been revised to include mappings of the attributes to the Unified Metadata Model (UMM) profiles, and expanded to include the Findability, Accessibility, Interoperability, and Reusability (FAIR) sub principles.
Copyright Notice
This is a work of the U.S. Government and is not subject to copyright protection in the United States. Foreign copyrights may apply.
Suggested Citation
Ramapriyan H.K., P.J.T. Leonard, E.M. Armstrong, S.J.S. Khalsa, D.K. Smith, L.F. Iredell, D.M. Wright, G.J. Huffman, and T.R. Walker. Data Product Development Guide (DPDG) for Data Producers version 2.0. NASA Earth Science Data and Information System Standards Office, July 2024. https://doi.org/10.5067/DOC/ESCO/RFC-041VERSION2
Abstract
This Data Product Development Guide (DPDG) for Data Producers was prepared for NASA's Earth Observing System Data and Information System (EOSDIS) by the DPDG Working Group, one of the Earth Science Data System Working Groups (ESDSWGs), to aid in the development of NASA Earth Science data products.
The DPDG is intended for those who develop Earth Science data products and are collectively referred to as “data producers.” This guide is primarily for producers of Earth Science data products derived from remote sensing, in situ, and model data that are to be archived at an EOSDIS Distributed Active Archive Center (DAAC). However, producers of other Earth Science data products will also find useful guidance.
Table of Contents
1. Introduction
2. Data Product Design Process
3. Selecting a Data Product Format
4. Metadata
5. Data Compression, Chunking, and Packing
6. Tools for Data Product Testing
7. Data Product Digital Object Identifiers
8. Product Delivery and Publication
9. References
Appendix A. Abbreviations and Acronyms
Appendix C. Product Testing with Data Tools
Appendix D. Important Global Attributes
Appendix E. Important Variable-Level Attributes
1 Introduction
NASA’s Earth Observing System Data and Information System (EOSDIS) is a major capability in the Earth Science Data Systems (ESDS) Program [1]. EOSDIS Science Operations (i.e., data production, archive, distribution, and user services), which are managed by the Earth Science Data and Information System (ESDIS) Project [2], are performed within a system of interconnected Science Investigator-led Processing Systems (SIPSs) and discipline-specific data centers called Distributed Active Archive Centers (DAACs).
This Data Product Development Guide (DPDG) for Data Producers was prepared for EOSDIS by the Earth Science Data System Working Groups (ESDSWGs) [3] under the supervision of the ESDIS Project to aid in the development of NASA Earth Science data products. This version is a major update to the DPDG V1.1 [4] and includes material relevant to data products from airborne and field campaign investigations, with new examples and edits to several sentences throughout the document as well as the addition of a new Figure 3. Additional information about airborne and field campaign data delivery and management can be found at [5]. Section 3.3 on cloud-optimized formats and services is new. Section 8 has been expanded to include Earthdata Pub. Appendices D and E have been revised to include mappings of the attributes to the Unified Metadata Model (UMM) profiles, and expanded to include the Findability, Accessibility, Interoperability, and Reusability (FAIR) sub principles.
The DPDG is intended for those who develop Earth Science data products and are collectively referred to as “data producers” (see Appendix B). This guide is primarily for producers of Earth Science data products derived from remote sensing, in situ, and model data that are to be archived at an EOSDIS DAAC. However, producers of other Earth Science data products will also find useful guidance.
There is an abundance of documents (e.g., regarding standards, conventions, best practices, and data formats) to direct developers in all aspects of designing and implementing data products. Moreover, some DAACs have developed guides for particular data producers and specific scientific communities [6] [7] [8] [9] [10] [11]. The DPDG aims to compile the most applicable parts of existing guides into one document that logically outlines the typical development process for Earth Science data products. Emphasis has been given to standards and best practices formally endorsed by the ESDIS Standards Coordination Office (ESCO) [12], findings from ESDSWGs, and recommendations from DAACs and experienced data producers. Ultimately, the DPDG provides developers with guidelines for how to make data products that best serve end-user communities—the primary beneficiaries of data product development. The DPDG also guides the developers to ensure that the data products are designed to be Findable, Accessible, Interoperable, and Reusable (FAIR) [13], and adhere to NASA’s open science objectives and information policies [14] [15].
The data products are assumed to be archived at a DAAC, and it is vital that data producers work closely with the DAACs to which their products are assigned to obtain details not covered in this document. This document indicates in the respective sections where such close communications between the data producers and their assigned DAACs are needed. Examples of areas requiring such communications are: product design, understanding user requirements, selecting a data format, product naming, metadata requirements, testing and receiving user feedback, optimizing product formats for use in the cloud, selection of keywords for the products to facilitate user searches, version numbering, obtaining Digital Object Identifiers (DOIs), and product delivery and publication schedules.
The ESDIS Project and the DAACs are actively engaged in migrating data products and services to the cloud, and, for this reason, we include guidance regarding cloud-optimized formats and services (Section 3.3). It is stressed that data producers should work with their assigned DAAC early in the lifecycle of product architecture and implementation when optimizing their data for cloud distribution and computing.
The organization of the rest of this document is illustrated in Figure 1, which shows the various steps in data product development and delivery. The numbers in parentheses in the figure indicate the sections where the individual steps are discussed. Sections 9, 10, and 11, not shown in the figure, cover a bibliography, authors’ addresses, and a list of contributing authors and editors, respectively. Finally, five appendices provide a list abbreviations and acronyms, a glossary, details of selected tools useful for testing, as well as important global and variable-level attributes that should be included in the product metadata.
2 Data Product Design Process
For this guide, a data product is defined as a set of data files that can have multiple variables (a.k.a. parameters), which compose a logically meaningful group of related data [16]. This concept is equivalent to a data collection in the Common Metadata Repository (CMR) [17], and is known colloquially as a dataset (see Appendix B for further explanation of these terms).
Based on the Earth Observing System (EOS) heritage [18], standard Earth Science data products have the following characteristic. They:
- have a peer-reviewed algorithmic
- have wide research and application
- are routinely produced over spatially and temporally extensive subsets of data
- are available whenever and wherever the appropriate input data are available
Since the beginning of the EOS Program, an extensive set of data products has been produced and archived that satisfy the above criteria that define standard products. Experience from those products and observations of related issues regarding their interoperability and metadata content have led to several recommendations from the ESDSWGs. Following those recommendations, which have been incorporated into the following sections, will result in a good data product design. It would benefit users of standard data products (as well as users of other data products that do not necessarily meet all the standard product criteria) if the guidance provided below is followed for the product design.
2.1 Requirements: Determining User Community Needs
At the beginning of the design process, the key elements are to 1) identify the expected user communities and understand their needs regarding data formats, data structures, and metadata in addition to what is required for data search and discovery; and 2) identify the needs for data tools and services. Developers can acquire this information by surveying the scientific literature, browsing predecessor data, holding and attending data applications workshops, and working with the DAAC that will archive the data product. While it is important to do this early in the product design process, it is equally important to remember that as Earth Science evolves to serve novel uses, user communities can change significantly. This is especially true in the case of long-term projects that might involve several reprocessing cycles, where such changes would need to be accommodated in later versions of the products. It is possible that a given data product would be archived at more than one DAAC (e.g., in the case of airborne campaigns), and the DAACs may have different format and/or metadata requirements. In such cases, some negotiations may be needed between the data producers and the assigned DAACs to understand the reason for the differences and see if the DAACs can accept a common approach.
Once user communities are identified, the key questions to be considered are the following:
- How might these data be used by the identified communities?
- Which tools and services will the community need to use the data?
- Are there common workflows applied to the data (e.g., subset >> quality filter >> re-grid)?
- What are the prevalent data formats, data and metadata standards, and data structures used by the community?
- What common keywords would assist in data discovery?
- What constraints are faced by the user community (e.g., timeliness, network bandwidth, processing capacity, and disk storage)?
- What temporal and spatial resolutions, coordinate systems, and map projections are commonly used in or required by the community?
- What information on data provenance (i.e., data product history and lineage) and quality will the users need for their purposes?
- What other information do the users need to assess the suitability of the data?
- How should the data product be designed to make it useful to unexpected user communities (e.g., make the files self-describing)?
- What associated knowledge should be preserved [19] in addition to the data for the benefit of future users, when data producers are no longer available for consultation?
- How should the associated knowledge be preserved (implementation guidance is available at [20])?
- How does a shift to a cloud environment affect the preceding questions? (See Section 3).
When the data collection is sufficiently complex in size, required services, or management, then an Interface Control Document (ICD) [21] may be necessary. An ICD provides the definition, management, and control of interfaces that are crucial to successfully transferring the data and metadata from the producer to the DAAC. The assigned DAAC can assess whether an ICD is needed for any given data collection.
2.2 Design: What Constitutes a Data Product Design
A data product design should address the following:
- Data format and associated conventions (Section 3)
- Identification and structure of constituent variables1 (Section 3)
- Metadata (Section 4)
- Data chunking, internal compression, and packing (Section 5)
2.3 Implementation: Creating Sample Data Files
The data producer should create sample data files to support the evaluation of the data product design. This needs to start early in the dataset development phase for the testing and evaluation of sample products and to provide timely feedback. Ideally, this process should begin when a single sample file has been developed and well before the full data collection has been completed. The more realistic the sample data, the more helpful it is for evaluating the usefulness and appropriateness of the design. Even a product populated with random science data values can be suitable for checking usability; however, sample data coordinates should be accurately populated, as these are critical to the use of tools on a data product. The software library supporting the selected standard data format and various tools (Section 6) can be used to quickly create sample data files.
1 In this document we use the term “variable.” Other commonly used synonyms to the term “variable” exist, such as “parameter”. See Appendix B for a full explanation.
2.4 Testing: Evaluating Sample Data Products
Testing of data products should be performed with the tools and services that user communities are expected to employ (see Section 6). Typically, testing will identify structural issues of the data, such as missing or mis-identified variables and attributes, and ordering of dimensions within the variables. The data products should also be tested using compliance checkers (Section 6.2). Data producers should consult with their assigned DAACs regarding the available compliance checkers and how to use these tools. Ideally, an iterative approach should be followed: supply the data product to the assigned DAAC and to selected representatives of the expected user communities, integrate feedback, re-test the product, and solicit additional feedback.
2.5 Review: Independent Evaluation of the Data Product
Soliciting external, independent evaluations can improve the quality and usability of a data product. The following are recommendations for establishing and conducting such reviews throughout the product life cycle:
- Obtain reviews from the distributors of the data product (i.e., the relevant DAAC or DAACs)
- To maintain objectivity, the evaluators should not be directly involved in the development of the data product. Ideally, evaluators should include representatives of the expected user communities and other subject matter experts
- Perform multiple reviews during the development process to receive guidance well before the release of the product. Any resulting modifications should be documented
- Feedback should be sought on the format, content, and quality of the data. Four aspects of quality are defined in [22], namely scientific quality, product quality, stewardship quality, and service quality. At this stage, the scientific and product quality are of primary concern. The format, content, and quality should also help improve the applicability of the data product for specific uses
- Responses to reviewers’ comments should be documented and provided to the reviewers, so that any misunderstanding of the comments can be clarified
- Reviews should address various aspects of the data product, including capabilities for search, access, exploration, analysis, interoperability, and usage
- Reviews should verify that the data files and their variables have suitable names and enough supporting information to facilitate their understanding
- While the data product user guides published by the DAAC hosting the data (which are provided when users retrieve a data product or file) are essential support for the usage of a data product, data producers should strive to make their products as self-describing as possible (i.e., with embedded metadata that describe the format and the meaning of the data)
It is also helpful to have a mechanism for the user community to provide feedback on the usability and quality of the data after the products are released. Such feedback should be gathered by the assigned DAAC and be conveyed as needed to the data producers to help improve the data products.
3 Selecting a Data Product Format
Selecting an encoding (i.e., format) for a data product involves weighing the advantages and disadvantages of the applicable formats. Below are important items for DAACs and data producers to consider together when selecting a format:
- Is the format open (i.e., has an openly published specification)?
- Has a format already been specified in an existing document (e.g., an Interface Control Document)?
- Does the format provide for the widest possible use of the data product, including potentially new applications and research beyond the original intentions?
- Is the format widely used in the target user community for similar data or similar data analysis workflows?
- Was the format used for past long-term observations or models and can it therefore provide for or enable more efficient data processing and interoperability [23] with those observations or models?
- Would it serve the user community better if the data were written in the same legacy format as was used for past or long-term observations for consistency, or in a format compatible with data from other agencies, e.g., NOAA or the USGS, to increase interoperability?
- Does the format enable efficient data analysis workflows on both global and local scales? Local (as opposed to global scale) applications will often require frequent subsetting, reprojection, or reformatting of the data for combination or intercomparison with in situ point observations and physical models
- Does the format enable efficient data analysis workflows over long time periods as well as relatively near real-time?
- Does the format support “self-describing” files (see Appendix B), meaning the files contain sufficient metadata that describe the contents of the file?
- Has the ordering of dimensions been considered for facilitating readability by end users [24] (Rec. 2.10)? The ordering of dimensions can have a significant impact on the ease with which the data can be read
- Does the format provide for efficient use of storage space (e.g., internal compression of data arrays, see Section 5), keeping file sizes practical for intended users, while minimizing the need to access multiple external files?
- Is the format supported by popular third-party applications (see Section 6) and programming environments, which could expand the user base and promote further development of tools and services?
- Is the version of the format supported by user community tools? The version of the format can be important, because new versions may not be readable by the libraries and tools currently in use by the user community
- Are resources available to support a long-term commitment to the format? This includes people who can develop and maintain libraries, tools, and documentation for working with the format
- Have optimizations, novel distribution mechanisms (e.g., streaming), or usage patterns (e.g., data downloading, in-cloud use) (see Section 3.3), etc. been considered? This may depend on the capabilities of the host facilities and the envisioned use cases
Data producers should consider the interplay of data format standards with other ESCO [12] approved standards for metadata, data search, and access [25]. Producers should consult the web pages associated with these standards to understand the strengths, weaknesses, applicability, and limitations (SWAL) of each data format in those contexts. These web pages also contain information on deprecated standards and practices.
Note that although the selected data format may not satisfy all users of a product, given sufficient user demand for a particular output format, EOSDIS can usually provide reformatting services to cater to a variety of preferences in the user community. The data producers are advised to consult with their assigned DAACs to determine what reformatting services are available or can be implemented to satisfy the expected user community demand.
3.1 Recommended Formats
While several acceptable formats are listed by ESCO [25], the highly preferred format for EOSDIS data products is network Common Data Form Version 4 (netCDF-4) [26], which uses the Hierarchical Data Format Version 5 (HDF5) [27] data storage model. Although files in netCDF-4 can in theory be written via the HDF5 library API, inadvertent use of certain HDF5 features can render files unreadable by the rich ecosystem of netCDF tools. Therefore, we recommend use of the netCDF-4 library API. The ESCO review of the netCDF-4/HDF5 File Format [26] lists the SWAL of the format, which the reader may find useful.
Some of the advantages of using netCDF-4 are:
- Files are “self-describing,” meaning they allow for inclusion of metadata that describe the contents of the file (see Appendix B)
- Supports many data storage structures, including multidimensional arrays and raster images, and naturally accommodates hierarchical groupings of variables
- Includes access to useful HDF5 features, and can be used in concert with HDF5 tools such as HDFView [28]
- Supports internal data
- Is supported by several important programming languages and computing platforms used in Earth Science
- Provides efficient input/output on high-performance computing
- Improves interoperability with popular NASA transformation and analysis tools and services
- Readily allows for conversion to a cloud-optimized format
Also, a well-established standard called the Climate and Forecast (CF) Metadata Conventions (hereafter, CF Conventions–see Appendices B, D and E) [29] specifies a set of metadata that provide a definitive description of what the data in each variable represent, and the spatial and temporal properties of the data. The CF Conventions were developed for netCDF; thus, they are sometimes referred to together as “netCDF/CF.”
If starting a new project with a user community that does not have a preferred format, then netCDF-4 (or a cloud-optimized version of netCDF-4) should be used. Data producers using legacy formats should work towards migrating to a more contemporary format.
3.1.1 NetCDF-4
A netCDF-4 file can include global attributes, dimensions, groups2, group attributes, variables, and variable-level attributes. The global attributes provide general information regarding the file (e.g., author information, data product version, date-time range, product DOI). Dimensions can represent: 1) spatio-temporal quantities (e.g., latitude, longitude, time); 2) other physical quantities (e.g., atmospheric pressure, wavelength); and 3) instrumental quantities (e.g., along track, cross track, waveband).
A netCDF variable is an object that usually contains an array of numerical data. The structure of a variable is specified by its dimensions. The dimensions included at a given level in the hierarchy can be applied to variables at or below that level. Variable-level attributes provide specific information for each variable (e.g., coordinates, units, valid range). Groups can be created to contain variables with some commonality (e.g., ancillary data, geolocation data, science data). Group attributes apply to everything in a group. Global attributes are attached to the root group.
Note that “dimensions” and “coordinates” are two terms in netCDF/CF that should not be confused with each other. For example, in a Level 2 (L2) swath file, the dimensions can be “along_track” and “cross_track,” while the corresponding coordinates can be “latitude” and “longitude” (illustrated in Figure 2). The coordinates for each variable are specified via the CF coordinates3 attribute.
Another example is in situ data collected during an airborne campaign (illustrated in Figure 3), where the dimensions are “observation number” and “trajectory number”, and the coordinates are “time”, “longitude”, “latitude”, and “altitude”.
2 Recently, the CF Conventions have been updated to include rules for files with group hierarchies [117].
3 Words or phrases in this document that are colored purple indicate officially recognized CF attribute names or best practice names.
Data structures are containers for geolocation and science data. Guidance regarding swath structures in netCDF formats is provided in Encoding of Swath Data in the CF Convention [30]. The ESDSWG Dataset Interoperability Working Group (DIWG) has provided guidance regarding grid structures in netCDF-4 in [24] (Rec. 2.8-2.12) and [31] (Rec. 3.6). NOAA has provided a set of netCDF format templates for various types of data products [32] although these should be considered as informative, not normative. Data producers can obtain guidance and samples from their DAAC. Earthdata Search [33] can also be used to acquire a variety of data in different formats and structures.
3.1.2 GeoTIFF
The Georeferenced Tagged Image File Format (GeoTIFF, *.tif) format is a georeferenced raster image that uses the public domain Tagged Image File Format (TIFF) [34], and is used extensively in the Geographic Information System (GIS) [35] and Open Geospatial Consortium (OGC) communities [36]. Although the types of metadata that can be added to GeoTIFF files are much more limited than with netCDF-4 and HDF5, the OGC GeoTIFF Standards Working Group is planning to work on reference system metadata in the near term. Both data producers and users find this file format easy to visualize and analyze, and so it has many uses in Earth Science. OGC GeoTIFF Standard, Version 1.1 is an EOSDIS recommended format [37].
Recently, a cloud-optimized profile for GeoTIFF (called COG) has been developed to make retrieval of GeoTIFF data from Web Object Storage (object storage accessible through https) more efficient [38] [39]. Also, OGC has published a standard for COG [40] [41]. See [37] for a discussion of SWAL of GeoTIFF. DIWG has recommended that data producers include only one variable per GeoTIFF file [23].
3.2 Recognized Formats
In some cases, where the dominant user communities for a given data product have historically used other formats, it may be more appropriate to continue to use those formats instead of the formats recommended above. If such formats are not already on ESCO’s list of approved data formats, they can be submitted to ESCO for review and approval following the Request for Comments instructions [42].
3.2.1 Text Formats
NASA DAACs archive numerous datasets that are in “plain text”, typically encoded using the American Standard Code for Information Interchange (ASCII). Unicode, which is a superset of ASCII, is used to represent a much wider range of characters, including those used for languages other than English. The list of ESCO’s approved standards using ASCII includes: International Consortium for Atmospheric Research on Transport and Transformation (ICARTT), NASA Aerogeophysics ASCII File Format Convention, SeaBASS Data File Format, and YAML Encoding ASCII Format for GRACE/GRACE-FO Mission Data. Recommendations on the use of ASCII formats are presented in the ASCII File Format Guidelines for Earth Science Data [43].
It should be noted that the comma-separated value (CSV) format is also a plain text format, as are Unidata’s Common Data Language (CDL), JavaScript Object Notation (JSON), and markup languages such as HTML, XML, and KML. The main advantage of encoding data in ASCII is that the contents are human readable, searchable, and editable. The main disadvantage is that file size, if not compressed, will be much larger than if the equivalent data were stored in a well-structured binary format such as netCDF-4, HDF5, or GeoTIFF. Another disadvantage of ASCII is that print-read consistency can be lost. Different programs reading a file could convert numerical values expressed in ASCII to slightly different floating-point numbers. This could complicate certain aspects of software engineering such as unit tests.
3.2.2 ICARTT
The ICARTT format [44] arose from a consensus established across the atmospheric chemistry community for visualization, exchange, and storage of aircraft instrument observations. The format is text-based and composed of a metadata section (e.g., data source, uncertainties, contact information, and brief overview of measurement technique) and a data section. Although it was primarily designed for airborne data, the format is also used for non-airborne field campaigns.
The simplicity of the ICARTT format allows files to be created and read with a single subprogram for multiple types of collection instruments and can assure interoperability between diverse user communities. Since typical ICARTT files are relatively small, the inefficiency of ASCII for storage is not a serious concern. See [44] for a discussion of SWAL of the ICARTT format.
3.2.3 Vector Data and Shapefiles
The OGC GeoPackage is a platform-independent and standards-based data format for geographic information systems implemented as an SQLite database container (*.gpkg) [45]. It can store vector features, tile matrix sets of imagery and raster maps at various scales, and extensions in a single file.
OGC has standardized the Keyhole Markup Language (KML, *.kml) format that was created by Keyhole, Inc. and is based on the eXtensible Markup Language (XML) [46]. The format delivers browse-level data (e.g., images) and small amounts of vector data (e.g., sensor paths, region boundaries, point locations), but it is voluminous for storing large data arrays. KML supports only geographic projection (i.e., evenly spaced longitude and latitude values), which can limit its usability. The format combines cartography with data geometry in a single file, which allows users flexibility to encode data and metadata in several different ways. However, this is a disadvantage to tool development and limits the ability of KML to serve as a long-term data format for archive. OpenGIS KML is an approved standard for use in EOSDIS. As noted in the recommendation, KML is primarily suited as a publishing format for the delivery of end-user visualization experiences. There are significant limitations to KML as a format for the delivery of data as an interchange format [47].
A Shapefile is a vector format for storing geometric location and attribute information of geographic features, and requires a minimum of three files to operate: the main file that stores the feature geometry (*.shp), the index file that stores the index of the feature geometry (*.shx), and the dBASE table that stores the attribute information of features (*.dbf) [48] [49]. Geographic features can be represented by points, lines, or polygons (areas). Geometries also support third and fourth dimensions as Z and M coordinates, for elevation and measure, respectively. Each of the component files is limited to 2 gigabytes. Shapefiles have several limitations that impact storage of scientific data. "For example, they cannot store null values, they round up numbers, they have poor support for Unicode character strings, they do not allow field names longer than 10 characters, and they cannot store both a date and time in a field” [50]. Additional limitations are listed in the cited article.
GeoJSON [51] is a format for encoding a variety of geographic features like Point, LineString, and Polygon. It is based on JSON and uses several types of JSON objects to represent these features, their properties, and their spatial extents.
3.2.4 HDF5
HDF5 is a widely-used data format designed to store and organize large amounts of data. NetCDF-4 (Section 3.1.1) and HDF-EOS5 (Section 3.2.4) are both built on HDF5. NetCDF-4 is the recommended format for new Earth Science data products as this format is generally more easily utilized by existing tools and services. However, as detailed in Section 3.3.3, there are emerging strategies for enhancing HDF5 for improved S3 read access that represent important usage and performance considerations for Earth Science data distributed via the NASA Earthdata Cloud.
3.2.5 HDF-EOS5
HDF-EOS5 was a specially developed data format for the Earth Observing System based on HDF5, which has been widely used for NASA Earth Science data products and includes data structures specifically designed for Earth Science data.
HDF-EOS5 employs the HDF-EOS data model [52] [53], which remains valuable for developing Earth Science data products. The Science Data Production (SDP) Toolkit [52] and HDF-EOS5 library provide the API for creating HDF-EOS5 files that are compliant with the EOS data model.
In choosing between HDF-EOS5 and netCDF-4 with CF conventions, netCDF-4/CF is recommended over HDF-EOS5 due to the much larger set of tools supporting the format.
3.2.6 Legacy Formats
Legacy formats (e.g., netCDF-3, HDF4, HDF-EOS2, and ASCII) are those used in early EOS missions, though some missions continue to produce data products in these formats. Development of new data products or new versions of old products from early missions may continue to use the legacy format, but product developers are strongly encouraged to transition data to the netCDF-4 format for improved interoperability with data from recent missions. Legacy formats are recommended for use only in cases where the user community provides strong evidence that research will be hampered if the data formats are changed.
3.2.7 Other Formats
Some data products are provided by data producers in formats that are not endorsed by ESCO. These can include ASCII files with no header, simple binary files that are not self-describing, comma- separated value (CSV) files, proprietary instrument files, etc. Producers of such data are not necessarily NASA-funded, such as some participants in field campaigns; thus, they are not under any obligation to conform to NASA’s format requirements or could lack adequate resources to do so.
There are other formats that are currently evolving in the community, stemming from developments in cloud computing, Big Data, and Analysis-Ready Data (ARD) [54] that are discussed in Section 3.3.
3.3 Cloud-Optimized Formats and Services
Following the ESDS Program’s strategic vision to develop and operate multiple components of NASA's EOSDIS in a commercial cloud environment, the ESDIS Project implemented the Earthdata Cloud architecture that went operational in July 2019 using Amazon Web Services (AWS) [55]. Key EOSDIS services, such as CMR and Earthdata Search, were deployed within it. Additionally, the DAACs are moving the data archives they manage into the cloud.
The AWS Simple Storage Service (S3) offers scalable solutions to data storage and on-demand/scalable cloud computing, but also presents new challenges for designing data access, data containers, and tracking data provenance. AWS S3 is a popular example of object-based cloud storage, but the general characteristics noted in this document are applicable for object-based cloud storage from other providers as well. Cloud (object) storage is typically accessed through HTTP “range-get” requests in contrast to traditional disk reads, and so packaging the data into optimal independent “chunks” (see Section 5) is important for optimizing access and computation.
Furthermore, the object store architecture allows data to be distributed across multiple physical devices, in contrast to local contiguous storage for traditional data archives, with the data content organization often described in byte location metadata (either internally or in external “sidecar” files). Thus, many cloud storage “formats” are better characterized as (data) content organization schemes (see Appendix B), defined as any means for enhancing the addressing and access of elements contained in a digital object in the cloud.
Cloud “optimized” data containers or content organization schemes that are being developed to meet the emerging cloud compute needs and requirements include Cloud-Optimized GeoTIFF (COG), Zarr (for-the-cloud versions of HDF5 and netCDF-4 [including NCZarr]), and cloud-optimized point-cloud data formats (see also [56] for additional background). COG, Zarr, HDF5, and netCDF-4 (see Sections 3.3.1, 3.3.2, and 3.3.3, respectively) continue to remain preferred formats for raster data, while lidar and point-based irregular data are more appropriate for point cloud formats (see Section 3.3.4). These cloud storage optimizations, although described in well-defined specifications, are still advancing and growing in maturity with regard to their use and adaptations in cloud-based workflows, third party software support, and web services (e.g., OPeNDAP, THREDDS, OGC WCPS). However, none of these formats require in-cloud processing for scientific analysis and can work with local computer operating systems and libraries without issue once the data have been downloaded. Analysis-Ready, Cloud-Optimized (ARCO) data, where the cloud data has been prepared with complete self-describing metadata following a standard or best practice, including the necessary quality and provenance information, and well-defined spatial and temporal coordinate systems and variables, offer a significant advantage for reproducible science, computation optimization, and cost reduction.
Data producers should carefully optimize their data products for partial data reads (via HTTP or direct S3 access) to make them as cloud friendly as possible. This requires organizing the data into appropriate producer-defined chunk sizes to facilitate access. The best guidance thus far is that S3 reads are optimized in the 8-16 megabyte (MB) range [57] presenting a reasonable range of chunk sizes. The Pangeo Project [58] reported chunk sizes ranging from 10-200 MB when reading Zarr data stored in the cloud using Dask [59] and the desired chunking often depends on the likely access pattern (e.g., chunking in small Regions of Interest (ROIs) for long time series data requests vs. chunking in larger ROI slices for large spatial requests over a smaller temporal range). However, on the other end of the spectrum, chunks that are too small, on the order of a few megabytes, typically impede read performance in the cloud. Data producers are advised to consult with their assigned DAAC regarding the specific approaches to their products including the chunking implementation.
3.3.1 Cloud-Optimized GeoTIFF
The COG data format builds on the established GeoTIFF format by adding features needed to optimize data use in a cloud-based environment [39] [40]. The primary addition is that internal tiling (i.e., chunking) for each layer is enabled. The tiling features enable data reads to access only the information of interest without reading the whole file. Since COG is compatible with the legacy GeoTIFF format it can be accessed using existing software (e.g., GIS software).
3.3.2 Zarr
Zarr is an emerging open-source format that stresses efficient storage of multidimensional array data in the cloud and fast parallel input/output (I/O) computations [60] [61]. Its data model supports compressed and chunked N-dimensional data arrays, inspired in part by the HDF5 and netCDF-4 data models. Its consolidated metadata can include a subset of the CF metadata conventions that are familiar to existing users of netCDF-4 and HDF5 files and allow for many useful time-series and transformation operations through third party libraries such as xarray [62]. Zarr stores chunks of data as separate objects in cloud storage with an external consolidated JSON metadata file containing all the locations to these data chunks. A Zarr software reader (e.g., using xarray in Python) only needs a single read of the contents of the consolidated metadata file (i.e., the sidecar file) to determine exactly where in the Zarr data store to locate data of interest, substantially reducing file I/O overhead and improving efficiency for parallel CPU access.
3.3.3 NetCDF-4 and HDF5 in the Cloud
Much of NASA Earth Science data has been historically stored in netCDF-4 and HDF5 data files. Besides maintaining continuity to legacy data products, there are other important data life cycle reasons to continue to use these formats, including data packaging, data integrity, and self-describing characteristics. The challenge is how to best optimize the individual files for cloud storage and access. Here, data chunking plays a leading role in this optimization with the general guidelines on this subject found in the introduction to Section 3.3. It has been demonstrated that it is possible to translate the annotated Dataset Metadata Response (DMR++) [63] sidecar files that are generated for many of NASA’s HDF5 files that have been migrated to the Earthdata Cloud into a JSON file with the key/value pairs that the Zarr library needs [64], making the HDF5 directly readable as Zarr stores.
Further cloud optimization of HDF5 files, specifically, requires enhancing the internal metadata HDF structure via the “Paged Aggregation” feature at the time of file creation (or modification via h5repack), so that the internal file metadata (i.e., not the global metadata) and data are organized into a single or a few pages of specified size (usually on the order of Mebibytes) to improve read file I/O. The exact size is important for parallel I/O operations in the cloud, and other HDF libraries that can cache the pages, further improving performance.
NCZarr is an extension and mapping of the netCDF-enhanced data model to a variant of the Zarr storage model (see Section 3.3.2).
Additional discussion of cloud optimization of netCDF-4 and HDF5 files via data transformation services is provided in Section 3.3.5.
3.3.4 Point Cloud Formats
A point cloud is commonly defined as a 3D representation of the external surfaces of objects within some field of view, with each point having a set of X, Y and Z coordinates. Point cloud data have traditionally been associated with lidar scanners such as on aircraft; in addition, in situ sensors such as those mounted on ocean gliders and airborne platforms can also be considered as point cloud data sources. The key characteristic is that these instruments produce a large number of observations that are irregularly distributed and thus are “clouds” of points.
There are many emerging formats in this evolving genre [65]. Some noteworthy formats include Cloud-Optimized Point Cloud (COPC), Entwine Point Tiles (EPT) and Parquet. COPC builds on existing point cloud formats popular in the lidar community known as LAS/LAZ (LASer file format/LAS compressed file format) and specifications from EPT. EPT itself is an open-source content organization scheme and library for point cloud data that is completely lossless and uses octree- based storage format and contains metadata in JSON. Parquet is a column-based data storage format that is suitable for tabular style data (including point cloud and in situ data). Its design lends itself to efficient queries and data access in the cloud. GeoParquet is an extension that adds interoperable geospatial types such as Point, Line and Polygon to Parquet [66].
3.3.5 Additional Data Transformation and Access Services
For data analysis in the cloud, it is often preferred to optimize data for parallel I/O and multi- dimensional analysis. This is where Zarr excels and a number of transformation services from netCDF-4 and HDF5 files to Zarr have emerged to support this need. Traditional file-level data access and subsetting via the OPeNDAP web service has also evolved to meet the needs of cloud storage.
Many of these tools enable Zarr-like parallel and chunked access capabilities to be applied onto traditional netCDF-4 and HDF5 files in AWS S3. While these services are not critical for producing data products, it is important for data producers to be aware of their use by Earth Science data consumers.
3.3.5.1 Harmony Services
The name “Harmony” refers to a set of evolving open-source, enterprise-level transformation services for data residing in the NASA Earthdata Cloud [67]. These services are accessed via a well- defined and open API, and include services for data conversion and subsetting.
Harmony-netcdf-to-zarr [68] is a service to transform netCDF-4 files to Zarr cloud storage on the fly. It aggregates individual input files into a single Zarr output that can be read using xarray calls in Python. As additional files become available, this service must be rerun to account for the new data.
Subsetting requests for trajectory (1D) and along track/across track data in netCDF and HDF files are executed using the harmony L2-subsetter service while geographically gridded Level 3 or 4 data use the Harmony OPeNDAP SubSetter service (HOSS). The Harmony-Geospatial Data Abstraction Library (GDAL)-adapter service supports reprojection.
3.3.5.2 Kerchunk
Kerchunk is a Python library to generate a “virtual” Zarr store from individual netCDF-4 and HDF5 files by creating an external metadata JSON sidecar file that contains all the locations to the individual input data chunks [69]. The “virtual” Zarr store can be read using xarray and the original netCDF-4 and HDF5 files remain unmodified in content and location. As Kerchunk leverages the fsspec library for storage backend access, it enables end users to more efficiently access parallel chunks from cloud-based S3, as well as other remote access such as data over Secure Shell (SSH) or Server Message Block (SMB).
3.3.5.3 OPeNDAP in the Cloud
The OPeNDAP Hyrax server is optimized to address the contents of netCDF-4 and HDF5 files stored in the cloud using information in the annotated DMR++ [63] sidecar file. The DMR++ file for a specific data file encodes chunk locations and byte offsets so access and parallel reads to specific parts of the file is optimized.
4 Metadata
4.1 Overview
Metadata are information about data. Metadata could be included in a data file and/or could be external to the data file. In the latter case there should be a clear connection between the metadata and the data file. As with the other aspects of data product development, it is helpful to consider the purpose of metadata in the context of how users will interact with the data and how metadata are associated with (i.e., structurally linked to) the data.
Metadata are essential for data management: they describe where and how data are produced, stored, and retrieved. Metadata are also essential for data search/discovery and interpretation, including facilitating the users’ understanding of data quality. A data producer has a responsibility to provide adequate metadata describing the data product at both the product level and the file level. The DAAC that archives the data product is responsible for maintaining the product-level metadata (known as collection metadata in the CMR [17]), which is a high-performance, high-quality, continuously-evolving metadata system that catalogs all data and service metadata records for EOSDIS. These metadata records are registered, modified, discovered, and accessed through programmatic interfaces leveraging standard protocols and APIs.
The ESDIS Project employs tools to interact with CMR based on the Unified Metadata Model (UMM). Profiles have been defined within the UMM based on their function or content description, such as Collection, Service, Variable, or Tool as shown in Table 1.
Data product- and file-level metadata are stipulated by the policies of the DAAC that will host the data, and subject to the somewhat minimal requirements that the CMR imposes. However, EOSDIS data are expected to exceed these in order to render the data more searchable and usable.
Metadata that are rich in content enable the creation of the UMM-Var, UMM-Service, or UMM-Tool records listed in Table 1. These tools and services could include subsetting of the data by variable, time, or location, and/or might enable the creation of a time series, for example. Other metadata provide information about the provenance of the data, product quality, the names of scientists or teams that created the product and the funding sources that supported the creation of the products. Data producers should work with the DAACs to determine the best approach for their products.
Metadata can be created for and associated with a data product through several methods that are different from the metadata used by the UMM/CMR system. Software libraries, such as netCDF and HDF, make populating file-level metadata straightforward. Metadata can be assigned to any object within the file (i.e., variable-level attributes), or to the file as a whole (i.e., global attributes). As such, most of the recommendations that follow apply to netCDF/HDF files, but a producer of products in other formats should aspire to conform to the degree that those formats permit. A further discussion of file-level metadata is provided in Appendices D and E.
Global attributes are meant to apply to all information in the file, and can vary from file to file within a data product. However, to maximize the self-describing nature of a file, a data producer can also include product-level metadata (i.e., information that is identical for all files) within each file. File-level metadata should be embedded in the file itself if using self-describing formats like netCDF. The DAACs may require that the metadata be provided both embedded in files and as a separate metadata file. The assigned DAAC will ensure that the physically separate metadata are properly associated with the data file to which they refer. Data product files may not contain all the available data product metadata (e.g., may not contain everything in related texts such as the Algorithm Theoretical Basis Document [ATBD]), but they must contain enough metadata to enable data search and discovery and scientific analysis using tools capable of recognizing metadata standardized for interoperability based on recommendations and standards in this document and provided by the DAACs.
4.1.1 Data Product Search and Discovery
Data product users likely experience the impact of metadata for the first time during the search and discovery process—when they are searching for data products that meet their needs. As mentioned above, the metadata that support this process are typically ingested into the CMR by the DAACs.
These metadata are used either by the CMR search engine or by other search engines that harvest metadata (e.g., data.gov or Google).
Key success criteria for metadata during the discovery process include:
- Intelligible, descriptive data product names (Section 2)
- Precise temporal and spatial coverage (Sections 4 and 4.5)
- Accurate and complete list of applicable Global Change Master Directory (GCMD) Science Keywords [70]
- Concise but readable (including machine readable) description of the data
Given the expectation to find a specific product among thousands of data products in the archives (over 52,750 as of February 14, 2024, in the EOSDIS catalog, the CMR), it is crucial to use GCMD keywords [70], especially for the platform (e.g., Nimbus-7 or DC-8), instrument (e.g., TOMS or HIRAD), and science (e.g., OZONE) keywords. For airborne and field campaign data it is important to include the campaign or project short name or acronym (e.g., Delta-X, AboVE, or EXPORTS).
Reference [71] provides links through which various categories of keywords can be downloaded in a variety of formats such as comma-separated values (CSV) or can be directly viewed with the GCMD Keyword Viewer [72]. When the keywords in the current list of GCMD are not directly applicable to a data product, the data producers are advised to follow the proper GCMD list update procedure [73] in consultation with their assigned DAACs. Data producers should work closely with their assigned DAACs in obtaining and selecting keywords for their products.
4.1.2 File Search and Retrieval
Once a user has identified a data product to pursue, the user typically needs only some files, and not all of the files, for the data product. When metadata are standardized, data search engines, such as Earthdata Search for CMR [33], support the specification of spatial and temporal criteria (i.e., search filters). Therefore, it is best to precisely specify the spatial extent of a given file to limit “false positives” in search results. For example, a four-point polygon provides a more precise specification of spatial extent than a bounding box (Figure 4). Data producers should consult with their assigned DAACs regarding methods the DAACs are using to specify bounding regions before deciding whether a different approach is needed for their products.
4.1.3 Data Usage
High-quality metadata are essential for data readability, not only for human users, but also for software, such as decision support systems or models. The aim should be for data producers to generate files that are usable without further adjustments by a data product user—a benchmark known as ARD (e.g., [54]), whose metadata typically includes:
- Coordinate information: spatial and temporal coordinates in standard
- Interpretation aids: units [31] (Rec. 2 and 3.3), fill (missing) value identification [31] (Rec. 3.7), offset and scale factors [24] (Rec. 2.2, 2.5 and 2.6), data uncertainty
- Human-readable and scientifically meaningful/standard variable names (see Section 1.4)
In addition, data producers should be mindful of other types of potentially useful documentation to facilitate the understanding of data, including but not limited to provenance information (see Section 4.6.1), particularly identifiers of the data inputs and algorithm version, and pointers to documentation such as ATBDs, and data quality assessment(s). (Some of this information is sometimes found in data producers’ README files as well). For field campaigns, this may include instrument placement and operation documents, such as images of instrument placement on an aircraft. Such documentation should be produced as early as possible in the dataset production lifecycle.
Since the primary purpose of CMR is to support search and discovery, at present, not all the data usage information is ingested into CMR. However, as automated data system capabilities evolve to provide higher levels of service for new data products, we expect CMR to include and handle more data-usage information.
4.2 Naming Data Products, Files, and Variables
4.2.1 Data Products
The name of a data product is critical to its discovery in Earthdata Search and tools developed by the DAACs. The DAACs have internal rules and guidelines for naming data products, so selection of long and short names should be a joint effort between the data producer and the assigned DAAC. DAAC involvement in naming also helps in making the data products discoverable and unique within EOSDIS. Data products must be assigned both a long name, which is meant to be human readable (and comprehensible), and a short name, which aids precise searching by keyword. The data product long name and short name are considered global attributes in that they are not associated with any particular variable but are related to the data product as a whole.
4.2.1.1 Long Name
The data product’s long name (called LongName
4 in CMR) is a name that is scientifically descriptive of the product. It should be as brief (but as complete) as possible, as it expands on the corresponding short name, which can sometimes be indecipherable. The attribute LongName
is synonymous with the CF title
attribute used in netCDF documentation (see Appendix D.1).
Data producers should seek a long name that will be understandable to the target user community and is also unique within EOSDIS. A reasonable data product name may already be in use (e.g., beginning with “MODIS/Aqua” or “MODIS/Terra”), and care should be taken to avoid naming conflicts by consultation with the assigned DAAC and other relevant data producers. Interoperability should also be considered when choosing names (see Section 4.2.2).
We provide the following recommendations regarding the formulation of data product long names:
- The data source, usually the acronym/abbreviation for the project responsible for producing the data, but can also include the instrument (e.g., HIRDLS, the High Resolution Dynamics Limb Sounder; AVIRIS, the Airborne Visible / Infrared Imaging Spectrometer), platform (satellite, aircraft, ship, etc.) (e.g., TRMM, the Tropical Rainfall Measuring Mission; P-3, a NASA aircraft), or program (e.g., MEaSUREs, Making Earth System Data Records for Use in Research Environments); include both the instrument and platform names to eliminate ambiguity (e.g., MODIS/Aqua; HIRAD/ER-2). For campaigns, it may be important to include the campaign name along with instrument and platform (e.g., OLYMPEX/HIWRAP/ER-2)
- Science content (e.g., Aerosol, Precipitation Rate, Total Column Water Vapor)
- The general spatial coverage type of the data (e.g., gridded, swath, orbit, point)
- The temporal coverage per file (e.g., instantaneous, 5-minute, orbital, daily, monthly)
- Processing level (e.g., L2 or Level 2) [74]
- Spatial resolution (e.g., 3 km; if there are multiple resolutions in the product, then the highest resolution is typically stated)
- Version number (optional; details should be resolved with the assigned DAAC)
Examples of data products that follow this naming convention are provided in Table 2. Note that not all the contents suggested in the above bullets are included in names shown in each of the examples.
4 Not to be confused with the CF long_name
attribute for individual variables.
4.2.1.2 Short Name
EOSDIS developed the standard Earth Science Data Type (ESDT) [75] naming conventions to provide an efficient way to archive data products by name and for convenience in working with toolkits. The short name (called ShortName
in CMR) is an abbreviated version of the product name. In short names, alphanumeric and underscore (“_”) are the only acceptable characters. The restriction on the usage of spaces and special characters is to ensure compatibility with Earthdata Search and other search systems. The short name is included in the metadata as the global attribute ShortName
and included in the data product’s documentation. Data producers should contact the DAAC responsible for archiving the data product to check if there are additional restrictions on short names, such as consistency across systems.
Table 2. Examples of data products named using some of the recommended naming conventions. Deviations from this format are sometimes necessary owing to the characteristics of specific data products.
Short Name |
Long Name |
Comments |
OCO2_L2_Met |
OCO-2 Level 2 meteorological parameters interpolated from global assimilation model for each sounding |
Includes: data source (OCO-2), scientific content (meteorological parameters) spatial coverage (global), processing level (Level 2) Missing: spatial resolution, temporal coverage, and version number |
MYD09GQ |
MODIS/Aqua Near Real Time (NRT) Surface Reflectance Daily L2G Global 250m SIN Grid |
Includes: data source (MODIS/Aqua), scientific content (surface reflectance), spatial coverage (global), temporal coverage (NRT), processing level (L2G), spatial resolution (250m) Missing: version number |
MOD05_L2 |
MODIS/Terra Total Precipitable Water Vapor 5-Min L2 Swath 1 km and 5 km – NRT |
Includes: data source (MODIS/Terra), scientific content (Total Precipitable Water Vapor), spatial coverage (swath), Temporal coverage, (5 min; NRT) processing level (L2), Spatial resolution (1km and 5km) Missing: version number |
Short Name |
Long Name |
Comments |
GPM_PRL1KU |
GPM DPR Ku-band Received Power L1B 1.5 hours 5 km |
Includes: data source (GPM DPR Ku- band), scientific content (Received Power), temporal coverage (1.5 hours), processing level (L1B), spatial resolution (5 km) Missing: spatial coverage, version number |
GLDAS_NOAH10_ M |
GLDAS Noah Land Surface Model L4 Monthly 1.0 x 1.0 degree |
Includes: data source (GLDAS), scientific content (Noah Land Surface Model), temporal coverage (Monthly), processing level (L4), spatial resolution (1.0 x 1.0 degree) Missing: spatial coverage, version number |
SWDB_L3M10 |
SeaWiFS Deep Blue Aerosol Optical Depth and Angstrom Exponent Monthly Level 3 Data Gridded at 1.0 Degrees |
Includes: data source (SeaWiFS), scientific content (Deep Blue Aerosol Optical Depth and Angstrom Exponent), spatial coverage, temporal coverage (Monthly), processing level (Level 3), spatial resolution (Gridded at 1.0 Degrees) Missing: spatial coverage, version number |
AIRX2RET |
AIRS/Aqua L2 Standard Physical Retrieval (AIRS+AMSU) V006 (AIRX2RET) at GES DISC |
Includes: data source (AIRS/Aqua; AIRS+AMSU), scientific content (Standard Physical Retrieval), processing level (L2), version number (V006) Missing: spatial coverage, temporal coverage, spatial resolution |
M2I3NPASM |
MERRA-2 inst3_3d_asm_Np: 3d,3- Hourly, Instantaneous, Pressure- Level, Assimilation, Assimilated Meteorological Fields 0.625 x 0.5 degree V5.12.4 |
Includes: data source (MERRA2), scientific content (Pressure-Level, Assimilation, Assimilated Meteorological Fields), spatial coverage (3d), temporal coverage (3-Hourly, Instantaneous), spatial resolution (0.625 x 0.5 degree), version number (V5.12.4) Missing: processing level |
ATLAS_VEG_PLOT S_1541 |
Arctic Vegetation Plots ATLAS Project North Slope and Seward Peninsula, AK, 1998-2000 |
Includes: data source (ATLAS), location (north slope and peninsula, AK), and temporal coverage (1998 - 2000) Missing: processing level, spatial and temporal resolution, version number |
Short Name |
Long Name |
Comments |
CARVE_L1_FTS_SP ECTRA_1426 |
CARVE: L1 Spectral Radiance from Airborne FTS Alaska, 2012-2015 |
Includes: campaign (CARVE), instrument (FTS), data product level (L1), region of study (Alaska), and temporal coverage (2012-2015) Missing: spatial and temporal resolution, version number |
DISCOVERAQ_Tex as_AircraftRemote Sensing_B200_GC AS_Data |
DISCOVER-AQ Texas Deployment B-200 Aircraft Remotely Sensed GCAS Data |
Includes: data source (campaign - DISCOVER-AQ, deployment - Texas, platform - B-200, and instrument - GCAS), region of study (Texas) Missing: spatial and temporal coverage and resolution, version number, processing level |
AirMOSS_L2_Preci pitation_1417 |
AirMOSS: L2 Hourly Precipitation at AirMOSS Sites, 2011-2015 |
Includes: campaign (AirMOSS), data level (L2), temporal resolution (hourly) and temporal coverage (2011-2015), variable (precipitation) Missing: data source (rain gauge), spatial coverage and resolution, region, version number |
DeltaX_Sonar_Bat hymetry_2085 |
Delta-X: Sonar Bathymetry Survey of Channels, MRD, Louisiana, 2021 |
Includes: campaign (Delta-X), data source (Sonar), region (Mississippi River Delta), variable (bathymetry), temporal coverage 2021 Missing: version number, processing level, temporal and spatial resolution |
4.2.2 Files
There is no universal file-naming convention for NASA Earth Science data products, apart from the DIWG recommendations regarding the components of file names provided in [31] (Rec. 3.8-3.11). However, file names should be unique and understandable to both humans and machines as well as contain information that is descriptive of the contents.
The date-time information in the file names should adhere to the following guidelines (detailed in [31], Rec. 3.11):
- Adopt the ISO 8601 standard [76] [77] for date-time information
- The start time should appear before the end time in the file
- Date-time fields representing the temporal extent of a file’s data should appear before any other date-time field in the file name
- All date-time fields in the file name should have the same
4.2.3 Variables
A fundamental consideration for variable names is that care should be exercised in the use of special characters or spaces in the name to ensure that they are readable and interpretable by commonly used software. Also, to promote usability and human readability the names should be meaningful.
This may include community best-practice names including modified standard names such as the CF Standard Names [78]. An example of a community approach to construction of variable names, used for ICARTT files (see Section 3.2.1), is contained in the Atmospheric Composition Variable Standard Name Convention document [79].
To optimize discovery and usability, variable names should comply, with community best-practice names and endorsed standard names via the variable-level CF long_name
and standard_name
attributes, respectively (see Appendix E).
4.3 Versions
Data products are uniquely identified by the combination of ShortName
and VersionID
within the CMR [17]. Data producers can also specify a product_version
as an Attribute Convention for Data Discovery (ACDD) global attribute [80]. In most cases, the product_version
and VersionID
are identical although there may be some exceptions to this (e.g., when selected files associated with a limited reprocessing of data have a different value for the product_version
).
However, if reprocessed data files have significant differences in terms of science content, then these files should be organized into a separate data product with a different VersionID
. Guidance for setting version numbers should be sought from the assigned DAAC.
The software version used to generate a data product is specified via the CMR attribute PGEVersion
. In most cases, the product_version
and the PGEVersion
differ.
4.4 Representing Coordinates
Earth Science data files should be produced with complete information for all geospatial coordinates to help enable software application capabilities for data visualization, mapping, reprojection, and transformation. Encoding geolocation based on the CF Conventions maximizes the ability to use the data in multiple tools and services. Please note that coordinates should not be confused with dimensions - sometimes these two things are one and the same, but this is not always the case, as explained in Section 3.1.1 (see Figures 2 and 3).
4.4.1 Spatial Coordinates
Variables representing latitude and longitude must always explicitly include the CF units
attribute, because there are no default values for the units of latitude and longitude. The recommended unit of latitude is degrees_north (valid range: -90 to 90 degrees) and unit of longitude is degrees_east (valid range: -180 to 180 degrees).
Consider the spatial accuracy required to represent the position of data in the file when choosing the latitude and longitude variable datatypes. Use double precision for latitude and longitude datatypes if meter-scale geolocation of the data is required.
To support the widest range of software tools while avoiding storage of redundant geospatial coordinate data, practice the following guidelines:
- Specify coordinate boundaries by adding the CF
bounds
attribute [24] (Rec. 3)- For example, a producer can annotate the “latitude" coordinate with the CF
bounds
attribute with value “latitude_bnds.” The “latitude_bnds” variable would be a multi- dimensional array of the intervals for each value of variable “latitude”
- For example, a producer can annotate the “latitude" coordinate with the CF
- Include horizontal attributes and, as necessary, vertical
- For example, a producer can include the CF attribute
units
: degrees_north for the latitude coordinate variable; degrees_east for the longitude coordinate variable; and "m" for a height coordinate variable
- For example, a producer can include the CF attribute
- Store all coordinate data for a single file in coordinate variables only. No coordinate data, or any part thereof, should be stored in attributes, or as variable and/or group names [31] (Rec. 5)
- The most applicable type of geospatial coordinates for the data should be provided with the data files. Geospatial coordinates can be included within the data file or provided via an external file. The former approach is preferred. In the latter case, machine-actionable linkages must be included in the data file. Also, the decision about providing any additional types of geospatial coordinates is left to the data producer
- Map projection and datum information can be specified using CF grid mapping attributes [31] (Rec. 3.6)
- Specify both the horizontal (geodetic) datum, which is typically WGS84, and the vertical datum, which may be either an ellipsoid such as WGS84, or a local datum which would yield orthometric heights
- The latitude and longitude coordinate variables can have the CF
axis
attribute attached with values Y and X, respectively. Grid products that do not explicitly include variables named latitude and longitude can also include the CFaxis
attribute for the horizontal coordinate variables
Note that different data user communities prefer different ordering of the latitude and longitude in the data files. Data producers should target the dominant user community for their products to decide upon the order, indicate clearly what the order is, and use self-describing file formats [23].
4.4.2 Temporal Coordinate
The CF Conventions represent time as an integer or float, with the units
attribute set to the time unit since an epochal date-time, represented as YYYY-MM-DDThh:mm:ss (e.g., “seconds since 1993- 01-01T00:00:00Z”). Use Coordinated Universal Time (UTC) instead of local time unless there is a strong justification for not doing so.
For gridded observations (e.g., Level 2G or Level 3), data can be aggregated using the time coordinate axis to record the different time steps. For example, this technique can be used to aggregate sub-daily observations (e.g., hourly, 4-hourly) into a single daily file.
When files contain only a single time slice, a time axis coordinate vector variable of size 1 should be included, so that the time information is easy to locate and the degenerate (i.e., one value of time for the entire file) time axis can improve performance when aggregating additional files over time (see [24], Rec. 2.9; [31], Rec. 3.4). The time coordinate variable can have the CF axis
attribute attached with value T.
Just as for the latitude and longitude coordinate variables (Section 4.4.1), temporal boundaries for each value of the time coordinate variable can be specified by including a time bounds variable (e.g., time_bnds), which is named via the value of the CF bounds
attribute attached to the time variable.
4.4.3 Vertical
Some data have a vertical dimension, and, therefore, a variable should be included to describe the vertical axis. The most commonly used values to describe vertical coordinates are layer, level, pressure, height, and depth. It is important to identify the vertical coordinates using the most common standard terminology, and to include the following information:
long_name
This CF attribute can be something as simple as “vertical level” and can also be used to clarify the CFunits
attribute. The valid values for the CFunits
attribute are provided by the Unidata units library (UDUNITS) package [81] and include units of pressure, length, temperature, and density. In general, the CFunits
attribute should not be included for variables that do not have physical units [31] (Rec. 3.3). Also, we recommend adding the CFstandard_name
attribute to describe the coordinate variablepositive
This CF attribute refers to the direction of the increasing coordinate values, with a valid value of eitherup
ordown
. If the vertical coordinate follows units of pressure, then this attribute is not required. Variables representing dimensional height or depth axes must always explicitly include the CFunits
andpositive
attributes because there is no defaultaxis
Setting the value of this CF attribute to Z indicates that a coordinate variable is associated with the vertical axis
4.5 Data Quality
It is essential that users of scientific data products have access to complete (to the degree to which all knowledge is available at the time) and properly articulated (i.e., correctly described for the user to logically understand, discern, and make well-informed decisions) information regarding the data quality, including known issues and limitations. This information will help to inform users about the potential applications of the data products and prevent data misuse. Therefore, data products should include metadata pointing to public documentation of the processes used for assessing data quality. Also, data producers should supply the documentation to the DAACs for archiving and distribution. Data producers can work with the DAACs and review boards to provide data quality information through an existing community-standardized format for describing quality (e.g., the data quality metadata model found in GHRSST [11]). For example, data quality information can be provided as a part of a data product user guide. Data quality information can also be included in file- level metadata and/or as data quality layers.
The recommended contents for capturing and ensuring data quality are provided below. More detailed explanations on these, along with examples, are found in the Data Management Plan Template for Data Producers [82]. In the discussion below, we use the term documentation to refer to somewhat extensive information that is typically stored separately from data files, with the metadata in the files including pointers (URLs) to such information.
4.5.1 Quality Information in Data Product Documentation
This subsection provides guidance regarding the information that should be covered in documents considered too extensive to be contained within the data files themselves.
- Document the process used, including data flows and organizations involved in assuring data quality. Provide the references to the ICDs between organizations that have been or will be developed. If the ICD does not exist or is a work in progress, include the names and email addresses of the lead authors responsible for drafting the See [83] for an example of an ICD. In the cases (e.g., airborne investigations and field campaigns) where formal ICDs are not produced, provide references to Data Management Plans where data quality procedures are described.
- Provide documentation of the calibration/validation (Cal/Val, see Appendix B) approach used, including sources of data, duration of the process, the targeted uncertainty budget that was used to assess performance, and the conditions under which Cal/Val are As Cal/Val data sources change or are reprocessed, ensure that the information is kept up to date in a publicly accessible location with reference to the relevant geospatial and temporal coverage information that is directly applicable to those Cal/Val data products.
- Provide a description of how quality flags or indicators (see Appendix B) are used in the product and explain their The following are general considerations regarding quality flags and indicators:
- Define and create indicators to represent the quality of a data product from different aspects (e.g., data dropout rate of a sea surface temperature data product).
- Ensure that quality flags are related to a quantifiable metric that directly relates to the usefulness, validity, and suitability of the data.
- Identify quantifiable data quality criteria, such as confidence levels and the values of quality flags, which can be used as criteria for refining search queries.
- Provide ancillary quality and uncertainty flags to facilitate detection of areas that are likely to contain spurious data (e.g., ice in unexpected places).
- Provide pixel-level (or measurement-level) uncertainty information where possible and meaningful. Provide the confidence level (e.g., 95%) to indicate the statistical
- Provide data quality variables and metadata along with detailed documentation on how the metadata are derived and suggestions on how to interpret them or use them in different
- Provide definition and description of each data quality indicator, including the algorithms and data products used to derive the quality information and description of how each quality indicator can be used.
- Provide examples of idealized quality flag/indicator combinations that would likely yield optimal quality filtering (i.e., minimized bias, uncertainty, and spurious observations) for science in a particular domain of research.
- A quality summary should also be documented and disseminated whenever a new dataset or a new version of a dataset is published. The quality summary should at least be a high-level overview of strengths and limitations of the dataset and should be directly traceable and reproducible by the variables within the dataset, such as by referencing the quality flags and indicators used to derive the summary. For example, the quality summary may describe the overall percentage of data that are either missing from the dataset (due to pre-processing Quality Assessment/Quality Control (QA/QC)) or that may be optionally discarded (at the discretion of the data user) due to quality conditions that are expressed by the quality flags and
- Provide documentation of the methods used for estimating uncertainty and how uncertainty estimates are included in the data Provide documentation of known issues and caveats for users and consider leveraging DAAC resources for more expedient updating and publication of this information (e.g., forums, web announcements). Also include citations and references to the data used in the validation process.
4.5.2 Quality Information in Product Metadata
This subsection provides guidance regarding the quality information that should be provided via the metadata within the data files themselves, if possible.
- Include the uncertainties in the delivered data, with the level of detail dependent on the size of the uncertainty information. For example, provide uncertainty expressed per data (pixel) value, per file, or per data product.
- Provide pointers (URLs or citations) to the ancillary data products that are used for quality assessments, Cal/Val, validation of uncertainty budget, as well as quantification and characterization of uncertainty.
- Implement quality flags and measurement state information in CF-compliant attributes [29]:
flag_values
,flag_masks
, andflag_meanings
[29] (Section 3.5). The choice offlag_values
vs.flag_mask
depends on the use case. Theflag_values
andflag_masks
may or may not be used together. An example of a complex case in which both can be used is illustrated in [29], Section 1.7, example 3.5. In all cases,flag_meanings
is used in conjunction with eitherflag_masks
orflag_values
.
Consider compliance with metadata standards related to data quality—International Organization for Standardization (ISO) 19157 [84], CF Conventions [29] including those for flags and indicators, ACDD [80], and ISO 8601 [76] [77]. Plan on using an automated compliance checker (Section 6.2).
4.6 Global Attributes
Global attributes (i.e., those that apply to an entire file rather than to a particular group or variable in a file) improve data discoverability, documentation, and usability. Descriptions of the recommended global attributes, according to CF, Attribute Conventions for Data Discovery (ACDD) [80], and other conventions, are found in Appendix D.
4.6.1 Provenance
Data provenance captures the origins, lineage, custody, and ownership of data. Including provenance in the file-level and product-level metadata helps ensure transparency and reproducibility. Provenance is also essential for debugging while data products are being developed, verified, validated, and evaluated for quality. File-level provenance can be combined with the data- product-level provenance to help the user ascertain the overall data product provenance. When describing provenance, include information about the context of the data production run (e.g., list of input data and ancillary files, production time) and the environment used to create the data product (e.g., software version, processing system, processing organization). While capturing provenance metadata can be challenging, it is generally good to capture sufficient lineage information for an independent third party to be able to reproduce the generation of a science data product.
The recommended provenance metadata for the processing environment are shown in Appendix D.6.
4.7 Variable-Level Attributes
Variable-level attributes (i.e., those that apply to a specific variable) improve interoperability, documentation, and usability. The recommended variable-level attributes given by the CF and the ACDD conventions are found in Appendix E. Several of these variable-level attributes have been discussed above, such as long_name
, standard_name
, units
, flag_values
, flag_masks
, and flag_meanings
).
5 Data Compression, Chunking, and Packing
Data compression and chunking are two storage features provided by the HDF5 library and are available through the netCDF-4 API. HDF5’s internal compression can reduce the space taken up by variables, especially those with many fill values or value repetition. The saved space can pay significant dividends in both storage space and transmission speed over the network. HDF5 includes a compression method referred to as “deflation”, based on the common compressor gzip (itself based on the Lempel-Ziv algorithm). Deflation levels run from 1 to 9, with storage efficiency and time to compress increasing with each level. A level of 5 is often a good compromise. NetCDF- 4/HDF5 variables are individually compressed within the file, which means that applications only need to uncompress those variables of interest, and not the whole file, as would be necessary for external compression methods such as gzip or bzip. NetCDF-4/HDF5 variables can also be “chunked,” which means that each variable is stored as a sequence of separate chunks in the file. If compression is used, each chunk is compressed separately. This allows read programs to decompress only the chunks required for a read request, and not the entire variable, resulting in even greater I/O efficiencies. Chunking also can allow a calling program to retrieve segments of data more efficiently when those data are stored in Object Storage (see Appendix B, Glossary). Note that the DIWG recommends using only the DEFLATE compression filter on netCDF-4 and netCDF-4-Compatible HDF5 Data [85]. Also, applying the netCDF-4/HDF5 “shuffle” filter before deflation can significantly improve the data compression ratio for multidimensional netCDF-4/HDF5 variables.
Chunking is appropriate for variables with a large number of values, particularly for multidimensional variables. It is helpful to consider the most likely pattern of data usage. However, where this is unknown or widely varied, “balanced chunking” is recommended, i.e., balanced access speeds for time-series and geographic cross-sections, the two most-common end-member geometries of data access. For example, Unidata has an algorithm for balanced chunking [86]. The DIWG recommends using balanced chunking for multidimensional variables contained in grid structures [24] (Rec. 2.11). Further recommendations regarding chunk sizes for use in the cloud are covered in Section 3.3 above.
In addition, the following command-line utilities can be used to chunk and compress files after the files have been written:
- h5repack (part of the HDF5 library [27])
- nccopy (part of the netCDF library [87])
- ncks (part of the NCO package [88])
These utilities are also useful in experimenting with different compression levels and chunking schemes.
Another way to reduce data size is to apply a scale and offset factor to the data values, allowing the values to be saved as a smaller data type, such as a 2-byte integer. This technique, known as “packing,” is appropriate for data with a limited dynamic range. The attributes needed to reconstruct the original values are scale_factor
and add_offset
.
The equation to reconstruct the original values is:
final_data_value = (scale_factor
* packed_data_value) + add_offset
The values for scale_factor
and add_offset
may be selected by the data producer, or automatically computed to minimize storage size by a packing utility such as ncpdq (part of the NCO package [89]). The DIWG recommends using packing only when the packed data are stored as integers [24] (Rec. 2.6). An additional benefit of packing is that it can sometimes make the data more compressible via deflation. That is, packing followed by the netCDF-4/HDF5 “shuffle” filter followed by deflation can result in very significant data compression.
6 Tools for Data Product Testing
The steps indicated below should be followed to test compliance and usability of a new data product and to respond to any issues found during testing:
- Inspect the contents of the data file (e.g., via ncdump, h5dump, Panoply, HDFView, or a similar tool) to check for correctness of the metadata, and that the data/metadata structures agree with the product user guide (Section 6.1).
- Use automated compliance checkers to test the product against metadata conventions (Section 2).
- Inspect and edit the metadata using tools described in Section 3, if problems are found.
- Modify the data production code as required, once all the necessary changes are
- Test the product with the tools that will likely be used on the product (Section 4).
- Validate that the packaging decisions (see Section 5) result in the desired size/performance trade-off.
6.1 Data Inspection
Dumping the data and inspecting them is a useful first check or troubleshooting technique. It can reveal obvious problems with standards compliance, geolocation (see Appendix C.1), and consistency with the data product user guide (if available). Useful tools for data inspection of netCDF-4 and HDF5 files are summarized in Table 3 (but also see [90]).
Table 3. Useful tools for inspecting netCDF-4 and HDF5 data. The "Type" column indicates the interfaces supported by the tools (command-line interface (CLI) or graphical user interface (GUI)).
Tool |
Type |
Access |
Capabilities |
HDFView |
GUI |
Website [28] |
Read netCDF-4 and HDF5 files; views any data object; select “Table” from menu bar and then “export to text” or “export to binary” |
Panoply |
GUI |
Website [91] |
Read netCDF-4 and HDF5 files; create images and maps from the data contained in a data product |
h5diff |
CLI |
HDF5 library [27] |
Compare a pair of netCDF-4 or HDF5 files; the differences are reported as ASCII text |
h5dump, h5ls |
CLI |
HDF5 library [27] |
Dump netCDF-4 and HDF5 file content to ASCII format |
IDV |
CLI |
IDV [92] |
Integrated Data Viewer; 3D geoscience visualization and analysis tool that gives users the ability to view and analyze geoscience data in an integrated fashion |
ncdump |
CLI |
NetCDF-4 C library [87] |
Dump netCDF-4 file content to ASCII format |
ncompare |
CLI |
NASA github [93] |
Compare a pair of netCDF-4 files. Runs directly in Python and provides an aligned and colorized difference report for quick assessments of groups, variable names, types, shapes, and attributes; can generate .txt, .csv or .xlsx report files |
ncks |
CLI |
NCO Toolkit [94] |
Read and dump netCDF-4 and HDF5 files |
NCL |
CLI |
NCAR Command Language [95] |
Interpreted language designed for scientific data analysis and visualization5 |
5 NCL was put into maintenance mode in 2019. See https://geocat.ucar.edu/blog/2020/11/11/November-2020-update.
6.2 Compliance Checkers
Compliance checkers should be used while data products are being developed to ensure that the metadata fields are all populated and are meaningful. The following are recommended compliance checkers:
- Metadata Compliance Checker (MCC) is a web-based tool and service designed by the Physical Oceanography DAAC (PO.DAAC) for netCDF and HDF formats [96]
- CFChecker developed by Decker [97]
- Dismember developed by NCO [98]
- Integrated Ocean Observing System (IOOS) Compliance Checker [99]
- National Centre for Atmospheric Science (NCAS) CF Compliance Checker [100]
- Center for Environmental Data Analysis (CEDA) [101]
6.3 Internal Metadata Editors
Data editors can be useful in tweaking metadata internal to the data files when problems surface during testing. Once the metadata (or data) have been corrected, of course, the data processing code typically needs to be modified for the actual production runs. Several useful tools available for editing data in netCDF-4 and HDF5 formats are summarized in Table 4 (but also see [90]).
Table 4. Useful tools for editing netCDF-4 and HDF5 metadata and data. The "Type" column indicates the interfaces supported by the tools (command-line interface (CLI) or graphical user interface (GUI)).
Tool |
Type |
Access |
Capabilities |
HDFView |
GUI |
Website [28] |
Create, edit, and delete content of netCDF-4 and HDF5 files |
h5diff |
CLI |
HDF5 library [27] |
Compare a pair of netCDF-4 or HDF5 files; the differences are reported as ASCII text |
ncatted |
CLI |
NCO Toolkit [94] |
Edit netCDF-4 global, group and variable-level attributes |
ncks |
CLI |
NCO Toolkit [94] |
For netCDF-4: subset, chunk, compress, convert between versions, copy variables from one file to another, merge files, print |
ncompare |
CLI |
NASA github [93] |
Compare a pair of netCDF-4 files. Runs directly in Python and provides an aligned and colorized difference report for quick assessments of groups, variable names, types, shapes, and attributes; can generate .txt, .csv or .xlsx report files |
ncrename |
CLI |
NCO Toolkit [94] |
Rename groups, dimensions, variables, and attributes of netCDF-4 files |
ncdump |
CLI |
NetCDF-4 C library [87] |
Print the internal metadata of netCDF-4 files |
ncgen |
CLI |
NetCDF-4 C library [87] |
Convert ASCII files to netCDF-4 format |
6.4 Other Community-Used Tools
Two generalized GUI tools in particular work with a large variety of netCDF and HDF products: Panoply [91] and HDFView [28]. Thus, these are recommended for at least minimal testing of data products before their release to users. Appendix C provides illustrations of these tools. In addition, it is helpful to test with tools that are in wide use by the target community for a data product, such as GIS tools for land processes products. Some of these tools are shown in Table 5. This table provides representative examples of data analysis tools, and is intended to spark a discussion of what tools the target community(communities) is(are) using.
Table 5. Other community-used tools
Tool |
Source |
URL |
ArcGIS |
Esri |
|
AppEEARS |
NASA Land Processes DAAC |
|
ENVI |
NV5 Geospatial Software |
|
ERDAS IMAGINE |
Hexagon |
|
GMT |
University of Hawai'i at Mānoa |
|
Google Earth Engine |
|
|
Grid Analysis and Display System (GrADS) |
George Mason University |
|
GRASS |
OSGeo Project |
|
HDFLook |
HDF-EOS Tools and Information Center |
|
HEG |
NASA Land Processes DAAC |
|
IDL |
NV5 Geospatial Software |
|
IDRISI |
Clark Labs |
|
Multi-Mission Algorithm and Analysis Platform (MAAP) |
NASA and ESA |
|
MATLAB |
Mathworks |
|
Octave |
GNU Octave |
|
Python |
Python Software Foundation |
|
Quantum GIS (QGIS) |
OSGeo Project |
Tool |
Source |
URL |
R |
The R Project for Statistical Computing |
|
SeaDAS |
NASA Ocean Biology DAAC |
|
Sentinel Application Platform (SNAP) |
ESA |
7 Data Product Digital Object Identifiers
A Digital Object Identifier (DOI) is a unique alphanumeric character string (i.e., handle) that can be used to identify a data product. A DOI is permanent, such that when it is registered it can be used to locate the object to which it refers permanently. Since their introduction in 2000, DOIs have been routinely assigned to journal articles and cited by the scientific community. Use of DOIs for data products, however, is more recent but equally important for universal referencing and discoverability of data, as well as for proper attribution and citation.
The DOI handle is composed of a prefix that includes characters to identify the registrant and a suffix that includes the identification number of the registered object. In addition to the DOI handle, a web address is assigned by the DOI registration service provider. For a data product, its DOI typically leads to a web landing page (for guidelines, see [102] [103]) that provides information about the data product and services for users to obtain the data. One of the key benefits of assigning a DOI to a data product is that even if the web address changes, the DOI remains valid. This means that a DAAC can change the web address of a data product without affecting the validity of references made in published literature. In addition, the data publisher could change, but the DOI is unaffected. For a detailed description of DOIs, see the DOI Handbook [104].
The ESDIS Project has established procedures for managing DOIs for EOSDIS data [105]. The format of the DOIs managed by the ESDIS Project is [prefix]/[suffix]. Here the prefix always starts with 10 followed by a number, as in 10.5067, and is assigned to the agency (ESDIS) for its repositories whereas the suffix uniquely identifies the object. The suffix can be semantic (containing meaningful information about the digital object) or opaque (any combination of alphanumeric characters, usually generated randomly and not having any semantic content). Examples of the URLs for two products, showing sematic and opaque DOIs respectively, are https://doi.org/10.5067/ECOSTRESS/ECO2LSTE.001 and https://doi.org/10.5067/D7GK8F5J8M8R.
Data producers should work with the assigned DAAC to obtain DOIs for their data products. The requests for DOI registration are made to the ESDIS Project by a DAAC.
The ESDIS Project uses a two-step process for registering DOIs for most DAACs. First, DOIs are reserved, so that data producers can start using them in the metadata while generating the products. Information about the DOI should be included in the data product metadata. In particular, the DOI resolving authority (i.e., https://doi.org) and the DOI identifier must be included as global attributes (see [106]). When a data product is ready to be delivered to the DAAC for public release, the DOI is registered. Until the DOI is registered, it can be modified or deleted (withdrawn). Once registered, the DOI becomes permanent.
8 Product Delivery and Publication
The responsibilities for generating data products and making them available to the user community are shared between data producers and the DAACs. Earthdata Pub [107] is a set of tools and guidance to help data producers publish Earth Science data products with an assigned NASA DAAC. In the case of an unknown DAAC assignment, except in the case of ROSES-funded projects, Earthdata Pub can be used to submit information about a potential data product for consideration and possible assignment by NASA. ROSES-funded projects should work with their Program Scientists to have a DAAC assigned as early as possible after their funding has been approved.
Earthdata Pub creates a consistent data publication experience across the DAACs and provides the primary point of interaction for data producers and the DAAC staff. Using Earthdata Pub data producers can: 1) request to publish data at a DAAC, 2) submit information and files required to publish data in one place, 3) track the publication status of a request in real-time, and 4) communicate directly with DAAC staff support. Earthdata Pub includes resources to help data producers at each step along the way. Simple forms and workflows provide a guided data publication process.
Even though the details of the processes leading to data delivery and publication vary depending on the type of data as well as the DAAC that the data producer is working with, the primary phases of the data publication processes, from the perspective of the DAAC, are generally the same: 1) obtain the data, documentation and metadata, and related information from data producers; 2) work with data producers to generate CMR compliant metadata and additional documentation (e.g., user guide) describing the data6; 3) generate or adapt appropriate data software readers as needed; and 4) release the data and documentation for access by the user community.
Each of the 12 EOSDIS DAACs have historically established publication workflows that account for the heterogeneous suite of missions, instruments, data producers, data formats, and data services managed by EOSDIS. Earthdata Pub allows each DAAC to maintain unique workflow steps while combining common steps. The specifics of data delivery and publication, such as schedules, interfaces, workflow, and procedures for submitting data product updates, are best established by communications between the data producers and their assigned DAACs through the Earthdata Pub. Data producers and DAACs will agree to a Level of Service [108] for the data. Levels of Service are the services applied to data during archiving and preservation to optimize the data usability and access.
To allow ample time to discuss data format and packaging, data producers should contact the assigned DAAC through Earthdata Pub as soon as sample data are ready. As of May 2024, the DAACs are actively on-boarding to Earthdata Pub. The data producer should check the list of DAACs in Earthdata Pub or contact the assigned DAAC to verify the DAAC’s preferred process. The general process for adding new data products to EOSDIS as well as the requirements and responsibilities of data producers and DAACs are shown in [109]. To get started publishing data, visit Earthdata Pub: https://pub.earthdata.nasa.gov.
6 Data documentation should include at least a user guide. If the product is a geophysical retrieval, then providing an Algorithm Theoretical Basis Document is recommended as well.
9 References
[1] |
ESDIS, "Earth Science Data Systems (ESDS) Program," 6 November 2023. [Online]. Available: https://www.earthdata.nasa.gov/esds. [Accessed 9 January 2024]. |
[2] |
ESDIS, "Earth Science Data and Information System (ESDIS) Project," 16 January 2020. [Online]. Available: https://earthdata.nasa.gov/esdis. [Accessed 9 January 2024]. |
[3] |
ESDIS, "Earth Science Data System Working Groups," 1 March 2021. [Online]. Available: https://www.earthdata.nasa.gov/engage/esdswg. [Accessed 16 May 2024]. |
[4] |
H. K. Ramapriyan and P. J. T. Leonard, "Data Product Development Guide (DPDG) for Data Producers version1.1. NASA Earth Science Data and Information System Standards Office," 21 October 2021. [Online]. Available: https://www.earthdata.nasa.gov/esdis/esco/standards-and-practices/data-product-development-guide-for-data-producers. [Accessed 9 January 2024]. |
[5] |
ESDS Program, "Airborne and Field Resources for Investigation Scientists and Data Managers," 7 December 2022. [Online]. Available: https://www.earthdata.nasa.gov/esds/impact/admg/evs. [Accessed 7 January 2024]. |
[6] |
GES DISC, "GES DISC Data and Metadata Recommendations to Data Providers," 31 March 2022. [Online]. Available: https://docserver.gesdisc.eosdis.nasa.gov/public/project/DataPub/GES_DISC_metadata_and_data_formats.pdf. [Accessed 9 January 2024]. |
[7] |
P. Meyappan, P. S. Roy, A. Soliman, T. Li, P. Mondal, S. Wang and A. K. Jain, "Documentation for the India Village-Level Geospatial Socio-Economic Data Set: 1991, 2001," NASA Socioeconomic Data and Applications Center (SEDAC), 12 March 2018. [Online]. Available: https://doi.org/10.7927/H43776SR. [Accessed 9 January 2024]. |
[8] |
PO DAAC, "PO.DAAC data management best practices," [Online]. Available: https://podaac.jpl.nasa.gov/PO.DAAC_DataManagementPractices. [Accessed 9 January 2024]. |
[9] |
NSIDC DAAC, "Submit NASA Data to NSIDC DAAC," 2024. [Online]. Available: https://nsidc.org/data/submit-data/submit-nasa-data-nsidc-daac/assigned-data. [Accessed 9 Janaury 2024]. |
[10] |
ORNL DAAC, "Submit Data," [Online]. Available: https://daac.ornl.gov/submit/. [Accessed 9 January 2024]. |
[11] |
GHRSST Science Team (2010), "The Recommended GHRSST Data Specification (GDS) 2.0, Document Revision 5," 9 October 2012. [Online]. Available: https://doi.org/10.5281/zenodo.4700466. [Accessed 9 January 2024]. |
[12] |
ESDIS, "ESDIS Standards Coordination Office (ESCO)," 18 May 2022. [Online]. Available: https://www.earthdata.nasa.gov/esdis/esco. [Accessed 9 January 2024]. |
[13] |
M. D. Wilkinson, "The FAIR Guiding Principles for scientific data management and stewardship," Scientific Data, vol. 3, 15 March 2016. |
[14] |
NASA, "Science Information Policy," 25 April 2023. [Online]. Available: https://science.nasa.gov/researchers/science-data/science-information-policy. [Accessed 9 January 2024]. |
[15] |
ESDS Program, "Open Data, Services, and Software Policies," NASA, 25 May 2021. [Online]. Available: https://www.earthdata.nasa.gov/engage/open-data-services-and-software. [Accessed 9 January 2024]. |
[16] |
ESDIS, "EOSDIS Glossary," 28 January 2020. [Online]. Available: https://www.earthdata.nasa.gov/learn/glossary. [Accessed 9 January 2024]. |
[17] |
ESDIS Project, "Common Metadata Repository," 12 May 2021. [Online]. Available: https://earthdata.nasa.gov/eosdis/science-system-description/eosdis-components/cmr. [Accessed 9 January 2024]. |
[18] |
G. Asrar and H. K. Ramapriyan, "Data and Information System for Mission to Planet Earth," Remote Sensing Reviews, vol. 13, pp. 1-25. https://doi.org/10.1080/02757259509532294, 1995. |
[19] |
H. K. Ramapriyan, J. F. Moses and D. Smith, "NASA Earth Science Data Preservation Content Specification, Revision C.," 3 May 2022. [Online]. Available: https://earthdata.nasa.gov/esdis/eso/standards-and-references/preservation-content-spec. [Accessed 9 January 2024]. |
[20] |
H. K. Ramapriyan, J. F. Moses and D. Smith, "Preservation Content Implementation Guidance, Version 1.0," 25 January 2022. [Online]. Available: https://doi.org/10.5067/DOC/ESO/RFC-042. [Accessed 9 January 2024]. |
[21] |
NASA, "Interface Management," [Online]. Available: https://www.nasa.gov/seh/6-3-interface-management. [Accessed 7 January 2024]. |
[22] |
H. Ramapriyan, G. Peng, D. Moroni and C.-L. Shie, "Ensuring and Improving Information Quality for Earth Science Data and Products," D-Lib Magazine, vol. 23, no. July/August 2017 Number7/8 https://doi.org/10.1045/july2017-ramapriyan, 2017. |
[23] |
DIWG, "Dataset Interoperability Recommendations for Earth Science," Dataset Interoperability Working Group, 17 November 2022. [Online]. Available: https://wiki.earthdata.nasa.gov/display/ESDSWG/Dataset+Interoperability+Recommendations+for+Earth+Science. [Accessed 9 January 2024]. |
[24] |
DIWG, "Dataset Interoperability Recommendations for Earth Science, ESDS-RFC-028v1.3," ESDIS Project, 19 June 2020. [Online]. Available: https://www.earthdata.nasa.gov/esdis/esco/standards-and-practices/dataset- interoperability-recommendations-for-earth-science. [Accessed 9 January 2024]. |
[25] |
ESCO, "Standards and Practices," ESDIS Project, 21 November 2023. [Online]. Available: ttps://www.earthdata.nasa.gov/esdis/esco/standards-and-practices. [Accessed 9 January 2024]. |
[26] |
ESCO, "netCDF-4/HDF5 File Format," ESDIS Project, 20 May 2021. [Online]. Available: https://www.earthdata.nasa.gov/esdis/esco/standards-and-practices/netcdf-4hdf5-file-format. [Accessed 9 January 2024]. |
[27] |
ESCO, "HDF5 Data Model, File Format and Library – HDF5 1.6," ESDIS Project, 20 May 2021. [Online]. Available: https://www.earthdata.nasa.gov/esdis/esco/standards-and-practices/hdf5. [Accessed 9 January 2024]. |
[28] |
The HDF Group, "HDF View," [Online]. Available: https://www.hdfgroup.org/downloads/hdfview/. [Accessed 9 January 2024]. |
[29] |
ESDIS, "Climate and Forecast (CF) Metadata Conventions," ESDIS Standards Coordination Office, 20 May 2021. [Online]. Available: https://www.earthdata.nasa.gov/esdis/esco/standards-and-practices/climate-and-forecast-cf-metadata-conventions. [Accessed 9 January 2024]. |
[30] |
A. Jelenak, "Encoding of Swath Data in the Climate and Forecast Convention," 19 June 2018. [Online]. Available: https://github.com/Unidata/EC-netCDF-CF/blob/master/swath/swath.adoc. [Accessed 9 January 2024]. |
[31] |
DIWG, "Dataset Interoperability Recommendations for Earth Science: Part 2, ESDS-RFC- 036v1.2," June 2020. [Online]. Available: https://www.earthdata.nasa.gov/esdis/esco/standards-and-practices/dataset-interoperability-recommendations-for-earth-science. [Accessed 9 January 2024]. |
[32] |
NCEI, "NCEI NetCDF Templates v2.0," 7 December 2015. [Online]. Available: https://www.nodc.noaa.gov/data/formats/netcdf/v2.0/. [Accessed 9 January 2024]. |
[33] |
ESDIS Project, "Earthdata Search," 18 September 2023. [Online]. Available: https://www.earthdata.nasa.gov/learn/earthdata-search. [Accessed 9 January 2024]. |
[34] |
Adobe, "TIFF," [Online]. Available: https://www.adobe.com/creativecloud/file-types/image/raster/tiff-file.html. [Accessed 9 January 2024]. |
[35] |
wikipedia, "Geographic information system," wikipedia, 8 January 2024. [Online]. Available: https://en.wikipedia.org/wiki/Geographic_information_system. [Accessed 9 January 2024]. |
[36] |
OGC, "Open Geospatial Consortium," [Online]. Available: https://www.ogc.org/. [Accessed 9 January 2024]. |
[37] |
ESCO, "GeoTIFF File Format, ESDS-RFC-040v1.1," 1 March 2021. [Online]. Available: https://www.earthdata.nasa.gov/esdis/esco/standards-and-practices/geotiff. [Accessed 16 May 2024]. |
[38] |
COG, "Cloud Optimized GeoTIFF: An imagery format for cloud-native geospatial processing," [Online]. Available: https://www.cogeo.org/. [Accessed 9 January 2024]. |
[39] |
N. Pollack, "Cloud Optimized GeoTIFF (COG) File Format. NASA Earth Science Data and Information System Standards Coordination Office.," April 2024. [Online]. Available: https://doi.org/10.5067/DOC/ESCO/ESDS-RFC-049v1. [Accessed 6 June 2024]. |
[40] |
OGC, "OGC Cloud-Optimized GeoTIFF Standard," 14 July 2023. [Online]. Available: http://www.opengis.net/doc/is/COG/1.0. [Accessed 18 January 2024]. |
[41] |
C. Plain, "New Standard Announced for Using GeoTIFF Imagery in the Cloud," 3 January 2024. [Online]. Available: https://www.earthdata.nasa.gov/learn/articles/new-cloud-optimized-geotiff-standard. [Accessed 12 February 2024]. |
[42] |
ESCO, "Instructions to RFC Authors," ESDIS Project, 20 May 2021. [Online]. Available: https://www.earthdata.nasa.gov/esdis/esco/standards-and-references/instructions-to-rfc-authors. [Accessed 9 January 2024]. |
[43] |
ESCO, "ASCII File Format Guidelines for Earth Science Data, ESDS-RFC-027v1.1," May 2016. [Online]. Available: https://www.earthdata.nasa.gov/esdis/esco/standards-and-practices/ascii-file-format-guidelines-for-earth-science-data. [Accessed 9 January 2024]. |
[44] |
E. Northup, G. Chen, K. Aikin and C. Webster, "ICARTT File Format Standards V2.0," January 2017. [Online]. Available: https://www.earthdata.nasa.gov/esdis/esco/standards-and-practices/icartt-file-format. [Accessed 9 January 2024]. |
[45] |
OGC, "OGC GeoPackage Encoding Standard Version 1.3.1," OGC, 16 November 2021. [Online]. Available: https://www.geopackage.org/spec131/. [Accessed 9 January 2024]. |
[46] |
W3C, "Extensible Markup Language (XML)," 11 October 2016. [Online]. Available: https://www.w3.org/XML/. [Accessed 9 January 2024]. |
[47] |
ESCO, "OGC KML," ESDIS Standards Coordination Office, 20 May 2021. [Online]. Available: https://www.earthdata.nasa.gov/esdis/esco/standards-and-practices/ogc-kml. [Accessed 9 January 2024]. |
[48] |
Esri, "Shapefiles," Esri, [Online]. Available: https://doc.arcgis.com/en/arcgis- online/reference/shapefiles.htm. [Accessed 9 January 2024]. |
[49] |
Wikipedia, "Shapefile," Wikipedia, 27 August 2023. [Online]. Available: https://en.wikipedia.org/wiki/Shapefile. [Accessed 9 January 2024]. |
[50] |
Esri, "Geoprocessing considerations for shapefile output," 24 April 2009. [Online]. Available: http://webhelp.esri.com/arcgisdesktop/9.3/index.cfm?TopicName=Geoprocessing%20considerations%20for%20shapefile%20output. [Accessed 9 January 2024]. |
[51] |
Internet Engineering Task Force (IETF), "The GeoJSON Format," August 2016. [Online]. Available: https://datatracker.ietf.org/doc/html/rfc7946. [Accessed 7 January 2024]. |
[52] |
ESCO, "HDF-EOS5 Data Model, File Format and Library," ESDIS Standards Office, 20 May 2021. [Online]. Available: https://www.earthdata.nasa.gov/esdis/esco/standards-and-practices/hdf-eos5. [Accessed 9 January 2024]. |
[53] |
A. Taaheri and K. Rodrigues, "HDF-EOS5 Data Model, File Format and Library," May 2016. [Online]. Available: https://cdn.earthdata.nasa.gov/conduit/upload/4880/ESDS-RFC-008- v1.1.pdf. [Accessed 9 January 2024]. |
[54] |
CEOS, "CEOS Analysis Ready Data," Committee on Earth Observation Satellites, 18 October 2021. [Online]. Available: http://ceos.org/ard/. [Accessed 9 Januarhy 2024]. |
[55] |
ESDS Program, "Earthdata Cloud Evolution," 10 August 2023. [Online]. Available: https://www.earthdata.nasa.gov/eosdis/cloud-evolution. [Accessed 8 January 2024]. |
[56] |
Cloud-Native Geospatial Foundation, "Cloud-Optimized Geospatial Formats Guide," 2023. [Online]. Available: https://guide.cloudnativegeo.org. [Accessed 8 January 2024]. |
[57] |
AWS, "Best Practices Design Patterns: Optimizing Amazon S3 Performance," June 2019. [Online]. Available: https://d1.awsstatic.com/whitepapers/AmazonS3BestPractices.pdf. [Accessed 9 January 2024]. |
[58] |
PANGEO TEam, "PANGEO - A community platform for Big Data geoscience," 2023. [Online]. Available: https://pangeo.io/. [Accessed 9 January 2024]. |
[59] |
R. Signell, A. Jelenak and J. Readey, "Cloud-Performant NetCDF4/HDF5 Reading with the Zarr Library," 26 February 2020. [Online]. Available: https://medium.com/pangeo/cloud- performant-reading-of-netcdf4-hdf5-data-using-the-zarr-library-1a95c5c92314. [Accessed 9 January 2024]. |
[60] |
Zarr Developers, "Zarr-Python Version 2.14.2," 2022. [Online]. Available: https://zarr.readthedocs.io/en/stable/index.html. [Accessed 9 January 2024]. |
[61] |
D. J. Newman, "Zarr storage specification version 2: Cloud-optimized persistence using Zarr. NASA Earth Science Data and Information System Standards Coordination Office.," April 2024. [Online]. Available: https://doi/org/10.5067/DOC/ESCO/ESDS-RFC-048v1. [Accessed 13 May 2024]. |
[62] |
xarray Developers, "Xarray documentation," 8 December 2023. [Online]. Available: https://docs.xarray.dev/en/stable/. [Accessed 8 January 2024]. |
[63] |
OPeNDAP, "DMR++: How to build & deploy dmr++ files for Hyrax," 17 October 2023. [Online]. Available: https://docs.opendap.org/index.php?title=DMR%2B%2B. [Accessed 9 January 2024]. |
[64] |
P. Quinn, "Cloud Optimized Formats: NetCDF-as-Zarr Optimizations and Next Steps," Element 84, 29 March 2022. [Online]. Available: https://www.element84.com/blog/cloud-optimized-formats-netcdf-as-zarr-optimizations-and-next-steps. [Accessed 12 January 2024]. |
[65] |
S. J. S. Khalsa, E. M. Armstrong, J. Hewson, J. F. Koch, S. Leslie, S. W. Olding and A. Doyle, "A Review of Options for Storage and Access of Point Cloud Data in the Cloud," February 2022. [Online]. Available: https://www.earthdata.nasa.gov/s3fs-public/2022-06/ESCO-PUB-003.pdf. [Accessed 12 January 2024]. |
[66] |
OGC, "GeoParquet," [Online]. Available: https://geoparquet.org/. [Accessed 8 January 2024]. |
[67] |
ESDIS, "Earthdata Harmony Documentation," [Online]. Available: https://harmony.earthdata.nasa.gov/docs. [Accessed 8 January 2024]. |
[68] |
NASA, "Service for transforming NetCDF4 files into Zarr files within Harmony," [Online]. Available: https://github.com/nasa/harmony-netcdf-to-zarr. [Accessed 12 January 2024]. |
[69] |
M. Durant, "kerchunk," 2021. [Online]. Available: https://fsspec.github.io/kerchunk/. [Accessed 8 January 2024]. |
[70] |
T. Stevens, "GCMD Keyword Access," 17 March 2021. [Online]. Available: https://wiki.earthdata.nasa.gov/display/CMR/GCMD+Keyword+Access. [Accessed 12 Janaury 2024]. |
[71] |
ESDIS, "GCMD Keywords by Category," 12 January 2024. [Online]. Available: https://gcmd.earthdata.nasa.gov/static/kms/. [Accessed 12 January 2024]. |
[72] |
ESDIS, "GCMD Keyword Viewer," 2024. [Online]. Available: https://gcmd.earthdata.nasa.gov/KeywordViewer/scheme/all?gtm_scheme=all. [Accessed 8 January 2024]. |
[73] |
ESDIS, "NASA's GCMD releases the Keyword Governance and Community Guide Document, Version 1.0," 11 August 2016. [Online]. Available: https://www.earthdata.nasa.gov/news/nasa-s-gcmd-releases-the-keyword-governance-and-community-guide-document-version-1-0. [Accessed 12 January 2024]. |
[74] |
ESDS Program, "Data Processing Levels," 13 July 2021. [Online]. Available: https://www.earthdata.nasa.gov/engage/open-data-services-and-software/data-information-policy/data-levels. [Accessed 12 January 2024]. |
[75] |
ESDIS, "EOSDIS Glossary - "E"," 12 January 2024. [Online]. Available: https://www.earthdata.nasa.gov/learn/glossary#ed-glossary-e. [Accessed 12 January 2024]. |
[76] |
ISO, "ISO 8601-1:2019 Date and time -- Representations for information interchange -- Part 1: Basic rules," February 2019. [Online]. Available: https://www.iso.org/standard/70907.html. [Accessed 12 January 2024]. |
[77] |
ISO, "ISO 8601-2:2019 Date and time -- Representations for information interchange -- Part 2: Extensions," February 2019. [Online]. Available: https://www.iso.org/standard/70908.html. [Accessed 12 January 2024]. |
[78] |
CF Conventions, "CF Standard Name Table," CF Conventions, 19 January 2024. [Online]. Available: https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard- name-table.html. [Accessed 16 May 2024]. |
[79] |
ESCO, "Atmospheric Composition Variable Standard Name Convention," 2023. [Online]. Available: https://doi.org/10.5067/DOC/ESCO/ESDS-RFC-043v1. [Accessed 16 May 2024]. |
[80] |
ESIP Documentation Cluster, "Attribute Conventions for Data Discovery," 5 September 2023. [Online]. Available: https://wiki.esipfed.org/index.php/Attribute_Convention_for_Data_Discovery. [Accessed 12 Janaury 2024]. |
[81] |
Unidata, "UDUNITS 2.2.28 Documentation," [Online]. Available: https://docs.unidata.ucar.edu/udunits/current/. [Accessed 12 January 2024]. |
[82] |
Earth Science Data Systems (ESDS) Program, HQ SMD, "Data Management Plan (DMP) Template for Data Producers, Version 1.1," 23 June 2020. [Online]. Available: https://wiki.earthdata.nasa.gov/download/attachments/118138197/ESDIS05161_DMP_for_ DPs_template.pdf?api=v2. [Accessed 12 January 2024]. |
[83] |
ESDIS Project, "ICD Between the ICESat-2 Science Investigator-led Processing System (SIPS) and the National Snow and Ice Data Center (NSIDC) Distributed Active Archive Center (DAAC)-423-ICD-007, Revision A," NASA GSFC, Greenbelt, MD, 2016. |
[84] |
ISO, "ISO 19157-1:2023 Geographic information — Data quality — Part 1: General requirements," April 2023. [Online]. Available: https://www.iso.org/standard/78900.html. [Accessed 12 January 2024]. |
[85] |
DIWG, "Use Only Officially Supported Compression Filters on NetCDF-4 and NetCDF-4- Compatible HDF5 Data," 10 Janaury 2024. [Online]. Available: https://wiki.earthdata.nasa.gov/display/ESDSWG/Use+Only+Officially+Supported+Compress ion+Filters+on+NetCDF-4+and+NetCDF-4-Compatible+HDF5+Data. [Accessed 12 January 2024]. |
[86] |
Developers@Unidata, "Chunking Data: Choosing Shapes," 28 March 2013. [Online]. Available: https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_choosing_shapes. [Accessed 12 January 2024]. |
[87] |
UCAR, "Network Common Data Form (NetCDF)," 14 March 2023. [Online]. Available: https://www.unidata.ucar.edu/software/netcdf. [Accessed 12 January 2024]. |
[88] |
NCO, "NCO 5.1.6-alpha03 User Guide - 3.32 Chunking," 7 November 2023. [Online]. Available: http://nco.sourceforge.net/nco.html#Chunking. [Accessed 12 January 2024]. |
[89] |
NCO, "NCO Users Guide, Edition 5.1.6 - Alpha03," 7 November 2023. [Online]. Available: http://nco.sourceforge.net/nco.html#ncpdq-netCDF-Permute-Dimensions-Quickly. [Accessed 12 January 2024]. |
[90] |
ESDIS, "Data Tools," 26 May 2023. [Online]. Available: https://www.earthdata.nasa.gov/learn/use-data/tools. [Accessed 12 January 2024]. |
[91] |
NASA, "Panoply netCDF, HDF and GRIB Data Viewer," 1 Janaury 2024. [Online]. Available: https://www.giss.nasa.gov/tools/panoply/. [Accessed 12 Janaury 2024]. |
[92] |
UCAR, "Integrated Data Viewer," 28 August 2023. [Online]. Available: https://www.unidata.ucar.edu/software/idv/. [Accessed 16 January 2024]. |
[93] |
NASA, "NASA ncompare," 2023. [Online]. Available: https://doi.org/10.5281/zenodo.10636759. [Accessed 12 February 2024]. |
[94] |
NCO, "Bienvenue sur le netCDF Operator (NCO) site," 8 November 2023. [Online]. Available: http://nco.sourceforge.net/. [Accessed 16 January 2024]. |
[95] |
NCAR, "The NCAR Command Language (Version 6.6.2) [Software]," Boulder, Colorado: UCAR/NCAR/CISL/VETS, November 2020. [Online]. Available: http://dx.doi.org/10.5065/D6WD3XH5. [Accessed 16 January 2024]. |
[96] |
PO DAAC, "Metadata Compliance Checker," [Online]. Available: https://mcc.podaac.earthdatacloud.nasa.gov. [Accessed 24 April 2023]. |
[97] |
M. Decker, "CFchecker," 11 November 2018. [Online]. Available: https://jugit.fz- juelich.de/IEK8-Modellgruppe/cfchecker. [Accessed 16 January 2024]. |
[98] |
NCO, "NCO User Guide Version 5.1.6-Alpha03 - 3.15.3 Dismembering Files," 7 November 2023. [Online]. Available: http://nco.sourceforge.net/nco.html#ncdismember. [Accessed 16 January 2024]. |
[99] |
IOOS, "IOOS Compliance Checker," 17 May 2023. [Online]. Available: https://compliance.ioos.us/index.html. [Accessed 16 January 2024]. |
[100] |
National Centre for Atmospheric Science, "CF Compliance Checker," [Online]. Available: https://cfchecker.ncas.ac.uk/. [Accessed 8 January 2024]. |
[101] |
Python Software Foundation, "The NetCDF Climate Forecast Conventions compliance checker," 2024. [Online]. Available: https://pypi.org/project/cfchecker/. [Accessed 8 January 2024]. |
[102] |
ESDIS Project, "DOI Landing Page," 13 October 2016. [Online]. Available: https://wiki.earthdata.nasa.gov/display/DOIsforEOSDIS/DOI+Landing+Page. [Accessed 16 January 2024]. |
[103] |
ESDIS Project, "DOI Documents," 19 September 2023. [Online]. Available: https://wiki.earthdata.nasa.gov/display/DOIsforEOSDIS/DOI+Documents. [Accessed 16 January 2024]. |
[104] |
International DOI Foundation, "DOI Handbook," April 2023. [Online]. Available: http://www.doi.org/hb.html. [Accessed 16 January 2024]. |
[105] |
ESDIS Project, "Digital Object Identifiers for ESDIS," 28 April 2023. [Online]. Available: https://wiki.earthdata.nasa.gov/display/DOIsforEOSDIS. [Accessed 16 January 2024]. |
[106] |
ESDIS Project, "DOI Background Information," 28 September 2023. [Online]. Available: https://wiki.earthdata.nasa.gov/display/DOIsforEOSDIS/DOI+Background+Information. [Accessed 16 Janaury 2024]. |
[107] |
Earthdata Pub Team, "NASA Earthdata Pub," [Online]. Available: https://pub.earthdata.nasa.gov/. [Accessed 16 Janaury 2024]. |
[108] |
ESDS Program, "Earth Science Data Systems Level of Service Model," 25 May 2021. [Online]. Available: https://www.earthdata.nasa.gov/engage/new-missions/level-of-service. [Accessed 16 Janaury 2024]. |
[109] |
ESDS Program, "Adding New Data to EOSDIS," 25 May 2021. [Online]. Available: https://www.earthdata.nasa.gov/engage/new-missions. [Accessed 16 January 2024]. |
[110] |
PO.DAAC, "PO.DAAC Data Management Best Practices - Metadata Conventions," [Online]. Available: https://podaac.jpl.nasa.gov/PO.DAAC_DataManagementPractices#Metadata%20Conventions. [Accessed 16 January 2024]. |
[111] |
C. Davidson and R. Wolfe, "VIIRS Science Software Delivery Guide," 2021. [Online]. Available: https://doi.org/10.5067/FA8689BC-E374-11ED-B5EA-0242AC120001. [Accessed 16 January 2024]. |
[112] |
GOFAIR, "FAIR Principles," [Online]. Available: https://www.go-fair.org/fair-principles/. [Accessed 19 Jaunary 2024]. |
[113] |
Internet Assigned Numbers Authority, "Uniform Resource Identifier (URI) Schemes," 12 January 2024. [Online]. Available: https://www.iana.org/assignments/uri-schemes/uri- schemes.xhtml. [Accessed 16 January 2024]. |
[114] |
OGC, "Geographic information — Well-known text representation of coordinate reference systems," OGC, 16 August 2023. [Online]. Available: https://docs.ogc.org/is/18-010r11/18- 010r11.pdf. [Accessed 26 April 2024]. |
[115] |
"Reverse DNS Look-up," [Online]. Available: https://remote.12dt.com/. [Accessed 16 January 2024]. |
[116] |
NOAA EDM, "ISO 19115 and 19115-2 CodeList Dictionaries," ESIP, 3 October 2018. [Online]. Available: http://wiki.esipfed.org/index.php/ISO_19115_and_19115- 2_CodeList_Dictionaries. [Accessed 16 January 2024]. |
[117] |
B. Eaton and et al., "NetCDF Climate and Forecast (CF) Metadata Conventions (section 2.7)," 5 December 2023. [Online]. Available: http://cfconventions.org/cf-conventions/cf-conventions.html#groups. [Accessed 16 January 2024]. |