Skip to main content

Principal Investigator (PI): Hook Hua, NASA's Jet Propulsion Laboratory

Multi-decadal climate data records are critical to studying climate variability and change. These often also require merging data from multiple instruments such as those from NASA's A-Train that contain measurements covering a wide range of atmospheric conditions and phenomena. The science Co-I of this proposal is also recently funded under a Making Earth System Data Records for Use in Research Environments (MEaSUREs) NRA to provide a merged and multi-decadal climate data record of water vapor measurements from sensors on A-Train, operational weather, and other satellites.

The data sets are being assembled from existing data sources, or produced from well-established methods published in peer-reviewed literature. However, the immense volume and inhomogeneity of data often requires an "exploratory computing" approach to product generation where data is processed in a variety of different ways with varying algorithms, parameters, and code changes until an acceptable intermediate product is generated.

This process is repeated until a desirable final merged product can be generated. Typically the production legacy is often lost due to the complexity of processing steps that were tried along the way. The data product information associated with source data, processing methods, parameters used, intermediate product outputs, and associated materials are often hidden in each of the trials and scattered throughout the processing system(s). We propose to help users better interpret the exploratory process of this production legacy by enabling the tracking of data, metadata, associated materials, algorithms, and parameter changes used during the production of these merged and multi-sensor data products.

By leveraging existing provenance tools, we will capture the metadata associated with the exploratory computing and present the data product provenance back to the users. We will also develop generic multi-platform clients to be plugged into existing code to communicate production information back to the provenance collection tool. To improve data knowledge and use, we will also develop a web portal enabling users to track and visualize the product and processing information collected by the provenance tool. For any product generated, a fully traceable processing lineage will be available that includes the production methods, parameters, source data, and associated information used.

This capability will enable one to cite data products in science literature with links to its full data and service provenance.

Water vapor and cloud observations from the current generation of NASA sensors, especially those on the A-Train, cover a wide range of atmospheric scales and a wide class of phenomena. Our proposed system will improve knowledge of NASA's Earth science data quality and production legacy of multi-sensor and multi-decadal water vapor data records. The size, heterogeneity and complexity of global-scale and long-term climate change often requires more complex data processing whose production legacy must be tracked and preserved for traceability and scientific justification.

We will integrate Web Service-based tools for multi-platform data provenance tracking (such as the "Karma Tool for Provenance Collection and Storage") into existing data production environments. Generating a merged multi-decadal climate data record of water vapor measurements requires potentially different processing environments. We will develop generic client plugins in MATLAB, IDL, Python, and C/C++ that can be easily integrated into existing code and will communicate via Web Services to the provenance collection tool.

To improve data knowledge and use, we will also develop a provenance web portal enabling users to track and visualize the product and processing information of the generated data products.