AQUA: Automated Data Query and Access for Large Earth Science Datasets

Principal Investigator (PI): Brian Wilson, NASA's Jet Propulsion Laboratory

A key challenge facing climate scientists is the difficulty of "scaling up" their statistical analysis to cover time periods of years to decades. If one wants to compare model grids from one or more (e.g., atmosphere) models to Level 2 (L2) or L3 retrievals (of temperature, water vapor, aerosol optical depth, etc.) from multiple EOS instruments, or just inter-compare the instruments, millions of EOS granules need to be located and then staged onto disk in order to perform the analysis. And inevitably, the process must be repeated as the models or comparison algorithms are refined. Currently, the data must be "ordered" (staged onto disk at NASA's Distributed Active Archive Centers [DAAC's]) by a human filling out a web form; order sizes are usually limited to a week or two at a time; and the response to the user comes via email. The EOS Clearinghouse (ECHO) provides services for space & time query, order entry, and automated delivery of granules via FTP push or pull. (Each of these service requests is forwarded to the appropriate data provider.) We propose to develop an ECHO client that will take the human out of the loop and enable transparent, machine-to-machine, automated data query and access to multiple EOS datasets on a large scale.

To accomplish this, we will develop a set of machine-callable Web (Simple Open Access Protocol) Services on top of ECHO data query and order entry services that together will automate the following multi-step process (or Use Case):

Query ECHO for datasets (collections) that contain the physical variables of interest,
Query those datasets for granule ID's that satisfy the desired space/time constraints,
Locate granules already on-line at the DAAC's or in the user's file cache,
or Order via ECHO their staging onto disk,
Fetch the granules using FTP URL's (or have them pushed),
or Access variables from the data files in-place using OpenDAP URL's to subset,
Analyze the data by calling the scientist's data fusion algorithm,
Repeat for progressive time "chunks" until the desired multi-year period is covered.

Each of our client services (query, locate, order, fetch, etc.) will be a composite service, which automates and hides the complexity of the multiple ECHO (SOAP) calls required to accomplish the task. Each service will have a simple SOAP interface described in standard Web Services Description Language, and be published (callable) at multiple web servers. Once these services are available, they will be assembled into an automated workflow to perform the desired scientific analysis (steps 1-8). The Order/Fetch/Analyze/Repeat service cycle will automatically adapt the size of the time "chunk" to the disk space available for staging data (at the DAAC's and client site). Using future Grid virtualization, the storage & compute resources required for a particular analysis job might be discovered and allocated on the fly, and paid for later on a utility bill.

Currently, a scientist wonders: Why can't I just push a button to (space/time) query, locate on-line, and/or automatically stage onto disk a year of Atmospheric Infrared Sounder (AIRS), Moderate Resolution Imaging Spectroradiometer (MODIS), and Multi-angle Imaging SpectroRadiometer (MISR) L2 or L3 products and then use them in an automated, repeatable science analysis? In the near future, this will be possible and she will only have to worry about the monthly bill.

Although the primary goal of this project is to develop automated machine-to-machine services, we will leverage existing SciFlo capabilities, and use the AJAX programming paradigm, to layer on top of the SOAP services a dynamic web browser interface for human users. The user will specify a simple "chain" of services in the dynamic GUI and then the desired workflow will execute automatically.

AQUA: Automated Data Query and Access for Large Earth Science Datasets

Find Data

By Platform

By Topic

Data Catalog

Data Tools