Community Tools for Analysis of NASA Earth Observation System Data in the Cloud
Principal Investigator (PI): Anthony Arendt, University Of Washington, Seattle
Data intensive scientific workflows are at a pivotal time in which traditional local computing resources are no longer capable of meeting the storage or computing demands of scientists. In the Earth System Sciences (ESS) community, we are facing an explosion of data volumes where new datasets, sourced from models, in-situ observations, and remote sensing platforms, are being made available at prohibitively large volumes to store at even medium to large High Performance Computing (HPC) centers. NASA has estimated that by 2025, it will be storing upwards of 250 Petabytes (PB) of its data using commercial cloud services (e.g. Amazon Web Services [AWS]). Availability of these data in cloud environments, co-located with a wide range of computing resources, will revolutionize how scientists use these datasets and provide opportunities for important scientific advancements. Fully leveraging these opportunities will require new approaches in the way the ESS community handles data access, processing and analysis. These technologies will be deployable on commercial cloud infrastructure where NASA's Earth Observing System Data and Information System (EOSDIS) is anticipated to be stored. At present, tools for working with these datasets consist of convenient interfaces for discovering and downloading data (e.g. Earthdata search) from individual Distributed Active Archive Centers (DAACs). We anticipate that the transition to cloud storage for many of these DAACs will bring immense opportunities and specific challenges to researchers. Our proposal will facilitate the ESS community's transition into cloud computing by developing technologies that build on existing open-source tools (e.g. Python, Jupyter) by integrating building on top of the growing Pangeo ecosystem.
Our first task will be to deploy a scalable cloud-based JupyterHub on AWS for community use. JupyterHub is a multi-user, multi-language interactive computing environment that facilitates open-ended, exploratory analysis and data visualization. Content ('notebooks') developed on JupyterHub are both functional and fluid; in the manner of an 'executable paper' combining data, processing and interpretation, a necessary departure from traditional publication as a sequence of static artifacts.
Our second task will be to integrate existing NASA data discovery tools with cloud based data access protocols. While existing data discovery tools, such as the Common Metadata Repository (CMR) and Global Imagery Browser Services (GIBS), provide convenient access to dataset metadata but navigating the access, retrieval, and processing steps for these datasets is left to individual users. We will develop an advanced Python application program interface (API) that leverages high-level tools like Xarray and Dask allowing scientists to accelerate their analysis. Integration of this API with the Pangeo ecosystem will provide our API with cutting edge scientific tools for pre-processing, regridding, machine learning, and visualization.
Our third task will leverage our advanced API for data discovery and processing to provide an advanced, cloud-optimized framework for remote data retrieval. Our approach to a data retrieval system goes beyond simple slice and download operations (e.g. Open-source Project for a Network Data Access Protocol [OpeNDAP]) and leverages our advanced API for data discovery, access, and processing to also provide server-side perfunctory processing.
We will demonstrate the use of these tools with several datasets including the North American Land Data Assimilation System (NLDAS), Gravity Recovery and Climate Experiment (GRACE), and Sentinel-1 synthetic aperture radar (SAR). The example applications will serve as templates for the broader community and real-world applications for evaluation of the cloud services and applications we develop.
We also propose to help accelerate a shift in the ESS culture toward cloud computing by providing short but intensive training opportunities. Our work will provide new ways for scientists to collaborate and make full use of NASA satellite datasets.
Last Updated: Jun 11, 2019 at 9:50 AM EDT