In the next five years, NASA's Earth Observing System Data and Information System (EOSDIS) archive will grow four-fold, from 65 petabytes (PB) to more than 250 PB. This growth will be driven primarily by two upcoming high-data-volume missions: the Surface Water and Ocean Topography (SWOT) and NASA-Indian Space Research Organization (ISRO) Synthetic Aperture Radar Mission (NISAR). These satellite missions will measure some of the planet's most complex processes in unprecedented detail, but the amount of data they will generate may be challenging for researchers to analyze.
Harmony: Subset, Regrid, and Reproject Data in the Cloud
SWOT will generate 20 terabytes (TB) of high-resolution data on Earth's surface water resources every day. This volume of data will make it impractical for researchers to download the data if they want to do timely analyses. For example, if someone were to download a day’s worth of global SWOT data, they would need 20 laptops, each capable of storing 1 TB of data. And with an average download speed of 25 megabits per second, it would take each of these 20 computers almost four days to download a terabyte each. Once NISAR is operational, it is expected to produce 86 TB of data per day. The time and energy needed to download this amount of data will incentivize researchers to change how they do science.
To make it easier for researchers to work with these Big Data collections, EOSDIS is migrating its 50 most popular datasets to the Earthdata Cloud in 2022, and in addition to upcoming SWOT and NISAR data, the entire EOSDIS data collection is on track to be migrated over the next two to three years. This co-location of large volumes of NASA data will enable researchers to analyze data directly in the cloud. Cloud computing streamlines scientific workflows, simplifies collaboration, and reduces computing time and cost by placing data in the cloud next to high performance computing.
NASA’s Earth Science Data Systems (ESDS) Program is innovating to ensure the research community has open access to the tools, software, and cyberinfrastructure needed to efficiently analyze large datasets in the cloud. One such tool is Harmony, NASA’s Earthdata Cloud Services System.
Single Access Point for the Earthdata Cloud
Harmony allows users to produce analysis-ready data by subsetting, reprojecting, and converting data to a cloud-optimized format. By subsetting data to an area of interest and a temporal extent, a user can minimize the size of their data request, reducing the time it takes to generate analysis-ready data. These data can then be accessed in the cloud (using Simple Storage Service, or S3, links) or downloaded to a local computer. These data reduction services, as they are called, allow users to only access and download data they need, saving the time, energy, and money it takes to move large files around.
In the past, if users wanted to transform NASA Earth science data, they would use data transformation services tied to individual EOSDIS Distributed Active Archive Centers (DAACs). But with the Earthdata Cloud, data from all the DAACs will be archived in one cloud-based archive; Harmony provides a single access point for these data.
Harmony allows users to reproject data to different Coordinate Reference Systems (CRS) so that the data are ready to incorporate into analyses. In addition, users can convert NetCDF data into cloud-optimized file formats such as Zarr and Cloud Optimized GeoTIFF (COG) files. Zarr is an open-source library for storing multidimensional arrays with attributes and dimensions similar to NetCDF4. Cloud-optimized files such as Zarr and COG allow researchers to quickly process large, complex datasets, reducing the amount of time it takes to develop new insights.
How to Use Harmony
You may not realize it, but Earthdata Search already uses Harmony to transform certain cloud-based data. But there are other ways people can use Harmony services in their scientific workflows. Harmony can be accessed using the application programming interface (API), and the documentation for how to use the API can be found on the Earthdata Harmony landing page.
Harmony developers have also created an installable Python package, called Harmony-py, that can be used to request data. An example of how to use the Harmony-py package with a Jupyter Notebook was demonstrated at the Openscapes 2021 Cloud Hackathon and has been posted on the hackathon website.
Harmony services are still in development by a community of developers within NASA’s EOSDIS. These services are currently available for 119 datasets, which are listed on the Openscapes Harmony tutorial page. Community development is a key component of NASA’s Open-Source Science Initiative (OSSI) because it helps reduce the barriers for re-use of code and sharing of domain knowledge. Keep up to date on the latest Harmony developments on the Harmony GitHub page.