Earthdata Cloud Evolution
With the impending arrival of new, high-data-volume missions, the need to effectively archive and process significantly larger data volumes will require new data management technologies and architectures that are more cost-effective, flexible, and scalable than traditional on-premises systems. To meet these needs, the Earth Science Data Systems (ESDS) Program has adopted a strategic vision to develop and operate multiple components of the Earth Observing System Data and Information System (EOSDIS) in a commercial cloud environment.
The Earthdata cloud migration breaks new ground both as the first and largest cloud project within NASA and because it necessitates transforming security policies and procedures that were written for on-premise systems to be applicable to the cloud environment.
The Earthdata Cloud architecture went operational in July 2019 and NASA's Global Hydrology Resource Center Distributed Active Archive Center (GHRC DAAC) was selected as the first DAAC to move operations to the cloud. GHRC DAAC is now operating in the cloud in parallel with the on-premises system and data migration is nearly complete, with full cutover expected in 2020.
Dataset prioritization activity is underway to determine the schedule and priority of onboarding other DAACs and NASA mission data into the Earthdata Cloud, based on criteria such as interoperability, mobility of data, turnaround time, and science value. The first cohort of data sets, from the Oak Ridge National Laboratory DAAC (ORNL DAAC), Goddard Earth Sciences Data and Information Services Center (GES DISC), and Land Processes DAAC (LP DAAC) have been selected and onboarding will begin in the first quarter of 2020. Going forward, additional cohorts will be selected on a yearly basis.
As Earth science data and computations move into the cloud, scientists will be able to do more than ever, enabling new science and application of large-scale analytics. The Earthdata Cloud will create opportunities for innovation around new services, such as sequencing data to support machine learning and artificial intelligence.
The Earthdata Cloud will also improve the efficiency of data systems operations, increase user autonomy, maximize flexibility, and offer shared services and controls. Researchers and commercial users of NASA Earth Science data will have increased opportunity to access and process large quantities of data quickly, allowing new types of research and analysis. Data that was previously geographically dispersed will now be accessible via the cloud, saving time and resources. Moving data to the cloud brings numerous benefits for both data users and EOSDIS, including:
- Easy access: Data users will be able to access data directly in the cloud, removing the need to download volumes of data for use.
- Rapid deployment: With an established EOSDIS cloud platform, data users can bring their algorithms and processing software to the cloud and work directly with the data in the cloud, simplifying procurement and hardware support while expediting science discovery.
- Scalability: The size and use of the archive can expand easily and rapidly as needed.
- Flexibility: Mission needs can dictate options for selecting operating systems, programming languages, databases, and other criteria to enable the best use of mission data.
- Reduced redundancy: The use of a common infrastructure with cloud native services will reduce redundant tools and services, enable sharing, and enforce the use of community standards as well as uniform policies and processes.
- Cost effectiveness: EOSDIS and NASA pay only for the storage and services actually used. Along with scalability benefits, this allows the amount of storage or services to be continually adjusted to ensure that data and services are effectively provided at the lowest possible cost to NASA and EOSDIS.
By 2022, the ingest rate of data into the EOSDIS archive is projected to grow to as much as 47.7 PB per year, according to estimates from ESDS. As this ingest rate increases, the volume of data in the EOSDIS archive also is expected to grow—from nearly 32 PB today to more than 37 PB by 2020; by 2025, the volume of data in the EOSDIS archive is expected to be more than 246 PB.
This anticipated growth in both the data ingest rate as well as the overall archive volume pose new challenges for distributing and analyzing data that currently are stored and disseminated through physical servers on-premises at EOSDIS DAACs.
To address these challenges, ESDS is implementing project Cumulus to move data to the cloud for processing and dissemination. Cumulus is integrated with the NASA-Compliant General Application Platform (NGAP), a custom-built cloud optimized platform, which provides highly flexible cloud native infrastructure, NASA compliant IT Security controls, networking services, and business cost control in Amazon Web Services (AWS). Together, Cumulus and NGAP make up the Earthdata Cloud.
Placing the data archive collectively in the cloud will, for the first time, place NASA EO data “close to compute” and improve management and accessibility of these data while also expediting science discovery for data users. Having EOSDIS data in the cloud will not change existing methods of user interaction with these data; it will, however, offer new methods of access not otherwise possible with on-premises platforms.
The DAACs will still serve as gateways to these data and provide a wide range of support services for data users. It’s likely that EOSDIS data users will not notice any difference in their interactions with the DAACs when searching for and downloading data stored in the cloud. What data users will notice is improved access to data and the ability to more efficiently utilize larger data sets for a broader range of research.
The cloud system must be able to provide services in these key areas:
- Data acquisition from data providers (such as NASA science teams).
- Data ingest (including validation and processing).
- Data archive: The system must preserve and protect NASA EO data.
- Data management: The system must meet the development and execution of information lifecycle needs of NASA mission-based Earth science data sets.
- Data ingest: The system must support multi-mission, multi-discipline data ingest. Data storage and distribution, including disaster recovery: The system must support distribution of data, subsetting, and visualization, and must be adaptable to future technologies.
- Metadata: The harvest, creation, and publication of dataset metadata to the Common Metadata Repository (CMR).
- Metrics: Publication of metrics to the ESDIS Metrics System (EMS), which collects and organizes various metrics from the DAACs and other data providers.
As the technical and architectural aspects of this project evolve, the needs of data users remain a top priority, and the innovative technologies being developed to archive and disseminate data in the cloud may soon enable data users to do more with this valuable resource than ever before.
Data in the Cloud
GHRC DAAC is leading activities to develop the procedures necessary to effectively carry out the DAACs’ mission in the cloud. More than 400 datasets at GHRC DAAC, comprising 32 terabytes, have been migrated. As of the end of fiscal year 2019, GHRC DAAC is projected to have all of its holdings in the cloud.
High data volume missions such as Surface Water Ocean Topography (SWOT) and NASA-Indian Space Research Organisation Synthetic Aperture Radar (NISAR) present opportunities to further develop and test systems and architectures that will provide improved data management and user access for many ongoing Earth science missions.
The launch of the upcoming NISAR mission, currently scheduled for 2022, is expected to add as much as 85 terabytes (TB) of data each day to the EOSDIS archive. Over its scheduled three-year mission, NISAR is expected to generate as much as 140 petabytes (PB) of data. To accommodate this volume, the Alaska Satellite Facility Distributed Active Archive Center (ASF DAAC) is working collaboratively with the Jet Propulsion Laboratory (JPL) to test and prototype ways of archiving and distributing NISAR data using the commercial cloud. This three-year project began in 2016 and is called Getting Ready for NISAR (GRFN). GRFN successfully demonstrated key components for efficiently handling NISAR volumes in a commercial cloud and in 2019 ASF began building out a Cumulus instance for NISAR data. ASF DAAC is also archiving and distributing Sentinel 1 data from the European Commission’s (EC) Copernicus Program into NASA-managed cloud accounts.
NASA has established strategic partnerships to help deliver on the potential of cloud computing.
NASA’s Office of the Chief Information Officer has chosen Amazon Web Services (AWS) as the source of general-purpose cloud services for NASA, and EOSDIS and the DAACs are building and testing prototypes to ensure that EOSDIS data and services will work successfully on this commercial cloud platform.
To facilitate research and analysis of cloud data, the ESDS IMPACT project has executed a Space Act Agreement with Google, LLC. The purpose of the agreement is to address data discovery, access, and use challenges specifically related to large volumes of NASA science data. Tasks include, but are not limited to: transferring and storing large volumes of data, improving data discovery, demonstrating discoveries possible through big data, and capturing lessons learned. IMPACT team members will work closely with Google and NASA’s Frontier Development Lab to discuss future machine learning and artificial intelligence collaborations. NASA Frontier Development Lab (FDL) is an applied AI research partnership with Google which focuses on interdisciplinary problem solving to develop future AI applications in data intensive areas.
The transparent and extendable open source processing framework being developed will adhere to the NASA policy of providing free and open access to data. Under NASA’s full and open data policy, all NASA mission data (along with the algorithms, metadata, and documentation associated with these data) must be freely available and provided to the public as soon as possible following a checkout period to ensure data accuracy and validity; there is no period of exclusive data use.
Last Updated: Mar 24, 2020 at 12:36 PM EDT