Global Hydrology Resource Center Distributed Active Archive Center (GHRC DAAC) Moves to the Earthdata Cloud
By Adam Voiland
December 5, 2019
NASA's Global Hydrology Resource Center Distributed Active Archive Center (GHRC DAAC) at NASA's Marshall Space Flight Center archives information about lightning, tropical cyclones, and other types of hazardous weather collected by NASA. Now GHRC DAAC is the first of NASA's twelve Distributed Active Archive Centers (DAACs) to migrate its data to a commercial cloud provider.
"We completed the migration in October 2019," said Manil Maskey, GHRC DAAC manager, noting that the data is newly available via Amazon Web Services (AWS). "This is a major accomplishment for us," he said. "Setting up the infrastructure to comply with NASA security policies was no small task, but we have documented the challenges and how we overcame them. We think we have mapped out a clear path that will make the process easier for other DAACs."
One of the reasons NASA decided GHRC DAAC would be the first DAAC to migrate to the NASA Earthdata Cloud was the diversity of its data holdings. It manages data from satellites, aircraft campaigns, in-situ sensors, and computer models. It is also among the smallest DAACs, which made the migration process "low-risk and high reward," noted Maskey.
"This is the beginning of a new era for us," said Mark McInerney, NASA's Earth Science Data and Information System (ESDIS) Project Deputy Project Manager. "When we get this system fully built, users won't have to spend months of time downloading and organizing data from geographically distributed data centers. The data will be co-located and users will be able to run research algorithms directly on the cloud."
GHRC DAAC has been working on the migration for a few years. They started the process in 2016 with a prototype study of a cloud-based ingest system called Cumulus. The formal migration process began in November 2017. "Now we're running parallel operations on-premise and on cloud," said Maskey. "Users downloading data shouldn't notice any difference," he added.
The GHRC DAAC migration is one of the first steps in a larger plan to migrate much of NASA's Earth science data holdings to the cloud in the coming years. The fact that a tsunami of incoming data from two new missions is poised to flood the Earth science data processing and distribution system spurred the move to the cloud.
The launch of the Surface Water and Ocean Topography (SWOT) mission, in late 2021, and NASA-Indian Space Research Organisation Synthetic Aperture Radar (NISAR) in early 2022 is projected to increase the system's ingest rate by more than tenfold within two years. "It stops being practical to do this the way we have done it in the past at data volumes of that scale," said McInerney. "It made more sense to put everything in one 'data-lake' on the cloud."
GHRC DAAC is the first DAAC to transition, but it will soon be followed by several others. The Alaska Satellite Facility DAAC (ASF DAAC) at the University of Alaska, Fairbanks, is in the process of migrating SAR data, and several other DAACs are starting the process of migrating individual datasets. Data from NISAR and SWOT will flow into the data lake as well when the missions are operational.
Coming up with an approach suitable for a government agency and within an AWS environment posed a unique set of challenges. "We had to develop innovative solutions to make sure that the cloud system met all the same security policies and procedures that we follow for on-premises systems," McInerney noted. In several cases, this required working closely with NASA's Office of the Chief Information Officer on solutions. Since the cloud is pay-as-you-go, for instance, the ESDIS team had to build new tools to ensure that the system can not violate the Antideficiency Act, a law that prohibits the government from spending money that has not been appropriated.
Christopher Lynnes, an ESDIS Project systems architect, underscored how the move to the cloud will benefit the research community by recounting hurdles he faced recently when trying to download and analyze temperature data for an upcoming presentation. Lynnes notes that it can currently take weeks to download, reformat, and restructure the data needed to do a time-series analysis, a task that will eventually be completed in minutes.
"We're looking at things like data cubes, subsetting, and regridding on the cloud," he said. "With some missions like Terra and Aqua having archives that are nearly 20 years old, it's become critical to offer more than just a single snapshot of time," he said. He also expects to see a proliferation of machine learning and deep learning research algorithms that are premised on the fact that the data will be co-located on the cloud.
"But we are not going into this blindly. It is not a whim," said Katie Baynes, the manager for Cumulus. "We are doing this in as deliberate and intentional a manner as possible. NASA will continue to be the steward of the data, regardless of where it is stored. It will be backed up multiple times. It’s going to continue to be free and open to everybody, just as it always has been."
Last Updated: Mar 5, 2020 at 9:41 AM EST