The anticipated growth in the volume of data in NASA's Earth Observing System Data and Information System (EOSDIS) poses new challenges for the Distributed Active Archive Centers (DAACs) tasked with archiving, curating, and distributing them.
To address these challenges, the DAACs are moving their data holdings from on-premise archives to the cloud. The transfer of EOSDIS data to the cloud is currently taking place but, when complete, the result will be vast collections of Earth observation data that are “close to compute,” meaning users will find it easier to discover, access, and manage data, as well as analyze large datasets more efficiently, thereby enabling a broader range of research.
However, as Physical Oceanography DAAC (PO.DAAC) Project Scientist Dr. Jinbo Wang acknowledges, moving DAAC-held data to the cloud means that members of the Earth observation science community will have to move to the cloud with it.
“After unlocking the potential of cloud computing, the next level is moving the code to the data,” said Wang. “Here the majority of the science community is far behind. There are some forerunners or early adopters, such as the members of the Pangeo community, who work to develop software and infrastructure to enable Big Data geoscience research. But there is still a long way to go to educate the majority of the scientific community and bring them into the cloud computing field.”
To do that, Wang and his PO.DAAC team and science colleagues have started a coding club to promote the use of cloud computing for scientific research. Participants include scientists covering the Sentinel-6 Michael Freilich satellite mission, the Estimating the Circulation and Climate of the Ocean (ECCO) project, the Gravity Recovery and Climate Experiment Follow-On (GRACE-FO) satellite mission, the Surface Water and Ocean Topography (SWOT) satellite mission, the Salinity and Stratification at the Sea Ice Edge (SASSIE) airborne mission, and the Group for High Resolution Sea Surface Temperature (GHRSST) project, as well as members of the PO.DAAC engineering team and NASA's Jet Propulsion Laboratory (JPL) Artificial Intelligence and Analytics Group.
The club, which began meeting in March of this year, assembles on a weekly basis and organizers regularly invite cloud experts to answer members’ questions and provide solutions to member’s computing problems. However, the club’s meetings aren’t led by technical experts.
“The PO.DAAC project science team is setting the agenda because we wanted to learn from the science community’s perspective,” said Wang. “So the experience is more relatable to and shareable with the community, and the outcome is more practical and useful for scientific applications.”
In fact, ensuring the meetings are practically useful is Wang’s primary objective.
“Every day we use our laptops and never stop to think how the laptop works,” said Wang. “We should aim for the same level of comfort and familiarity with cloud computing so that scientists can focus on their work rather than admiring the technology. Once reaching that stage, we can say that the community has adopted cloud computing."
To reach that level of comfort and familiarity, the coding club helps members of the PO.DAAC science community understand the basics of Amazon Web Services (AWS), the cloud platform NASA uses to ingest, archive, distribute, and manage the Earth science data in NASA’s EOSDIS collection. It also helps researchers determine if using the cloud is the best approach for their work.
“To begin, there needs to be a rigorous analysis of cloud computing and its costs from the scientific perspective,” said Wang. “Is it worth it? Is it cost effective to embrace this? No matter how wonderful cloud computing is, we will not get everybody to use the cloud. Our goal is to share our own experience and inform the community about the advantages and disadvantages in using the cloud and help them make informed decisions.”
If researchers decide it is worth using the cloud, they then learn the practices associated with getting started in the cloud, such as learning how to set up a collaborative workspace.
“These are basic steps for a cloud engineer, but for scientists it's a giant leap,” said Wang. “We were often lost in the jargon and the acronyms and desperately needed a translator when listening to cloud experts; we do not know enough about the cloud infrastructure to troubleshoot very elementary problems by ourselves or even ask the right questions. Many universities and institutions now start to provide cloud support for their community, but a lot of the researchers that I interact with are still wandering to find the start line. All they need is a gentle nudge toward that line, which might be just three steps away. ”
A Successful Case
That was certainly true for Dr. Ian Fenty, Principal Investigator with the ECCO Consortium’s team at JPL who, despite feeling comfortable around most computer systems and very competent with complex nonlinear physics and numerical systems, didn’t know where or how to begin.
“Operating in the cloud environment is not that hard once you know how to do it,” he said. “It's just that no one knew what those steps were. The purpose and motivation of the coding club was to figure things out together.”
Fenty described his understanding of cloud computing prior to joining the club as “zero.” However, he felt compelled to participate in the coding club because he wanted to make use of the latest big datasets from NASA and a complete collection of the latest ECCO products are distributed by PO.DAAC.
“The ways we’ve been managing and grappling with large datasets are no longer tenable,” he said. “Datasets are much larger and cannot be manipulated on single workstations or even small clusters. They're growing so large that even on supercomputers it’s become a real challenge to process and analyze them. It was rumored that with cloud resources we might be able to get our hands into the data again in ways that lead us to new insights.”
Participating in the coding club provided Fenty and his colleagues with the information they needed to use those big datasets, and it removed some of the uncertainty about cloud computing in the process.