Data Chat: Kaylin Bugbee

NASA Earth science data are openly available to anyone for any purpose. As NASA data manager Kaylin Bugbee observes, open scientific data – and the growth of these data – is leading to a new, collaborative paradigm for scientific research.
Josh Blumenfeld
Image

NASA’s Earth Science Data Systems (ESDS) Program defines open science as a collaborative culture enabled by technology that empowers the open sharing of data, information, and knowledge within the scientific community and the wider public to accelerate scientific research and understanding. For Kaylin Bugbee and her ESDS colleagues, open science brings both tremendous potential for scientific discovery along with challenges for both data users and data providers.

In recently-published peer-reviewed article, Bugbee, an ESDS data manager and a member of NASA’s Interagency Implementation and Advanced Concepts Team (IMPACT) at NASA’s Marshall Space Flight Center in Huntsville, AL, and her coauthors, IMPACT manager Rahul Ramachandran and ESDS Program Executive Kevin Murphy, provide a broad look at the current state of open science. She and her coauthors observe that the combination of massive volumes of data and the technology to work with these data collaboratively is leading to new ways of pursuing scientific investigations. They also note that oversight is needed to ensure that large repositories of aggregated data, sometimes referred to as data lakes, don’t turn into data swamps.

What do we mean when we talk about open science and how is this leading to a paradigm shift in how science is conducted?

Science has been open in some measure for a long time. It really became more formalized with the introduction of the journal system in the late-16th and early-17th centuries, where the scientific community recognized that they needed a way to share their knowledge in a more transparent way. We’re seeing a paradigm shift now, often called Open Science 2.0, because of the advent of the internet and the increased accessibility [to large data collections].

To me, there are a number of different definitions or understandings of what open science is. I tend to adopt a broader definition of open science that includes three aspects. First, there’s open access to all aspects of research. This includes open access to data, software, and any information that comes out of research such as journals, blogs, and similar products. Second, there’s open access to the scientific process. This means being transparent with the process and also being open to different communities being part of the scientific process. We’re seeing this more and more, especially with citizen science activities. Finally, there’s the aspect of fostering a collaborative and inclusive process that is welcoming to everyone and open to everyone.

What are some of the challenges to making science open?

Image

There are several that I would categorize as coming from a needed cultural change.

For example, scientists want to spend time doing science, and while they may recognize the value of making their data or code available, this is an added step that might take them away from conducting research. Some researchers also don’t realize that there is interest in the things that they make as products of their research, such as algorithms or code to analyze data.

Another challenge is working out the research rewards system. In planetary science, for example, researchers devote a huge chunk of their career to an instrument that is flown to a distant celestial body. It can take years for the instrument to get to the point of even collecting data. The scientist has a significant investment in their career just to get these data. It’s important to keep in mind the anxiety they might have about their research, their career, and the credit they receive for their work given the knowledge that the data they collect might be openly available as soon as possible not just to them, but to the world.

A further challenge is one of equity. For example, not everyone has reliable access to the internet or funding to support computing costs. While science may be more open, equity will be a challenge that needs to be addressed for certain groups.

ESDS talks about open science being a collaborative culture. How is open science leading to more collaborative science?

Image
MAAP is a virtual open and collaborative environment that leverages cloud technologies to facilitate open data use across aggregated data sets. Using MAAP, NASA and ESA are working together to make terrestrial biomass data and metadata from multiple missions and sources more interoperable across organizations. NASA MAAP image.

I’m working on a project called the Multi-Mission Algorithm and Analysis Platform, or MAAP, which is a really great example of collaborative open science in action. In the MAAP platform, we have brought together open data from NASA and ESA [European Space Agency] in a single location that both NASA and ESA scientists can use for their research. As we start doing these bigger analyses at scale, having data from all these disparate sources in a single location really makes it easier to get up to speed with running your science.

The other big thing about MAAP is that you can write code in a Jupyter environment and share the code back and forth in the platform. NASA scientists are sharing their [Jupyter] Notebooks with ESA scientists and they’re getting feedback and they’re sharing the data with each other. It’s speeding up the research process and it’s speeding up collaboration, and because we’re making all the data and the code and the infrastructure open, it makes it possible for scientists to efficiently collaborate.

Is it fair to say that working with Big Data likely will require big collaborations?

Yes. We’re already starting to see these collaborations happening on the MAAP project. MAAP is a really good example of bringing together large volumes of data from both NASA and ESA to provide a better understanding of terrestrial carbon dynamics and biomass. This problem requires an increase in both temporal observations and spatial coverage to really understand these questions. The more data that we can bring together from different organizations, the greater temporal and spatial coverage we can provide to help better conduct biomass research.

As science becomes more open, greater volumes of data will become available, requiring individual scientists to have the programming skills needed to work with and manage these data. The MAAP scientists are very interested in learning how to collaborate to work with Big Data. This is pretty exciting because effectively using Big Data by working together can lead to significant scientific outcomes.

For data providers, how can open science be balanced with the inherent challenges of managing very large data collections?

Image
During its planned three-year mission, NISAR is expected to generate as much as 140 PB of data. During its three-year mission, SWOT is expected to generate 23 PB of data. At the end of the 2020 Fiscal Year, NASA’s EOSDIS had a total archive volume of approximately 42 PB. JPL images.

In many ways, I think the challenges of managing Big Data collections requires that we be more open and more transparent. We need an open, collaborative environment to have people reviewing the algorithms that we’re generating to produce data at scale and giving us feedback to help us understand what’s working and what isn’t.

There’s going to be so much data coming into NASA's EOSDIS [Earth Observing System Data and Information System] archive with the upcoming Surface Water and Ocean Topography [SWOT] and NASA/Indian Space Research Organisation Synthetic Aperture Radar [NISAR] missions. For these data to be used effectively we need open science collaborative platforms to ensure that we’re getting the most value from these amazing data assets.

You and your coauthors point out that while data can be made freely and openly available, the software necessary for working with these data is often copyrighted, making the free use of this software restricted unless the copyright owner grants a license for its use. How is NASA’s ESDS Program dealing with this issue?

I think ESDS is ahead of the curve on dealing with these issues. We require all our funded researchers to create open source software [OSS] for any code developed as part of the research process and we provide guidance on the permissive open licensing to assign to the code so that a scientist does not have to be an OSS expert to know what license to use. We also provide guidance on where and to what type of repository you should deliver that code. The one that’s most familiar is GitHub. We encourage the use of the NASA GitHub, but our intent is to encourage code to be placed in a recognized, established repository.

One big benefit of open source software is increased transparency and reproducibility. Providing the code in an open manner along with the algorithm theoretical basis documents [ATBDs] enables a scientist to understand where the data are coming from and the science and the math behind what is being generated. If a scientist wanted to, they could use the code and ATBDs to check the work and verify that they can get the same results. In addition to making code openly available, we are developing the Algorithm Publication Tool [APT] in order to make ATBDs standardized and easily discoverable to scientists and users.

We’re also speeding up the scientific process by making code openly available. We’re already seeing this in MAAP with scientists sharing biomass algorithms within the platform. Sometimes, as a scientist, you just need to take something that’s [openly available] and tweak a couple of variables to keep moving forward with an analysis. Reusing code can save valuable time for scientists.

What is the role of impact measurements and similar metrics in open science? You use the term altmetrics, which is a term I was not familiar with.

I see impact measurements as a way for scientists to better understand the impact of the data and code they are sharing. A scientist can look at a citation count and see how broadly a paper is being disseminated, but they need a similar system to show them the impact of their data and their code to help them to understand how broadly their work is being used.

Altmetrics acknowledges that research content goes other places than just in a journal. You can think of altmetrics as an equation with a number of variables – views in Facebook, citation references, etc. – that all have different weights. Combining these weighted variables gives you a score relating to the overall impact of the content.

The altmetrics that exist now don’t work very well for data, and I think it would be interesting to consider developing altmetrics for data systems, datasets, and code. In order to do this, we would need to build our own altmetric equation of what we think are the input variables of value to help us understand data and software use. These data and software-specific metrics will help us better communicate the value of making data and code openly available for use.

Let’s talk about data swamps. You and your coauthors note that there is the potential for cloud-based data lakes to turn into data swamps filled with data of dubious quality. How can we avoid this?

This is a problem I worry about for the MAAP. In the MAAP, for example, we encourage users to openly share data. If I’m a scientist working in the MAAP and I create some data that I think are interesting, I easily can share these data with other MAAP users through a streamlined workflow we built.

Now, there is a potential danger with this data sharing, with a risk of users sharing data of questionable quality or that are poorly documented. However, there are a number of things that can be done to help mitigate this. One strategy that we are looking at is leveraging a data governance and management plan commonly used in industry. This type of plan documents the steps taken to manage data quality and integrity in a platform. One step taken in industry is the use of AI [artificial intelligence] and ML [machine learning] techniques to clean data, understand data, assess data, and annotate data. I think these are valuable tools for us to consider to help keep our data lake from becoming a data swamp.

It sounds like in the short term, it might take more time to develop these systems to oversee and manage open science and open data collections, but once these systems are established and running and accepted it could lead to more efficient science.

Yes, the goal and long-term vision for platforms like MAAP is get tools in place to make data management processes more efficient to allow for more time to help people do their science.

As these open science policies and practices continue to develop, what are you most excited about over the next five to 10 years? Where do you see scientific exploration and discovery headed?

I am excited to see where we take interdisciplinary research. A flagship example of this at NASA is the Exoplanet Exploration Program Analysis Group [ExoPAG]. It’s an interdisciplinary group of people from astrophysics and planetary science with scientists also participating from heliophysics and Earth science. Context, information, and data are needed to work on these emerging exoplanet topics. I think having open data and open code – and making it easier for everyone to participate in the scientific process – is going to really accelerate these types of interdisciplinary research topics. I am very excited to see where the scientific community takes interdisciplinary research and what we will learn as a result.

Explore more Data Chats

Last Updated