This blog post is a collaborative effort between Kaylin Bugbee, who is responsible for leading Science Discovery Engine Project, one of NASA’s Open-Source Science Initiative efforts, and Rahul Ramachandran, who is leading the AI foundation model effort within NASA/IMPACT. For some time, IMPACT has been studying the effects of large language models on science, particularly on the data lifecycle. However, with the introduction and widespread use of ChatGPT, the narrative has shifted significantly, prompting a deeper examination about the impact of generative large language models (LLMs) on NASA’s open science initiative.
What Are Large Language Models?
LLMs are machine learning models designed to process and generate human language based on large-scale text data. These models use neural networks, a type of machine learning algorithm, to learn patterns and make predictions based on vast amounts of text data [1]. OpenAI’s GPT series, which includes GPT-2, GPT-3, and newer versions, is among the most well-known LLMs. GPT-3, for example, was trained on a dataset of over 45 terabytes of text data from various sources, such as books, articles, and web pages [1]. Google’s BERT (Bidirectional Encoder Representations from Transformers) and Facebook’s RoBERTa (Robustly Optimized BERT Pre-training Approach) are other notable LLMs trained on large text data [2][3]. These models have various applications such as language translation, chatbots, text summarization, and generating human-like text for news articles, chat messages, and creative writing.
Despite the potential benefits of LLMs in speeding up tasks such as writing papers, grants, and code, there are concerns about their reliability, particularly in returning false information [1]. The learning process of LLMs relies on statistical patterns of language in large databases of online text, including untruths, biases, and outdated information [1]. As a result, LLMs are unreliable in producing accurate information, particularly for technical topics on which they have had limited training data. Moreover, the fixed context length of LLMs also pose challenges in generating factually correct information since they start to become less coherent as the context length increases. Therefore, it is essential to be aware of the limitations of LLMs and validate their outputs before using them for critical tasks.
What Is ChatGPT?
ChatGPT is a conversational agent that utilizes the large language model GPT-3 (specifically, InstructGPT, fine-tuned by human feedback) to generate human-like responses to natural language inputs. As a killer application of LLM GPT-3, ChatGPT has the potential to revolutionize the way we interact with computers and digital services. ChatGPT is capable of understanding complex language structures and generating text that is coherent and contextually relevant. Furthermore, ChatGPT can be fine-tuned on specific tasks and domains to provide more specialized and accurate responses. ChatGPT can be used for various purposes, such as customer service, language translation, and even creative writing. In the scientific domain, ChatGPT can also help with writing manuscripts and generating code. Overall, ChatGPT represents a significant advancement in natural language processing and has the potential to enhance human-computer interaction in numerous ways.
While ChatGPT has potential, it should be noted that ChatGPT has several known shortcomings. First, generative LLMs are prone to producing errors or creating misinformation. Second, given that LLMs are often trained on historical documents, there is the potential to incorporate bias and outdated ideas into responses. GPT-3 has restricted its corpus to documents from 2021 and was not allowed to browse the Internet to circumvent this restriction [6]. As technology continues to evolve, some of the issues may be resolved and LLM-based conversational agents like ChatGPT will likely become even more powerful and prevalent in various fields.
Implications of Large Language Models to Science
The emergence of tools based on LLMs like ChatGPT has provided researchers and scientists with a tool for editing manuscripts, writing or checking code, and brainstorming ideas [5]. Although ChatGPT has made its debut in scientific literature, the use of AI-generated text has become a topic of debate, with some publishers grappling with the question of whether it is appropriate to cite ChatGPT as an author [5]. While some publishers have prohibited the use of text generated by ChatGPT in scientific papers, citing it as scientific misconduct, others have yet to create policies regarding the use of AI tools in published literature [5]. Nevertheless, the use of LLM-based generative tools like ChatGPT has raised concerns about their reliability and tendency to return false information, emphasizing the need for human oversight and the recommitment of scientists to giving careful attention to detail to maintain trust in science [5][6].
To address these concerns, researchers have suggested measures to enforce honest use, transparency in use, and detection and watermarking of AI-generated content [6]. Detection tools for AI-generated content can help in flagging the use of LLM [6]. However, as language models become more sophisticated, these tools may not be infallible, and the future of generative AI will depend on the ethical choices made by researchers.
Implications of These Generative Models to Open Science
Open science is defined as a collaborative culture enabled by technology that empowers the open sharing of data, information, and knowledge within the scientific community and the wider public to accelerate scientific research and understanding [7]. Open science adheres to a number of principles:
- Transparent Science ensures that the scientific process and results are visible, accessible, and understandable [8].
- Accessible Science makes scientific data, tools, software, documentation, and publications accessible to everyone [8].
- Inclusive Science welcomes participation and collaboration in the process of science from people and organizations with diverse backgrounds [8]. This includes public engagement in the scientific process.
- Reproducible Science ensures that the scientific process and results are open so that they can be independently verified and validated [8].
The goal of open science is to accelerate the time to actionable science. While technology alone cannot achieve all of the goals of open science, it can aid in streamlining and optimizing the various steps involved in conducting research, from idea generation to data collection and analysis, to publication and dissemination of findings. New technologies, such as collaborative platforms, cloud computing, artificial intelligence, and high-throughput experimentation can help speed up data collection and analysis.