Should ChatGPT be a Research Collaborator? - Herman, Neuhauser Research Management Consulting, Inc.,

Claudia Neuhauser and Brian Herman

ChatGPT made its debut in November 2022. It took about a month before it was listed as a co-author on a preprint in medRxiv: Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models. It took another six months before a couple of researchers let ChatGPT do the research and write a preprint: The Impact of Fruit and Vegetable Consumption and Physical Activity on Diabetes Risk among Adults. Apparently, it took less than an hour to write the paper, according to the Nature article that described how ChatGPT wrote the paper.

As with any new technology, there are always positive and negative aspects, and much has been written about those related to AI (see here or here). In research, AI platforms have already demonstrated their usefulness. Just think of the protein folding problem or drug design. But the ease of use and the speed of responses of the latest Large Language Models (LLMs), like ChatGPT, have catapulted discussions in academic circles on how this new technology could be, should be, and might be used in scientific research.

It would be naïve to think that researchers will not use LLMs, and so the question we need to address now is how LLMs should be a research collaborator and not whether they should be one. Once ChatGPT and other LLMs are integrated into research teams, they may turn out to be forceful collaborators who are know-it-alls. We need to anticipate and proactively solve issues before these platforms take over the research enterprise.

Research is a social activity

Research is a profoundly social activity. To find out the latest in their area of research, researchers go to conferences, attend departmental seminars, network with colleagues, read research publications of their colleagues, and contact colleagues to discuss their discoveries. Over the last several years, social media has also become part of staying up to date on the latest, sharing new findings with colleagues, and engaging with colleagues in online discussions. Social media platforms are different from LLMs. While LLMs may engage a person in a discussion, the discussion will remain private to the person. LLMs also do not have the nuanced view that comes from social interactions in a research area where knowledge is collaboratively created, and scientists may battle contradictory hypotheses.

In the future, we will have LLMs that are way smarter and way faster than us. And therein lies the potential problem. Why engage in a discussion with a colleague when an LLM can provide deep insights that accelerate the research? And since LLMs can take on any expertise, we may no longer need interdisciplinary teams; we can simply ask the LLM to be on our team as the humanist, the statistician, or whatever expert we need. Will our social research networks simply unravel as we rely more and more on machines as collaborators? And will we then still be able to direct the research enterprise to address the most pressing societal issues?

Researchers will also feel less of a need to read publications carefully when feeding them into an LLM that links new findings to previous ones with no effort and synthesizes the ever-increasing list of publications that one needs to read to keep up with the field. This technology is around the corner, with the first LLM science search engines having just been announced.

The current LLM science search engines are taking a cautious approach, at least for now. For instance, Elsevier’s Scopus AI chatbot, as explained in Van Noorden’s Nature article, keeps the human in the loop. The human uses a regular search to identify up to ten publications of interest, and then the chatbot summarizes the research based on the abstracts of these papers. Dimensions and Web of Science are experimenting with their versions of LLMs to summarize and synthesize publications. All of these products are still in the test phase. It won’t be long, and LLMs will write the introductory paragraphs of publications where past research is summarized. Will our next generation of researchers still have the time and patience to slug through dense and difficult research papers or simply rely on summaries written by LLMs?

Should ChatGPT be a co-author?

The preprint in medRxiv we mentioned earlier prompted the journal Nature to talk with publishers and preprint servers about listing ChatGPT as a co-author. Everyone involved in this discussion agreed that ChatGPT should not be listed among the authors since it “cannot take responsibility for the content and integrity of scientific papers,” one of the criteria of being an author.

Even if we don’t want to include LLMs and other AI platforms officially as co-authors on publications, current versions can help with summaries of previous research, as mentioned earlier, and future versions may end up doing the bulk of the work. The question that we will need to answer sooner rather than later is who owns the intellectual property generated by AI: the human scientist, the programmer who wrote the code for the AI algorithm, the individual or agency who funded the research, or those who generated the information the AI system was trained on? Some (if not most) of these issues will be decided by courts, like the decision issued by a federal appeals court in August 2022 that unequivocally said that AI cannot be named as an inventor on US patents because the “Patent Act requires an ‘inventor’ to be a natural person.” (Of course, the Patent Act could be changed at some point in the future to accommodate machines as inventors.)

What it means to be a natural person may also change as we start to merge humans and machines. Loz Blain reported in an article in the New Atlas, “Computer chip with built-in human brain tissue gets military funding,” that researchers are already experimenting with brain organoids, such as the “DishBrain,” a computer chip created by researchers at Monash University in Australia that integrates human and mouse brain cells into an array of electrodes. This chip was able to “demonstrate something like sentience” by learning how to play the game Pong within minutes. One of the researchers, Adeel Razi, explains that this could be “a new type of machine intelligence that is able to learn throughout its lifetime.”

Working alongside LLMs in research will greatly accelerate research. The ChatGPT written research paper that we mentioned at the beginning of the blog is an illustration of what might come. The scientists who had ChatGPT do the research and write the paper developed an “autonomous data-to-paper system [that] led the chatbot through a step-by-step process that mirrors the scientific process,” according to the Nature article. If we move into a research environment where labs produce the data and LLMs analyze the data, interpret the results, and write the paper, we will see a tsunami of publications that can only be digested by LLMs. We may need to limit the number of publications a lab can publish. This will allow humans to keep up with research, have a hand in steering the research enterprise, and control the output quality. It would also solve some of the concerns with the publish-or-perish system science currently lives under and might force scientists to only publish their most relevant and important results.

Can ChatGPT be reliable?

We may not want to name our favorite chatbot as a co-author, but we will use LLMs to synthesize and write papers and grants and referee those same papers and grants. LLMs will be particularly useful in helping us determine whether the research is new, contradicts other research, or adds something significant because of its ability to synthesize very large amounts of data—way more than a scientist can consume in a reasonable amount of time. But the current LLMs are prone to hallucinations, meaning that they make up stuff. In addition, the scientific corpus is full of contradictions because of how science works. How do we monitor the accuracy of what these LLMs spit out?

LLMs are trained on the corpus of information available to the AI system at the time of the training. Science is a self-correcting enterprise; What is published is what the community deemed accurate at that time. This means, in particular, that the scientific record is not a curated record in the sense that publications whose results have been overturned by new insights are being removed from the corpus, a practice that would be devastating to science since what is considered accepted knowledge may go back and forth among different theories. LLMs are not sophisticated enough to have that nuanced view, at least not currently.

The problem of research misconduct and retractions

We only remove publications from the corpus that are identified as having falsified or fabricated data. About 1-2% of published papers are based on falsified or fabricated data. Spotting these publications is difficult, and even if an investigation concludes that the data are falsified or fabricated, it may take years before the paper is retracted. In the meantime, others cite the publication and base their research on those false results.

It takes a sleuth like Elisabeth Bik, a microbiologist who has dedicated her professional life to identifying publications that contain fabricated and falsified data. Her focus is on image manipulations, and she has screened thousands of images in research papers. Bik published her analysis of ~20,000 peer-reviewed published papers in 2016 for potential manipulation of images. She claimed to have found one in twenty-five papers (4%) to have potentially manipulated images.

The screening approach is not without serious concerns: With any screening, whether it is for suspicious images in research papers or the annual cancer screening at the doctor’s office, false positives occur. We do not know what algorithms Elisabeth Bik uses and what their false positive rate is. But if the prevalence of research misconduct is rare, like the 1-2% that is commonly cited, it would not be too surprising if some or even many of the accusations won’t be borne out by further scrutiny, just like a positive test result in an initial cancer screening may not be confirmed in a biopsy.

Research papers with questionable images often end up on PubPeer, a public website that allows scientists and others to post comments on published studies anonymously. Of course, any proven manipulation of scientific data is wrong, and the perpetrators should be held accountable. But at this point, these algorithms are not sophisticated enough to run without a human in the loop, and public accusations based on the output of an algorithm without further scrutiny can seriously damage someone’s reputation.

Screening for suspicious images with software is quick, whereas a research misconduct investigation typically takes months and many hours of work by experts. PubPeer has been used to make allegations that resulted in research misconduct investigations, as publicized in “The Research Scandal at Stanford Is More Common Than You Think.” An anonymous comment posted in 2015 on PubPeer stimulated Theo Baker, a reporter at the student newspaper of Stanford University, to take a closer look at five papers published by Stanford’s president, Marc Tessier-Lavigne, between 1999 and 2012. Baker wrote a well-researched article in the student newspaper, which prompted Stanford to launch an investigation conducted by a special committee. The investigation concluded with a 95-page report in July 2023 after a review of thousands of documents.

The committee that reviewed the allegations concluded that Tessier-Lavigne did not engage in research misconduct and that he did not know about the data manipulation in his lab. Since it was determined during the investigation that someone in Tessier-Lavigne had manipulated data, Tessier-Lavigne will retract three of the papers and correct another two. Some of the publications have been out for approximately a quarter of a century, and the results have been used by many other scientists to move research forward on Alzheimer’s disease.

It is not clear how LLMs trained on the scientific corpus would deal with retractions and the large number of other papers that cite retracted papers. Will LLMs be able to distinguish between citing a retracted paper as part of the summary of past research and relying on results based on falsified or fabricated data from a retracted publication?

The Tessier-Lavigne story should make it clear that relying on AI to determine which publications should be included in the training corpus is fraught with problems, at least at this moment in time. We still need human experts to decide which research to use to build the knowledge edifice that won’t be a sandcastle. And, at least for now, these discussions happen among real people and often in social settings.