Text Mining Scientific Articles: Why You Need the Full Picture


For life sciences researchers, the idea of finding a needle in a haystack is less a platitude and more a reality. Faced with the unenviable task of identifying distinct facts and insights as a way to define and classify scientific works, researchers rely on text mining and natural language processing (NLP) techniques to get the job done.

However, such is the immensity of the task – dealing with tens of thousands of full-text scientific articles at a time – that the temptation is to focus on MEDLINE abstracts alone. Being shorter, more to the point and in a standardized format, abstracts have clear advantages in this line of work.

Using abstracts rather than full-text articles when text mining scientific articles may often be a false economy.

Nonetheless, more and more researchers have turned to mining full-text articles. Open Access repositories such as PubMed Central (PMC) offer a platform through which researchers can experiment with full-text text-mining, and organizations are keen to make the most of the unique benefits it provides.

The main advantages of text mining full-text scientific articles are volume, information diversity, and speed.

#1: More volume = more facts

The most obvious of these benefits is the amount of information being mined. If you compare a full-text article to its abstract, there are no prizes for guessing has more words. As a result, full-text articles contain more named entities and connections between those entities – and more facts. According to research published in the Journal of Biomedical Informatics, “Only 7.84% of the scientific claims made in full-text articles are found in their abstracts.”

#2: Greater diversity of information = clearer picture

Lengthier full-text articles also provide more diverse types of information – for example, experimental and tabular data. While the body of an article includes full information, an abstract can only summarize it. However, summarizing complex ideas and data can sometimes present a distorted version of the facts. According to a study published in the BMC Medical Research Methodology, “Abstracts published in high impact factor medical journals underreport harm even when the articles provide information in the main body of the article.”

#3: Faster retrieval of facts = earlier recognition of discoveries

It may sound counter-intuitive, but researchers are likely to find scientific facts faster when mining full text rather than abstracts. The truth is, it may take as long as one or two years for a fact included in a full-text article to appear in a succeeding abstract.

So, using abstracts rather than full-text articles when text mining scientific articles may often be a false economy. It may feel like the quick route to results, but in reality only a full-text version is comprehensive. Put simply, text mining on full-text offers more facts, more kinds of facts and quicker paths to insights.

Learn about Copyright Clearance Center’s text mining solutions here. 

Topic:

Author: Stephen Garfield

Stephen Garfield is the Vice President, Client Engagement at CCC. He is responsible for the annual renewal of corporate licensing solutions, as well as overseeing a strategic account management plan designed to help companies educate their employees on copyright law.