The Benefits of Text Mining Full Text Instead of Abstracts

The Benefits of Text Mining Full Text Instead of Abstracts


Many researchers use the summary information in article abstracts to compile a collection of records for text mining, rather than using full-text articles.

There are two reasons for this: Abstracts are not only easily accessible via biomedical databases like PubMed, but they also come in a suitable format for text mining: XML.

Even though using abstracts seems like an easy workaround, there are major benefits that come from mining full text. For example, abstracts often don’t include essential facts and relationships, access to secondary study findings, and adverse event data. While abstracts do provide some valuable information, researchers need access to full-text articles to get the best results from text mining projects.

Access More Facts

Full-text articles provide more information than abstracts. The difference is in both volume and type of information, including detailed descriptions of methods and protocols and the complete study results. Authors often include only their most important findings in the abstract, leaving secondary study findings, discoveries, observations and other critical insights only in the full-text article.

    1. Abstracts often exclude, or underrepresent, data. Given the size limitations of abstracts and their concise nature, results that are less relevant to or out of scope of the main idea often are left out. In some cases, critical information may reside in a footnote of full text. By mining all of a given text, including bibliographic information, researchers can gain richer results that reveal vital patterns and information in the documents.
    2. New discoveries are more likely to be mentioned in the full text of articles before appearing in abstracts. Following initial publication of a new discovery in a particular journal, the research is often repeated and included in other publications. But there is a substantial delay between when that discovery appears in full articles and when that information appears in abstracts. In fact, it can take one to two years for discoveries to appear in the abstract of a subsequent article, according to a study conducted by Elsevier.
    3. Full-text articles are more likely to contain information on adverse events. Per a study published in BMC Medical Research Methodology, “Abstracts published in high impact factor medical journals underreport harm even when the articles provide information in the main body of the article.” This missing information can reduce the value of abstracts as the “raw material” to mine, especially in pharmacovigilance use cases, or when researchers want to make novel connections that haven’t been a major focus of the literature.

Uncover More Relationships

Full-text articles also contain more relationships between named entities than abstracts. According to a study published in the Journal of Biomedical Informatics, only 8% of the scientific claims made in full-text articles were found in their abstracts.

Another study, conducted by publisher Elsevier, compared the use of abstracts and full-text articles to derive relevant information about drugs and proteins that affect the progression of fibromyalgia. They found 31 relationships in the literature by mining abstracts and an additional 53 relationships when they ran the same search across the full-text articles.

While text mining article abstracts yields some information, there are limitations to what can be discovered through that process. To ensure that researchers don’t miss vital data, discoveries, and assertions, the full text of the article should be mined.

Check out Copyright Clearance Center’s text mining solutions here. 

Keep Learning:

Topic:

Author: Mike Iarrobino

Mike Iarrobino is Director of Product Management for CCC’s award-winning content and rights workflow suite, RightFind. He has previously managed marketing technology and regulatory search products at FreshAddress, Inc., and HCPro, Inc. He speaks at webinars and conferences on the topics of data pipelines, information discovery, and knowledge management.