Q&A With the Godfather of FAIR Data Principles, Barend Mons


Barend Mons is the Founding Director of the Leiden Institute for FAIR and Equitable Science (LIFES), a joint effort by an international public-private partnership of 10 academic and private organizations to build a broad and diverse network of members that want to incorporate FAIR data principles. He is also the Scientific Director of the GO FAIR Foundation, which guides people and organizations on solutions to making data universally Findable, Accessible, Interoperable, and Reusable. CCC recently spoke with Professor Mons, and below is an excerpt from that conversation, edited for space and clarity.

CCC: Professor Mons, you are well-known in the industry for introducing FAIR data principles. How did this movement start, and what problems were you trying to address?

Barend Mons: I am a biologist by training. At some point in my career, while at the European Commission, I started working on Malaria. During this work, I identified a language problem that needed to be clarified between Portuguese, English, and French. This led me to my interest in ambiguity in languages and other areas.

I started working with thesauri and ontologies, then came into a mode where we saw that the data we generated in biology – especially in my field of genetics – was growing enormously fast. I understood early on that we would need machines to process this data. And the machines could not deal with ambiguity. That is why I became involved in precisely defining data so it could be machine-actionable.

I always say that the “I” in FAIR means ‘the machine knows what I mean.’ The good thing was that if machines understand what you mean, you can certainly express it – and indeed with LLMs – in any language you want, whether you say it in Portuguese or French, for example. Either way, it is the same concept.

So this led to implementation in the pharmaceutical industry with a significant project we ran in 2012 called Open PHacts, where individual searches were machine-readable. Then 2014, it boomed after our Lorentz meeting and when we published a paper. So, it was a long trajectory, but the fundamental realization was that, soon, we would not be able to do large-scale science without assistance from machines as our assistants, and we’d better make data machine-actionable.

CCC: Fast-forward to 2024. Where do we stand now in implementing FAIR data principles in Pharmaceuticals, Academia, and society in general?

Barend Mons: Measuring the impact of something like this is always tricky. People say that 10 years is nothing after a new concept appears, but what I can say is that when we had a meeting in January 2024 – exactly 10 years, almost to the day, since we had the inaugural meeting – we saw thousands of citations in an analysis of the CWTS [Centre for Science and Technology Studies – Leiden], which is a significant institute dealing with citations. So, many people might think that the citations referencing the paper are limited to Life Sciences, but they [CWTS] showed that of the almost 15,000 published citations, it has been cited in 4,000 recognized scientific disciplines.

So, it is all over the place, from social sciences to justice to water and the environment. There is even a recent statement by the G7 that they are still behind the FAIR principles and that we need to continue going this way. Though less with publishers, it has also been immensely accepted by funders and in policy.

CCC: Can you share your thoughts about FAIR concerning Large Language Models (LLMs)?

Barend Mons: What is also an unexpected development is the development of LLMs and what people call “AI” [artificial intelligence], even though it may not be AI. I recently heard a beautiful statement that ‘everything that LLMs produce is a hallucination’ as the machines have no clue what they are saying. The meaning behind this is that if the hallucination makes sense to us, it can be perceived as significant, whereas if it does not fit our conceptual model, we refer to it as a hallucination. The more beautiful it looks, the more dangerous it is and the harder it is to detect mistakes.

As I have discussed with CCC, there are at least two ways to restrict LLMs to prevent the craziest hallucinations. One is to feed them FAIR data, which can also be considered “Fully AI Ready” data, so they understand the data much better than just mining random text from the web. The other way is to constrain the output and force the conceptual models – again, a form of FAIR data as this involves ontologies – to ensure outputs are not of low quality and ill-fitting of reality. But again, if you do not feed models with proper training data, then they do not interpret data as it should be. So now, 10 years after its inception, FAIR has undoubtedly been propelled onto a larger stage because of LLMs.

CCC: Is it accurate to say that the creation of LIFES was an initiative to rectify some of the issues you just described?

Barend Mons: LIFES is the natural next evolutionary step because we have realized that many people are jumping on FAIR. This followed the classical route of significant infrastructural changes—from TCP/IP [Transmission Control Protocol/Internet Protocol] to the Internet, mobile phones, electricity, and railways.

This is detailed in a paper by George Strawn, co-founder of the Internet, director of NSF [National Science Foundation], and one of my gurus, Peter Wittenburg. They were both excellent scientists in this field, and they looked at patterns in significant infrastructural changes. They discovered that after a groundbreaking idea, things that were not possible before are suddenly possible. You get this wild growth of diversification of technologies and implementations, which they called ‘Creolization,’ which was a bit of an old-fashioned word, so we decided to instead refer to it as technological diversification.

Once you arrive at this realization or diversification phase, then at some point, the pain becomes so big that choices are made. For example, you had [ISO] OSI and TCP/IP at the beginning of the Internet. Then, finally, we saw a convergence to TCP/IP [as the standard protocol] because it was simple, NSF chose it, and finally, Cisco and the industry jumped in during the late nineties. Another example: the leading telephone companies said there were all these standards; let us use one (GSM). So, at some point, there was an attractor, and then there was convergence, and then exploded. Boom.

According to George, we are in the same phase now with FAIR. George has said at some meetings recently that the impact of FAIR could be as significant or larger than the impact of the Internet itself. If we create an Internet that is independently operated by machines, then there is no way to fathom what will happen. We saw that LIFES became the natural crystallization point of people with the same approach to FAIR.

CCC: Why is it essential for LIFES to have support from a mix of public and private founding members rather than just one or the other?

Barend Mons: One of the problems we saw in the last 10 years was that FAIR needed to be developed more separately in academia and industry. By bringing the two together structurally under LIFES, there is this very pragmatic and scalability-oriented approach – by pharma, other sectors, and organizations like CCC – that are much more down to earth than the average scientist who once they know how things work, they then move onto something else. Bringing these two worlds together was precisely necessary to drive this convergence.

CCC: What is the vision for the core business for LIFES?

Barend Mons: We envision two lines: Our goal is to help create as many FAIR data stations as possible and help anyone who wants to make their data FAIR and put them into a station, making all the diverse implementations for distributed learning possible. I am proud that LIFES will be prominently present with sibling institutes globally, including in the South across Africa, Latin America, and everywhere. This will help me realize my dream of equal access and assist developing countries with knowledge sharing to develop further.

Editor’s Note: CCC collaborated with key industry stakeholders to help launch LIFES

Topic:

Author: CCC

A pioneer in voluntary collective licensing, CCC advances copyright, accelerates knowledge, and powers innovation. With expertise in copyright, data quality, data analytics, and FAIR data implementations, CCC and its subsidiary RightsDirect collaborate with stakeholders on innovative solutions to harness the power of data and AI.