By Juan Ginzo, George Lindsay-Watson, Maryam Ahmed, Clare Spencer

How can AI enhance journalism? 32 journalists and technologists from across the globe have joined the 2023 JournalismAI fellowship to find out. They are working in six self-selected teams on six different projects that use AI to enhance journalism and its processes.

In this series of articles, our Fellows describe their journey so far, the progress they’ve made, and what they learned along the way. In this blog post, you’ll hear from team MP Interests Tracker, a collaboration between editorial and technical Fellows from the BBC, and The Times.

For the 2023 JournalismAI fellowship, teams from The Times and BBC teamed up with a shared aspiration. Our goal was to harness the latest advancements in GenAI to make complex data, like that in the UK Register Members Financial Interest, more accessible and easier to comprehend.

The Register of Members Financial Interest main purpose is to provide information about any financial interest which a Member of the UK Parliament has, or any benefit which he or she receives, that others might reasonably consider to influence his or her actions or words as a Member of Parliament.

We think that this kind of project will serve two purposes. It will allow journalists to better and more efficiently understand how economic interests try to influence MPs, both individually and on the aggregate. But will also provide an interesting template in how to utilise GenAI as a way to provide structure to unstructured data, taking the place of classical Natural Language Processing techniques, with better out of the box performance.

But wait, what is Generative AI?

Generative AI (GenAI) refers to a subset of artificial intelligence models that can generate content. Typically associated with generating images, videos, music or text, GenAI can be trained on vast datasets, allowing it to produce new, unique outputs based on patterns it discerns in the training data. In the context of our project, GenAI is employed not for content creation but for data extraction and structuring, particularly from unstructured or semi-structured sources. The Register, where MPs declare gifts received or any kind of financial interests they may hold, is the perfect case study.

Application to structured and unstructured data

While structured data is organised in a clear, tabular format (like databases), unstructured data lacks a specific form, making it harder to analyse using traditional methods. Think of structured data as a neatly organised bookshelf where each book represents a piece of data and unstructured data as a pile of mixed notes, photos and documents. GenAI can sift through this ‘pile’ and categorise, organise and transform it into something as structured as the bookshelf, making the information more digestible and actionable.

More importantly, it can provide this capability without the previous need for expensive labelling of the data for training purposes; what is sometimes called ‘zero-shot learning’.

The Register provides a vital insight into the intersections of economic groups and political influence. However, its potential remains largely untapped due to its current unstructured format. With the volume of data it houses, extracting meaningful insights requires innovative techniques. By leveraging GenAI, we aim to transform this resource into a structured and easily queryable tool, thereby serving public interest more efficiently and providing a great tool for journalists to help them ‘join the dots’.

Challenges and approaches

The inherent semi-structured nature of the data in the Register poses a significant challenge. Entries, divided into sections, vary widely in format, as MPs have discretion over their submissions. This inconsistency complicates data extraction using classical techniques or straightforward ‘pattern matching’. We needed a tool that has some ‘semantic’ understanding of the inputs we pass through them.

We turned to spacy_llm, a sophisticated abstraction layer atop the OpenAI API. Using it, we defined data extraction tasks combining various techniques, from identifying entities to establishing relationships between them. The tool also allowed us to refine results with ‘few shot learning’ when necessary, by utilising examples where it doesn’t perform well as a way to guide it to the correct output and data extraction.

One example of this was in Section One of the Register, where MPs record employment and earnings. We asked the tool to specify the nature of the businesses MPs were working for but found that certain companies were hard to define. Is Lloyds a bank, an insurer or an estate agent? The underlying LLM didn't know, so we had to help fill in the blanks, by providing examples of common companies (in the context of the register) and what industries they relate to.

Progress and future direction

Having developed a comprehensive pipeline — from scraping the Register to structuring MP entries — we are now focussing on optimising our data extraction methods and broadening our field range. Collaborations with journalists and editors are ongoing to align the data’s presentation with the evolving demands of journalism.

Hopefully, the MP Interest Tracker initiative will help provide an example of the potential of GenAI in revamping and revitalising the way we look into public resources. As we delve deeper, we anticipate that our methodologies will pave the way for future data-intensive projects, both in the UK and globally.

If you are interested in the subject matter or technical approach, we’d love to hear from you.

Do you have skills and expertise that could help team MP Interests Tracker? Get in touch by sending an email to Programme Manager, Lakshmi Sivadas, at lakshmi@journalismai.info.

MP Interests Tracker: Utilising GenAI to uncover insights in the UK Register of Financial Interest

Daisy: Pioneering Data Journalism with LLMs

Generating Change 2023: What have we seen before?