Daisy: Pioneering Data Journalism with LLMs

2 Oct

By Piyush Aggarwal, Ankit Kumar, Caroline Lindekamp, Andrew Ong, Vashiegaran Manogaran

How can AI enhance journalism? 32 journalists and technologists from across the globe have joined the 2023 JournalismAI fellowship to find out. They are working in six self-selected teams on six different projects that use AI to enhance journalism and its processes.

In this series of articles, our Fellows describe their journey so far, the progress they’ve made, and what they learned along the way. In this blog post, you’ll hear from team Daisy, a collaboration between editorial and technical Fellows from the India Today, CORRECTIV and Malaysiakini.

In past attempts to create interactive election dashboards, one issue has persistently cropped up: Interactions were only one-sided. Users could only access what the interface presented. Flexibility in terms of users’ specific queries was not possible. However, this may no longer be a barrier with the rapid development of Large Language Models (LLM).

The upcoming 2024 General Elections in India, poised to be the world's largest democratic exercise, will provide a compelling opportunity for data journalists to leverage the latest advancements in Artificial Intelligence (AI). As part of the JournalismAI Fellowship 2023, our project, Daisy, aims to revolutionise data journalism using LLMs.

Daisy is an AI framework that empowers users to effortlessly extract insights from complex, structured datasets using natural language queries. Daisy is poised to not just interpret data but reshape the landscape of data journalism itself.

Tackling the complexity of Indian elections data

The scale of the 2024 Lok Sabha Elections in India is staggering, with nearly 950 million eligible voters who will cast their ballots across 543 seats. Gathering and processing exhaustive datasets from diverse sources adds an additional layer of complexity to this monumental task.Daisy takes on this challenge by providing a user-friendly solution that simplifies the access and analysis of such vast and intricate datasets.

A collaborative endeavor

Daisy is not a solo effort. It's a collaborative project that brings together expertise from around the globe. It pools collective resources and knowledge to develop an LLM-driven data journalism platform.

Daisy is a collaboration between India Today Group , one of India's biggest and most diverse newsrooms, CORRECTIV, a Germany-based fact-checking newsroom with IFCN certification, and Malaysiakini, a prominent news outlet in Malaysia.

Daisy is driven by our collective passion for data journalism as well as the desire to enhance user experience. The emergence of Large Language Models has inspired us to make our data-driven dashboards more interactive while creating leeway for journalists to focus on investigations. It builds upon the previous work of the three newsrooms involved, as all of them work with data on a daily basis.

The journey so far

The journey with Daisy began by focusing on Indian elections data, specifically focusing on the data of Uttar Pradesh, one of India's most populous states, with nearly 240 million residents and over 150 million eligible voters. Piyush Aggarwal and Ankit Kumar, representing India Today, put together an exhaustive dataset of Uttar Pradesh's Assembly polls. This dataset casts its net over 16 state assembly elections from 1965 to 2022, encompassing more than 40 columns of diverse data points. These include everything from candidate profiles to constituency specifics and the patterns of voting behaviour.

Aggarwal and Kumar also crafted more than 170 unique election-related queries in English. These queries were written to uncover valuable insights from our dataset. They ranged from straightforward enquiries like, "How many candidates won the election with a victory margin of more than 10,000 votes in 2022?" to more nuanced questions such as, "How many candidates did the BJP and Congress field in the 2022 polls?” and, “How many MLAs got reelected in the 2017 polls?"

The queries were classified into 12 distinct categories to facilitate efficient exploration.

The initial choice for the LLM was OpenAI's GPT-3.5 turbo combined with the LangChain framework. However, a series of challenges emerged while making the LLM comprehend the nuances and correlations within our dataset. The model hallucinated and grappled with understanding concepts like victory margin, differentiating between districts and assembly constituencies, identifying the winners of elections, etc.

In response to these challenges, team Daisy tried various system prompts to fine-tune our approach to elicit more accurate and relevant responses from the LLM. This involved devising system prompts that included clear definitions of various attributes within the data. These tailored prompts served as a vital bridge, enabling the LLM to better grasp the context and nuances of the data and the corresponding queries.

During this, the importance of crafting precise and concise system prompts, which play a crucial role in accurately describing the dataset's intricacies and extracting the desired insights, became clear. Through iterative experimentation and refinement, we are steadily advancing our ability to harness the LLMs' capability of uncovering meaningful insights from complex datasets.

A successful start

Early efforts with Daisy yielded promising results, like the success rate of 95 per cent in comprehending and responding to natural language queries using GPT-3.5 turbo, demonstrating the potential of LLMs in simplifying data analysis for journalists and researchers alike.

Charting the path forward

Looking ahead, the focus is to broaden and diversify our repository of natural language queries to enhance the robustness of our system. A prototype, ElectionGPT (currently in Beta) has been created that can be accessed here for testing purposes. This access is limited to dedicated domain experts who will assist in augmenting the query database with a wider variety and range of textual queries.

In due course, we will extend access to this prototype to a wider audience, providing an opportunity for users to engage with Indian Lok Sabha data through a ChatBot - a unique experience that has not been available before. This expansion will open up new horizons, facilitating a deeper exploration of the data and democratising access to valuable insights.

Also, at the same time, we are actively exploring the capabilities of other LLMs, including GPT-4, Llama 2 and more. Daisy's journey is marked by innovative collaboration to redefine and push the boundaries of data journalism in this digital age.

Do you have skills and expertise that could help team Daisy? Get in touch by sending an email to Programme Manager, Lakshmi Sivadas, at lakshmi@journalismai.info.

Team Daisy

Daisy: Pioneering Data Journalism with LLMs

TimeLark: Understanding relationships over time made easy

MP Interests Tracker: Utilising GenAI to uncover insights in the UK Register of Financial Interest