Using Open Data to Sharpen Science Stories

  Léelo en Español

An image of a pixelated opened padlock icon on digital background.
Maxkabakov/iStock

In August 2024, Brazilian journalists Sílvia Lisboa and Carla Ruas broke a story linking high levels of pesticide to shocking rates of fetal death and birth defects in Brazil’s top crop-producing regions. The months-long investigation, published in Dialogue Earth, brought together interviews with residents and evidence from dozens of research papers.

Key in this investigation were three open datasets: one containing public health data gathered by Brazil’s national health system, another providing mapping data about land use within the country, and a third tracking pesticides use around the world. Analyzed together, these rich resources allowed the reporters to shine a light on an overlooked public health problem that had been building for years.

Many reporters are familiar with the idea of using data to bring nuance and depth to a story. They may draw from public government data, such as census data or COVID-19 dashboards, for example. But, as my team’s research suggests, journalists aren’t widely utilizing the vast troves of freely accessible data that researchers and organizations share online. These “open” research datasets are distinct from other types of data in a journalist’s toolkit, including data collected via Freedom of Information Act requests or scraped from the web. Not only are they by definition free to access and analyze without restriction, they might also come with detailed descriptions outlining how the information was gathered and cleaned. This methodological transparency can make it easier for journalists to know a dataset’s limitations and strengths and put the research in context.

These characteristics make open research data a rich resource for journalists seeking to tell many kinds of science stories, including data journalism pieces but also features, investigations, and even single-study reports. The accessibility of open data can be beneficial to all journalists, but especially those with limited budgets and resources, including many freelancers, reporters in small newsrooms, and journalists based in the Global South, says Yao Hua Law, science journalist and co-founder of the Malaysian environmental journalism outlet Macaranga.

Taking advantage of open data might seem daunting—especially if you don’t have formal research training. “Many journalists are hesitant to use data because they would feel that it is difficult to digest and difficult to find,” Law says. But if you know what types of open data exist, where to look, and what questions to ask experts, these data can open a world of possible stories.

 

Investigating New Questions

Open data can be especially useful for powering in-depth, often exclusive, investigations. The free and immediate access of open data means journalists can use them to explore potential stories with less red tape and lower costs than with closed data. And their sheer volume and breadth allow reporters to cast a wide net. Open data can take the form of interview transcripts, archival data, visualizations, geographic data, and audio or video recordings, for example. “They are not just what you might typically think of,” says open-data researcher Kathleen Gregory.

By drawing on the unique knowledge of their beat and audience, journalists can ask questions that others might have overlooked, says Arturo Garduño Magaña, former regional engagement specialist for Latin America at DataCite, a global nonprofit that supports open-data use. “Any data can tell a story,” he says.

For instance, in Ruas and Lisboa’s Dialogue Earth investigation, the duo drew on existing research about pesticides and health to develop some initial “research questions,” such as whether regions producing crops linked to higher pesticide use also had higher illness rates. They worked with environmental-health researcher Tatiane Moraes to refine their questions before Moraes and her team at the University of São Paulo ran the analyses. “Making that partnership was essential to be able to read these datasets, find patterns, and to tell this story that had not been told before,” Ruas says.

These data-driven investigative projects might be intensive, but they don’t have to be turned around like breaking news. This gives reporters extra time to let their curiosity lead them. For example, after Nature features editor Richard Van Noorden got a tip from a source that there might be as many as 10,000 research paper retractions by the end of 2023, he dove into open datasets from both Retraction Watch and research-integrity sleuth Guillaume Cabanac. Van Noorden noticed that neither dataset seemed to capture all of the retractions. By combining these data, he uncovered a shocking rise in research retractions—a story other journalists (and even scientists) hadn’t yet been able to tell.

Key in that story’s process was that Van Noorden had more than a month to clean and analyze the data, as well as interview sources, without worrying about getting scooped. “If it’s an exclusive data project, you don’t need to do it in three days,” he says. “That’s the great thing about these projects.”

 

Delivering Depth and Detail

Open data can also enhance non-investigative science stories, allowing journalists to critique and contextualize new findings. When covering studies, for example, journalists can make a habit of downloading open datasets associated with the research. These data could reveal weaknesses that may not be clear from the paper itself, such as an overrepresentation of a particular demographic among study participants, large amounts of missing or incomplete data, or the presence of outliers—extreme, but potentially meaningful, data points that researchers might toss out. Open data from qualitative research, such as interview transcripts, can have similar clues about a study’s quality. If key findings in the paper are based on interviews with just one or two participants, for example, that’s a sign that the study may not be that representative.

These and other red flags might be more apparent to researchers than journalists, who don’t all have training in scrutinizing data or the time to devote to extra analyses. When Van Noorden asks outside sources to comment on a new study, he says he sometimes points them to datasets associated with the paper to help them better evaluate the findings.

For features or other stories with space for more detail, different types of open data can bring extra specificity to your reporting. In a 2023 feature on climate-adaptive architecture for El País, freelance journalist and architecture scholar Daniel Díez Martínez used an open dataset containing the levels of water consumption involved in producing common building materials. With this data, Díez Martínez could show how the water footprint of wood, for example, is nearly double that of concrete. Including specific numbers allowed him to balance a “quantitative perspective and qualitative perspective,” he says.

Qualitative open datasets might include quotes or anecdotes from participants that can help journalists illustrate how findings impact actual people. Including personal perspectives helps humanize science stories, especially when reporters don’t have time to interview people directly affected by the issue they’re covering.

Similarly, visual open data can enliven stories with rich details. Freelance science journalist Sofia Quaglia used a collection of openly available videos to augment her 2024 National Geographic story on how cephalopods can change their shape and color. Quaglia linked to the video data to illustrate how some species ripple dynamic patterns across their bodies to camouflage themselves. “It’s an article that’s so focused on what this looks like,” she says. “I thought it would be really helpful for the reader to just click and actually see what we’re talking about.”

Law has also used visual data to give a sense of place to a story. When he was reporting a 2021 feature for Macaranga on the detrimental effects of a deforestation project on an Indigenous community in Pahang, Malaysia, COVID-19 travel restrictions prevented him from traveling to the logging site. So, Law turned to open data in the form of historical satellite imagery to capture the changes taking place within the community. These images became the basis of a rich description of his protagonist Omar’s environment: “Just two years ago, Omar could step out of his hut into the adjacent forest and walk westward for more than 15 km under the canopy. He would have seen signs of tapirs, sun bears, leopard cats, and elephants. Perhaps even tigers,” Law wrote. “Within a year, bare soil stretched for 5 km west of Omar’s hut, with the logging continuing unabated today.”

 

Finding Open Research Data Sets

Many people search for open datasets the same way they search for other resources: Google. While it’s possible to find open data this way, that isn’t always the most effective strategy. General search engines rely on text, images, and videos to determine whether content is relevant to your query, which means datasets can easily be overlooked, especially if their accompanying descriptions aren’t thorough. Instead, journalists should search for data via online “repositories” developed with this task in mind.

General Collections

General repositories can be a great place to start when you’re still sketching out a story idea because they include data from a diverse range of topics. Google has created a tool specific for finding datasets: Google Dataset Search. This resource provides access to more than 45 million datasets, including academic and government data. Users can narrow their search by filtering for type of dataset (e.g., image, text, tables) and topic, such as life sciences, agriculture, or engineering.

 

A screenshot of a Google Dataset search.
Google Dataset Search allows users to refine search results by filtering for specific topics, usage rights, formats, and more.

 

Other general portals include Zenodo, the Harvard Dataverse, the Open Science Framework (OSF), The Accountability Project, and Figshare, which both Quaglia and Díez Martínez used to access data for their stories. Some of these platforms house other resources, including books, software, and preprints, so it helps to refine your search to just datasets.

Field-Specific Data Portals

These platforms allow researchers to deposit datasets specific to a particular field or discipline, like biomedical science or marine science, and can be especially useful for journalists with a dedicated beat. For instance, when Law was covering forest demolition, he says he would often start by searching Global Forest Watch, an open repository for forest-related data. Many of these portals also have helpful filters. The OpenNeuro data repository, for example, allows users to search for neuroscience datasets by publication date or the age, number, and species of participants. To find repositories for their particular beat, journalists can explore data registries, including re3data.org and DataCite Commons, or ask their sources for recommendations.

 

A screenshot of OpenNeuro.
The OpenNeuro data repository houses data specific to neuroscience research.

 

Institutional Data Collections

Universities and other research institutions often provide open access to datasets produced by their researchers and students. Accessing these troves of open data could be especially helpful when searching for data connected to a particular researcher, paper, or institution. You can find these collections via institutions’ websites or platforms such as the Registry of Open Access Repositories or OpenDOAR, both of which house academic data portals.

Research Papers

Digging through published research is another common way to find relevant open data. References to openly available datasets appear in many places within a paper, including the methods section, figure or table captions, footnotes, the reference list, or acknowledgements. Or there might be a separate “data availability” statement, which some journals, such as PLOS ONE and Nature, require authors to submit.

 

A screenshot of a PLOS research paper.
Research articles published in PLOS journals include a data availability statement after the abstract, pointing readers to openly available datasets or explaining why sharing data is not possible (e.g., to avoid revealing participants’ identities).

 

If you don’t see a dataset linked in a paper, ask the researchers whether it’s freely available online—and if it’s not, ask them why. Sometimes, researchers may be willing to share data that aren’t already publicly available. However, these closed data often come with restrictions around sharing or reuse, meaning you may not be able to report on them without permission.

 

Evaluating Open Data

The accessibility of open data means they’re available for anyone to dissect—providing a check on their quality. In addition, researchers usually scrutinize their own datasets and ask colleagues to check them over before posting, Garduño Magaña says. That said, it’s important for journalists to know how to spot potential weaknesses in datasets.

There are a few simple tactics you can use to evaluate open data, such as manually checking for missing values or extreme outliers. Datasets might also have a “cleanliness” issue if their data doesn’t follow consistent formatting patterns for dates, geographic areas, or other information. These inconsistencies can muck up analyses or contribute to misleading findings. Free tools, such as OpenRefine, are helpful for identifying variations and cleaning data, especially when working with large datasets, but they do require some coding experience.

Taking a different approach, Alice Dreger, publisher of the newsletter Local News Blues, turned to crowdsourcing to evaluate the data behind the widely used Medill “State of Local News” map. (Disclosure: I was interviewed for one of Dreger’s stories on this data.) With the help of her peers at local newsrooms across the U.S., Dreger flagged several issues, such as many outlets being double-counted, misplaced, miscategorized, or excluded altogether from the interactive dataset, raising questions about the Medill researchers’ findings. “I suspect a lot of it is wrong because [of] the way they’re counting,” she says.

Comparing related datasets, like Van Noorden did for his Nature feature, is another (more hands on) approach to vetting data. Is one dataset pointing to a dramatically different conclusion than another? If so, what might be causing the divergence? Big differences could point to a potential quality issue, or they could simply reflect differences in study design or data collection methods. Uncanny similarities between datasets might also be a tip-off that something isn’t quite right; those data could be copied from somewhere else.

In addition, many open datasets come with a ReadMe file that describes how the data were gathered and processed. Critically reviewing this information could point to possible limitations. For example, if you’re vetting survey data, is the sample representative of the population the researchers are studying? Could there be a response or selection bias at play? When was the data collected? (Even newly published datasets could have datapoints that were collected years ago.)

Another key method for vetting datasets is to ask the experts. Contact the researchers behind a dataset to gather intel such as how the data were collected and whether there are any limitations you should know about. Ask around, too, “just as you would want an independent source to comment on a research article,” Van Noorden says. He typically asks outside sources general questions such as, “What do you think of this dataset and how it’s put together?” as well as more specific prompts, like whether a finding might be an artifact of how the data were collected, whether anything important has been left out, and if there are other datasets worth considering instead. Journalists can also find public comments from researchers critiquing datasets on peer review forums such as PubPeer.

Remember that just because a dataset is available doesn’t mean it’s good. Rely on the input of sources and your journalistic instincts to decide whether a dataset can be trusted. And if not, there’s a story there, too. This was the case for freelance climate and environmental journalist Chloe Glad, who revealed major issues with a public dataset published as part of the European Union’s initiative to plant three billion trees by 2030. The numbers seemed to tell an impressive story about conservation, but by visiting the tree-planting sites and interviewing experts in biodiversity, Glad uncovered inconsistencies across the planting projects, including differences in how the trees are being counted. The first part of her investigation, published in the Belgian magazine Wilfried in 2024, points to perhaps the most important caveat of any dataset: Numbers are only part of the story. They neglect the “nuances of the reality that’s happening on the ground,” Glad says. It’s up to journalists to bring those numbers down to earth.

 

Alice Fleerackers Josie Baik

Alice Fleerackers is a freelance writer whose work has appeared in outlets including Nautilus, The Globe and Mail, and the National Post. She is also an assistant professor in the Department of Media Studies at the University of Amsterdam, where she studies the intersections of science journalism and open science.

Skip to content