Identifying High Quality Training Data for Misinformation Detection

Jaren Haber, Kornraphop Kawintiranon, Lisa Singh, Alexander Chen, Adrian Pizzo, Anna Pogrebivsky, Joyce Yang

April 2023

Abstract

Misinformation spread through social media poses a grave threat to public health, interfering with the best scientific evidence available. This spread was particularly visible during the COVID-19 pandemic. To track and curb misinformation, an essential first step is to detect it. One component of misinformation detection is finding examples of misinformation posts that can serve as training data for misinformation detection algorithms. In this paper, we focus on the challenge of collecting high-quality training data in misinformation detection applications. To that end, we demonstrate the effectiveness of a simple methodology and show its viability on five myths related to COVID-19. Our methodology incorporates both dictionary-based sampling and predictions from weak learners to identify a reasonable number of myth examples for data labeling. To aid researchers in adjusting this methodology for specific use cases, we use word usage entropy to describe when fewer iterations of samplin g and training will be needed to obtain high-quality samples. Finally, we present a case study that shows the prevalence of three of our myths on Twitter at the beginning of the pandemic.

Type

Conference paper

Publication

In Proceedings of the International Conference on Data Science, Technology and Applications (DATA)

NLP misinformation

Identifying High Quality Training Data for Misinformation Detection

Abstract

Kornraphop Kawintiranon

LLM / ML / NLP

Related