Misinformation detection is a challenging task because claims often lack exact matches in labeled datasets. A robust system needs to retrieve relevant and semantically similar statements while ensuring accurate classification. To achieve this goal, a combination of spare and dense retrieval techniques is used. Additionally, fine-tuning a BERT-based classifier improved contextual understanding over simpler ML models.
Dataset
- 12.8k manually labelled statements
- 6 classes
Retrieving relevant information is crucial for fact-checking, especially as claims usually do not have one-to-one matches in the dataset. To improve retrieval effectiveness, a hybrid approach combining both sparse and dense retrieval is used. This is achieved by calculating a weighted sum of the respective scores.
Sparse Retrieval
BM25 ranks documents based on TF-IDF (Term Frequency-Inverse Document Frequency) scores, making it effective for keyword-based searching. The top-k documents are selected based on their BM25 scores. Dense Retrieval
Sparse retrieval fails to capture semantic similarity, which is problematic for fact-checking. To address this, I used FAISS (Facebook AI Similarity Search) with sentence transformers, using 'all-MiniLM-L6-v2' for encoding. The claims are first converted to dense vector representations, and the closest dense vectors from the dataset are retrieved. The final ranking is determined by combining scores from both methods with weighted sum to maximize recall and precision.
Transformer-based models, particularly BERT, are superior for fact-checking due to their ability to capture nuanced relationships in text. I fine-tuned the model on the LIAR dataset to optimize it for six-class veracity predictions.