go back to projects

note: embedded demo at the end

A Fact Checking Bot

With the LIAR dataset, categorizing natural language queries based on their truthfulness.

links
built with
  • LIAR Dataset
  • FAISS
  • BM25
  • Transformers
  • Streamlit
Misinformation detection is a challenging task because claims often lack exact matches in labeled datasets. A robust system needs to retrieve relevant and semantically similar statements while ensuring accurate classification. To achieve this goal, a combination of spare and dense retrieval techniques is used. Additionally, fine-tuning a BERT-based classifier improved contextual understanding over simpler ML models.

Dataset

LIAR
  • 12.8k manually labelled statements
  • 6 classes

Labels

pants-fire
completely false, no factual basis
false
factually incorrect, not outrageous
barely-true
some truth, but misleading
half-true
partially correct
mostly-true
mostly correct, minor inaccuracies
true
completely true
The system consists of two components, one for retrieval and the other for classification.

The Retrieval Component

Retrieving relevant information is crucial for fact-checking, especially as claims usually do not have one-to-one matches in the dataset. To improve retrieval effectiveness, a hybrid approach combining both sparse and dense retrieval is used. This is achieved by calculating a weighted sum of the respective scores.

Sparse Retrieval

BM25 ranks documents based on TF-IDF (Term Frequency-Inverse Document Frequency) scores, making it effective for keyword-based searching. The top-k documents are selected based on their BM25 scores.

Dense Retrieval

Sparse retrieval fails to capture semantic similarity, which is problematic for fact-checking. To address this, I used FAISS (Facebook AI Similarity Search) with sentence transformers, using 'all-MiniLM-L6-v2' for encoding. The claims are first converted to dense vector representations, and the closest dense vectors from the dataset are retrieved.
The final ranking is determined by combining scores from both methods with weighted sum to maximize recall and precision.

The Classification Component

Transformer-based models, particularly BERT, are superior for fact-checking due to their ability to capture nuanced relationships in text. I fine-tuned the model on the LIAR dataset to optimize it for six-class veracity predictions.

Demo

Check it out! Here's a few of examples:
  • Is it true that Barack Obama was born in Kenya?
  • Is it true that climate change is a hoax?
  • Is it true that increasing the minimum wage will lead to massive job losses?
my resume

too bright? click ↝