go back to projects

Course Project - Generative AI

Program Repair and Hint Generation

Exploring generative AI models for automated program repair and hint generation for Python programming education using the INTROPYNUS dataset.

built with
  • OpenAI API
  • GPT 4o Mini
  • Phi 3 Mini
  • LoRA Fine-tuning
  • Chain-of-Thought Prompting
Programming education faces significant challenges in providing timely, personalized feedback to students. This project investigates how generative AI can automate program repair and generate educational hints for Python programming exercises using the INTROPYNUS dataset (5 tasks, with 5 buggy programs each). Through careful prompt engineering, LoRA fine-tuning, and multi-task learning approaches, we explore how to balance correctness, pedagogical value, and computational efficiency in automated programming education tools.

Problem Statement

The core objective is to develop AI systems that can automatically repair buggy Python programs and generate educational hints that guide students toward correct solutions without directly revealing the answer. This addresses the scalability challenges in programming education where personalized feedback is crucial but resource-intensive.
Program Repair
Given a buggy Python program, automatically generate a corrected version that passes all test cases while maintaining pedagogical intent.
Hint Generation
Provide natural language feedback that guides students toward solutions without directly revealing the fix or complete answer.

Dataset & Evaluation

INTROPYNUS Dataset

A comprehensive dataset of Python programming exercises with buggy implementations, correct solutions, and test cases designed for introductory programming education. Here we consider 5 tasks, each containing 5 buggy programs.

Evaluation Metrics

Program repair effectiveness measured by RPass (Repair Pass rate) and hint quality assessed through pedagogical criteria.
RPass Rate
Percentage of buggy programs successfully repaired to pass all test cases
Hint Quality
Correctness, informativeness, concealment, and comprehensibility of generated hints

Baseline Performance

Initial evaluation of foundation models on program repair tasks reveals significant performance differences between large and small models, with GPT-4o-mini achieving excellent results while Phi-3-mini requires substantial improvement.
Baseline Model Comparison

Multi-Candidate Sampling

Generating multiple repair candidates and selecting the best solution significantly improves performance, especially for smaller models. Analysis shows diminishing returns beyond 5-10 candidates.
k=1
36%
Phi-3-mini
k=5
52%
+16%
k=10
58%
+6%
k=20
62%
+4%
Sampling Performance Analysis

LoRA Fine-tuning Results

Low-Rank Adaptation (LoRA) fine-tuning dramatically improves small model performance. Different configurations of rank (r)(r) and scaling factor (α)(\alpha) offer trade-offs between performance and computational requirements.
LoRA Configuration: Performance Comparison
LoRA Configuration: Resource Usage

Multi-task Learning

Training models simultaneously on program repair and hint generation tasks leverages shared representations and improves overall performance on both objectives compared to single-task approaches.
Single-task Performance
Program Repair:80%
Hint Quality:0.72
Models trained separately on each task
Multi-task Performance
Program Repair:84%
Hint Quality:0.78
Joint training improves both tasks

Example Program Repair

Buggy Implementation

student_solution.py
def factorial(n):
    if n == 0:
        return 1
    return n * factorial(n)  # Missing decrement!
Error
RecursionError: maximum recursion depth exceeded

AI-Generated Repair

repaired_solution.py
def factorial(n):
    if n == 0:
        return 1
    return n * factorial(n - 1)  # Fixed: decrement n
Generated Hint
"Think about what happens to the parameter in each recursive call. What should change to eventually reach the base case?"

Theoretical Framework

The approach builds on pedagogical theories emphasizing guided discovery and scaffolded learning. Mathematical formalization includes repair accuracy as P(repair_correct | buggy_code, context)P(\texttt{repair\_correct | buggy\_code, context}) and hint quality as a multi-dimensional vector q=(correctnessinformativenessconcealmentcomprehensibility)\vec{q} = \begin{pmatrix} \texttt{correctness} \\ \texttt{informativeness} \\ \texttt{concealment} \\ \texttt{comprehensibility} \end{pmatrix}
Correctness
C={hH:semantically_correct(h)}HC = \frac{|\{h \in H : \texttt{semantically\_correct}(h)\}|}{|H|}
Informativeness
Measured by semantic richness and actionability of provided guidance
Concealment
conceal(h)=1similarity(h,solution)\text{conceal}(h) = 1 - \text{similarity}(h, solution)
Comprehensibility
Readability and clarity for target student population

Key Findings

The research demonstrates that LoRA fine-tuning is essential for small models to achieve competitive program repair performance. The optimal configuration (r=16,α=32)(r=16, \alpha=32) balances performance and computational efficiency. Multi-task learning provides synergistic benefits, and careful prompt engineering with chain-of-thought reasoning significantly enhances both repair accuracy and hint quality.
LoRA Effectiveness
Fine-tuning improves Phi-3-mini from 36% to 88% RPass rate, making small models viable for educational deployment.
Multi-task Benefits
Joint training on repair and hint generation improves both tasks through shared representation learning.
Prompt Engineering
Chain-of-thought prompting enhances reasoning quality and educational value of generated content.

Limitations & Future Work

Language Expansion
Extend beyond Python to Java, C++, and JavaScript for broader educational impact
Personalization
Incorporate student learning patterns and preferences for adaptive hint generation
Advanced Pedagogy
Develop sophisticated Socratic questioning and scaffolding techniques for deeper learning
my resume

too bright? click ↝