Programming education faces significant challenges in providing timely, personalized feedback to students. This project investigates how generative AI can automate program repair and generate educational hints for Python programming exercises using the INTROPYNUS dataset (5 tasks, with 5 buggy programs each). Through careful prompt engineering, LoRA fine-tuning, and multi-task learning approaches, we explore how to balance correctness, pedagogical value, and computational efficiency in automated programming education tools.
Problem Statement
The core objective is to develop AI systems that can automatically repair buggy Python programs and generate educational hints that guide students toward correct solutions without directly revealing the answer. This addresses the scalability challenges in programming education where personalized feedback is crucial but resource-intensive.
Program Repair
Given a buggy Python program, automatically generate a corrected version that passes all test cases while maintaining pedagogical intent.
Hint Generation
Provide natural language feedback that guides students toward solutions without directly revealing the fix or complete answer.
Dataset & Evaluation
INTROPYNUS Dataset
A comprehensive dataset of Python programming exercises with buggy implementations, correct solutions, and test cases designed for introductory programming education. Here we consider 5 tasks, each containing 5 buggy programs.
Evaluation Metrics
Program repair effectiveness measured by RPass (Repair Pass rate) and hint quality assessed through pedagogical criteria.
RPass Rate
Percentage of buggy programs successfully repaired to pass all test cases
Hint Quality
Correctness, informativeness, concealment, and comprehensibility of generated hints
Baseline Performance
Initial evaluation of foundation models on program repair tasks reveals significant performance differences between large and small models, with GPT-4o-mini achieving excellent results while Phi-3-mini requires substantial improvement.
Baseline Model Comparison
Multi-Candidate Sampling
Generating multiple repair candidates and selecting the best solution significantly improves performance, especially for smaller models. Analysis shows diminishing returns beyond 5-10 candidates.
k=1
36%
Phi-3-mini
k=5
52%
+16%
k=10
58%
+6%
k=20
62%
+4%
Sampling Performance Analysis
LoRA Fine-tuning Results
Low-Rank Adaptation (LoRA) fine-tuning dramatically improves small model performance. Different configurations of rank (r) and scaling factor (α) offer trade-offs between performance and computational requirements.
LoRA Configuration: Performance Comparison
LoRA Configuration: Resource Usage
Multi-task Learning
Training models simultaneously on program repair and hint generation tasks leverages shared representations and improves overall performance on both objectives compared to single-task approaches.
Single-task Performance
Program Repair:80%
Hint Quality:0.72
Models trained separately on each task
Multi-task Performance
Program Repair:84%
Hint Quality:0.78
Joint training improves both tasks
Example Program Repair
Buggy Implementation
student_solution.py
def factorial(n):
if n == 0:
return 1
return n * factorial(n) # Missing decrement!
Error
RecursionError: maximum recursion depth exceeded
AI-Generated Repair
repaired_solution.py
def factorial(n):
if n == 0:
return 1
return n * factorial(n - 1) # Fixed: decrement n
Generated Hint
"Think about what happens to the parameter in each recursive call. What should change to eventually reach the base case?"
Theoretical Framework
The approach builds on pedagogical theories emphasizing guided discovery and scaffolded learning. Mathematical formalization includes repair accuracy as P(repair_correct | buggy_code, context) and hint quality as a multi-dimensional vector q=correctnessinformativenessconcealmentcomprehensibility
Correctness
C=∣H∣∣{h∈H:semantically_correct(h)}∣
Informativeness
Measured by semantic richness and actionability of provided guidance
Concealment
conceal(h)=1−similarity(h,solution)
Comprehensibility
Readability and clarity for target student population
Key Findings
The research demonstrates that LoRA fine-tuning is essential for small models to achieve competitive program repair performance. The optimal configuration (r=16,α=32) balances performance and computational efficiency. Multi-task learning provides synergistic benefits, and careful prompt engineering with chain-of-thought reasoning significantly enhances both repair accuracy and hint quality.
LoRA Effectiveness
Fine-tuning improves Phi-3-mini from 36% to 88% RPass rate, making small models viable for educational deployment.
Multi-task Benefits
Joint training on repair and hint generation improves both tasks through shared representation learning.
Prompt Engineering
Chain-of-thought prompting enhances reasoning quality and educational value of generated content.
Limitations & Future Work
Language Expansion
Extend beyond Python to Java, C++, and JavaScript for broader educational impact
Personalization
Incorporate student learning patterns and preferences for adaptive hint generation
Advanced Pedagogy
Develop sophisticated Socratic questioning and scaffolding techniques for deeper learning