Program Repair Hint Generation

Programming feedback has two different jobs. A repair system must produce code that is demonstrably correct; a teaching system must help a learner discover that correction. This project studies both jobs on the same introductory Python exercises, then asks how model scale, candidate sampling, LoRA rank, and joint training change the result.

The Learning Problem

The input is a buggy program, its exercise context, and a test suite. The model produces either an executable patch or a natural-language hint. Those outputs share the same diagnosis, but they are judged differently: one by tests and the other by pedagogical restraint.

Repair the program

Change the implementation just enough to pass every test while preserving the exercise’s intended solution.

Teach without revealing

Explain the underlying mistake clearly enough to guide the next attempt without disclosing the finished patch.

Evaluation Setup

INTROPYNUS sample

The experiments use five introductory Python tasks with five buggy submissions per task. Each record includes the exercise context, a failing implementation, a reference solution, and executable tests.

Evaluation slice

Tasks: 5
Buggy programs per task: 5
Primary language: Python
Repair criterion: All tests pass
Hint criterion: Useful without disclosure

A deliberately compact sample for comparing prompting and fine-tuning strategies under the same test conditions.

What counts as a good answer

Repair quality is binary and executable. Hint quality is multi-dimensional: a useful hint can be technically correct and still fail if it gives away too much or assumes knowledge the learner does not yet have.

RepairRPass

Share of generated patches that pass the complete test suite.

HintCorrectness

The diagnosis and suggested direction match the actual defect.

HintConcealment

The guidance avoids reproducing the reference implementation.

HintComprehensibility

The wording is clear for an introductory programming student.

What Changed Performance

Foundation-model baseline

Prompting alone produces a large model-size gap. GPT-4o-mini repairs 92% of the evaluated programs, while Phi-3-mini reaches 36%. The smaller model therefore becomes the useful test case: can inference-time search or parameter-efficient training close that gap?

Prompt-only repair pass rates. GPT-4o-mini begins 56 percentage points ahead of Phi-3-mini on the same evaluation slice.

Candidate sampling

Sampling several possible repairs gives the smaller model more chances to escape an incorrect first approach. Most of the improvement arrives by five candidates; beyond ten, each additional sample buys progressively less.

Phi-3-mini repair performance as the number of sampled candidates increases. The first four extra candidates recover 16 points; the next fifteen recover only 10 more.

Largest gain+16 pointsk=1 → k=5

Practical elbowk=1058% RPass

Best observed62%k=20

Parameter-efficient fine-tuning

LoRA produces a much larger shift than sampling. Increasing rank improves repair accuracy, but the parameter and memory costs grow rapidly. The medium configuration, $(r=16, \alpha=32)$ , is the operating point used for the remaining experiments: it stays within eight points of the highest-rank result at one quarter of its trainable parameters.

	r=4α=8	r=16α=32	r=64α=128
RPass rate	64%	80%	88%
Trainable parameterslower is cheaper	7.4M	29.8M	119.5M
Peak memorylower is better	0.817 GB	1.022 GB	1.625 GB

LoRA rank sweep on Phi-3-mini. The best value in each row is highlighted — and r=16 takes none of them, yet trails the winner narrowly every time. That is what makes it the operating point: rank buys accuracy at a steep cost in parameters and memory.

Repair pass rate across the LoRA rank sweep. Rank 16 captures two thirds of the available improvement without the parameter footprint of rank 64.

Joint repair and hint learning

Training the two tasks together improves both outcomes. Shared representations help the model connect a diagnosis to an executable fix and then translate the same diagnosis into restrained guidance.

Program repairRPass

single-task80%multi-task84%

+4 pts

Hint qualitycomposite score

single-task0.72multi-task0.78

+0.06

Single-task and multi-task results. Joint training improves the executable repair rate and the human-oriented hint score together.

A Repair in Context

A factorial implementation exposes the distinction between patching and teaching. The recursive call repeats the same argument, so execution never approaches the base case. The repair changes one expression; the hint names the invariant the learner should inspect.

Beforestudent_solution.py

1def factorial(n):
2    if n == 0:
3        return 1
4    return n * factorial(n)

Afterrepaired_solution.py

1def factorial(n):
2    if n == 0:
3        return 1
4    return n * factorial(n - 1)

A minimal recursive repair. Decrementing the argument makes every call move toward the base case while leaving the intended algorithm intact.

Generated hint “Think about what happens to the parameter in each recursive call. What should change to eventually reach the base case?”

repair-session

$ python student_solution.py
RecursionError: maximum recursion depth exceeded
$ repair-hint --task factorial --mode repair-and-hint
Inspecting the recursive call and its base-case path…
Patch: recursive argument n → n - 1
Hint: What should change in each call to reach the base case?
$ python repaired_solution.py
5 / 5 tests passed

An illustrative repair session. The same diagnosis produces an executable one-line patch and a hint that points to the invariant without revealing the final expression.

How Hint Quality Is Scored

A hint is not treated as a single scalar until the end of evaluation. Keeping the dimensions separate makes failure modes visible: a fluent answer may be wrong, and a correct answer may disclose too much.

Correctness

C = \frac{|\{h \in H : \mathrm{correct}(h)\}|}{|H|}

Does the hint identify the actual defect?

Concealment

\mathrm{conceal}(h) = 1 - \mathrm{sim}(h, solution)

How little of the finished solution is reproduced?

Informativenessactionable direction

Can the learner use it to make a meaningful next attempt?

Comprehensibilitystudent-level clarity

Is it readable without knowledge beyond the exercise?

What the Experiments Show

Findings

Fine-tuning changes the small-model regime.

Phi-3-mini moves from 36% to 88% RPass, making local or constrained deployment substantially more plausible.

Search helps, but it is not a substitute for learning.

Twenty candidates recover 26 points; LoRA recovers 52, with no inference-time sampling multiplier.

The two objectives reinforce one another.

Multi-task training improves both test-passing repairs and pedagogical hint quality rather than trading one for the other.

Limits and next steps

The evaluation is deliberately small and Python-only. The next useful test is not another rank sweep; it is a broader study of whether the same behavior survives new languages, unseen exercise families, and real student interaction.

Languages

Java, C++, and JavaScript exercises with language-specific tests.

Generalization

Hold out complete problem families instead of individual programs.

Learning impact

Measure whether hints improve the student’s next independent attempt.