Who’s Harry Potter? An interactive walk through approximate unlearning in LLMs
An interactive explainer of the 2023 Microsoft Research paper that taught Llama-2-7B to forget Harry Potter without retraining from scratch. Built with marimo and runs entirely in your browser.
This page embeds a marimo reactive notebook running entirely in your browser via WebAssembly. First load takes ~30 seconds while Python downloads; subsequent visits are local and fast.
Llama-2-7B took 184,000 GPU-hours to train. In a 2023 paper, Eldan and Russinovich at Microsoft Research showed that with about one GPU-hour of fine-tuning you can make it forget Harry Potter, with no retraining from scratch and no measurable damage to the model’s general competence.
This notebook walks through how the trick works. The mechanism is three small ideas; the interaction between them is where the work happens.
The headline insight that drives the whole notebook: forgetting Harry isn’t deleting the word “Harry”. It’s cutting the edges between “Harry” and “Hogwarts”, “Harry” and “magic”, “Hogwarts” and “Quidditch”. Edge surgery, not amputation. The interactive graph below makes that visceral: drag the slider and watch the Harry Potter cluster pull itself apart while the surrounding language graph stays intact.
About this project
I built this as a portfolio piece while transitioning from R-heavy physiological data analysis into Python + interactive ML tooling. It’s also a low-stakes test of a publishing pattern I want to use more: reactive notebooks as the long-form artifact, with a polished landing page hosting a WebAssembly export of the notebook itself.
- Source on GitHub: github.com/JacobBowie/marimo-unlearning
- Original paper: arxiv 2310.02238
- Released checkpoint: microsoft/Llama2-7b-WhoIsHarryPotter
Built with marimo, Altair, networkx, and Pyodide. All numbers in the notebook are taken verbatim from the paper; the network-graph edge weights are illustrative and explicitly flagged as such.