SWE-Bench-CL: Continual Learning for Coding Agents

Final Project for COMS 4995: Neural Networks and Deep Learning w/ Prof. Richard Zemel + advised by Tom Zollo Team Members: Thomas Joshi, Shayan Chowdhury, Fatih Uysal

Introduction & Motivation

Large Language Models (LLMs) have achieved remarkable success in a variety of code-related tasks, from autocompletion to generating entire code snippets from natural language descriptions. However, the lifecycle of real-world software projects is inherently dynamic and continuous. Repositories evolve daily: APIs are deprecated, libraries are upgraded, new bugs are discovered and fixed, and novel features are constantly requested. An adept software engineering agent must therefore not only generate correct code for an immediate request but also learn from its experiences, adapt to changes in the codebase, and, crucially, retain knowledge of how to handle past issues as the project grows and shifts. Consider the analogy of a human software engineer: one who has successfully resolved 100 bugs within a specific, complex codebase will be far more adept at tackling the 101st bug in that same repository than an equally skilled engineer encountering the codebase for the first time. This ability to accumulate and leverage experience is a hallmark of expertise and a critical capability for agents that continuously learn.

This project makes the following primary contributions:

A Novel Benchmark Dataset (SWE-Bench-CL): We detail the construction and structure of SWE-Bench-CL, a reproducible, temporally organized benchmark designed to measure adaptation and memory retention in coding agents.
Preliminary Dataset Analysis: We present an analysis of SWE-Bench-CL's structural characteristics, including inter-task similarity and contextual sensitivity. These findings highlight the unique challenges the benchmark poses for continual learning and inform the design of effective evaluation strategies and agent architectures.
A Proposed Agentic Evaluation Framework: We propose a methodology for evaluating agents on SWE-Bench-CL. This framework centers on an interactive coding agent, which is notably augmented with a semantic memory module to facilitate learning from past experiences. It was developed to overcome challenges encountered with existing evaluation harnesses when applied to our continual learning setup, offering greater transparency and control.
Specialized Continual Learning Metrics: We define a suite of evaluation metrics specifically tailored for assessing continual learning in the context of software engineering, addressing aspects like success rate, tool use efficiency, knowledge transfer, and forgetting.

Project Structure

This repository is organized as follows:

data/: Contains the core dataset files.
- SWE-Bench-CL-Curriculum.json: The continual learning benchmark dataset derived from SWE-Bench Verified.
eval_v1/eval_procedure.py: Naive implementation of continual learning experiments, generating patches, evaluating with the SWE-bench harness, and calculating metrics, evaluating LLMs on SWE-bench-CL using SWE-bench's own evaluation harness w/ docker containers
eval_v2_agent/eval_procedure.py: an agentic implementation via langgraph w/ basic file search, editing, and unit test running functionality + FAISS RAG for CL semantic memory
eval_v3_agent/eval_procedure.py: eval v3, an agentic evaluation of SWE-bench-CL inspired by SWE-agent + integrating continual learning methods + using LangGraph + semantic memory
scripts/: Utility scripts for dataset construction, experimentation, and analysis
- SWE-Bench-CL_dataset_construction.py: The script used to generate the SWE-Bench-CL-Curriculum.json dataset from the original SWE-Bench data.
requirements.txt: Python dependencies for the project.
.env: (User-created) File for storing API keys.
LICENSE: Project license.
Makefile: Makefile for potential build/automation tasks.
research-papers/: Contains relevant research papers.

SWE-Bench-CL Dataset

We developed SWE-Bench-CL, a continual learning adaptation of the SWE-Bench-Verified dataset, a human-verified refinement of the original SWE-Bench dataset, designed to evaluate how effectively AI agents can learn and retain programming knowledge over time. Our new benchmark transforms the original task-independent format into sequential learning scenarios that simulate a developer's progression on real-world projects. The entire dataset is provided in JSON format at SWE-Bench-CL.json.

Dataset Structure

Transforming the original SWE-Bench-Verified dataset, we created 8 learning sequences, each associated with a different repository from the original dataset. Each sequence is designed to follow a curriculum, starting with simpler tasks and progressively introducing more complex problems.

We employed several strategies for how we sequenced the tasks within each repository:

Chronological Ordering: Tasks within each repository are primarily ordered by their creation date, simulating the natural evolution of a codebase.
Curriculum Learning: Within each sequence, tasks are further grouped by difficulty levels:
- Level 1: <15 min fix
- Level 2: 15 min - 1 hour
- Level 3: 1-4 hours
- Level 4: >4 hours
Dependency Awareness: The dataset identifies potential dependencies between tasks based on file modifications, enabling evaluation of knowledge transfer between related problems.

Dataset Statistics

Repository	Tasks	Easy (<15m)	Medium (15m-1h)	Hard (1-4h)	Very Hard (>4h)	Tasks w/ Dependencies
django/django	50	50	0	0	0	25 (50%)
sympy/sympy	50	25	25	0	0	12 (24%)
sphinx-doc/sphinx	44	22	17	4	1	23 (52%)
matplotlib/matplotlib	34	15	19	0	0	13 (38%)
scikit-learn/scikit-learn	32	13	18	1	0	4 (13%)
astropy/astropy	22	4	15	3	0	3 (14%)
pydata/xarray	22	5	15	1	1	13 (59%)
pytest-dev/pytest	19	8	8	3	0	7 (37%)

Evaluation Procedure

Sequential Learning:
- Train/evaluate the agent on each sequence in order
- For each task in a sequence:
  - Present the task to the agent
  - Measure success (whether the agent's solution passes all tests)
  - Record tool usage and solution strategy
Forgetting Assessment:
- Periodically re-test the agent on previously solved tasks
- Calculate the forgetting rate based on performance degradation
Transfer Evaluation:
- After completing a repository sequence, test the agent on tasks from other repositories
- Measure cross-domain transfer and knowledge retention
Reporting:
- Learning curve: Success rate as a function of task number
- Forgetting curve: Performance on previously solved tasks over time
- Transfer matrix: How learning on one repository affects performance on others
- Tool usage patterns: How tool use evolves over time
- CL-Score: Combined metric incorporating success, forgetting, transfer, and tool use

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
data		data
eval_v1		eval_v1
eval_v2_agent		eval_v2_agent
eval_v3_swe-agent		eval_v3_swe-agent
scripts		scripts
.gcp_zone		.gcp_zone
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SWE_Bench_CL_NeurIPS_Submission.pdf		SWE_Bench_CL_NeurIPS_Submission.pdf
requirements.txt		requirements.txt
test-requirements.txt		test-requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SWE-Bench-CL: Continual Learning for Coding Agents

Introduction & Motivation

Project Structure

SWE-Bench-CL Dataset

Dataset Structure

Dataset Statistics

Evaluation Procedure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

thomasjoshi/agents-never-forget

Folders and files

Latest commit

History

Repository files navigation

SWE-Bench-CL: Continual Learning for Coding Agents

Introduction & Motivation

Project Structure

SWE-Bench-CL Dataset

Dataset Structure

Dataset Statistics

Evaluation Procedure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages