Final Project for COMS 4995: Neural Networks and Deep Learning w/ Prof. Richard Zemel + advised by Tom Zollo Team Members: Thomas Joshi, Shayan Chowdhury, Fatih Uysal
Large Language Models (LLMs) have achieved remarkable success in a variety of code-related tasks, from autocompletion to generating entire code snippets from natural language descriptions. However, the lifecycle of real-world software projects is inherently dynamic and continuous. Repositories evolve daily: APIs are deprecated, libraries are upgraded, new bugs are discovered and fixed, and novel features are constantly requested. An adept software engineering agent must therefore not only generate correct code for an immediate request but also learn from its experiences, adapt to changes in the codebase, and, crucially, retain knowledge of how to handle past issues as the project grows and shifts. Consider the analogy of a human software engineer: one who has successfully resolved 100 bugs within a specific, complex codebase will be far more adept at tackling the 101st bug in that same repository than an equally skilled engineer encountering the codebase for the first time. This ability to accumulate and leverage experience is a hallmark of expertise and a critical capability for agents that continuously learn.
This project makes the following primary contributions:
- A Novel Benchmark Dataset (SWE-Bench-CL): We detail the construction and structure of SWE-Bench-CL, a reproducible, temporally organized benchmark designed to measure adaptation and memory retention in coding agents.
- Preliminary Dataset Analysis: We present an analysis of SWE-Bench-CL's structural characteristics, including inter-task similarity and contextual sensitivity. These findings highlight the unique challenges the benchmark poses for continual learning and inform the design of effective evaluation strategies and agent architectures.
- A Proposed Agentic Evaluation Framework: We propose a methodology for evaluating agents on SWE-Bench-CL. This framework centers on an interactive coding agent, which is notably augmented with a semantic memory module to facilitate learning from past experiences. It was developed to overcome challenges encountered with existing evaluation harnesses when applied to our continual learning setup, offering greater transparency and control.
- Specialized Continual Learning Metrics: We define a suite of evaluation metrics specifically tailored for assessing continual learning in the context of software engineering, addressing aspects like success rate, tool use efficiency, knowledge transfer, and forgetting.
This repository is organized as follows:
data/: Contains the core dataset files.SWE-Bench-CL-Curriculum.json: The continual learning benchmark dataset derived from SWE-Bench Verified.
eval_v1/eval_procedure.py: Naive implementation of continual learning experiments, generating patches, evaluating with the SWE-bench harness, and calculating metrics, evaluating LLMs on SWE-bench-CL using SWE-bench's own evaluation harness w/ docker containerseval_v2_agent/eval_procedure.py: an agentic implementation via langgraph w/ basic file search, editing, and unit test running functionality + FAISS RAG for CL semantic memoryeval_v3_agent/eval_procedure.py: eval v3, an agentic evaluation of SWE-bench-CL inspired by SWE-agent + integrating continual learning methods + using LangGraph + semantic memoryscripts/: Utility scripts for dataset construction, experimentation, and analysisSWE-Bench-CL_dataset_construction.py: The script used to generate theSWE-Bench-CL-Curriculum.jsondataset from the original SWE-Bench data.
requirements.txt: Python dependencies for the project..env: (User-created) File for storing API keys.LICENSE: Project license.Makefile: Makefile for potential build/automation tasks.research-papers/: Contains relevant research papers.
We developed SWE-Bench-CL, a continual learning adaptation of the SWE-Bench-Verified dataset, a human-verified refinement of the original SWE-Bench dataset, designed to evaluate how effectively AI agents can learn and retain programming knowledge over time. Our new benchmark transforms the original task-independent format into sequential learning scenarios that simulate a developer's progression on real-world projects. The entire dataset is provided in JSON format at SWE-Bench-CL.json.
Transforming the original SWE-Bench-Verified dataset, we created 8 learning sequences, each associated with a different repository from the original dataset. Each sequence is designed to follow a curriculum, starting with simpler tasks and progressively introducing more complex problems.
We employed several strategies for how we sequenced the tasks within each repository:
- Chronological Ordering: Tasks within each repository are primarily ordered by their creation date, simulating the natural evolution of a codebase.
- Curriculum Learning: Within each sequence, tasks are further grouped by difficulty levels:
- Level 1: <15 min fix
- Level 2: 15 min - 1 hour
- Level 3: 1-4 hours
- Level 4: >4 hours
- Dependency Awareness: The dataset identifies potential dependencies between tasks based on file modifications, enabling evaluation of knowledge transfer between related problems.
| Repository | Tasks | Easy (<15m) | Medium (15m-1h) | Hard (1-4h) | Very Hard (>4h) | Tasks w/ Dependencies |
|---|---|---|---|---|---|---|
| django/django | 50 | 50 | 0 | 0 | 0 | 25 (50%) |
| sympy/sympy | 50 | 25 | 25 | 0 | 0 | 12 (24%) |
| sphinx-doc/sphinx | 44 | 22 | 17 | 4 | 1 | 23 (52%) |
| matplotlib/matplotlib | 34 | 15 | 19 | 0 | 0 | 13 (38%) |
| scikit-learn/scikit-learn | 32 | 13 | 18 | 1 | 0 | 4 (13%) |
| astropy/astropy | 22 | 4 | 15 | 3 | 0 | 3 (14%) |
| pydata/xarray | 22 | 5 | 15 | 1 | 1 | 13 (59%) |
| pytest-dev/pytest | 19 | 8 | 8 | 3 | 0 | 7 (37%) |
- Sequential Learning:
- Train/evaluate the agent on each sequence in order
- For each task in a sequence:
- Present the task to the agent
- Measure success (whether the agent's solution passes all tests)
- Record tool usage and solution strategy
- Forgetting Assessment:
- Periodically re-test the agent on previously solved tasks
- Calculate the forgetting rate based on performance degradation
- Transfer Evaluation:
- After completing a repository sequence, test the agent on tasks from other repositories
- Measure cross-domain transfer and knowledge retention
- Reporting:
- Learning curve: Success rate as a function of task number
- Forgetting curve: Performance on previously solved tasks over time
- Transfer matrix: How learning on one repository affects performance on others
- Tool usage patterns: How tool use evolves over time
- CL-Score: Combined metric incorporating success, forgetting, transfer, and tool use