anontruder - meddler (Page 3)

Beyond Quantity: Trajectory Diversity Scaling for Code Agents

As code large language models (LLMs) evolve into tool-interactive agents via the Model Context Protocol (MCP), their generalization is increasingly limited by low-quality synthetic data and the diminishing returns of qua...

Leo Parker

3 Feb 2026 · 1 min read

BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization

Large language models (LLMs) have shown strong reasoning and coding capabilities, yet they struggle to generalize to real-world software engineering (SWE) problems that are long-horizon and out of distribution. Existing...

Aria Patel

29 Dec 2025 · 1 min read

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software system...

Zoe Walker

14 Dec 2025 · 1 min read

Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases

Real-world software engineering tasks require coding agents that can operate on massive repositories, sustain long-horizon sessions, and reliably coordinate complex toolchains at test time. Existing research-grade coding...

Ethan Shaw

11 Dec 2025 · 1 min read

Agint: Agentic Graph Compilation for Software Engineering Agents

LLM-based coding agents are increasingly common but still face challenges in context management, latency, reliability, reproducibility, and scalability. We present Agint, an agentic graph compiler, interpreter, and runti...

Maya Collins

24 Nov 2025 · 1 min read

Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?

Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are...

Noah Bennett

17 Nov 2025 · 1 min read

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

Agents are now used widely in the process of software development, but building production-ready software engineering agents is a complex task. Deploying software agents effectively requires flexibility in implementation...

Liam Carter

5 Nov 2025 · 1 min read

A Comprehensive Empirical Evaluation of Agent Frameworks on Code-centric Software Engineering Tasks

Unlike traditional automation tools or static LLM-based systems, agents combine decision-making and tool utilization to accomplish complex tasks, showing great potential in software engineering. However, existing studies...

Ava Brooks

2 Nov 2025 · 1 min read

TOM-SWE: User Mental Modeling For Software Engineering Agents

Recent advances in coding agents have made them capable of planning, editing, running, and testing complex code bases. Despite their growing ability in coding tasks, these systems still struggle to infer and track user i...

Owen Blake

24 Oct 2025 · 1 min read

TEST

TEtt

anontruder

4 Oct 2025 · 1 min read

RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents

Code agents have gained widespread adoption due to their strong code generation capabilities and integration with code interpreters, enabling dynamic execution, debugging, and interactive programming capabilities. While...

Nina Reed

2 Oct 2025 · 1 min read

TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini-Pro, practical guidance on optimal tool use is lacking. The...

Leo Parker

30 Sep 2025 · 1 min read

PerfBench: Can Agents Resolve Real-World Performance Bugs?

Performance bugs are inefficiencies in software that waste computational resources without causing functional failures, making them particularly challenging to detect and fix. While recent advances in Software Engineerin...

Aria Patel

28 Sep 2025 · 1 min read

Context Engineering for Multi-Agent LLM Code Assistants Using Elicit, NotebookLM, ChatGPT, and Claude Code

Large Language Models (LLMs) have shown promise in automating code generation and software engineering tasks, yet they often struggle with complex, multi-file projects due to context limitations and knowledge gaps. We pr...

Zoe Walker

9 Aug 2025 · 1 min read

SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments

Modern Large Language Model (LLM) agents promise end to end assistance with real-world software tasks, yet existing benchmarks evaluate LLM agents almost exclusively in pre-baked environments where every dependency is pr...

Ethan Shaw

11 Jul 2025 · 1 min read

Unified Software Engineering Agent as AI Software Engineer

The growth of Large Language Model (LLM) technology has raised expectations for automated coding. However, software engineering is more than coding and is concerned with activities including maintenance and evolution of...

Maya Collins

17 Jun 2025 · 1 min read

SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling

Large language models (LLMs) have advanced rapidly from conversational problem solving to addressing real-world tasks involving tool use, such as software engineering (SWE). Recent LLM-powered toolkits, such as OpenAI Co...

Noah Bennett

9 Jun 2025 · 1 min read

From Knowledge to Noise: CTIM-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents

We introduce CTIM-Rover, an AI agent for Software Engineering (SE) built on top of AutoCodeRover (Zhang et al., 2024) that extends agentic reasoning frameworks with an episodic memory, more specifically, a general and re...

Liam Carter

29 May 2025 · 1 min read