We transform research into product value in a changing development landscape.
This page presents an overview of our teams in the Applied Research (AR) Division and our current projects.
Here's what we're working on right now. You can find further details, including our ongoing projects, below.
Modern codebases are complex and full of implicit structure, making them hard for AI agents to navigate efficiently. As a result, current systems often waste compute and produce poorly grounded changes.
This research focuses on improving how agents explore code before acting. By introducing structured representations and targeted search capabilities, we aim to separate understanding from modification – leading to more accurate, efficient, and project-aware AI developer tools.
Ongoing projects:
Real-world software development depends on tools like IDEs, terminals, and CI/CD systems. Agents that only generate code without using these tools remain limited, but naïve tool use is often brittle and inefficient.
This research focuses on making tool use more reliable and scalable for AI agents. By improving how agents select, execute, and learn from tool interactions, we aim to enable more capable and practical AI systems for software development.
Ongoing projects:
Multi-agent systems focus on how multiple specialized AI agents can reliably work together to solve complex, multi-step tasks. Modern single-agent setups often hit limits on context and specialization, while multi-agent systems introduce decomposition and parallelism but are currently fragile, expensive, and poorly understood.
This research investigates stable architectures, coordination protocols, and division-of-labor strategies that make multi-agent systems more predictable and efficient. By learning from real-world implementations and extracting reusable patterns, we aim to inform orchestration tooling and provide best-practice guidance for organizations adopting multi-agent architectures.
Ongoing projects:
Many teams struggle to find and apply trustworthy evaluation. This is especially the case when results are noisy, benchmarks are hard to maintain, and risks like data leakage or contamination are easy to overlook.
This research builds scalable, realistic evaluation systems through benchmark mining and generation, careful dataset curation, and techniques that reduce leakage and overfitting. Our aim is to make it much easier for both internal teams and customers to create and maintain high-quality benchmarks, speeding up evaluation workflows and improving the reliability of AI-assisted coding tools.
Ongoing projects:
When agents fail, their reasoning is often opaque, execution is spread across many steps and systems, and errors may only surface late in the process. All this makes debugging and agent improvement both slow and unreliable.
This research builds tools to capture, visualize, and analyze agent behavior, including detailed execution traces, anomaly detection, and methods for selecting high-quality traces for training. Our goal is to give both IDE users and internal agent developers better observability, so they can iterate on AI agents more quickly and with greater confidence.
Ongoing projects:
Today’s agents are fragile: small prompt tweaks, tool updates, or shifts in context can unexpectedly break behavior. On top of that, manual prompt engineering does not scale.
This research investigates automated ways to improve robustness, such as self-optimization, adaptive prompting, and dynamic system configuration. Our goal is to simplify prompt engineering, automate choices like agent topology and tool descriptions, and streamline debugging of unexpected agent behavior, ultimately giving both product teams and internal agent developers more reliable, maintainable AI systems.
Ongoing project:
Code generation only delivers value when outputs meet real specifications, yet current agents often produce solutions that look plausible but are wrong. In addition, verification can be a major bottleneck.
This research develops ways to enforce correctness through testing, formal checks, and other validation strategies, and explores how to weave these checks directly into the generation loop. The aim is to strengthen automated test generation in IDEs and make benchmarks like SWE-Bench more robust by expanding and tightening their test suites.
Key Projects:
We are open to collaborating with other researchers from both academia and industry.
If you’d be interested in working with us on any of the above projects, please reach out!