JetBrains logo

Machine Learning Research Division

ML Research
Division

At JetBrains, we are passionate about the way AI transforms software engineering.

The Machine Learning (ML) division at JetBrains Research explores ways to use ML techniques and agentic approaches to help developers and enhance software development processes. We aim to improve ML adoption for code by turning the latest academic advances into practical applications.

This page presents an overview of our research teams and collaboration opportunities.

JetBrains

Code modeling research

The Code Modeling Research Team focuses on advancing model capabilities in understanding and producing code. We specialize in fine-tuning procedures applicable across a wide spectrum of models and tasks. Our recent projects include supervised fine-tuning for Kotlin, reinforcement learning (RL) fine-tuning using compiler feedback, and enhancing model contexts by leveraging project information. Additionally, we actively develop comprehensive benchmarks, for instance, for plot generation, Kotlin Q&A, and test-based evaluation — because every ML venture begins with robust benchmarking.

Project adaptation

Our current flagship project is Project Adaptation — we are working on fine-tuning a model to generate more accurate and efficient code for a specific project. This project setup presents a unique challenge due to the limited amount of available data, but it also offers valuable opportunities, such as an existing CI/CD pipeline that can provide feedback on new generations. These factors shape our current focus areas: data synthesis and reinforcement learning approaches.

Code editing research

In Code Editing Research, we study how to make code models better at a variety of editing-related tasks, including reasoning through edits, better edits’ representation, and generation of synthetic editing data. We also explore broader ML questions, such as improving post-RL model performance, developing new optimization methods, and applying low-variance RL techniques to language modeling.

Diff-XYZ

Diff-XYZ is a benchmark of 1,000 real-world code edits designed to isolate how different edit representations affect LLMs behavior. It enables controlled evaluation across three tasks — Apply, Anti-Apply, and Diff Generation — to show how well models understand and generate code edits in various formats.

Paper Dataset

Best-of-N sampling for reinforcement learning

In this project we address the loss of generation diversity caused by standard RL fine-tuning by deriving an objective for RL training that directly optimizes max@k metric, aligning training with Best-of-N inference. We provide an unbiased on-policy gradient estimator and an approximately unbiased off-policy version compatible with modern RL with verifiable rewards (RLVR) pipelines along with better performing baselines for them.

Paper

AI agents and planning

Our team's goal is to enable better decision-making for JetBrains AI agents. We have a wide variety of projects ranging from benchmarking to studying agents' behavior.

EnvBench

Environment setup is an inevitable part of any modern coding agent training or evaluation process. We developed a benchmark to measure how good different automated environment setup systems are. The benchmark is focused on hard cases that aren’t installed with a simple static script and comprise more than 300 repositories for Python and more than 600 repositories for JVM-based languages.

Paper Dataset Code

GitGoodBench

We believe that agents shouldn’t replace humans in the software engineering process, but help them automate the boring parts. To help measure this, we established a benchmark on how well agents work with the version control systems. It measures such abilities as conflict resolution and interactive rebasing.

Paper Code + Dataset

PIPer

To democratize the process of automated environment setup, we trained a small model (based on Qwen3-8B) specialized on this task. This model is able to perform on par with GPT-4o while providing an order of magnitude saving in compute.

Paper Code

The Complexity Trap

We studied two popular context management strategies for agents: context compression and observation masking. Surprisingly, we found out that the simple observation masking is often-time performs on par with the more intricate strategy of summarizing the history. We also proposed the combination of these two strategies that delivers further savings.

Paper Dataset Code

Human-AI experience

With the emergence of code-fluent LLMs, programming practices are changing. At the same time, the environments need to change to provide an optimal human-AI experience (HAX) in the IDE.

Human-AI interaction design

Work in this direction includes prototyping AI functionality integration into existing programming environments and developers' workflows, ensuring intuitive and efficient experiences.

Impact of AI on software developers

This research area involves exploring how programmers use and perceive AI assistants, identifying the challenges they face and the benefits these tools bring, to better align AI with the real-world needs of developers.

Quality of Human-AI interaction

Our efforts in this field include identifying, evaluating, and adjusting critical aspects of AI assistants' output, from correctness to understandability, while also ensuring these tools truly support developers in their tasks.

Federated compute

ML techniques are constrained by quality and availability of domain-specific data. Our team is focused on creation and delivery of privacy-preserving solutions that lift data constraints, allowing for training high-quality models for our IDEs.

Federated platform

By shifting the paradigm from centralized to distributed the federated platform allows for model training on user data without that data leaving the users' devices, enabling deployment of user aligned ML-based features to our IDEs.

Differential privacy

By employing mathematically proven methods of private data processing the models we train can learn generalities of user data but stay unable to create exact copies of it. That allows users to confidently contribute to improving IDE features knowing their individual data will remain protected by the strongest privacy standard available in machine learning today.

Collaboration Opportunities

We are open to collaborate with other researchers from both academia and industry.

The topics below are particularly interesting to us at the moment.

Contact us if you are interested!

Reinforcement Learning beyond test execution

In the field of Reinforcement Learning for coding tasks, the most common source of rewards is test execution. It allows optimization for widely used SWE-bench and various benchmarks with algorithmic tasks (1, 2, 3). In practice, test execution feedback has several limitations. For one, for large software repositories, running a test suite can be very time and computationally expensive. Secondly, passing tests does not necessarily guarantee code correctness. Finally, reward from tests execution resembles binary function in practical cases, which can complicate the training process.

Our experiments show effectiveness of other reward sources such as compilation success. We can assist researchers willing to further explore this topic by sharing infrastructure that allows efficient collection of reward signals about code based on static analysis, compilation, and runtime.

Adapting tokenizers to new domains and projects

For LLMs, using the same tokenizer when changing domains may be suboptimal, plus techniques for mitigating the issue already exist (1, 2).

In the software engineering domain, the tokenizer can be adapted at least at two stages: (i) when focusing general models on a single programming language, and (ii) when adapting models to work with a specific software project. We believe that tokenizer adaptation may reduce the inference cost, plus improve quality when working with the large codebases typical in software engineering.

We invite researchers studying how to adapt tokenizers to collaborate on the above cases. From our side, we are ready to share the relevant datasets and help with evaluating approaches on proprietary code that the models have not seen during training.

Agents for various software development scenarios

We are interested in agents and models that target various aspects of software development, not limited to issue resolution. Recently, we published benchmarks that assess models’ proficiency in environment setup and working with Git.

We would be happy to collaborate with researchers working on similar problems, assist in running these benchmarks, and share our fine-tuning pipelines.

Reasoning for code beyond natural language

In contrast to natural language, code can be executed, and execution can bring a lot of additional information that is absent from a static code snapshot. Recent works explore this topic by measuring models’ capabilities in predicting runtime behavior (CRUXEval) training foundation models on runtime information (CWM). Moreover, programming projects have rich development histories that allow extraction of code evolution, rarely present for natural language texts.

These unique kinds of information can be used to create new methods of reasoning for ML models: reasoning via sequential edits, reasoning about program’s execution, and others. If you are interested in exploring new reasoning modalities for coding models, we would be happy to collaborate and share our experience of mining runtime data and evolution data from code.

Integration of project-level knowledge into the models

During our work on Long Code Arena, we prepared a set of six benchmarks that require models to operate with project-level context in different settings. The benchmarks include:

We have already prepared baseline solutions for all benchmarks. We are deeply interested in various approaches to progress on these benchmarks, such as RAG, long-context models, agentic solutions, and other novel techniques to integrate project knowledge into your model.

We are happy to assist researchers who are willing to work on solving these benchmarks, by setting up everything related to running evaluation and baselines, as well as to assist with the data as needed.

Structured Code Generation

So that LLMs follow a specific generation format, we can use techniques such as structured text generation (see outlines as an example). This approach makes models follow a defined language grammar, guaranteeing that the output will comply with this grammar. The grammar is also simplistic, as it is used to produce correct JSONs or to follow basic regular expressions.

We can use the same approach to make models generate correct code for a given programming language. Moreover, we can enforce some additional checks, such as correctness of the generated API calls. This task, however, remains challenging from a technical perspective when it comes to structured code generation and raises many nuanced questions: What is the right way to sample from the model given the constraints? How can rollbacks be implemented in the presence of low-probable branches of generation?

We would be happy to assist researchers working on this by sharing the datasets and evaluation setups for tasks such as code completion and code generation, as well as by connecting them to folks in the Kotlin and other language teams.

Optimal settings for LLM-as-a-judge evaluation in ML4SE

LLM-as-a-judge is becoming increasingly popular, replacing the classical textual similarity metrics in various evaluation suites.

There are many different ways of using LLMs for evaluation, including pointwise or pairwise grading, probability estimation by sampling or directly converting from logits, and applying different assessment models.

Open questions remain, such as which setups researchers and practitioners should use when working on ML4SE tasks and what factors the setup depends on.

We would be happy to share our expertise in designing and conducting studies on metrics evaluation. Depending on the problem, we can also share groundtruth and/or generated examples for some datasets.