JetBrains logo

Machine Learning Research Division

ML Research
Division

At JetBrains, we are passionate about the way AI transforms software engineering.

The Machine Learning (ML) division at JetBrains Research explores ways to use ML techniques and agentic approaches to help developers and enhance software development processes. We aim to improve ML adoption for code by turning the latest academic advances into practical applications.

This page presents an overview of our research teams and collaboration opportunities.

Code modeling research

This team focuses on tasks around code modeling and code generation.

Long Code Arena

A set of benchmarks for tasks that require project-level information to solve: code completion, generation, summarization, error fixing, and more.

Read more

Benchmarks

Preprint

Kotlin ML Initiative

An initiative aimed at organizing, fine-tuning, training models for Kotlin, as well as researching how to make the most of limited data for less popular languages.

Read more

Technical report

Datasets and benchmarks

Project-aware code modeling

The practice of studying, evaluating, and developing models that use project-wide information through retrieval techniques, expanding the input size, and optimal context building.

Code editing research

This team is working on training generative models on the history of software project evolution.

Key research questions

  • How can we most accurately represent code edits in models' input and output?
  • How can we work with edits efficiently with smaller models?

Key projects

  • Code review comment resolution
  • Instructional code generation
  • Next Edit Prediction in the editor

AI agents and planning

This team's goal is to enable better decision-making for JetBrains AI agents.

The two main directions of its research are:

  • Distilling decision-making capabilities from large pre-trained models to small task-specific ones.
  • Exploring the possibility of tuning LLMs for decision-making using reinforcement learning.

Human-AI experience

With the emergence of code-fluent LLMs, programming practices are changing. At the same time, the environments need to change to provide an optimal human-AI experience (HAX) in the IDE.

Human-AI interaction design

Work in this direction includes prototyping AI functionality integration into existing programming environments and developers' workflows, ensuring intuitive and efficient experiences.

Impact of AI on software developers

This research area involves exploring how programmers use and perceive AI assistants, identifying the challenges they face and the benefits these tools bring, to better align AI with the real-world needs of developers.

Quality of Human-AI interaction

Our efforts in this field include identifying, evaluating, and adjusting critical aspects of AI assistants' output, from correctness to understandability, while also ensuring these tools truly support developers in their tasks.

Federated compute

Federated learning for software engineering tasks

To effectively apply machine learning to software development requires vast, real-world codebase insights, and traditional centralized methodologies often compromise data privacy. To tackle this, we're exploring federated learning techniques for efficient, privacy-preserving solutions.

Collaboration Opportunities

We are open to collaborate with other researchers from both academia and industry.

The topics below are particularly interesting to us at the moment.

Contact us if you are interested!

Integration of project-level knowledge into the models

During our work on Long Code Arena, we prepared a set of six benchmarks that require models to operate with project-level context in different settings. The benchmarks include:

We have already prepared baseline solutions for all benchmarks. We are deeply interested in various approaches to progress on these benchmarks, such as RAG, long-context models, agentic solutions, and other novel techniques to integrate project knowledge into your model.

We are happy to assist researchers who are willing to work on solving these benchmarks, by setting up everything related to running evaluation and baselines, as well as to assist with the data as needed.

Adapting tokenizers to new domains and projects

For LLMs, using the same tokenizer when changing domains may be suboptimal, and techniques for mitigating the issue already exist (1, 2).

In the software engineering (SE) domain, the tokenizer can be adapted at least at two stages: (i) when focusing general models on a single programming language and (ii) when adapting models to work with a specific software project. We believe that tokenizer adaptation may reduce the inference cost, plus improve quality when working with the large codebases typical in SE.

We invite researchers studying how to adapt tokenizers to collaborate on the above cases. From our side, we are ready to share the relevant datasets and help with evaluating approaches on proprietary code that the models have not seen during training.

Structured Code Generation

So that LLMs follow a specific generation format, we can use techniques such as structured text generation (see outlines as an example). This approach makes models follow a defined language grammar, guaranteeing that the output will comply with this grammar. The grammar is also simplistic, as it is used to produce correct JSONs or to follow basic regular expressions.

We can use the same approach to make models generate correct code for a given programming language. Moreover, we can enforce some additional checks, such as correctness of the generated API calls. This task, however, remains challenging from a technical perspective when it comes to structured code generation and raises many nuanced questions: What is the right way to sample from the model given the constraints? How can rollbacks be implemented in the presence of low-probable branches of generation?

We would be happy to assist researchers working on this by sharing the datasets and evaluation setups for tasks such as code completion and code generation, as well as by connecting them to folks in the Kotlin and other language teams.

Optimal settings for LLM-as-a-judge evaluation in ML4SE

LLM-as-a-judge is becoming increasingly popular, replacing the classical textual similarity metrics in various evaluation suites.

There are many different ways of using LLMs for evaluation, including pointwise or pairwise grading, probability estimation by sampling or directly converting from logits, and applying different assessment models.

Open questions remain, such as which setups researchers and practitioners should use when working on ML4SE tasks and what factors the setup depends on.

We would be happy to share our expertise in designing and conducting studies on metrics evaluation. Depending on the problem, we can also share groundtruth and/or generated examples for some datasets.