JetBrains Research Highlights Updates 2023

The Year in Review

36

publications

17

talks at the conferences

14

workshops and seminars

41

thesis and students projects

22

offline and online courses

18

collaborations with universities and research groups

Machine Learning Methods in Software Engineering Lab

The ML4SE lab works on ​​improving modern software engineering tools and discovering new ways to develop and maintain code.

Code Modeling

One of the main research directions of the ML4SE lab is developing machine learning techniques for better processing of software code at the project level. This year we developed a benchmark for models working with project-level context called Long Code Arena, and we are now in the process of publishing it.

AI Agents

The team has embarked on a project that seeks to integrate IDEs with LLMs, specifically focusing on enhancing the model’s function-calling and planning capabilities. Currently, our efforts are centered on exploring various methodologies that allow for better action planning and execution with smaller models.

Kotlin ML Initiative

In 2023, we started improving the quality of openly available machine learning models working with Kotlin. Our changes included:

  • Improving datasets and benchmarks.
  • Tuning language models to better support Kotlin.

Education Research

This research is mainly focused on educational products. We continued to develop the automatic code quality assessment on the Hyperskill platform. We also contributed to the development of the JetBrains Academy plugin and worked on a prototype for an analytics platform for teachers. As part of the content creation activities, we co-developed several courses for the JetBrains Academy plugin and also shared Ready-to-teach Programming in Kotlin open course materials.

Human-AI Experience (HAX)

This year, with the increasing prominence of AI, our team has introduced a Human-AI Experience research (HAX), emphasizing human interaction with AI, particularly within the IDE. This interaction is delineated into three aspects: Design, Impact, and Quality, and our team is actively engaged in research across all three domains in collaboration with TU Delft, University of California, and internal JetBrains teams and labs.

Intelligent Collaboration Tools Lab

This lab uses data-driven methods to improve collaborative software engineering tools such as communication engines, issue trackers, and code review platforms, while also helping to devise novel approaches to tool support for collaborative work.

ICTL: Key Research Directions and Projects

Test Generation

  1. Published the TestSpark plugin on the Marketplace, drawing significant organic attention (1,400 downloads).
  2. Added LLM-based test generation support to TestSpark, supporting both JetBrains AI and OpenAI.
  3. Paper about TestSpark accepted at ICSE 2024.

Analysis of Development Traces

  1. The Bus Factor LABS project completed.
  2. File Significance Effect on Bus Factor Calculation: paper published at FSE 2023.

Code Review Research

  1. Demonstrated the advantage of custom ordering in code reviews; paper published at EASE 2023.
  2. Did a thorough overview and comparison of all existing viable reviewer recommendation models.

Crash Reproduction

Joint endeavor with the Exception Analyzer team: researching viability of crash reproduction models (classic and LLM-based) for IntelliJ development using the stack traces as input.

Code Readability

Building a new generation of code readability datasets and models. Joint effort with the HAX team, UPorto, and Meta.

AI4SE

Launched the AI for Software Engineering (AI4SE), a long-term research collaboration with Delft University of Technology, also opening several Ph.D. positions there.

Validation of AI-Generated Code

Working on ways to validate snippets generated by LLM. Collaboration with TU Delft, ML4SE, APAL.

Message Reaction Prediction

Built a model to suggest reactions to messages. The first academic project to satisfy the requirements of modern messengers (100+ reactions, with custom ones possible), also generating competitive results.

Undesirable Patterns in Collective Development

Finished a massive study of undesirable patterns in collective development; using the results to inform our decisions about other projects, too.

Programming Languages and Program Analysis Lab

The Programming Languages and Program Analysis Lab carries out research in the areas of programming languages, static and dynamic program analysis, code generation, and related topics.

PLAN: Key Research Directions and Projects

Kotlin Memory Model Research

Started a new project aiming to provide a specification for the Kotlin memory model and tool to check the conformance of the Kotlin compiler to this specification.

Kex

Kex is a Java bytecode analysis platform.

  • Implemented a prototype that combines EvoSuite and Kex into one test generation tool.
  • Implemented a second, improved prototype for Reanimator: a tool for creating instances of objects with the required shape.

CoqPilot

Released CoqPilot, a plugin designed to help automate writing of Coq proofs by using LLMs to generate potential proofs.

Kotlin Compiler Fuzzing

Implemented a fully generation-based fuzzer for the Kotlin compiler and improved the existing mutation-based fuzzer.

Lama

Continued working on the Lama language and infrastructure:

  • Parallel and/or concurrent GC;
  • Lama lsp-server.

LLM Code Generation Consistency

Started researching the possibility of assessing the quality of LLM-generated code without the need for a reference.

Java-like Relaxed Memory Model

Worked on a new theoretical memory model in the style of the original Java memory model (with a commit and re-execute mechanism), that resolves the out-of-thin-air problem while providing simpler semantics and supporting a larger set of optimizations than existing solutions.

Gradle DSL to Kotlin Gradle DSL Conversion via LLMs

Started working on converting .gradle to .gradle.kts using LLMs. This applied research topic aims to understand the pragmatic difficulties of using LLMs for language-to-language translation.

miniKanren

Optimization efforts for relational programming:

  • Functional conversion (first iteration finished, improvements needed);
  • Offline partial deduction (investigation phase).

Fundamental Computational Sciences

This is an interdisciplinary research area that applies advanced computational methods to explore the frontiers of physics, mathematics, and computer science.

Algorithms and Complexity Theory Lab

The team continued to work on barriers for reduction-based hardness proofs – a problem in the field of fine-grained complexity. The main result is a hardness-of-showing-hardness theorem: for many practically important problems, if we prove that a problem is hard under an assumption that the satisfiability problem is hard, then we get new circuit lower bounds (that are extremely challenging to obtain).

Astroparticle Physics Lab

The team has done research into the use of instruction-fine-tuned large language models to analyze astrophysical data. The LLM tool, codenamed NIMBUS, is now open source and currently being integrated into Astro-COLIBRI – a real-time data platform for multi-messenger astrophysics.

HoTT and Dependent Types Lab

The team continued its work on the Arend theorem prover.

  • Arend has become mature and expressive enough to have a weak Nullstellensatz (a basic theorem from algebraic geometry) formalized in it.
  • A number of foundational theorems pertaining to ring theory, linear algebra, and algebraic number theory have been added to the Arend standard library.

Applied Research

Paper-Analyzer

This year, Paper-Analyzer continued working on the existent Question-Answering (Q&A) system and applying it to PubMed articles as well as product documentation.

New directions:

  • Correcting and extending a biomedical knowledge dataset called GENIA.
  • Incorporating and improving on open-source and long-context LLMs for our extraction pipelines.

Concurrent Computing Lab

The lab focused on practical concurrent algorithms and frameworks for testing them.

  • Released new channels in Kotlin coroutines, publishing two papers at the PLDI and PPoPP conferences about the synchronization algorithms in Kotlin coroutines.
  • Paper about the Lincheck framework for testing concurrent data structures was presented at CAV.

Mobile Robot Algorithms Lab

The main ongoing research is related to SLAM algorithms with a focus on high performance and the ability to work with hardware with limited resources.

New research initiatives:

  • Lightweight containers for RISC-V.
  • Robotics CI/CD pipeline.
  • Intelligent professor assistant.

Computational Biology

The science of using biological data to develop algorithms or models in order to better understand biological systems and their relationships.

BioLabs

Neurodevelopment and Neurophysiology Lab

Scientific projects:

  • Massive scale DNA methylation profiling of healthy aging cohort.
  • Unified pipeline for identifying cellular composition changes during healthy aging.
  • Decomposing healthy aging trajectories from proteomics data.
  • Accurate chromatin peak calling with SPAN.