Paper-Analyzer is a web-based application that performs search queries on a collection of 30 million PubMed paper abstracts. A search query includes a gene ID (according to NCBI gene id) and (or) a MeSH(Medical Subject Heading). Users can also specify taxon names for genes, add context and author names to narrow down the search. We trained the model to search for relations between entities participating in search queries. Right now, we can find connections between genes and diseases, chemicals and genes, chemicals and diseases. There are 13 types of relations such as marker-mechanism relations, therapeutic effect, increase or decrease in expression, activity or metabolic processing, and so on. There are many cases when abstracts don't contain explicit statements about the presence of relationships between entities in question. We trained a Natural Language Understanding model based on Transformer architecture called BERT to address this problem. We took positive relation examples from the Comparative Toxicogenomics Database (CTD) to train the model. We used the PubTator application for named entity recognition and entity name normalization tasks, but we plan to substitute it with our own NER system shortly. We also plan to include gene-gene relations from Reactome in our search system.
We preprocess all the abstracts and store relations in a database.
After submitting the query, the user gets a list of resulting papers aggregated by relation endpoints and types. One can collapse relation types and sort the search results by score (model confidence), publication year, or number of papers in a group.
Users can explore search results at the level of particular abstracts by selecting papers grouped by relation types. One can filter abstracts by publication year using the histogram. We also provide detailed information about the papers and links to PubMed and PubTator.
We are now working on extracting additional information about entities and relations from article text. As for now, one can see contexts found in sentences containing both entities that form a relation.
As a result of Relation Extraction model application to PubMed abstracts we obtained a database of extracted relations. We are going to update this database when the model changes.
The RE database is a tsv file with columns:
We analyze the following classes of relation types:
These types represent a subset of types mentioned in CTD.
Release description:
The database can be downloaded here.
Release description: the first public release of the database. Can be downloaded here.