Active OTHER RESEARCH-RELATED NIH (US)

Text Mining and Large Language Models for AI-Driven Evidence-Based Functional Annotation of Clinical Variants

$3M USD

Funder	NATIONAL CANCER INSTITUTE
Recipient Organization	Washington University
Country	United States
Start Date	Sep 19, 2023
End Date	Aug 31, 2028
Duration	1,808 days
Number of Grantees	2
Roles	Principal Investigator; Co-Investigator
Data Source	NIH (US)
Grant ID	`11124530`

Grant Description

Project Summary Biomedical knowledgebases are faced with the challenge of sustaining high-quality curation in the face of ever increasing amounts of biomedical data and limited curator resources. These resources have successfully taken advantage of natural language processing (NLP) technologies to automate some curation tasks, such as

document triage; however, other tasks such as free-text annotation and information extraction still require intensive manual effort that causes bottlenecks in curation workflows. The advent of Large Language Models (LLMs), which have demonstrated impressive performance in interpretation and production of natural

language, opens up the possibility of automating these time-consuming tasks, so as to maximize the value of curator effort. Discussions at the recent NIH data repository and knowledgebase (DRKB) program meeting in February 2024 showcased the great interest among resources in using LLMs to scale up curation. In this

supplement application, the Clinical Interpretation of Variants in Cancer (CIViC) resource and UniProt will collaborate to develop AI-driven data curation strategies to benefit our resources and to serve as a model for other DRKB members. CIViC is dedicated to the expert curation of information about the clinical significance of

cancer genome alterations to enable precision medicine. To support CIViC curation, we have previously developed a BERT-based NLP system that extracts relationships between genes, genetic variants, cancers, and drugs from sentences in the scientific articles. In Aim 1 of this project, we will enhance this tool in two

ways. First, we will add functionality that will classify sentences according to CIViC evidence types for somatic variants: predictive, diagnostic, prognostic, oncogenic, and functional. Second, we will use an LLM to verify the information extracted by the BERT-based tool. Relations that are supported by both methodologies will be

scored as high confidence, necessitating less manual curator review. In Aim 2, we will use an LLM to prepare drafts of CIViC evidence statements, which are free-text descriptions of the literature evidence supporting asserted relations. To increase the accuracy and relevance of the statements, we will provide sentences

identified by the BERT-based tool as enhanced context to the LLM. Supplementing an LLM with domain-specific information, an approach known as Retrieval Augmented Generation (RAG), has been shown to improve LLM performance on biomedical tasks. Finally, in Aim 3, we will disseminate the results from Aims 1

and 2 via the Hypothes.is community annotation platform, which is used by curators at CIViC and also at ClinGen, an NIH-funded resource focusing on the clinical relevance of genes and genetic variants. Moreover, UniProt will import CIViC relations and Evidence Statements for display in its computationally mapped

bibliography and will establish cross-links with CIViC. The prototype framework developed here is generalizable to other biomedical knowledge domains and can be adopted by other data resources.

All Grantees

Washington University

Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant

Text Mining and Large Language Models for AI-Driven Evidence-Based Functional Annotation of Clinical Variants

Grant Description

All Grantees

Interested in applying for this grant?

Quick Summary

Related Grants