arxiv_cl 85% Match System Paper Life scientists,Medical researchers,Bioinformaticians,NLP researchers in biomedical domain 2 weeks ago

EasyNER: A Customizable Easy-to-Use Pipeline for Deep Learning- and Dictionary-based Named Entity Recognition from Medical and Life Science Text

large-language-models › model-architecture

📄 Abstract

Abstract: Background Medical and life science research generates millions of publications, and it is a great challenge for researchers to utilize this information in full since its scale and complexity greatly surpasses human reading capabilities. Automated text mining can help extract and connect information spread across this large body of literature, but this technology is not easily accessible to life scientists. Methods and Results Here, we developed an easy-to-use end-to-end pipeline for deep learning- and dictionary-based named entity recognition (NER) of typical entities found in medical and life science research articles, including diseases, cells, chemicals, genes/proteins, species and others. The pipeline can access and process large medical research article collections (PubMed, CORD-19) or raw text and incorporates a series of deep learning models fine-tuned on the HUNER corpora collection. In addition, the pipeline can perform dictionary-based NER related to COVID-19 and other medical topics. Users can also load their own NER models and dictionaries to include additional entities. The output consists of publication-ready ranked lists and graphs of detected entities and files containing the annotated texts. In addition, we provide two accessory scripts which allow processing of files in PubTator format and rapid inspection of the results for specific entities of interest. As model use cases, the pipeline was deployed on two collections of autophagy-related abstracts from PubMed and on the CORD19 dataset, a collection of 764 398 research article abstracts related to COVID-19. Conclusions The NER pipeline we present is applicable in a variety of medical research settings and makes customizable text mining accessible to life scientists.

Authors (11)

Rafsan Ahmed

Petter Berntsson

Alexander Skafte

Salma Kazemi Rashed

Marcus Klang

Adam Barvesten

+5 more

Submitted

April 16, 2023

arXiv Category

q-bio.QM

arXiv PDF

Key Contributions

Develops EasyNER, an easy-to-use, end-to-end pipeline for deep learning- and dictionary-based Named Entity Recognition (NER) in medical and life science texts. It processes large collections like PubMed and CORD-19, fine-tuning models on the HUNER corpus, making advanced text mining accessible to researchers without deep NLP expertise.

Business Value

Accelerates scientific discovery by enabling researchers to quickly extract and connect critical information from vast biomedical literature, potentially leading to faster drug discovery and medical advancements.

Paper Metadata

Innovation Type

Tool/Pipeline Development

Deployment Feasibility

High, as it's a pipeline designed for practical use by researchers.

Limitations Addressed

Difficulty for life scientists to access and utilize the vast amount of information in biomedical literature due to scale and complexity; lack of accessible text mining tools.

Technical Tags

Named Entity Recognition (NER)medical text mininglife science textdeep learningdictionary-based NERpipelinePubMedCORD-19HUNER corpusinformation extraction

Research Topics

Biomedical Text MiningNamed Entity RecognitionInformation ExtractionDeep Learning for NLPLife Sciences Research Support

Methods & Architectures

End-to-end pipelineDeep learning models (fine-tuned)Dictionary-based NERText processing Deep learning models (fine-tuned)

Applications & Tasks

Medical Research Life Sciences Biotechnology Pharmaceuticals Extracting information from large biomedical literatureMaking text mining accessible to life scientistsIdentifying specific entities (diseases, genes, etc.) Named Entity Recognition (NER)Information ExtractionLiterature analysis

Datasets & Benchmarks

Datasets

PubMed, CORD-19, HUNER corpus

PrecisionRecallF1-score for NER

Related Fields

Natural Language ProcessingBioinformaticsComputational BiologyMedical InformaticsMachine Learning

Keywords

NERmedical text mininglife sciencesdeep learningpipelinebiomedicalinformation extractionPubMedCORD-19research tools

Academic Context

#Biomedical Text Mining#Named Entity Recognition#Information Extraction#Deep Learning for NLP#Life Sciences Research Support

Commercial Potential

Potential Products

Specialized NER tools for pharmaceutical researchKnowledge discovery platforms for life sciences

Target Industries

PharmaceuticalsBiotechnologyHealthcareResearch Institutions

Use Case Examples

Identifying all mentions of specific genes or proteins in research papersExtracting disease-gene associationsSummarizing research on a particular drug

Competitive Edge

Provides an accessible, end-to-end solution combining deep learning and dictionary methods, specifically tailored for the medical and life science domain.

Market Opportunity

Large and growing market for bioinformatics and medical informatics tools.

Revenue Models

Licensing of the softwareSaaS platform for text mining services.

Resource Requirements

Compute Needs

Moderate for training/fine-tuning models, potentially high for processing large literature collections.

Data Requirements

Access to large biomedical literature databases (e.g., PubMed) and annotated corpora (e.g., HUNER).

Deployment Constraints

Requires domain expertise for effective use and interpretation of results.

Scalability

Designed to process large collections of research articles.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years for a polished, commercial product.

View Full Paper Back to Papers