International Journal of Medical Informatics
Volume 67, Issue 1 , Pages 49-61, 4 December 2002

Protein names and how to find them

  • Kristofer Franzén

      Affiliations

    • Swedish Institute of Computer Science, Box 1263, SE-164 29 Kista, Sweden
    • Corresponding Author InformationCorresponding author. Tel.: +46-8-633-1537; fax: +46-8-751-7230
  • ,
  • Gunnar Eriksson

      Affiliations

    • Swedish Institute of Computer Science, Box 1263, SE-164 29 Kista, Sweden
  • ,
  • Fredrik Olsson

      Affiliations

    • Swedish Institute of Computer Science, Box 1263, SE-164 29 Kista, Sweden
  • ,
  • Lars Asker

      Affiliations

    • Virtual Genetics Laboratory AB, SE-171 77 Stockholm, Sweden
  • ,
  • Per Lidén

      Affiliations

    • Virtual Genetics Laboratory AB, SE-171 77 Stockholm, Sweden
  • ,
  • Joakim Cöster

      Affiliations

    • Virtual Genetics Laboratory AB, SE-171 77 Stockholm, Sweden

Abstract 

A prerequisite for all higher level information extraction tasks is the identification of unknown names in text. Today, when large corpora can consist of billions of words, it is of utmost importance to develop accurate techniques for the automatic detection, extraction and categorization of named entities in these corpora. Although named entity recognition might be regarded a solved problem in some domains, it still poses a significant challenge in others. In this work we focus on one of the more difficult tasks, the identification of protein names in text. This task presents several interesting difficulties because of the named entities variant structural characteristics, their sometimes unclear status as names, the lack of common standards and fixed nomenclatures, and the specifics of the texts in the molecular biology domain in which they appear. We describe how we approached these and other difficulties in the implementation of Yapex, a system for the automatic identification of protein names in text. We also evaluate Yapex under four different notions of correctness and compare its performance to that of another publicly available system for protein name recognition.

Keywords:  Knowledge, Linguistics, Natural language processing, Medical information science, Computational molecular biology, Information extraction, Protein names

To access this article, please choose from the options below

Login to an existing account or Register a new account.

  • Purchase this article for 31.50 USD (You must login/register to purchase this article)

    Online access for 24 hours. The PDF version can be downloaded as your permanent record.

  • Subscribe to this title

    Get unlimited online access to this article and all other articles in this title 24/7 for one year.

  • Claim access now

    For current subscribers with Society Membership or Account Number.

  • Visit SciVerse ScienceDirect to see if you have access via your institution.
 

PII: S1386-5056(02)00052-7

International Journal of Medical Informatics
Volume 67, Issue 1 , Pages 49-61, 4 December 2002