International Journal of Medical Informatics
Volume 78, Issue 12 , Pages e19-e26, December 2009

Developing a standard for de-identifying electronic patient records written in Swedish: Precision, recall and F-measure in a manual and computerized annotation trial

  • Sumithra Velupillai

      Affiliations

    • Department of Computer and Systems Sciences, Stockholm University/KTH, Forum 100, 164 40 Kista, Sweden
    • Corresponding Author InformationCorresponding author. Tel.: +46 8 16 11 74.
  • ,
  • Hercules Dalianis

      Affiliations

    • Department of Computer and Systems Sciences, Stockholm University/KTH, Forum 100, 164 40 Kista, Sweden
  • ,
  • Martin Hassel

      Affiliations

    • Department of Computer and Systems Sciences, Stockholm University/KTH, Forum 100, 164 40 Kista, Sweden
  • ,
  • Gunnar H. Nilsson

      Affiliations

    • Department of Neurobiology, Care Sciences and Society, Karolinska Institutet, Stockholm, Sweden

Received 31 October 2008; received in revised form 2 March 2009; accepted 9 April 2009. published online 25 May 2009.

Abstract 

Background

Electronic patient records (EPRs) contain a large amount of information written in free text. This information is considered very valuable for research but is also very sensitive since the free text parts may contain information that could reveal the identity of a patient. Therefore, methods for de-identifying EPRs are needed. The work presented here aims to perform a manual and automatic Protected Health Information (PHI)-annotation trial for EPRs written in Swedish.

Methods

This study consists of two main parts: the initial creation of a manually PHI-annotated gold standard, and the porting and evaluation of an existing de-identification software written for American English to Swedish in a preliminary automatic de-identification trial. Results are measured with precision, recall and F-measure.

Results

This study reports fairly high Inter-Annotator Agreement (IAA) results on the manually created gold standard, especially for specific tags such as names. The average IAA over all tags was 0.65 F-measure (0.84 F-measure highest pairwise agreement). For name tags the average IAA was 0.80 F-measure (0.91 F-measure highest pairwise agreement). Porting a de-identification software written for American English to Swedish directly was unfortunately non-trivial, yielding poor results.

Conclusion

Developing gold standard sets as well as automatic systems for de-identification tasks in Swedish is feasible. However, discussions and definitions on identifiable information is needed, as well as further developments both on the tag sets and the annotation guidelines, in order to get a reliable gold standard. A completely new de-identification software needs to be developed.

Keywords: Medical informatics applications, Natural language processing, Medical record systems, Electronic patient records in Swedish, Protected health information, Ethical issues, Annotation

To access this article, please choose from the options below

Login to an existing account or Register a new account.

  • Purchase this article for 31.50 USD (You must login/register to purchase this article)

    Online access for 24 hours. The PDF version can be downloaded as your permanent record.

  • Subscribe to this title

    Get unlimited online access to this article and all other articles in this title 24/7 for one year.

  • Claim access now

    For current subscribers with Society Membership or Account Number.

  • Visit SciVerse ScienceDirect to see if you have access via your institution.
 

PII: S1386-5056(09)00069-0

doi:10.1016/j.ijmedinf.2009.04.005

International Journal of Medical Informatics
Volume 78, Issue 12 , Pages e19-e26, December 2009