International Journal of Medical Informatics
Volume 77, Issue 2 , Pages 81-97 , February 2008

Predictive data mining in clinical medicine: Current issues and guidelines

  • Riccardo Bellazzi

      Affiliations

    • Dipartimento di Informatica e Sistemistica, Università di Pavia, via Ferrata 1, 27100 Pavia, Italy
    • Corresponding Author InformationCorresponding author. Tel.: +39 0382 505511; fax: +39 0382 505373.
  • ,
  • Blaz Zupan

      Affiliations

    • Faculty of Computer Science, University of Ljubljana, Slovenia
    • Department of Human and Molecular Genetics, Baylor College of Medicine, Houston, TX, United States

Received 27 October 2006 ,Accepted 17 November 2006.

References 

  1. Giudici P. Applied Data Mining Statistical Methods for Business and Industry. Wiley & Sons; 2003;
  2. Fayyad U, Piatetsky-Shapiro G, Smyth P. Data mining and knowledge discovery in databases. Commun. ACM. 1996;39:24–26
  3. Zupan B, Demsar J, Smrke D, Bozikov K, Stankovski V, Bratko I, et al. Predicting patient's long-term clinical status after hip arthroplasty using hierarchical decision modelling and data mining. Meth. Inf. Med. 2001;40:25–31
  4. Demsar J, Zupan B, Leban G, Curk T. Orange: from experimental machine learning to interactive data mining. In: European Conference of Machine Learning. Pisa, Italy: Springer Verlag; 2004;537-539
  5. Kononenko I. Inductive and Bayesian learning in medical diagnosis. Appl. Artif. Intelligen. 1993;7:317–337
  6. Lubsen J, Pool J, van der Does E. A practical device for the application of a diagnostic or prognostic function. Meth. Inf. Med. 1978;17:127–129
  7. Mozina M, Demsar J, Kattan MW, Zupan B. Nomograms for visualization of naive bayesian classifier. In: Proceedings of the Principles Practice of Knowledge Discovery in Databases (PKDD-04). Pisa, Italy. 2004;p. 337–348
  8. Harrell FE. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer; 2001;
  9. Kattan MW, Eastham JA, Stapleton AM, Wheeler TM, Scardino PT. A preoperative nomogram for disease recurrence following radical prostatectomy for prostate cancer. J. Natl. Cancer Inst. 1998;90:766–771
  10. Graefen M, Karakiewicz PI, Cagiannos I, Quinn DI, Henshall SM, Grygiel JJ, et al International validation of a preoperative nomogram for prostate cancer recurrence after radical prostatectomy. J. Clin. Oncol. 2002;20:3206–3212
  11. Quinlan JR. C4.5: Programs for Machine Learning. San Mateo, Calif: Morgan Kaufmann Publishers; 1993;
  12. Breiman L. Classification and Regression Trees. New York, London: Chapman & Hall; 1993;
  13. Clark P, Niblett T. The CN2 Induction Algorithm. Mach. Learn. 1989;3:261–283
  14. Michalski RS, Kaufman K. Learning patterns in noisy data: the AQ approach. In:  Paliouras G,  Karkaletsis V,  Spyropoulos C editor. Machine Learning and its Applications. Berlin: Springer-Verlag; 2001;p. 22–38
  15. Lavrac N, Kononenko I, Keravnou E, Kukar M, Zupan B. Intelligent data analysis for medical diagnosis: using machine learning and temporal abstraction. AI Commun. 1998;11:191–218
  16. Hosmer DW, Lemeshow S. Applied Logistic Regression. 2nd ed.. New York: Wiley; 2000;
  17. Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2001;
  18. Schwarzer G, Vach W, Schumacher M. On the misuses of artificial neural networks for prognostic and diagnostic classification in oncology. Stat. Med. 2000;19:541–561
  19. Cristianini N, Shawe-Taylor J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge, UK, New York: Cambridge University Press; 2000;
  20. Vapnik VN. Statistical Learning Theory. New York: Wiley; 1998;
  21. Cortes C, Vapnik V. Support-vectors networks. Mach. Learn. 1995;20:273–297
  22. Kononenko I. Machine learning for medical diagnosis: history, state of the art and perspective. Artif. Intell. Med. 2001;23:89–109
  23. Andreassen S, Jensen FV, Olesen KG. Medical expert systems based on causal probabilistic networks. Int. J. Biomed. Comput. 1991;28:1–30
  24. Hamilton PW, Montironi R, Abmayr W, Bibbo M, Anderson N, Thompson D, et al. Clinical applications of Bayesian belief networks in pathology. Pathologica. 1995;87:237–245
  25. Galan SF, Aguado F, Diez FJ, Mira J. NasoNet, modeling the spread of nasopharyngeal cancer with networks of probabilistic events in discrete time. Artif. Intell. Med. 2002;25:247–264
  26. Luciani D, Marchesi M, Bertolini G. The role of Bayesian Networks in the diagnosis of pulmonary embolism. J. Thromb. Haemost. 2003;1:698–707
  27. Spiegelhalter DJ, Lauritzen SL. Sequential updating of conditional probabilities on directed graphical structures. Networks. 1990;20:579–605
  28. Buntine WL. A guide to the literature on learning probabilistic networks from data. IEEE Trans. Know. Data Eng. 1996;8:195–210
  29. Cooper GF, Herskovits E. A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 1992;9:309–347
  30. Ramoni M, Sebastiani P. Robust learning with missing data. Mach. Learn. 2001;45:147–170
  31. Herskovits EH, Gerring JP. Application of a data-mining method based on Bayesian networks to lesion-deficit analysis. Neuroimage. 2003;19:1664–1673
  32. Sebastiani P, Ramoni MF, Nolan V, Baldwin CT, Steinberg MH. Genetic dissection and prognostic modeling of overt stroke in sickle cell anemia. Nat. Genet. 2005;37:435–440
  33. Geiger D, Hackerman D. Learning Gaussian networks. In:  de Mantaras RL,  Poole D editor. Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence. San Francisco, CA/Seattle, WA: Morgan Kaufmann; 1994;p. 235–243
  34. Larrañaga P, Sierra B, Gallego MY, Michelena MJ. P J M. Learning Bayesian networks by genetic algorithms: a case study in the prediction of survival in malignant skin melanoma. In:  Keravnou E,  Garbay C,  Baud R,  Wyatt CJ editor. Artificial Intelligence in Medicine Europe. France: Grenoble; 1997;p. 261–272
  35. Le Phillip P, Bahl A, Ungar LH. Using prior knowledge to improve genetic network reconstruction from microarray data. In. Silico. Biol. 2004;4:335–353
  36. Chapman P, Clinton J, Kerber R, Khabaza T, Reinartz T, Shearer C, et al. CRISP-DM 1. 0: Step-by-Step Data Mining Guide: The CRISP-DM Consortium. 2000;
  37. Moore GW, Berman JJ. Anatomic pathology data mining. In:  Cios KJ editors. Medical Data Mining and Knowledge Discovery. Berlin/Heidelberg: Springer-Verlag; 2001;p. 61–108
  38. Hristovski D, Stare J, Peterlin B, Dzeroski S. Supporting discovery in medicine by association rule mining in Medline and UMLS. Medinfo. 2001;10:1344–1348
  39. Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc. AMIA Symp. 2001;17–21
  40. Hand DJ. Data mining: statistics and more?. Am. Statist. 1998;52:112–118
  41. Hand DJ, Mannila H, Smyth P. Principles of Data Mining. Cambridge, Mass: MIT Press; 2001;
  42. Bloedorn E, Michalski RS. Data-driven constructive induction. IEEE Intell. Syst. 1998;13:30–37
  43. Jakulin A, Bratko I, Smrke D, Demsar J, Zupan B. Attribute interactions in medical data analysis. In:  Dojad M,  Keravnou E,  Barahona P editor. Proceedings of the Ninth Conference on Artificial Intelligence in Medicine in Europe (AIME 2003). Protaras, Cyprus: Springer. 2003;p. 229–238
  44. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U. S. A. 1998;95:14863–14868
  45. Yu JS, Ongarello S, Fiedler R, Chen XW, Toffolo G, Cobelli C, et al. Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data. Bioinformatics. 2005;21:2200–2209
  46. Louie B, Mork P, Martin-Sanchez F, Halevy A, Tarczy-Hornoch P. Data integration and genomic medicine. J. Biomed. Inform. 2007;40:5–16
  47. Mischel PS, Cloughesy T. Using molecular information to guide brain tumor therapy. Nat. Clin. Pract. Neurol. 2006;2:232–233
  48. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–537
  49. van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, et al Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–536
  50. Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, et al Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature. 2002;415:436–442
  51. Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, et al Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med. 2002;8:68–74
  52. Nevins JR, Huang ES, Dressman H, Pittman J, Huang AT, West M. Towards integrated clinico-genomic models for personalized medicine: combining gene expression signatures and clinical factors in breast cancer outcomes prediction. Hum. Mol. Genet. 2003;R153–R15712 Spec No 2:
  53. Futschik ME, Sullivan M, Reeve A, Kasabov N. Prediction of clinical behaviour and treatment for cancers. Appl. Bioinform. 2003;2:S53–S58
  54. Fernandez-Teijeiro A, Betensky RA, Sturla LM, Kim JY, Tamayo P, Pomeroy SL. Combining gene expression profiles and clinical parameters for risk stratification in medulloblastomas. J. Clin. Oncol. 2004;22:994–998
  55. Brenton JD, Carey LA, Ahmed AA, Caldas C. Molecular classification and molecular forecasting of breast cancer: ready for clinical application?. J. Clin. Oncol. 2005;23:7350–7360
  56. Berrar D, Bradbury I, Dubitzky W. Avoiding model selection bias in small-sample genomic datasets. Bioinformatics. 2006;22:1245–1250
  57. Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J. Natl. Cancer Inst. 2003;95:14–18
  58. Ein-Dor L, Zuk O, Domany E. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc. Natl. Acad. Sci. U. S. A. 2006;103:5923–5928
  59. Hu Z, Fan C, Oh DS, Marron JS, He X, Qaqish BF, et al The molecular portraits of breast tumors are conserved across microarray platforms. BMC Genom. 2006;7:96
  60. Adam BL, Qu Y, Davis JW, Ward MD, Clements MA, Cazares LH, et al Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Res. 2002;62:3609–3614
  61. Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, et al Use of proteomic patterns in serum to identify ovarian cancer. Lancet. 2002;359:572–577
  62. Barbarini N, Magni P, Bellazzi R. A new approach for the analysis of mass spectrometry data for biomarker discovery. AMIA Annu Symp. Proc. 2006;26–30
  63. Somorjai RL, Dolenko B, Baumgartner R. Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics. 2003;19:1484–1491
  64. Liu X, Minin V, Huang Y, Seligson DB, Horvath S. Statistical methods for analyzing tissue microarray data. J. Biopharm. Stat. 2004;14:671–685
  65. Bismar TA, Demichelis F, Riva A, Kim R, Varambally S, He L, et al Defining aggressive prostate cancer using a 12-gene model. Neoplasia. 2006;8:59–68
  66. McKinney BA, Reif DM, Ritchie MD, Moore JH. Machine learning for detecting gene-gene interactions: a review. Appl. Bioinform. 2006;5:77–88
  67. Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, et al. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J. Theor. Biol. 2006;241:252–261
  68. Wyatt CJ, Altman DG. Prognostic models: clinically useful or quickly forgotten?. BMJ. 1995;311
  69. Kattan MW, Zelefsky MJ, Kupelian PA, Scardino PT, Fuks Z, Leibel SA. Pretreatment nomogram for predicting the outcome of three-dimensional conformal radiotherapy in prostate cancer. J. Clin. Oncol. 2000;18:3352–3359
  70. Shortliffe EH, Davis R, Axline SG, Buchanan BG, Green CC, Cohen SN. Computer-based consultations in clinical therapeutics: explanation and rule acquisition capabilities of the MYCIN system. Comput. Biomed. Res. 1975;8:303–320
  71. Miller RA, Pople HE, Myers JD. Internist-1, an experimental computer-based diagnostic consultant for general internal medicine. N. Engl. J. Med. 1982;307:468–476
  72. Andrews PJ, Sleeman DH, Statham PF, McQuatt A, Corruble V, Jones PA, et al. Predicting recovery in patients suffering from traumatic brain injury by using admission variables and physiological data: a comparison between decision tree analysis and logistic regression. J. Neurosurg. 2002;97:326–336
  73. Stel VS, Pluijm SM, Deeg DJ, Smit JH, Bouter LM, Lips P. A classification tree for predicting recurrent falling in community-dwelling older persons. J. Am. Geriatr. Soc. 2003;51:1356–1364
  74. Eastwood EA, Magaziner J, Wang J, Silberzweig SB, Hannan EL, Strauss E, et al. Patients with hip fracture: subgroups and their outcomes. J. Am. Geriatr. Soc. 2002;50:1240–1249
  75. Sierra B, Larranaga P. Predicting survival in malignant skin melanoma using Bayesian networks automatically induced by genetic algorithms An empirical comparison between different approaches. Artif. Intell. Med. 1998;14:215–230
  76. Fellbaum C, Fellbaums C, Fellbaum C. WordNet An Electronic Lexical Database. MIT Press; 1998;
  77. Zupan B, Holmes JH, Bellazzi R. Knowledge-based data analysis and interpretation. Artif. Intell. Med. 2006;37:163–165
  78. Lavrac N, Dzeroski S, Pirnat V, Krizman V. The utility of background knowledge in learning medical diagnostic rules. Appl. Artif. Intelligen. 1993;7:273–293
  79. Quaglini S, Bellazzi R, Locatelli F, Stefanelli M, Salvaneschi C. An influence diagram for assessing GVHD prophylaxis after bone marrow transplantation in children. Med. Decis. Mak. 1994;14:223–235
  80. Silipo R, Vergassola R, Zong W, Berthold MR. Knowledge-based and data-driven models in arrhythmia fuzzy classification. Meth. Inf. Med. 2001;40:397–402
  81. Mani S, Shankle WR, Dick MB, Pazzani MJ. Two-stage machine learning model for guideline development. Artif. Intell. Med. 1999;16:51–71
  82. Lucas P. Expert knowledge and its role in learning Bayesian Networks in medicine: an appraisal. In:  Quaglini S,  Barahona P,  Andreassen S editor. Artificial Intelligence in Medicine. Berlin: Springer; 2001;p. 156–166
  83. Druzdzel MJ, van der Gaag LC. Building probabilistic networks: “Where do the numbers come from?”. IEEE Transn. Knowl. Data Eng. 2000;12:481–486
  84. Coupe VMH, Van der Gaag LC, Habbema JDF. Sensitivity analysis: an aid for belief-network quantification. Knowl. Eng. Rev. 2000;15:215–232
  85. Pazzani M, Kibler D. The utility of background knowledge in inductive learning. Mach. Learn. 1992;9:57–94
  86. Pazzani MJ, Mani S, Shankle WR. Acceptance of rules generated by machine learning among medical experts. Meth. Inf. Med. 2001;40:380–385
  87. Kononenko I. Estimating attributes: analysis and extensions of RELIEF. In: European Conference on Machine Learning (ECML). 1994;p. 171–182
  88. Kohavi R, John GH. Wrappers for feature subset selection. Artif. Intell. 1997;97:273–324
  89. Zupan B, Demsar J, Kattan MW, Beck JR, Bratko I. Machine learning for survival analysis: a case study on recurrence of prostate cancer. Artif. Intell. Med. 2000;20:59–75
  90. Bellazzi R, Zupan B. Intelligent data analysis in medicine and pharmacology: a position statement. In: Workshop on Intelligent Data Analysis in Medicine and Pharmacology (IDAMAP). Brighton, UK. 1998;p. 2–5
  91. Pyle D. Data preparation for data mining. San Francisco, CA: Morgan Kaufmann Publishers; 1999;
  92. Lavrac N, Flach P, Zupan B. Rule evaluation measures: a unifying view. In: Workshop on Inductive Logic Programming. 1999;p. 174–185
  93. Beck JR, Shultz EK. The use of relative operating characteristic (ROC) curves in test performance evaluation. Arch. Pathol. Lab. Med. 1986;110:13–20
  94. Brier GW. Verification of forecasts expressed in terms of probability. Month. Weather Rev. 1950;78:1–3
  95. Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques With Java Implementations. San Francisco, CA: Morgan Kaufmann; 1999;
  96. Hand DJ. Construction and assessment of classification rules. Chichester; New York: Wiley; 1997;
  97. Bohanec M, Zupan B. Integrating Decision Support and Data Mining by Hierarchical Multi-Attribute Decision Models. In: Intl. Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning. Helsinki, Finland. 2001;p. 25–36
  98. de Rooij SE, Abu-Hanna A, Levi M, de Jonge E. Factors that predict outcome of intensive care treatment in very elderly patients: a review. Crit. Care. 2005;9:R307–R314
  99. Bemmel JHv, Musen MA, Helder JC. Handbook of Medical Informatics. Heidelberg, Germany: Springer Verlag; 1997;
  100. Zupan B, Porenta A, Vidmar G, Aoki N, Bratko I, Beck JR. Decisions at hand: a decision support system on handhelds. Medinfo. 2001;10:566–570
  101. Zupan B, Demsar J, Kattan MW, Ohori M, Graefen M, Bohanec M, et al. Orange and Decisions-at-Hand: bridging predictive data mining and decision support.. In: Intlerationa Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning. Helsinki, Finland. 2001;p. 151–162
  102. Abidi SS. Knowledge management in healthcare: towards ‘knowledge-driven’ decision-support services. Int. J. Med. Inf. 2001;63:5–18
  103. Pazzani M. Knowledge discovery from data?. IEEE Intell. Syst. March–April 2000;10–13
  104. Cios KJ, Moore GW. Uniqueness of medical data mining. Artif. Intell. Med. 2002;26:1–24
  105. Fox J, Das SK. Safe and Sound: Artificial Intelligence In Hazardous Applications. Cambridge, Mass: MIT Press; 2000;
  106. Bellazzi R, Zupan B. Intelligent data analysis. Meth. Inf. Med. 2001;40:362–364
  107. Haux R, Ammenwerth E, Herzog W, Knaup P. Health care in the information society A prognosis for the year 2013. Int. J. Med. Inf. 2002;66:3–21
  108. Towards 2020 Science, Available at http://research.microsoft.com/towards2020science.

PII: S1386-5056(06)00274-7

doi: 10.1016/j.ijmedinf.2006.11.006

International Journal of Medical Informatics
Volume 77, Issue 2 , Pages 81-97 , February 2008