ISPOR 19th Annual European Congress
Vienna, Austria
October, 2016
PRM58
Multiple Diseases/No Specific Disease
Research on Methods (RM)
Databases & Data Management Methods (DM)
ACCURACY OF TEXT PROCESSING TOKENISERS FOR AUTOMATED IDENTIFICATION OF DISEASES AND INTERVENTIONS IN ABSTRACTS OF STUDIES ON HUMANISTIC AND ECONOMIC BURDEN OF DISEASE
Challen R1, Martin A2, Martin C2
1Terminological Ltd, Hove, UK, 2Crystallise Ltd., London, UK
OBJECTIVES: To determine the sensitivity and specificity of software based on text processing analysis to classify diseases and interventions in PubMed abstracts relevant to the humanistic or economic burden of disease. METHODS:  We developed an online database of abstracts of over 100,000 studies identified by a systematic search of PubMed on the humanistic and economic burden of disease (www.heoro.com). We manually indexed 10,000 abstracts to detailed ontologies of diseases and interventions, as well as to study types, PRO instruments and geographical setting. The disease and intervention ontologies were developed from MeSH terms and lists of licensed drugs from the US and UK, with new items added when identified from the abstracts. We used this training set to develop tokenisers to facilitate matching the text, MeSH headings and metadata in the abstracts to relevant ontology items. We then assessed the initial accuracy of the tokeniser matching on a sample of 150 abstracts from the unmoderated set, using expert evaluation, prior to further software refinements. RESULTS:  The tokeniser matching had a sensitivity of 95% for disease ontology items and 85% for intervention ontology items compared with expert assessment. The specificity, defined as matching to any ontology items that appeared in the text, MeSH headings or metadata of each abstract, was 89% for diseases and 91% for interventions. The accuracy of matching was higher for drug terms than for non-pharmaceutical interventions, which tend to be described less consistently. CONCLUSIONS:  With overall accuracy of around 90%, the initial tokeniser matching compared reasonably to indexing of abstracts by less experienced scientists. Ongoing final expert checking and further software refinement will improve the specificity of the indexing to topics that were the focus of the research. As 90,000 abstracts could be indexed within hours, this method facilitates a streamlined approach to identifying relevant data for health economics and outcomes research.