Using Natural Language Processing on the Free Text of Clinical Documents to Screen for Evidence of Homelessness Among US Veterans

Information retrieval algorithms based on natural language processing (NLP) of the free text of medical records have been used to find documents of interest from databases. Homelessness is a high priority non-medical diagnosis that is noted in electronic medical records of Veterans in Veterans Affairs (VA) facilities. Using a human-reviewed reference standard corpus of clinical documents of Veterans with evidence of homelessness and those without, an open-source NLP tool (Automated Retrieval Console v2.0, ARC) was trained to classify documents. The best performing model based on document level work-flow performed well on a test set (Precision 94%, Recall 97%, F-Measure 96). Processing of a naïve set of 10,000 randomly selected documents from the VA using this best performing model yielded 463 documents flagged as positive, indicating a 4.7% prevalence of homelessness. Human review noted a precision of 70% for these flags resulting in an adjusted prevalence of homelessness of 3.3% which matches current VA estimates. Further refinements are underway to improve the performance. We demonstrate an effective and rapid lifecycle of using an off-the-shelf NLP tool for screening targets of interest from medical records.

Publication Date: 
2013
Pages: 
537–546
Volume: 
2013
Journal Name: 
AMIA Annual Symposium Proceedings