Inferring Which Medical Treatments Work from Reports of Clinical Trials

This is a blog post about our latest paper, which can be found here, with the corresponding github code.

In this blog post, we will introduce Evidence Inference, a novel dataset that pairs sets of interventions, comparators, and outcomes (ICO frame) with full length text describing medical randomized control trials (RCT). This dataset will enable us to train machine learning models that, for example, determine if a given article provides evidence supporting the use of aspirin [intervention] to reduce risk of stroke [outcome], as compared to placebo [comparator].

Introduction

With over 100 reports, on average, of RCTs are published every day, it is time-consuming, and often practically impossible, to sort through all of the relevant published literature to robustly answer questions such as: Does infliximab reduce dysmenorrhea (pain) scores, relative to placebo? Natural Language Processing (NLP) can play a key role in automating this process, minimizing costs and updating treatment recommendations with the most recently published evidence.

Example Problem

Here, we will walk through a simple instance of the dataset. Each article (RCT) will have up to 5 various "prompts" associated with it.

Key:
Outcome
Intervention
Comparator
Example prompt:

With respect to frequency of contractions, characterize the difference between antibiotics and placebo.

Example Rationale:

...antibiotics decreases contraction frequency in comparison to a placebo treatment (p = 0.041)...

This will lead to 1 of 3 results:

  1. If antibiotics made contractions more common (in comparison to no antibiotics/placebo), we would select "Significantly Increased."
  2. If antibiotics made contractions less common (in comparison to no antibiotics/placebo), we would select "Significantly Decreased."
  3. If antibiotics made no difference in contractions (in comparison to no antibiotics/placebo), then we would select "No Signficant Difference."

Each prompt comes with at least 2 corresponding 'rationales'. These 'rationales' are simply stretches of text in the article that explain why a specific label (sig. inc./dec./no diff.) was chosen for this prompt.

Video Example

Although we employ a three-stage process to construct the dataset, the following video gives the instructions for the last stage of the process. Watching this video may be helpful with understanding the data's content.

Data Collection

A large portion of the work for this project was actually creating the dataset. This process that was used can be split into three parts: Prompt Generation, Prompt Annotation, and Verification.

  • Prompt Generation

    Given a full-text RCT, prompt creators/generators identified a snippet that reports a relationship between an intervention, comparator, and outcome. These doctors also provided answers and accompanying rationales to the prompts generated, as doing so did not constitute much additional effort.

  • Prompt Annotation

    Given an evidence prompt articulating an intervention, comparator, and outcome (generated as described above), prompt annotators determined whether the associated article reports results indicating that the intervention significantly increased, significantly decreased, or realized no significant difference, relative to the comparator and with respect to the outcome. The annotator was also asked to mark a snippet of text supporting their response.

  • Verification

    The verifier is responsible for checking both whether the prompt (i.e., question) is valid and can be answered from the text, and whether the responses provided are accurate. Verifiers also assess whether the associated supporting evidence provided is reasonable.

A Simple Approach

In order to determine the difficulty of the task, we explored numerous baseline approaches, including majority guessing, heuristics, and neural network variants. However, in order to highlight the difficulty of the task, we will walk through a simple neural approach, as seen in the figure to the right.

To start, we must encode each of the following four items. This is done through a mix and match between CBoW, GRU, and Bi-GRU.

  1. Article (average length = ~4200 tokens)
  2. Outcome
  3. Intervention
  4. Comparator
The outcome + intervention + comparator combined lengths is on average 12 tokens.

We then concatenate the encodings of the outcome, intervention, and comparator, and use it to apply attention to the encoded article. This is known as conditional attention, as we are conditioning our attention based on the encodings of the ICO frame. While using the encoded ICO frame is not necessary, such a step certainly helps the model performance-wise. This is likely because it changes the scope of the problem, from "find words that are important," to "find words that are important with respect to these key terms." The concatenated encodings, as well as the attention induced encoded article, are concatenated together, and finally fed through an MLP for a prediction. This model in particular achieves a 0.531 F1-score.

Difficulties with this Approach

In order to measure the performance of our model, we compare it to an oracle model. The oracle model differs from the simple neural approach in one way: rather than use the entire article, we only give it the 'rationales' for that specific prompt. This drastically simplifies the problem. Through this approach, we see that there is a large gap in performance (F-score = 0.739). The high success of this model demonstrates the difficulties that occur from the lengthy articles.

Potential Solutions

We have seen that our model struggles to learn due to the length of the articles. Using various flavors of attention maybe a solution to help the model cut down on useless words, sentences, paragraphs, or even sections.

Conclusions

Through the construction of this novel dataset, we have introduced a new challenging task. Our baseline results establish both the feasibility and difficulty of the task. Simple baselines (e.g., rule-based methods) perform quite poorly, whereas modern neural architectures currently achieve the best results. A key future research direction leads to designing more sophisticated attention mechanisms that (conditionally) identify spans of evidence pertinent to a given prompt. We hope this corpus and task provides opportunity to pursue such models.