Metrics the ABR Uses to Eliminate Problematic Exam Items
By Robert M. Barr, MD, ABR President, and Ben Babcock, PhD, ABR Psychometrician
ABR certification exists to enhance patient care by providing the public with evidence that radiologic professionals meet a high standard of expertise and skill in the radiologic sciences. By extension, certification also offers an opportunity for radiation oncologists, radiologists, and medical physicists to distinguish themselves, through training and assessment, as highly qualified in their art. Reliable assessment of an individual practitioner’s knowledge base is a critical element of medical board certification. Design and validation of these assessments require not only thoughtful consideration of the breadth and depth of exam content to support relevance but also attention to best practice applications of psychometric standards.
Hundreds of ABR volunteers actively participate in the development of exam items (questions). This process has previously been described and illustrated on our website. Two of the fundamental principles in item writing are optimizing consistency in structure (to avoid distracting the examinee from the concept under consideration) and avoiding ambiguity (to mitigate possible confusion, or a “trick,” based on similarities between correct and incorrect answer choices).
Despite the best efforts and judgment of the volunteers, committees, and ABR staff (including editors), a small number of items may not perform as expected. For example, if most examinees select an option that is not correct in a multiple-choice question that is formatted as “single best answer,” we flag it as a “problem item” and investigate. This often involves review by one or more subject matter experts (members of our volunteer corps who are experienced practitioners in the field).
When evaluating an exam item’s statistical performance, ABR staff ask two main questions. First, is the exam item of appropriate difficulty? A candidate’s performance on a question that 100% of candidates answer correctly does not reveal much about a candidate’s knowledge. The same could be said about exam items that candidates answer correctly less than 25% of the time, which is lower than random chance for most items. The ABR generally excludes scoring of items in these extreme difficulty ranges as part of an exam’s final score. This occurs as part of the routine process of post-exam review and includes scrutiny of any items for which the correct answer is not the most common response. An expert committee decides whether the item should be scored as is, scored with different or multiple acceptable answers, or deleted from the final scoring.
The second question asked is: does this item’s performance have a positive relationship with the exam as a whole? Because all items on an ABR exam are intended to measure knowledge in a specific domain related to radiology, radiation oncology, or medical physics, it is important that there is a statistical correspondence between a given item’s performance and the larger group of exam items. To this end, ABR staff examine the correlation between each individual item and the total exam score. This is called the point-biserial correlation. One could view this as a statistical metric to see if an item “hangs together” with the rest of an exam’s content. ABR generally does not use items with negative point-biserial correlations in its final scoring.
With these tools in hand, the ABR can weed out items that do not perform well statistically, leaving only those that are not too easy, not too hard, and “hang together” statistically. These items are just right for measuring ABR candidates’ knowledge.