Skip to main content


Biostatistics journal club: Why are we still using Cohen’s Kappa? – June 5

Wednesday, June 5, 2019
1:00 pm – 2:00 pm
Harvard T.H. Chan School of Public Health, Building 2, Conference Room 426 (4th Floor)

Journal Club: Why are we still using Cohen’s Kappa?

Camden Bay, PhD, Brigham and Women’s Hospital, will explore the article “Factors Affecting Intercoder Reliability: A Monte Carlo Experiment,” describing and comparing Cohen’s Kappa and other summary measures of agreement. Questions for discussion: Are any of the measures of agreement mentioned better than Cohen’s Kappa? How do you explain conflicting agreement results? Why is Cohen’s kappa still being exclusively used? Registration is required.


Camden Bay, PhD
Biostatistician in the Department of Radiology and Center for Clinical Investigation, Brigham and Women’s Hospital
Instructor, Harvard Medical School

Why are we still using Cohen’s kappa?
Cohen’s kappa coefficient, the default measure of inter-rater agreement, is easy to calculate by hand, has an intuitive mathematical definition, accounts for agreement by chance, and is well-known by statisticians in all fields, from remote sensing to medical statistics. It is also very difficult to interpret, accounts for chance agreement in a specific and often inappropriate manner, and is subject to seemingly paradoxical results. Over the past 60 years, numerous articles have been published discussing these issues, but few have proposed solutions beyond suggesting that the simple percent agreement and other contingency table summary measures should be reported in addition to Cohen’s kappa, one of its adjusted variations, or perhaps an agreement chart (see “The agreement chart” (2013) by Bangdiwala SI & Shankar V). Does reporting agreement need to be so convoluted? I would like to discuss an excellent review article by Guangchao Charles Feng that describes and compares Cohen’s kappa and other summary measures of agreement like the recent Gwet’s AC1 and the versatile Krippendorff’s alpha. Do you use any of these? Are any of them better than Cohen’s kappa (and how do we define better)? How do we explain conflicting agreement results? Why are we still exclusively using Cohen’s kappa?

Feng GC. Factors affecting intercoder reliability: A Monte Carlo experiment. Qual Quant. 2013; 47:2959-2982.

In addition to being a thorough review of statistical methods for measuring agreement, the bibliography is comprehensive.

Slides from Dr. Bay’s presentation [PDF]

Sign up to receive our newsletter: courses, funding, events, and resources.