In the initial preliminary phase of this study we examined IRR using a kappa statistic (¦Ê) among a mixed group of expert and non-expert reviewers using only a brief description of the scoring system to score single images from a series of patients. In the second phase we explored the effect of training on the use of our HIA scoring system by assessing IRR among neuroimaging experts before and after a brief interactive training session. In this phase, multiple slices from each patient were scored. Separate ¦Ê values and intraclass correlation coefficients (ICC) were calculated from the scores given to each hippocampal image and from the asymmetry of scores between left and right for each slice. In the third phase the effect of training on non-expert reviewers was explored using a similar approach as with the expert reviewers.
In the preliminary phase of the study, HIA scoring of single images showed substantial agreement among expert reviewers (¦ÊHIA = 0.65), fair agreement among non-expert reviewers (¦ÊHIA = 0.27), and a fair to moderate degree of agreement among all the reviewers as a whole (¦ÊHIA = 0.40). In the second phase, prior to training there was substantial agreement among expert reviewers in regard to the individual HIA scores (¦ÊHIA = 0.62; ICCHIA = 0.81) but only moderate agreement on the degree of asymmetry (¦ÊAsym = 0.47; ICCAsym = 0.71). Training improved agreement on the individual HIA scores (¦ÊHIA = 0.58-0.72; ICCHIA = 0.76-0.84) and on the degree of asymmetry (¦ÊAsym = 0.61-0.67; ICCAsym = 0.81-0.85). Among non-expert reviewers, scores improved from only a fair degree of agreement pre-training (¦ÊHIA = 0.25, ¦ÊAsym = 0.25; ICCHIA = 0.68, ICCAsym = 0.66) to a moderate level of agreement after training (¦ÊHIA = 0.54, ¦ÊAsym = 0.52; ICCHIA = 0.78, ICCAsym = 0.81).
The proposed HIA scoring system has a substantial degree of inter-rater reliability among experienced neuroimaging reviewers. Training improves the detection of asymmetries in HIA score in particular. Non-expert reviewers can employ the system with a moderate degree of reliability, and training has an even greater impact on the improvement of scoring reliability.