The Barthel Index: comparing inter-rater reliability between Nurses and Doctors in an older adult rehabilitation unit
Article Outline
- Abstract
- 1. Introduction
- 2. Methods
- 3. Results
- 4. Discussion
- 5. Limitations
- 6. Conclusion
- Acknowledgments
- References
- Copyright
Abstract
Objective
To ensure accuracy in recording the Barthel Index (BI) in older people, it is essential to determine who is best placed to administer the index. The aim of this study was to compare doctors' and nurses' reliability in scoring the BI.
Methods
Sixty-five consecutive patients admitted to an older adult rehabilitation unit were assessed using the BI. Four raters recorded the BI on all patients. BI scores were compared for equivalence on the level of agreement between raters.
Results
Near-perfect correlation in the total scores between raters indicated that the final score is reliable. There was a statistically significant bias by doctors who gave a higher BI score than nurses with a mean difference of 1.2 (95% confidence interval). Nurses demonstrated good and very good agreement on all 10 items, whereas doctors demonstrated good agreement on only 5 items.
Conclusion
The BI is highly reliable when recorded by nurses with low interrater variation, whereas between doctors, there is greater variation. When assessing older adult's activities of daily living, there is greater interrater reliability in the BI score when the patient is observed performing the activities of daily living compared to the self-report method.
1. Introduction
Mahoney and Barthel (1965) introduced the Barthel Index (BI), originally called the Maryland Disability Index in 1955. The BI is a weighted and summed index designed to reflect a patient's dependency in activities of daily living (ADL). The score for each item is added, and the higher the total score, the lower the restrictions in ADL. Gresham, Philips, and Labi (1980) referred to the BI as “the best buy” among common ADL indices. There are several methods of administering the BI from self-report/informant rating to observing or testing the patient performing the ADL, as outlined in the BI. The BI is administered by doctors, therapists, and nurses. The literature highlights the discrepancies in scoring the BI between different raters and methods (McGinnis et al., 1986, Ranhoff & Laake, 1993, Richards et al., 2000, Shinar et al., 1987, Yeo et al., 1995). Initially, the BI was developed for the assessment of patients with neuromuscular and musculoskeletal disorders, but soon after, it was widely used to assess functional change in the rehabilitation of people who have had a stroke (Gibbon, 1991). It has subsequently been used to measure ADL in a wide range of disabling conditions. The original 10-item version gives a maximum sum score of a hundred, and the modified version gives the 10 items a maximum sum score of 20 (Collin et al., 1988, Hartigan, 2007). The modified version is in common usage and enhances the sensitivity of the index (Collin et al., 1988, Gresham et al., 1980, Hsueh et al., 2002). The reliability and validity of the BI are well established (Collin et al., 1988, Granger et al., 1979, Mattison et al., 1991). The reliability of BI was established when administered by different assessors using different methods of administration; however, the final score had a clinically worrying imprecision (95% confidence interval [CI] = ±4 points; Sainsbury, Seebass, Bansal, & Young, 2005). In one study, scoring the BI according to the information collected by the physician on interviewing the patients was biased toward an inappropriately high score when compared with the score obtained by a nurse when observing the patient perform the ADL (Ranhoff & Laake, 1993). A recent study by Richards et al. (2000) compared the results of the BI when administered by nurses and nonclinical research assistants. Both were trained in administrating the BI to standardize the method of administration. Overall, the average total scores demonstrated moderate to good agreement. However, the levels of agreement were relatively wide with significant imprecision (95% CI = ±3 points).
In another study examining the accuracy of the BI score, information was received from a family member or person involved in the patient's care to give an “informant rating.” This method demonstrated that informants tend to underestimate patients' functional ability (Dorevitch et al., 1992). Conversely, the self-report method also used in this study highlighted that patients often perceive themselves more independent than they actually are (Dorevitch et al., 1992). The self-report method showed the greatest variance with the other methods of administration. When recording the BI, studies have highlighted that higher rater variation was mainly evident for items such as transfer, toileting, fecal continence, and dressing (Collin et al., 1988, McGinnis et al., 1986, Richards et al., 2000, Shinar et al., 1987, Sinoff & Ore, 1997). A study by Ranhoff and Laake (1993) compared information collected by two administrators. The sample of 59 patients included patients with cognitive impairment. The findings revealed that cognitive impairment influenced the reliability of total score, and there was poor agreement on individual BI items. In addition, administrators did not receive training on how to administer the BI; therefore, discrepancy may arise from misinterpretation of categories in the BI. In summary, errors in recording the BI may be due to several factors such as the introduction of subjective bias, lack of training in administering the BI, poor technique by the raters, or different interpretations of the construct of the items in the BI. Therefore, the principal aim of this study was to calculate the interrater reliability of the BI between doctors and nurses when common errors in recording the BI are accounted for (Efron & Tibshirani, 1993).
2. Methods
This study prospectively examined all patients admitted to an older adult rehabilitation unit over a 4-month period. Patients with an abbreviated Mental Test Score (AMTS) less than 7/10 were excluded from participating in this study, as they were unable to give informed consent. The Clinical Research Ethics Committee approved the study protocol. The researcher informed all nurses and doctors working in the unit of the study protocol, and those willing to participate signed a consent form. The modified BI is routinely recorded by the patient's primary nurse on admission. However, for this study, four administrators recorded the BI for each patient, one other nurse and two doctors who agreed to participate in this study recorded the BI. To standardize the method of administering the BI, a brief training session on how to accurately score the BI was given to all nurses and doctors participating in this study, and the guidelines developed by Collin et al. (1988) were provided on the reverse side of the BI.
Baseline demographic details, including gender and age, were recorded for patients who consented to this study. The response rate for health care professionals to participate in this study was very positive, 72% (n = 36) of nurses consented, and all doctors (n = 6) approached agreed to participate. The doctors that participated in this study were senior house officers and had a special interest in older adult medicine. All nurse raters had greater than 2 years experience working in the older adult unit. Data collection using the BI was measured four times for each patient, twice by two different nurses and twice by two different doctors within 5 days of the patient's admission. Five days was considered an appropriate time window in which to record the BI, as any longer period was likely to cause variation in patients' BI scores. Each assessor was assigned a code to ensure confidentiality and administered the BI independently of each other. When a BI score was recorded, it was placed in an envelope, then sealed and posted to the collection box located in the rehabilitation unit. Each administrator was asked to score the BI taking into consideration the guidelines and not to communicate or compare ratings with others.
2.1. Sample
During the study period there were 120 consecutive admissions to the rehabilitation unit. Sixty-five patients consented to participate in the study. Those that refused to consent were 6, and 43 patients met the exclusion criteria with an AMTS less than 7/10. Six patients were not included in the final study number because they had incomplete data of only three BI scores recorded in the time frame of 5 days. The age range of the sample (n = 65) was 65–100 years (M = 81.5, SD = 7.46), with most (62%) of participants being women.
2.2. Analysis of data
The BI is an ordinal scale, observations are grouped and ranked, and for this reason, nonparametric tests are applied. Data were analyzed with the aid of the STATA 8.2, statistical computer software package. Descriptive statistics were obtained for demographic variables. The Bland and Altman approach, which is based on graphical techniques for assessing agreement between two methods of clinical measurements, was used. Bland and Altman (1986) plot displays the difference in scores against the mean for each subject and calculates the 95% confidence for these mean differences to indicate the magnitude of random measurement error. Such plots have become a standard accessory in validity or method-comparison studies. Bland and Altman (1986) devised their plot to steer researchers away from what they considered was misuse of the correlation coefficient as a measure of validity. The Bland–Altman plot explicitly shows differences between the two raters' measurements (on the Y axis) over their range (on the X axis). Because the eye is better at judging departures from a horizontal line than from a tilted line, the difference between a pair of measurements is plotted on a graph against their mean. If the measurements are comparable, the differences should be small, centered around 0, and show no systematic variation with the mean of the measurement pairs. The plot facilitated estimating precision by calculating the standard error and CI. The Bland–Altman plot method was used to determine the level of agreement between raters. The limits of agreement for this study were defined as when the overall score was within two points of agreement between doctors and nurses. A difference of two points on a scale of 1 to 20 is not considered of great clinical importance when scoring the BI (Collin et al., 1988, Sainsbury et al., 2005), yet a change of more than 2 points in the total score does reflect a probable genuine change (Collin et al., 1988, Fleiss, 1981).
Interrater reliability was determined using percentage agreement. Cohen's kappa coefficient was used to calculate the measure of the agreement between individual items of the BI, corrected for chance. This measure was appropriate because a difference of one item of ADL would represent less disagreement than a difference of two items of ADL and a difference of three items of ADL would represent even more of a disagreement. The level of interrater agreement was determined by the magnitude of the overall weighted kappa statistic. When quantifying actual levels of agreement, kappa's calculation uses a term called the proportion of chance (or expected) agreement. This is interpreted as the proportion of times raters would agree by chance alone. The CI around an estimate of the kappa is a function of the absolute percentage agreement, the prevalence or variance of the condition, as well as the number of pairs being compared. Standard errors and CI can be calculated to see how precise our estimates are provided the difference follows a distribution that is approximately normal. The 95% limit of agreement approach is judged against the “gold standard” method of measurement. It tells us how far from the gold standard the measurement is likely to be. The ideal is to correlate action with a gold standard, which is an unequivocally valid, universally accepted outcome measure that directly reflects the behavior under scrutiny (Swets, 1986). However, because no gold standard exists from which to judge one method against, two nurses were compared with 2 doctors, and the group with higher interrater agreement within would be seen as gold standard.
3. Results
3.1. Difference between doctors and nurses (DN)
Table 1 displays the breakdown of difference in average BI scores. Doctors scored patients higher for most categories when compared to nurses. The average overall difference was 1.2 (95% CI = 0.5–1.8).
Table 1. Displays difference between nurses and doctors in average barthel scores for each item
| Function | Nurses | Doctors | Difference |
|---|---|---|---|
| Overall | 11.9 | 13.1 | 1.2 |
| Bowels | 1.8 | 1.7 | −0.1 |
| Bladder | 1.4 | 1.4 | 0 |
| Grooming | 0.5 | 0.8 | 0.3 |
| Toilet use | 1.1 | 1.4 | 0.3 |
| Feeding | 1.8 | 1.9 | 0.1 |
| Transfer | 1.9 | 2.1 | 0.2 |
| Mobility | 2 | 2.1 | 0.1 |
| Dressing | 1.2 | 1.3 | 0.1 |
| Stairs | 0.2 | 0.3 | 0.1 |
| Bathing | 0 | 0.3 | 0.3 |
The Bland–Altman plot in Fig. 1 displays the mean difference in total BI score between doctors and nurses against the average total BI score for each patient. There was a strong rank correlation between overall scores between doctors and nurses (Spearman's ρ = 0.6968). The reliability of the BI was influenced by the degree of disability or independence in the sample examined, and the scoring of individual items was least reliable when the patient's score was in the middle categories of 10 to 16.

Fig. 1.
Bland–Altman plot displaying the difference in total BI score between doctors and nurses against the mean total BI score for each patient. Reference range for difference = −5.483 to 7.975; mean difference = 1.2, CI = 0.5–1.8; range = 0.000–20.000. The horizontal lines represent the mean difference and the 95% limits of agreement.
3.2. Difference between doctors and doctors (DD)
The Bland–Altman plot in Fig. 2 demonstrates the difference in total BI score between doctors and doctors against the average total BI score for each patient. Although there was strong rank correlation between BI score between doctors (Spearman's ρ = 0.74), significant bias was calculated based on the mean difference (0.554; CI = −0.147 to 1.255), and there was wide variation with the following limits of agreement (−5.1 to 6.2).

Fig. 2.
Bland–Altman comparison of the first doctor and the second doctor randomly selected. Limits of agreement (reference range for difference) = −5.1 to 6.2; mean difference = 0.554, CI = −0.147 to 1.255; range = 0.000–20.000. The horizontal line represents the mean difference and the 95% limits of agreement.
3.3. Difference between nurses and nurses (NN)
The Bland–Altman plot in Fig. 3 displays the difference in total BI score between nurses and nurses against the average total BI score for each patient. There was strong rank correlation between BI score for nurses (Spearman's ρ = 0.8290). No significant bias was calculated based on the mean difference (0.154; CI −0.397 to 0.705), and there was little variation with the following limits of agreement (−4.293 to 4.601).

Fig. 3.
Bland–Altman comparison of the first nurse and the second nurse randomly selected. Limits of agreement (reference range for difference) = −4.29 to 4.60; mean difference = 0.15, CI = −0.397 to 0.705; range = 0.000–20.000. The horizontal line represents the mean difference and the 95% limits of agreement.
3.4. Calculation of variation between raters
The average agreement was calculated using weighted kappa between doctors and nurses when arbitrarily chosen: between the first doctor and the first nurse randomly compared (DN11), between the second doctor and the second nurse randomly compared (DN22), between the first doctor and second nurse compared (DN12), and finally between the second doctor and the first nurse compared (DN21). The levels of agreement are summarized in Table 2. Overall, between doctors and nurses, only three items (toilet use, transfer, and mobility) demonstrated a kappa coefficient greater than 0.61, corresponding to good agreement. Nurses demonstrated good and very good agreement on all 10 domains unlike doctors who only demonstrated good agreement on 5 domains (toilet use, dressing, transfer, mobility, and feeding).
Table 2. Calculation of variation between raters
| Function | NN | DD | DN11 | DN22 | DN12 | DN21 | DN |
|---|---|---|---|---|---|---|---|
| Grooming | 0.66 | 0.39 | 0.14 | 0.2 | 0.25 | 0.26 | 0.21 |
| Bathing | 0.79 | 0.51 | 0.04 | 0.13 | 0.16 | 0.007 | 0.08 |
| Bowels | 0.8 | 0.52 | 0.39 | 0.52 | 0.61 | 0.35 | 0.47 |
| Bladder | 0.78 | 0.58 | 0.47 | 0.53 | 0.57 | 0.6 | 0.54 |
| Toilet use | 0.74 | 0.69 | 0.62 | 0.63 | 0.64 | 0.61 | 0.63 |
| Dressing | 0.72 | 0.63 | 0.54 | 0.6 | 0.67 | 0.55 | 0.59 |
| Stairs | 0.82 | 0.48 | 0.55 | 0.3 | 0.36 | 0.51 | 0.43 |
| Transfer | 0.84 | 0.67 | 0.7 | 0.67 | 0.69 | 0.71 | 0.69 |
| Mobility | 0.74 | 0.68 | 0.77 | 0.64 | 0.64 | 0.84 | 0.72 |
| Feeding | 0.86 | 0.61 | 0.45 | 0.65 | 0.52 | 0.59 | 0.55 |
4. Discussion
Previous studies have highlighted the validity and reliability of the BI in younger patients (less than 65 years; McGinnis et al., 1986, Roy et al., 1988); however, this study considered older adults. Cognitive impairment was considered an extraneous confounding variable because in previous studies, patients who have cognitive impairment gave an unreliable BI score (McGinnis et al., 1986, Ranhoff & Laake, 1993, Richards et al., 2000). Consequently, this category of patient was excluded from participating in the study to minimize variation when assessing the congruency in scoring the BI among doctors and nurses. All patients were in-patients for rehabilitation following a medical illness. Therefore, it was unlikely that there would be dramatic changes in patients' functional ability. This coupled with the time frame of 5 days to record the BI allowed little variation within the patients' condition.
Because no gold standard exists against which to measure physical function, this study compared two nurses' BI scores with two doctors' BI scores, and the professional group with higher intra-agreement indicated the most accurate BI score and the more reliable method. Providing the guidelines on the reverse side of the BI allowed clear definition of each performance level for each item in the BI, hence standardizing the method of assessment. There was a strong rank correlation between overall scores between doctors and nurses and no significant differences on average total BI scores (mean difference = 1.246, CI = 0.662–1.830). However, there was only good agreement on three individual items of the BI (mobility, transfer, and toilet use). Perhaps, this was related to the nature of the activities, as observing mobility and transfer is more opportunistic when compared to other items in the BI such as dressing, feeding, or climbing the stairs. Good agreement on the item toilet use maybe influenced by the patients' physical mobility and transfer, as any restrictions in mobility and transfer may influence a patient's independence or dependence with this activity.
Previous studies used inappropriate statistical methods for analyzing agreements between health professionals, notably they were intraclass correlation coefficient or kappa alone (Collin et al., 1988, Gresham et al., 1980, Ranhoff & Laake, 1993, Roy et al., 1988), chi square and t test (McGinnis et al., 1986), Kendall rank correlation (Collin et al., 1988, Roy et al., 1988), and Pearson's correlation (Roy et al., 1988, Shinar et al., 1987). Such tests are irrelevant when questioning the levels of agreement of more than one rater, as data which seem to be poor in agreement can produce quite high correlations. The Intra-class Correlation (ICC) or kappa value can be ambiguous when there is a wide range in patient scores and depends on the clinical context in which the measure is administered (Streiner & Norman, 2003). In addition, the use of correlation is misleading because it gives the product–moment correlation coefficient between the results of two measurement methods as an indicator of agreement (Gwet, 2002). This statistic is insensitive to rater mean differences (Sainsbury et al., 2005) and does not take into account the magnitude of the differences between raters. There is a lack of consensus on the statistical method to investigate interrater reliability (Sainsbury et al., 2005).
To overcome errors in comparing the reliability of doctors' and nurses' scores in this study, the Bland–Altman plot (1986) was chosen because it explicitly displays the difference in scores against the mean for each subject and takes into account the magnitude of the differences between raters. Previous studies using the Bland–Altman method (Ranhoff, 1997, Richards et al., 2000, Yeo et al., 1995) have shown little systematic bias but a clinically worrying imprecision with a 95% CI of ±4 points or more in the modified BI. Because the limits of agreement in previous studies were relatively wide (±3 points and ±4 points; Richards et al., 2000, Roy et al., 1988), statistical advice was sought for this study and the limits of agreement agreed upon. Because a change of more than 2 points in the total score reflects a probable genuine change in the patients' functional ability (Collin et al., 1988), the limits of agreement need to be narrow. The limits of agreement in this were ±2 points with a 95% CI, which reduced the opportunity for clinical imprecision. This study demonstrated a strong rank correlation between overall BI scores using Bland–Altman plots substantiating the validity of the BI scores in this study.
Despite the presence of the BI guidelines in this study to standardize the method of administering the BI, the results demonstrated that doctors overestimate patients' functional ability when compared with nurses (13.9 vs. 11.9; mean difference = 1.2). Nurses recorded what patients did in their presence, whereas doctors recorded patients' self-report. This explains the nurse–doctor variation, as it cannot be explained by patient variability. The results indicate that self-report method by doctors is less reliable when compared to direct observation. The difference in scoring items may be a consequence of a patient concealing or overestimating their ability to care for themselves when scored by a doctor. The guidelines allow clear definition of each item and performance level in the BI, hence standardizing the interpretation when administering the BI. However, bias may arise from how the patient perceives the questions, for instance, when questioning patients about their capacity to perform ADL, for example, “could you..?” rather than “do you..?” demonstrates the difference between capacity and usual performance accounting for much of the discrepancy.
The reliability of the BI was influenced by the degree of the patients' disability or independence. The widths of variation were narrow in patients who were independent or with disability and least reliable when the patient's score was in the middle categories of 10 to 16 (10–16/20). This finding is similar to previous studies (Collin et al., 1988, Roy et al., 1988). Overall, the level of agreement was moderate for most items between doctors and nurses. The items of transfer, mobility, and toileting demonstrated good agreement, whereas items such as bowels, bladder, dressing, stairs, and feeding demonstrated moderate agreement between doctors and nurses. Grooming and bathing appeared to be least reliable with only fair agreement, possibly because grooming is a less standardized item and raters tend to speculate on what this item entails causing large variation. The guidelines refer to grooming as personal hygiene, for example, brushing teeth, fitting dentures, brushing hair, shaving, and face washing. However, not all patients shave or wear dentures, and face washing may be incorporated under the item of bathing, so it is difficult to score this item unless the performance of the activity is observed by the rater (Shinar et al., 1987). The second item of low agreement was bathing. The guidelines refer to this as the most difficult activity, where the patient must get in and out of the bath unsupervised and wash himself or herself. It is uncertain whether this item can be accurately scored without having observed the patient perform the activity.
There was strong rank correlation between BI scores for nurses (r = .8290). There was no significant bias based on the mean difference (0.154), and the limits of agreement (−4.293 to 4.601) were narrow between nurses and nurses. The nurse's wider experience of assisting the patient with ADL may have influenced the assessment of the patient's functional ability. The opportunity of caring for the patient allowed the nurse to observe the patient's ability to perform ADL. This occurred frequently for ADL that are performed regularly during the day such as feeding, transfer, and toileting. It is likely that frequent contact with the patient facilitated casual observation of the patient's ability to perform ADL. Activities of least agreement between nurses were that of grooming and dressing, perhaps because some patients tend to assume a more sick or dependent role. Overall, the agreement between nurses was exceptionally high, indicating good interrater reliability.
There was strong rank correlation between BI scores for doctors (Spearman's ρ = 0.74). Although there was slight bias, based on the mean difference (0.554) and CI (−0.147 to 1.255), there was a wide variation with the following limits of agreement (−5.103 to 6.211). Doctors demonstrated good agreement on five items, moderate agreement on four items, and fair agreement on one item. Doctors may sometimes overestimate the patient's ability due to the method of administering the BI. Doctors are more often required to speculate on the validity of certain ADL, such as toileting (bowel or bladder continence). Doctors usually rely on the patient's chart or the patient's report. Patients may report the ability to perform ADL to avoid the embarrassment of not being able to care for themselves. Bias in subjective measurements can arise from how the patient perceives the questions, the method of administration, or misinterpretation of items in the BI. Therefore, administrators should be familiar with the BI guidelines to allow accurate assessment of the patients' activity.
5. Limitations
The sample in this study consisted of older adults admitted for rehabilitation without cognitive impairment. This limits the generalization of these findings. This study was carried out at a single site, thereby limiting the transferability of findings.
6. Conclusion
To optimize a patient's rehabilitation, clinical measures such as the BI need to be administered accurately. This study demonstrated that the nurse is best placed to administer the BI. For this study, patient variation was minimized by excluding participants with cognitive impairment and reducing the time frame for administering the BI. To substantiate the validity of the BI final score, close correlation between limits of agreement was introduced. The results of this study lend credence to the method of observing the patient's performance when assessing ADL because this method demonstrated high interrater reliability among nurses. The self-report method often used by doctors maybe threatened when a patient is reluctant to admit difficulty with everyday activities of living. In addition, the administrator needs to be familiar with the guidelines to ensure accuracy in scoring the BI. The guidelines allow clear definition of each item and performance level in the BI, hence standardizing the interpretation and method of assessment. The BI facilitates the transmission of important patient functional ability between health providers; therefore, accuracy and congruency in recording the BI are vital to enable appropriate care planning.
Acknowledgments
The author would like to thank the staff at the rehabilitation hospital for participating and facilitating this study. Thank you also to all the patients who took part in this study. The authors have no conflicts of interest to declare. The authors received no financial support to conduct this study.
References
- Altman, D.G. (1991). Practical Statistics for Medical Research. London.
- . Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;1:307–310
- . The Barthel ADL Index: A standard measure of disability. International Disability Studies. 1988;10:64–67
- . The accuracy of self and informant ratings of physical functional capacity in the elderly. Journal of Clinical Epidemiology. 1992;45(7):791–798
- . An introduction to the Bootstrap. California: Stanford University; 1993;
- . Statistical methods for rates and proportions. New York: Wiley; 1981;
- . Measuring stroke recovery. Nursing Times. 1991;87(44):32–34
- . Outcome of comprehensive medical rehabilitation: Measurement by PULSES profile and Barthel Index. Archive Physical Medical Rehabilitation. 1979;60:145–154
- . ADL status of stroke: Relative merit of three standard indexes. Archive Physical Medical Rehabilitation. 1980;61:355–358
- . Handbook of inter rater reliability. Gaithersburg U.S.A Stataxis: Publishing Company; 2002;
- . A comparative review of the Katz ADL and the Barthel Index in assessing the activities of daily living of older people. International Journal of Older People Nursing. 2007;2(3):204–212
- . Comparison of the psychometric characteristics of the functional independence measure, 5 item Barthel index, and 10 item Barthel index in patients with stroke. Journal of Neurol Neurosurg Psychiatry. 2002;73:188–190
- . Functional evaluation: The Barthel Index. Maryland State Medical Journal. 1965;12:61–65
- . Rehabilitation status—The relationship between the Edinburgh Rehabilitation Status Scale (ERSS), Barthel Index and PULSES Profile. International Disability Studies. 1991;13:9–11
- . Program evaluation of physical medicine and rehabilitation departments using self-report. Archive Physical Medical Rehabilitation. 1986;67:123–125
- . Activities of daily living, cognitive impairment and other psychological symptoms among elderly recipients of home help. Health and Social Care in the Community. 1997;5:147–152
- . The Barthel ADL Index: Scoring by the physician from patient interview is not reliable. Age and Ageing. 1993;22:171–174
- . Inter-rater reliability of the Barthel ADL Index: How odes a researcher compare to a nurse?. Clinical Rehabilitation. 2000;14:72–78
- . An inter rater reliability study of the Barthel Index. International Journal of Rehabilitation Research. 1988;11:67–70
- . Reliability of the Barthel Index when used with older people. Age and Ageing. 2005;34(3):228–232
- . Reliability of the activities of daily living scale and its use in telephone interview. Archive Physical Medical Rehabilitation. 1987;68:723–728
- . The Barthel Activities of Daily Living Index: Self-reporting versus actual performance in the old–old (>75 years). Journal of American Geriatrics Society. 1997;45:832–836
- . In: Health Measurement Scales. A practical guide to their development and use. 3rd ed.. Oxford Medical Publication; 2003;p. 104–127(ch 8:Reliability)
- . Indices of discrimination or diagnostic accuracy: Their ROC's and implied models. Psychological Bulletin. 1986;99:110–117
- . Barthel ADL Index: A comparison of administration methods. Clinical Rehabilitation. 1995;9:34–39
PII: S0897-1897(09)00125-6
doi:10.1016/j.apnr.2009.11.002
© 2011 Elsevier Inc. All rights reserved.
