Article Text

Download PDFPDF

Validity and reliability of three definitions of hip osteoarthritis: cross sectional and longitudinal approach
  1. M Reijman1,
  2. J M W Hazes3,
  3. H A P Pols2,
  4. R M D Bernsen1,
  5. B W Koes1,
  6. S M A Bierma-Zeinstra1
  1. 1Department of General Practice, Erasmus MC, Rotterdam, Netherlands
  2. 2Department of Internal Medicine, Erasmus MC, Rotterdam
  3. 3Department of Rheumatology, Erasmus MC, Rotterdam
  1. Correspondence to:
    Dr M Reijman
    Department of General Practice, Erasmus MC – Faculty, PO Box 1738, 3000 DR Rotterdam, Netherlands; m.reijmanerasmusmc.nl

Abstract

Objectives: To compare the reliability and validity in a large open population of three frequently used radiological definitions of hip osteoarthritis (OA): Kellgren and Lawrence grade, minimal joint space (MJS), and Croft grade; and to investigate whether the validity of the three definitions of hip OA is sex dependent.

Methods: Subjects from the Rotterdam study (aged ⩾55 years, n = 3585) were evaluated. The inter-rater reliability was tested in a random set of 148 x rays. The validity was expressed as the ability to identify patients who show clinical symptoms of hip OA (construct validity) and as the ability to predict total hip replacement (THR) at follow up (predictive validity).

Results: Inter-rater reliability was similar for the Kellgren and Lawrence grade and MJS (κ statistics 0.68 and 0.62, respectively) but lower for Croft’s grade (κ statistic, 0.51). The Kellgren and Lawrence grade and MJS showed the strongest associations with clinical symptoms of hip OA. Sex appeared to be an effect modifier for Kellgren and Lawrence and MJS definitions, women showing a stronger association between grading and symptoms than men. However, the sex dependency was attributed to differences in height between women and men. The Kellgren and Lawrence grade showed the highest predictive value for THR at follow up.

Conclusions: Based on these findings, Kellgren and Lawrence still appears to be a useful OA definition for epidemiological studies focusing on the presence of hip OA.

  • GEE, generalised estimating equations
  • HAQ, health assessment questionnaire
  • ICC, intraclass correlation coefficient
  • LDI, lower limb disability index
  • MJS, minimal joint space
  • ROA, radiographic osteoarthritis
  • THR, total hip replacement
  • hip osteoarthritis
  • definition
  • epidemiology

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Osteoarthritis (OA) of the hip is of particular interest as this is often the sole joint affected by the disease, suggesting an important role for local biomechanical risk factors. In addition, the prevalence of hip OA is expected to increase with the aging of Western society,1 and the hip is crucial for independent function.2

A problem with studying hip OA is the absence of a consensus for defining it for the purposes of epidemiological research.3 To investigate occurrence and (potential) risk factors, a valid and reliable definition of hip OA is required. Most epidemiological studies have used a single hallmark (radiological signs) to define hip OA.4,5

In a previous systematic appraisal, we summarised the validity, reliability, and applicability of seven definitions of hip OA used in epidemiological studies.6 Considering the frequent use of the definitions of hip OA, it is surprising that the validity of these definitions has been so poorly investigated. Because of the lack of comparability between the different studies and because most studies only investigated a single definition, it was difficult to compare the reliability and validity of the various definitions of hip OA. Our appraisal also showed that the validity and reliability of minimal joint space (MJS; according to Croft) and Croft’s grade (a modification of the Kellgren and Lawrence grade)7 have only been studied in a male population.

The primary objective of the present study was to compare the reliability and validity of the three most commonly used radiological definitions of hip OA—Kellgren and Lawrence grade, MJS (according to Croft), and Croft’s grade—in a large open population of elderly people. Our secondary objective was to investigate whether the validity of the three definitions of hip OA was sex dependent.

METHODS

The study population consisted of participants in the Rotterdam study, a prospective cohort of men and women aged 55 years and over. The objective of the Rotterdam study is to investigate the incidence of and risk factors for chronic disabling diseases. The rationale and study design have been described previously.8 The focus is on neurogeriatric, cardiovascular, ophthalmological, and locomotor diseases. All 10 275 inhabitants of Ommoord (a district in Rotterdam, Netherlands) were invited to participate. The response rate was 78%, resulting in 7983 subjects participating in the study. Written informed consent was obtained from each participant. The medical ethics committee of the Erasmus University Medical Centre approved the Rotterdam Study.

We used a sample of 3585 subjects from the Rotterdam study. Selection was based on the availability of the radiographs of the hip at baseline and follow up. The fact that subjects had to be mobile enough to visit the research centre at baseline and follow up and survive the follow up period caused a health selection bias in our study population. Compared with the total Rotterdam study population, the study population was significantly younger (66.0 years v 70.6 years), had a lower prevalence of lower limb disability at baseline (index score ⩾0.5: 12.9% v 35.5%), and had a somewhat lower prevalence of hip pain at baseline (11.7% v 12.7%).

Subjects with bilateral total hip replacement (THR) at baseline (n = 24) were excluded from analysis, which resulted in a study population of 3561 subjects. The baseline measurements were conducted between April 1990 and July 1993, and the follow up measurements between 1996 and 1999, with a mean (SD) follow up time of 6.6 (0.50) years.

Radiographic assessment

Weight bearing anteroposterior pelvic radiographs with both feet in 10° internal rotation were obtained at 70 kV, a focus of 1.8, and a focus to film distance of 120 cm, applying a Fuji High Resolution G 35×43 cm film.9 The x ray beam was centred on the umbilicus.

One independent trained reader (MR) evaluated the radiographs according to a standardised protocol, unaware of the clinical status of the patients.

At baseline, radiographic osteoarthritis (ROA) of the hip was quantified by measurements of the Kellgren and Lawrence grading system (atlas based) (table 7),6,10–13 the Croft grading system (a modification of the Kellgren and Lawrence system) (table 8), and MJS as defined by Croft (table 9).6,7,12–14 For the Croft grading scale, we assessed the individual radiographic features of MJS, presence of osteophytes, subchondral sclerosis, and cyst formation. We looked for the presence of the individual radiographic features (of any grade), using an atlas of individual features.12,13 Different cut off points to quantify hip ROA were employed: for Kellgren and Lawrence, ⩾grade 2 (moderate) and ⩾grade 3 (severe); for the Croft grading system, ⩾grade 3 (moderate) and ⩾grade 4 (severe); and for MJS, ⩽2.5 mm (moderate), ⩽2.0 mm (intermediate), and ⩽1.5 mm (severe).

The joint space width (lateral, superior, axial, medial, and minimum) measurements were standardised using a 0.5 mm graduated magnifying glass laid directly over the radiograph.15–17

The follow up radiographs were evaluated for the presence of an incident THR (not present at baseline).

For all three grading systems and all measurements, inter-rater reliability was tested in a random set of 148 radiographs.18,19

Clinical assessment

At baseline, trained interviewers undertook an extensive home interview on demographic characteristics, medical history, risk factors for chronic diseases, and use of medicines.

For this study we used information on the presence of hip pain (“Did you have joint complaints of your right/left hip during the last month”), the presence of morning stiffness, and lower limb disability. Lower limb disability was assessed using a modified version of the Stanford health assessment questionnaire.9 A lower limb disability index (LDI) was obtained by calculating the mean score of answers to the following six questions: “Are you able to stand up from a straight chair without using your arms for support?”; “Are you able to get in and out of bed?”; “Are you able to walk outdoors on flat ground?”; “Are you able to climb up five steps?”; “Are you able to bend down to pick up clothing from the floor?”; and “Are you able to get in and out of a car?”. The answers were scored as follows: 0 = yes, without difficulty, 1 = yes, with some difficulty, 2 = yes, with much difficulty, 3 = no, unable to do (needs help). Moderate disability was defined as a score greater than 0.5 and severe disability as a score greater than 1.0 on the LDI. Moderate disability is present whenever there is at least some difficulty with three of six daily activities in the LDI.9

Statistical analysis

For the inter-rater reliability, the κ statistic and the intraclass correlation coefficient (ICC) were assessed for the individual different radiological features and the three definitions of hip ROA.

Because of the absence of a gold standard, we expressed validity as construct validity and predictive validity. Construct validity is a measure of the ability to identify patients with symptoms (presence of hip pain, morning stiffness, or lower limb disability) of hip OA20,21; predictive validity is a measure of the ability to predict important long term outcomes of disease.21 For construct validity, we tested the association between baseline radiological osteoarthritis of the hip according to the three definitions and the separate baseline clinical symptoms (hip pain, morning stiffness, and lower limb disability) by means of generalised estimating equations (GEE) (cross sectional design). This is a procedure of repeated measurements. It is used here to take account of the correlation between the left and right hip. In addition, sensitivity and specificity were assessed using the main symptom of hip OA—hip pain—as the gold standard. We used different cut off points for the three definitions of hip ROA and also stratified the results for sex and age. A two sided probability (p) value of 0.05 was considered significant. For predictive validity, we assessed the proportion of THR after a clinically meaningful follow up period of 6.6 years in patients identified by each definition as having hip ROA at baseline (longitudinal design). We also calculated the association between the different definitions of hip ROA and THR at follow up by means of the GEE method (odds ratios). We used SPSS version 11.0 (SPSS Inc, Chicago, Illinois, USA) and SAS software, version 8.0 (SAS Institute, Cary, North Carolina, USA) for all analyses.

RESULTS

Study population

Table 1 shows the demographic characteristics and prevalence data on radiographic hip OA, stratified for sex of the study population of 3585 participants. Women were older, had a higher body mass index (BMI), and were shorter. The prevalence of lower limb disability and hip pain was twice as great in women as in men. Men showed a higher prevalence than women when OA was defined by Kellgren and Lawrence or Croft grades. Of the subjects with hip pain, 98.8% had had pain for more than a month; of these, 30.2% had pain for between one and five years and 51.1% for longer than five years. The prevalence of ROA was a much higher when defined by Croft grade 3 than by the other definitions of moderate hip OA. The prevalence of moderate radiological hip OA defined by Kellgren and Lawrence and MJS was similar.

Table 1

 Demographic characteristics and prevalences of radiographic hip osteoarthritis stratified for sex

Reliability

Table 2 shows the inter-rater reliability for different individual radiological features and three definitions of hip ROA. The inter-rater reliability for the different individual radiological features was relatively low, with the exception of the MJS assessed as a continuous variable. Kellgren and Lawrence ⩾grade 2 and MJS ⩽2.5 mm had comparable reliability, whereas for Croft ⩾grade 3 the reliability was somewhat lower.

Table 2

 Inter-rater reliability for individual radiological features and the three definitions of radiographic hip osteoarthritis studied (n = 148)

Construct validity

Table 3 shows the association between the three definitions of hip ROA for different cut off points and clinical symptoms of hip OA—hip pain, morning stiffness, and lower limb disability (moderate and severe). The percentages of subjects defined in these ways according to the different cut off points are given in table 1. Table 3 shows that severe hip ROA had a stronger association with symptoms than moderate hip ROA. The Kellgren and Lawrence grade and MJS demonstrated comparable associations with clinical symptoms of hip OA, especially with hip pain and lower limb disability for both moderate and severe hip ROA. The Croft grade had the weakest associations with clinical symptoms of hip OA.

Table 3

 Association between different definitions of radiographic hip osteoarthritis and clinical symptoms (n = 3561)

Sex as an effect modifier

We found that men had on average a larger joint space width than women (4.2 v 3.9 mm, respectively). Furthermore we found that height was positively correlated with joint space width. We also found a positive correlation within sex between height and the joint space width (respectively, a β value of 0.16 for men and 0.14 for women). Because in the present study women were shorter than men, we adjusted for height. After adjustment for height, the sex effect disappeared.

Table 4 shows the association between different definitions of hip ROA and the clinical symptoms of hip OA stratified for sex. For both definitions of Croft, the results showed no significant sex difference except for the association between Croft ⩾grade 4 and severe lower limb disability. We found, however, that sex was a significant effect modifier for Kellgren and Lawrence grade (⩾grade 2). For women the associations between symptoms (hip pain and lower limb disability) and hip OA defined by Kellgren and Lawrence (⩾grade 2) were significantly stronger than for men. We also found that for women the association between symptoms and hip ROA according to Kellgren and Lawrence (⩾grade 2) was stronger than according to the MJS (⩽2.5 mm) or Croft grade (⩾grade 3). Women had a higher BMI than men, but after we adjusted for BMI the assessed associations did not change.

Table 4

 Association between the three different definitions of radiographic hip osteoarthritis studied and clinical symptoms, stratified for sex (n = 3561)

Table 5 shows the association between different definitions of hip ROA (Kellgren ⩾grade 2 and MJS ⩽2.5 mm) and hip pain stratified for sex and age (two categories). We divided men and women into two equal groups, a younger and older group (around the median of 65.2 years). Older persons had a stronger association between hip ROA and hip pain than younger persons, especially when defined by Kellgren and Lawrence. The trend was that hip ROA in younger men, especially when defined by Kellgren and Lawrence, had a weaker relation with hip pain than in women (both age categories) or older men. These results were similar for the association with lower limb disability.

Table 5

 Association between two definitions of radiographic hip osteoarthritis and hip pain, stratified for sex and age

Predictive validity

Table 6 shows the predictive validity of the three definitions for THR at follow up, indicated by the association between the different definitions of hip ROA at baseline and THR at follow up. The Kellgren and Lawrence grading system predicted the highest ratio of number of incident THR at follow up divided by the number ROA cases at baseline, and showed the strongest association with THR at follow up, compared with the other definitions.

Table 6

 Predictive validity of the three definitions for total hip replacement (THR) at follow up (n = 3561)

DISCUSSION

Based on the results of the present study, with the Kellgren and Lawrence grade being the best predictor for a THR at follow up and MJS being height dependent, we concluded that radiological hip OA might be better defined for epidemiological studies by the Kellgren and Lawrence system than by MJS.

The inter-rater reliability of Kellgren and Lawrence grade assessed in this study is similar to that described in published reports.4,6,22,23 In contrast to more recent studies, the original study of Kellgren and Lawrence showed a relatively low inter-rater reliability (ICC of 0.40).10 In the present study we found an inter-rater reliability of the MJS according to Croft that was similar to previous studies.4,7,22,24,25 The inter-rater reliability for Croft grade ⩾3 in the present study had a κ value of 0.51 compared with values of 0.37 to 0.79 in earlier studies.4,7,25,26 The wide range of inter-rater reliability between these studies is mainly explained by the different cut off levels used. One study14 used the same cut off level as the present study, and reported a similar κ value of 0.41. However, in the original study by Croft the κ values presented were based on measurement of the size of the individual radiological features and not on atlas based grades. The inter-rater reliability reported in the present study was similar for subchondral sclerosis and for osteophytes compared with the reliability reported in the study by Croft.14

The validity of the different definitions of hip ROA has been poorly investigated in previous studies. In the present study we investigated the construct and predictive validity. Because of the absence of a gold standard we expressed the predictive validity as the ability of each definition to predict a THR at follow up. The requirement for a THR has been proposed as a potential outcome measure based on the assumption that THR is undertaken only in patients with severe disease—from both a symptomatic (painful and disabling disease) and a structural point of view (overall severity or advanced joint space narrowing).27,28 The lower limb disability assessed by the health assessment questionnaire (HAQ) in the Rotterdam study is not a disease specific outcome measure, but it measures arthritic conditions in general. On the other hand, lower limb disability is an important symptom of hip OA, and OA is the most important cause of disability of elderly people.29 Hence we included lower limb disability, assessed by the HAQ—as well as the presence of hip pain and morning stiffness—as an important symptom of hip OA in the analysis. Overall, the Kellgren and Lawrence grading system had the best predictive validity when compared with the other definitions of hip ROA, and similar associations with symptoms of hip OA (construct validity) as MJS. MJS came out better in both construct and predictive validity than Croft grade. The weak associations reported in the present study between Croft grade ⩾3 and symptoms of hip OA can be explained by the high prevalence of moderate hip OA, and presumably therefore by the low specificity value, using hip pain as the gold standard for Croft grade ⩾3. Therefore it is difficult to compare the Croft definition (moderate hip OA, ⩾grade 3) with the other definitions. An earlier study reported similar prevalence of hip OA defined by Croft grade ⩾3 and MJS ⩽2.5 mm, and also a similar prevalence of hip pain in “disease positive” hips.14 When we excluded those subjects with an incident hip fracture during the follow up period, and repeated the analysis for predictive validity, the results did not change essentially.

The second objective of our study was to investigate whether the relation between the three definitions and symptoms was sex dependent. Surprisingly, only the strength of the association between the Kellgren and Lawrence grading system and symptoms of hip OA was sex dependent. These findings have not been reported in previous studies. A possible explanation for this sex dependency could be the stronger relation between (femoral) osteophytes and hip pain in women. In women we found a stronger relation between osteophytes and hip pain (odds ratio of 1.7 for women v 1.2 for men); however, the prevalence of osteophytes in women was lower (34.3% in women v 43.6% in men). In contrast to our findings, we had expected that the strength of the association between MJS and symptoms of hip OA would have been sex dependent.

A stronger association between hip ROA and hip pain was found in older persons than in younger ones. Because of a power problem (owing to the smaller sample size for the younger male category) the sex difference was not significant. A possible explanation for this difference might be that younger persons have better muscle strength in the lower limb than older persons. Reduced muscle strength is regarded as a risk factor for pain and disability in OA,30–32 and exercise therapy aiming to improve muscle strength has a beneficial effect on pain in patients with OA of the hip or knee.33,34

The results of the present study may be flawed by the quality of the radiographs. In particular, measurement of joint space width (MJS) could be flawed because of the quality of the radiographs. Important variables in the radiographic procedures are the position of the central ray of the x ray beam relative to the centre of the joint, and the distance between the centre of the joint and the x ray film (focus to film distance). Centring the x ray beam on the umbilicus instead of on the superior aspect of the symphysis pubis resulted in an average increase in joint space width of about 10%.16 The focus to film distance may also modify the measurement.35 On the other hand in the Croft’s study14 the x ray beam was also centred 10 cm higher than a standard anteroposterior view of the pelvis.

The source of potential bias in this study is a probable health based selection. The subjects in the present study had to be mobile enough to visit the research centre at baseline and follow up and to survive the follow up period (mean 6.6 years). Overall, participants were generally healthier than non-participants. In other words, patients with the most severe symptoms were most probably not included. It seems likely that, in this younger and healthier population with less frequent lower limb disability and hip pain, the prevalence of hip ROA is underestimated, as may be the magnitude of the association between the different definitions of hip ROA and symptoms of hip OA. Knowing that for older persons a stronger relation was found between hip ROA and hip pain, especially when defined by Kellgren and Lawrence, this underestimation may particularly hold for Kellgren and Lawrence grade.

When we compared the results of Kellgren and Lawrence grade and MJS we found the following differences. Kellgren and Lawrence was the best predictor for a THR at follow up. As described earlier by Buckland-Wright36 and Lanyon et al,37 we also found that men had larger joint spaces than women. After adjustment for height these joint space differences between men and women disappeared. Considering these results, it is doubtful whether the given cut off point of MJS is valid for people of short stature.

When we stratified the associations between each definition and the symptoms of hip OA (hip pain and lower limb disability) for sex, surprisingly we found significantly stronger associations for Kellgren and Lawrence grade for women with hip pain and lower limb disability than for men.

Based on these results, we concluded that the Kellgren and Lawrence grade is still a useful definition of hip ROA for epidemiological studies.

APPENDIX

The Kellgren and Lawrence grading system is shown in table 7, Croft’s modification of the Kellgren and Lawrence grading system in table 8, and Croft’s measurement of the minimal joint space in table 9.

Table 7

 The Kellgren and Lawrence grading system

Table 8

 Croft’s modification of the Kellgren and Lawrence grading system (“Croft grade”)

Table 9

 Croft’s measurement of the minimal joint space (lateral, superior, axial, medial)

Acknowledgments

This study was supported by a grant from the Dutch Arthritis Association. We are very grateful to F van Rooij, E van der Heijden, R Vermeeren, and L Verwey for collection of follow up data. Moreover, we thank the participating general practitioners, the pharmacists, the many field workers at the research centre in Ommoord, and of course all the participants.

REFERENCES