Objective: To test data mining methods used in pharmacosurveillance in order to identify potential emerging disease–nuisance associations in the national occupational disease surveillance and prevention network (RNV3P) database.
Methods: Proportional reporting ratios (PRR) used in pharmacosurveillance were applied to detect disproportional reporting of disease–nuisance associations which are not compensated by the national social security system.
Results: The 24 785 reports of the RNV3P were grouped into 1344 different disease–nuisance associations reported more than twice, of which 422 did not give entitlement to compensation by the social security system. Among these associations, 162 were potentially emergent and generated a signal, of which eight associations involve cancer.
Conclusion: This work is the first stage of an exploratory investigation submitting the questions raised to experts and involving participants in the network in reflection on the hypotheses generated.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
The detection of new work-related diseases is an important public health issue, especially as large numbers of new processes and products are regularly introduced. The Réseau National de Vigilance et Prévention des Pathologies Professionnelles (RNV3P) was created in 2001 in France to monitor occupational and environmental diseases and, in return, to allow participating centres, when confronted with an unusual potentially work-related problem, to search the RNV3P database for similar cases reported elsewhere in France.1–3 One of its aims is to develop statistical methods or models to detect potential emerging associations between diseases and nuisances. Since 2001, this network has carried out systematic and standardised recording of all occupational health reports originating from the 29 consultation centres for occupational diseases of university hospitals in metropolitan France, to which patients are referred by occupational physicians, hospital practitioners, or general practitioners for diagnosis of the occupational nature of their disease.
The aim of this work is to test the value and limitations of the application, to the data in the RNV3P database, of safety data mining methods used in pharmacosurveillance,4 5 and in particular proportional reporting ratios (PRR). The goal is to detect previously unrecognised disease–nuisance associations (called “associations”) which do not yet give entitlement to compensation and which generate a signal with these methods. These associations are likely to reveal a new risk for workers’ health, so in this paper they will be termed “potentially emerging associations”. While this approach cannot affirm a causal relationship between the nuisance and the disease observed, it may generate new hypotheses which can be submitted to experts for evaluation.
Data analysed concern the occupational health reports collected by the RNV3P database covering the years 2001–2005. The quality of the data is optimised by use of a computerised tool for entering, coding and monitoring the coherence of the information, by annual training in quality assurance for users, by the creation of a national coding group as well as by centralised quality control. Each occupational health report is a structured expert clinical report whose principal coded items are: principal disease and co-morbid diseases (ICD-10), principal nuisance and four other possible nuisances (INRS-CNAM), professional position (ISCO-88, edited by ILO) and sector of professional activity (NAF, edited by INSEE). These national or international codes are graded and can be analysed at different degrees of precision. An association is considered as recognised, and cannot therefore be emergent, if it potentially gives entitlement to compensation (“compensated association”).
The data mining methods used in pharmacosurveillance are independent of data external to the base. They analyse drug–event pairs which are the equivalent of the disease–occupational nuisance pairs included in the RNV3P. The structure of the disease-occupational nuisances matrix of the RNV3P is similar to that of the adverse drug event matrices used by the principal pharmacosurveillance databases (MEDWATCH in the USA, LAREB in the Netherlands, VIGIBASE for the WHO)6: they are “empty” matrixes, consisting mainly of cells whose content is nil, with a possible reported percentage of pairs of about 1%. For these reasons, as a preliminary exploratory approach we have chosen to use pharmacosurveillance methods to analyse the RNV3P data.
These methods generate a statistical signal when there is a discrepancy between the true number of a disease–nuisance pair reported in the RNV3P database and the expected number for this same pair in the database. For each of the pairs, these methods generally use a 2×2 contingency table in which the number of reports is entered (table 1).
Each measurement of associations uses a different probability model to describe the distribution of the number of reports. There are few published, large-scale, systematic comparisons of the data-mining methods currently used for pharmacosurveillance.7 However, the results of the different methods tend to converge when the number of reports increases for a given pair.5 The method tested here is “proportional reporting ratios” (PRR),8 used by the British Medicine Control Agency (MCA) to analyse the data of the ADROIT (Adverse Drug Reactions On-line Information Tracking) pharmacosurveillance database. The equation which gives the value of PRR for a given pair is PRR = (a/(a+b))/(c/(c+d)). This is similar to the relative risk used in epidemiology. The PRR is equal to the ratio of the probability a/(a+b) of presenting, within the database, the target disease if exposed to the target nuisance, over the probability c/(c+d) of presenting this disease if exposed to any nuisance, with the exception of the target nuisance. Two signal generation criteria have been proposed with PRR: the first (PRR1) uses three conditions: (number of cases observed : a⩾3) AND (PRR⩾2) AND (χ2 Yates 1 dof ⩾4), χ2 with Yates correction being calculated using the same contingency table.8 The second signal generation criterion (PRR2) is based on the confidence interval of PRR.5 This is calculated as follows: 95% CI = exp(ln(PRR)±1.96SE(lnPRR)), with SE(lnPRR) = (1/a−1/(a+b)+1/c+1/(c+d))1/2. A signal is generated if the lower limit of the 95% CI is greater than unity (LI95(PRR)>1). According to PRR1 and PRR2 definitions, PRR1 cases are expected to be almost entirely a subset of PRR2 cases for which the selective criteria are less restrictive.
A 2×2 table crossing the dimensions “principal disease” and “principal nuisance” was created, with, in each cell, the number of reports of the target disease and nuisance. The most detailed analysis was conducted on the nuisance code (four digits) and on the first three digits for the ICD code, which has four digits at its most informative level, in order to avoid too broad a dispersion of the codes. A programme based on SPlus 6.1 software applied PRR to the RNV3P data, calculating for each pair the value of the PRR and its confidence interval, as well as the value of χ2 with Yates correction. These two signal generation criteria were applied successively. As in pharmacosurveillance, the pairs whose number reaches unity are taken into account for calculation of the margins of the 2×2 contingency table but cannot generate a signal.
Analysis of the data successively presents the distribution of the disease–nuisance associations in terms of numbers of reports, then the distribution of these associations according to whether they give entitlement to compensation or not (compensated versus non-compensated associations) and of their ability to generate a signal with PRR (PRR1 and PRR2 criteria).
Description of the data
The 24 785 reports in the RNV3P database gave 3830 different disease–nuisance associations, of which only 47% corresponded to associations giving entitlement to compensation. Figure 1 presents the distribution of the number of compensated and non-compensated associations reported from 1 to more than 50 times. Only one-third of associations were reported more than once (1344 associations); however, these accounted for 89% of all 22 299 reports in the database. Two-thirds of the associations reported at least twice (n = 922) gave entitlement to compensation (table 2), a total of 17 196 reports. The compensated diseases most frequently encountered were related to asbestos (20 different associations, 5483 reports), dermatitis (149 associations, 2485 reports), bone and muscle disorders related to repetitive trauma and load carrying (70 associations, 2284 reports), asthma (132 associations, 1774 reports), occupational deafness (8 associations, 874 reports) and rhinitis (90 associations, 744 reports).
The 422 non-compensated associations reported more than twice accounted for 5103 reports. Some are well known to clinicians and are very frequently observed, such as psychiatric disorders related to psychosocial and organisational stress (3770 reports).
Application of proportional reporting ratios
Figure 2 shows the number of disease–nuisance associations and the number of reports corresponding to the different stages of analysis: total in the database, exclusion of associations only reported once, and associations generating a signal using the PRR1 and PRR2 criteria. With the PRR1 criterion, 17% of associations generated a signal, or half the associations reported at least twice. With the PRR2 criterion, 26% of associations generated a signal, or three-quarters of the associations reported at least twice. However, the percentage of all reports involved varies little: 81% with PRR1 and 83% with PRR2. Table 3 presents the proportion of disease–nuisance associations and the proportion of reports generating a signal according to whether they are compensated or non-compensated.
Among the 162 non-compensated associations generating a signal with PRR1, 31 use codes which are too broadly based to be sufficiently informative in detecting emerging associations, for example solvents or organic acids for nuisance, or breathing difficulties for respiratory disorder. Several associations found as potentially emergent in the network are, at this time, being investigated in the scientific community: trichloroethylene and kidney tumour,9 10 laryngeal tumour and asbestos,11 12 malignant connective tissue tumour and pesticides,13 lung cancer and paints, organic solvents and thinners,14 15 systemic sclerosis and solvents,16 17 sleep disorders and solvents,18 19 multiple chemical sensitivity and formaldehyde,20 sterility and glycol ether,21 and wood or vegetable dust and sarcoidosis.22 The PRR2 criterion identified all the associations revealed by PRR1, as well as further associations of which 92% were only reported twice in the database. The subgroup of cancerous diseases (table 3) included 72 associations (1613 reports) reported at least twice, comprising 46 compensated associations (1517 reports) and 26 non-compensated associations (96 reports).
The non-compensated associations which generated a signal with the PRR1 criterion are shown in table 4, with the exception of one poorly-coded association which underwent further quality control. The seven complementary associations which generated a signal only with PRR2 and χ2 were all reported twice, except for the association kidney cancer and solvents (four reports).
Value of the application of pharmacosurveillance methods to the RNV3P
This work is a preliminary step towards automated systematic analysis of RNV3P data in order to detect potentially emerging disease–nuisance associations. The aim is twofold: to reveal, as early as possible, the new occupational risks which can be detected from the network database (hypothesis-generating function) and to emphasise the role of the network in forming expert opinion, as the hypotheses generated are transmitted to experts in the occupational disease centres who assess their pertinence.
The results presented in this paper demonstrate that proportional reporting ratios, as well as identifying compensated associations, have detected some formerly “unrecognised” associations, reported several times by the network, and which originate either from potential toxicological problems (associations where the relationship of cause to effect is not yet established) or from problems of social recognition: work-related health disorders which do not yet give entitlement to compensation. This approach by means of associations has also revealed wide discrepancies in the frequency with which disease–nuisances pairs are reported in the database.
As an illustration, pharmacosurveillance methods could also be used for paring other variables such as disease and activity sector dimension or disease and profession dimension. These new matrices yielded further information. For example, the association of chronic kidney failure (N18) with vehicle maintenance and repair (NAP 502Z) generated a signal with PRR. The hypothesis generated by this signal led to further analysis of these cases. The majority followed a nephritic syndrome (N05) cited as a co-morbid disease. They had little chance of attracting the clinicians’ attention as they came from three different consultation centres. Lastly, these cases did not generate signals during analysis of the disease×nuisance or disease×occupation matrices because they had different principal nuisance or occupation codes. It is now the task of the experts to give their opinion on the relevance of this association.
Inherent limitations of the data
RNV3P suffers from the same limitations as the pharmacosurveillance networks: there is a lack of precise information on the population from which the patients come; a potential link between the patient’s disease and occupational nuisances has to be suggested by a physician, who refers the patient to an occupational health consultation centre in a university hospital; and, lastly, the range of codes used for diseases or nuisances is very broad (synonyms or very closely related concepts) which poses a problem for analysis, as it does in pharmacosurveillance.
With regard to the first point, the aim is not however to calculate the incidence of associations but to look for potentially emerging diseases. A validation study of the network2 showed high stability, year after year, of the characteristics of the patients and of the physicians supplying the reports. With regard to the second point, heightened awareness of potential links between diseases and occupation by the medical profession should make for improved patient recruitment.
Lastly, the problem of the broad range of codes interferes with the results. For example, the relatively low proportion of compensated diseases generating a signal (55% of the associations reported at least twice generate a signal with the PRR1 criterion, 76% with the PRR2 criterion) is due to situations where the observed number of reports is low while the pathology and/or the nuisance are reported a large number of times. This should indicate that the association has become infrequent, but in practice this may also happen because the association is described with different synonymous disease codes or nuisances codes, hence leading the same entity to be greatly increased by several associations. Nevertheless, the compensated diseases generating a signal represent 91% of the reports corresponding to compensated associations reported at least twice (93% for PRR2). This problem of a broad range of codes can be solved by paying attention to coding (standardisation of practices by regular training courses) and also to analysis (grouping of codes). At present, two-thirds of the associations generated in the RNVPP are only reported once, even though the ICD disease code is limited to three digits in order to reach an acceptable compromise between obtaining accurate information and eliminating multiple classification bias related to the large number of disease codes available. An individual approach to the nuisance codes is necessary to assess how they can be grouped without running the risk of losing valuable information in terms of detection. Code groupings will have to be validated by all the experts: a balance must be struck between the need to preserve useful distinctions and the risk of not detecting a signal sufficiently early. However, there is a factor which helps to decrease over time the number of associations reported only once: the network is fed by 5000 new cases every year.
Inherent limitations of the methods
These statistical methods generate hypotheses which have clinical significance only after they have been evaluated by the experts in relation to other parameters (mechanistic and pathophysiological factors, degree of exposure, duration of exposure, extra-professional risk factors, etc). Several authors warn against “over-confidence in the results of data mining” or “enthusiasm generated by promise of the data mining tools”23 24 and remind us that while data mining methods are useful for systematic screening of large databases “to discover hidden patterns of associations or unexpected occurrences”, they must only be used with full awareness of the limitations of these methods and of the databases to which they are applied.
The choice of method and of the criterion of signal generation must be determined so that they do not identify too many associations which would be considered by the experts as false positives. In particular, these methods are highly sensitive to the number of cases reported and to a low number count of the disease and the target nuisance. Therefore, when the number of reports is small, the signal may be very strong even if only two cases are observed. Inversely, certain known and compensated associations do not generate a signal if they are reported relatively few times compared with the number of reports citing either the target disease or the target nuisance. Lastly, a very strong signal generated by a nuisance is likely to attenuate the other associations with the same nuisance, and in the same way a very strong signal with a given disease may attenuate the other associations with the same disease.
Other data mining methods used in pharmacosurveillance have been tested on the RNV3P database6: other “classical or frequentist forms of disproportionality analysis” (reporting odds ratios or ROR and its derivative Yule’s Q), certain methods using Poisson distribution (Poisson, sequential probability ratio test or SPRT2 methods) and a Bayesian method (Bayesian confidence propagation neural network or BCPNN, also called the information component (IC), WHO or Bate’s method). The sensitivity of these methods certainly differs depending on the number of times the association is reported. However, the above-mentioned tests, applied with the usual thresholds, produced overall about the same proportion of positive signals (with the exception of SPRT2 which is much less sensitive for pairs which are reported only a few times). No extensive study has demonstrated the superiority of any one of these methods compared with another.4 7 However, a recent study25 comparing PRR and the multi-item gamma Poisson shrinker (MGPS, also called empirical Bayes screening or EBS), emphasised the greater stability of the results of the MGPS in situations with small numbers, and so a smaller proportion of false positives. To make data mining tools more effective and to reduce the number of false positives, other factors have been taken into account in analysis, in particular with PRR: “strength of the signal”; its unexpectedness (“whether it is really new”), its “clinical importance (severity and seriousness)”, and its preventable nature (“the potential for preventive measures”).8 In other words, “those given the highest priority are Strong, New, Important and potentially Preventable (SNIP)”. This list has been completed by the WHO Programme for International Drug Monitoring26 which proposes supplementary criteria to analyse the results of the BCPNN method, such as a two-fold rise in signal generation over a 3-month period, a positive rechallenge test, etc. This type of filter could well be used in the RNV3P. Another interesting filter could reveal cases from different occupational disease consultation centres (as they may be liable to be missed by experts in each individual centre).
Lastly, only the principal nuisance is taken into account in this work, whereas other so-called secondary nuisances may be coded by the experts. Methods taking all nuisances into account are currently being developed.
Early detection of new disease–occupational nuisance associations is an important health issue. It cannot be done on the basis on the declarations of compensated occupational diseases, which refer to diseases–nuisance associations already accepted by the experts. The application to the RNV3P data of data mining methods used in pharmacosurveillance has led in particular to the detection of associations which are currently being investigated.
This exploratory investigation has raised clinical queries for submission to the experts and is involving those participating in the network in reflection on the hypotheses generated. It now appears necessary to complete these methods by a modelling approach taking into account all nuisances reported by the expert for each disease and their synergic effects. Other elements which could be taken into account include the degree of imputability attributed by the expert to each nuisance, established from data concerning exposure (duration, dose, chronological relation of symptoms to exposure) and pathophysiological plausibility. When the number of reports allows, further investigation will study the extension of diseases–occupational nuisance associations over time as well as within occupations and sectors of activity.
The detection of new work-related diseases can be enhanced by the application of the safety data mining methods used in pharmacovigilance.
The application to the RNV3P data of data mining methods has led to the detection of associations which are currently being investigated.
Further investigation applying safety data mining methods to RNV3P data will study the extension of diseases–occupational nuisance associations over time as well as within occupations and sectors of activity.
The authors wish to thank all physicians from the occupational health consultation centres of the university hospitals who supply their reports to the RNV3P, as well as the partners and organisers of the network: the Caisse Nationale d’Assurance Maladie (CNAM), the Agence Française de Sécurité Sanitaire de l’Environnement et du Travail (AFSSET) and the Société Française de Médecine du Travail (SFMT).
Competing interests: None declared.