Ensuring that individuals (or organisations) cannot be identified from research outputs when using administrative data has additional hurdles to cross and can require extra levels of preparation. Data providers will often review any outputs before they are published, and each provider has a set criteria that every output will have to clear. In the case of data that requires secure access, this will include obtaining outputs from the secure space while writing your draft articles, reports and other publications.
Statistical disclosure is when a small number of observations are isolated, say in a count, and could be used to identify an individual. This can be in the form of primary disclosure where the small number observations are shown directly, or secondary disclosure where another source can be used to calculate a hidden small number (say another table in your report or external official statistics).
A particular consideration when working with administrative data from a public service is the individuals that work there, for example geographical breakdowns that inadvertently identify members of the judiciary from court data.
Data providers require researchers to undergo training before being given access to their data. The UK Data Service run a Safe Researcher Training course with the Office of National Statistics (ONS) and the Administrative Data Research Council (ADR UK) often found here:
- Safe User of Research data Environments (SURE) Training course – run by ONS, the UK Data Service, the Administrative Data Research Network.
Alternatively the Medical Research Council run the following courses here, which are also accepted by some providers:
- MRC Regulatory Support Centre: Research Data and Confidentiality e-learning
- MRC’s Research, GDPR and confidentiality
If, like me, you end up going through multiple forms of these training courses and reading through a few different sets of guidelines and rules it can be difficult to keep track of the different boundaries and criteria. So below I have highlighted a few dos and don’ts to keep in mind throughout a research project.
UK Data Service provides a “Handbook on Statistical Disclosure Control for Outputs” here.
The following is not comprehensive of the criteria and rules regarding statistical disclosure, there are many grey areas, and it is best to maintain good communication with the data providers to discuss on a case-by-case basis.
Dos
-
Keep to a threshold of 10 at all times
From a statistical stand-point this will be pretty obvious as low frequencies can already cause problems when it comes to running tests and fitting models. If you don’t have low frequencies from the start they’ll be less of a problem later. This can mean grouping categories together, or defining the variables a little differently.
By “at all times” I mean from the very first frequency counts, down to the finest detail of every cell in the largest cross-tabulation. Secondary disclosure can be tricky to navigate when writing up your research if there are any small numbers. For example, you usually want to start with a frequency table that lays out the scale of the problem you want to address; there are no small numbers so it is fine on it’s own. However, if you come across a small number later when you break down the problem going through the controlling factors and explanatory variables, then what you can publish and pass through the disclosure control could be limited. If you know where the small numbers are you can make better decisions in your research and be prepared.
Not all data providers have a threshold of 10, some are lower at 5, so keeping to 10 is an easy way to make sure you always keep within the criteria.
-
Request the data table underlying every graphic and visualisation for clearance
Perfecting the aesthetics of a graph or a visualisation can be a tedious affair, with minute changes to scales, labels or even altering the colour scheme. At the same time, getting outputs through disclosure checks can also be tedious and lengthy at times (I once experienced delays over a graph where a single pixel could be used to calculate a small number that was hidden in a later table).
It is, therefore, advisable to request the data used to create the visualisation alongside the visualisation itself. That way you can re-create the graphic with the minute aesthetic changes without the need to pass through the disclosure control again.
On submitting a visualisation for clearance you may be required to provide the underlying data anyway to prove that there are no small counts represented in the graphic.
-
Plan your outputs before you start
Think about what outputs you might want before you get started. Similar to my first “do”, being prepared is key. What will be an absolute “must have” in the article/report? What will you show in tables or figures? What will connect together and risk any secondary disclosure, and which would you change to prevent disclosure?
-
Be cautious of creating graphs from individual level data
The problem with individual data points on graphs is self-identification. Including outputs from a statistical model that show the values of any observed variable. You know your own data, so it’s possible for a person to know which point corresponds to themselves and anyone else they know that value for.
That is not to say it is not possible to have a graph with individual data points, like a outliers on a boxplot, but a good explanation will have to given that justifies why it does not disclose any personal data.
-
Check for dominance
Dominance occurs when the largest “unit” accounts for more than 43.75% of the total, or all but the top two units account for less than 12.5% of the largest unit (as defined on the UK Data Service Safe Researcher training course, check with your provider for their specific criteria). Although this does not usually happen at the individual level, consider for example identifying a particular court or centre. Data owners will not allow you to upset the organisations providing the data, and it may not be in the public interest to identify local practices. You wouldn’t, for example, identify an interviewer or branch of a study’s survey.
Don’ts
-
Don’t display a complete set of frequencies
Presenting frequencies can be key to portraying scale, however they make it difficult to prevent later secondary disclosures. Personally I avoid frequency tables wherever possible, especially cross-tabulations, and stick to in-line rounded figures where necessary. Alternatively, present the distribution in terms of percentages, rates or other appropriate statistics. Beware that a total frequency can be used to calculate any hidden small values.
-
Don’t provide the minimum and maximum
The minimum and maximum often only relate to a single observation, a big no no. A suggested solution can be to average the top/bottom 10 instead. In general summarising continuous variables has its own set of problems and categorising these variables for display purposes can make things a lot easier.
-
Don’t overlook 0 counts
Zeroes in themselves do not identify any individuals but you can infer from them. Say you find zero individuals in the data are without a disability, then you meet someone and know they used the service you were analysing (thus were present in the dataset). The results would then disclose that the individual you met has a disability, which would be breach of their personal information.
Structural zeroes can be ok, like highest qualification by age, it would be expected that there would be zero under 16s with a qualification from higher education.
Don’t be overwhelmed. Statistical disclosure comes with a lot to take in and all the information is taught during the required training courses. Data providers are also readily available to help guide you along the way. Working together it is possible to resolve most statistical disclosure issues.
[subscribe2]