Conversation – who writes the data?

The main difference between survey and administrative data is the input process.

Design

Survey data is input either by the subject or on behalf of the subject, by an interviewer or proxy. Survey questions are heavily designed with a lot of research behind each one to obtain maximum response and be as accommodating of all possible answers, including ‘don’t know’ and ‘prefer not to say’.

Administrative data is input via an administrative person, whose job is to complete the form in order to provide a service to the subject. The form is designed around what the system needs in order to fulfil it’s purpose, and only what is needed according to GDPR. The data then has to go through a process for the system to work, so the form is therefore designed with barriers and restrictions to direct the data input; often without flexibility should the real value not fit within these boundaries. If a field must be completed then a computer-based system should flag a warning when it is empty or if the data required to be in a certain format then this will also be flagged. But what happens when the real data does not fit within these boundaries? Or the data simply does not make sense? It is down to the administrator to decide. For example a date is required but the event happened over a couple of days or weeks so the administrator then puts the earliest date. But they could have chosen to put the latest date or the first day of the week, or last day of the month, or any other date, into the form.

Source(s)

Surveys are a one-to-one event between survey and respondent. One respondent completes one survey.

Administrative data can contain multiple sources. Multiple individuals and organisations are often involved in an administrative process, and sometimes the data is secondary from a previous administrative process. For example, in the courts administrators will input data from the applicant, the defendant, the local authority, the police, the CPS and/or so on. Some sections will require direct input from a single source and some sections could be completed by multiple sources. To demonstrate take say a defendant’s name and address. Ideally this would be come directly from the defendant, but it could come from the police, or any other organisation, that collected the data for their own processes (This is what makes data linkage possible). It is this complex network of data collection that makes the pros and cons of working with administrative data.

Researchers need to keep in mind who provided the data and who input the data. The individual person entering data may be dependant on another source for the data but also have a strict deadline.

Signatory

Respondents to a survey will often provide a signature to declare that it was indeed them that completed the survey.

Signatures in an administrative process will often be a declaration that the data is correct and accurate, with acceptance of responsibility should otherwise be proven.

Researchers using administrative data need to be mindful of this declaration. It is often tempting in data processing to correct errors, such as “correcting” dates from 3018 to 2018 or when the calculated age does not match the declared age. Do not! I would seriously recommend researchers restrict themselves to only filtering valid and invalid data, and an inclusion/exclusion criteria.

 

Statistical disclosure, the dos and don’ts

Ensuring that individuals (or organisations) cannot be identified from research outputs when using administrative data has additional hurdles to cross and can require extra levels of preparation. Data providers will often review any outputs before they are published, and each provider has a set criteria that every output will have to clear. In the case of data that requires secure access, this will include obtaining outputs from the secure space while writing your draft articles, reports and other publications.

Statistical disclosure is when a small number of observations are isolated, say in a count, and could be used to identify an individual. This can be in the form of primary disclosure where the small number observations are shown directly, or secondary disclosure where another source can be used to calculate a hidden small number (say another table in your report or external official statistics).

A particular consideration when working with administrative data from a public service is the individuals that work there, for example geographical breakdowns that inadvertently identify members of the judiciary from court data.


Data providers require researchers to undergo training before being given access to their data. The UK Data Service run a Safe Researcher Training course with the Office of National Statistics (ONS) and the Administrative Data Research Council (ADR UK) often found here:

Alternatively the Medical Research Council run the following courses here, which are also accepted by some providers:

  • MRC Regulatory Support Centre: Research Data and Confidentiality e-learning
  • MRC’s Research, GDPR and confidentiality

If, like me, you end up going through multiple forms of these training courses and reading through a few different sets of guidelines and rules it can be difficult to keep track of the different boundaries and criteria. So below I have highlighted a few dos and don’ts to keep in mind throughout a research project.

UK Data Service provides a “Handbook on Statistical Disclosure Control for Outputs” here.

The following is not comprehensive of the criteria and rules regarding statistical disclosure, there are many grey areas, and it is best to maintain good communication with the data providers to discuss on a case-by-case basis.


Dos

  • Keep to a threshold of 10 at all times

From a statistical stand-point this will be pretty obvious as low frequencies can already cause problems when it comes to running tests and fitting models. If you don’t have low frequencies from the start they’ll be less of a problem later. This can mean grouping categories together, or defining the variables a little differently.

By “at all times” I mean from the very first frequency counts, down to the finest detail of every cell in the largest cross-tabulation. Secondary disclosure can be tricky to navigate when writing up your research if there are any small numbers. For example, you usually want to start with a frequency table that lays out the scale of the problem you want to address; there are no small numbers so it is fine on it’s own. However, if you come across a small number later when you break down the problem going through the controlling factors and explanatory variables, then what you can publish and pass through the disclosure control could be limited. If you know where the small numbers are you can make better decisions in your research and be prepared.

Not all data providers have a threshold of 10, some are lower at 5, so keeping to 10 is an easy way to make sure you always keep within the criteria.

  • Request the data table underlying every graphic and visualisation for clearance

Perfecting the aesthetics of a graph or a visualisation can be a tedious affair, with minute changes to scales, labels or even altering the colour scheme. At the same time, getting outputs through disclosure checks can also be tedious and lengthy at times (I once experienced delays over a graph where a single pixel could be used to calculate a small number that was hidden in a later table).

It is, therefore, advisable to request the data used to create the visualisation alongside the visualisation itself. That way you can re-create the graphic with the minute aesthetic changes without the need to pass through the disclosure control again.

On submitting a visualisation for clearance you may be required to provide the underlying data anyway to prove that there are no small counts represented in the graphic.

  • Plan your outputs before you start

Think about what outputs you might want before you get started. Similar to my first “do”, being prepared is key. What will be an absolute “must have” in the article/report? What will you show in tables or figures? What will connect together and risk any secondary disclosure, and which would you change to prevent disclosure?

  • Be cautious of creating graphs from individual level data

The problem with individual data points on graphs is self-identification. Including outputs from a statistical model that show the values of any observed variable. You know your own data, so it’s possible for a person to know which point corresponds to themselves and anyone else they know that value for.

That is not to say it is not possible to have a graph with individual data points, like a outliers on a boxplot, but a good explanation will have to given that justifies why it does not disclose any personal data.

  • Check for dominance

Dominance occurs when the largest “unit” accounts for more than 43.75% of the total, or all but the top two units account for less than 12.5% of the largest unit (as defined on the UK Data Service Safe Researcher training course, check with your provider for their specific criteria). Although this does not usually happen at the individual level, consider for example identifying a particular court or centre. Data owners will not allow you to upset the organisations providing the data, and it may not be in the public interest to identify local practices. You wouldn’t, for example, identify an interviewer or branch of a study’s survey.

Don’ts

  • Don’t display a complete set of frequencies

Presenting frequencies can be key to portraying scale, however they make it difficult to prevent later secondary disclosures. Personally I avoid frequency tables wherever possible, especially cross-tabulations, and stick to in-line rounded figures where necessary.  Alternatively, present the distribution in terms of percentages, rates or other appropriate statistics. Beware that a total frequency can be used to calculate any hidden small values.

  • Don’t provide the minimum and maximum

The minimum and maximum often only relate to a single observation, a big no no. A suggested solution can be to average the top/bottom 10 instead. In general summarising continuous variables has its own set of problems and categorising these variables for display purposes can make things a lot easier. 

  • Don’t overlook 0 counts

Zeroes in themselves do not identify any individuals but you can infer from them. Say you find zero individuals in the data are without a disability, then you meet someone and know they used the service you were analysing (thus were present in the dataset). The results would then disclose that the individual you met has a disability, which would be breach of their personal information.

Structural zeroes can be ok, like highest qualification by age, it would be expected that there would be zero under 16s with a qualification from higher education.


Don’t be overwhelmed. Statistical disclosure comes with a lot to take in and all the information is taught during the required training courses. Data providers are also readily available to help guide you along the way. Working together it is possible to resolve most statistical disclosure issues.


[subscribe2]

Forms of administrative data

Administrative data is collected in many different settings by businesses and public services, usually in exchange for a product or service. Nowadays it is commonly accepted that private companies use customers’ data to conduct analysis and market research that informs business decisions on marketing and sales. However, research of this kind with data from public services is only just entering the spotlight.

The administrative data collected by public services will be the primary focus of this blog.

Raw Data

Direct from the source raw data can be a dry place to start research. In its original form the value of the raw data is not always obvious in terms of research. On the other hand, there are endless possibilities for derived variables that are meaningful and valuable for research.

A key part aspect of raw data is the data processing that involves a number of steps and considerations. To illustrate a few:

  • Data validation:
    • Is each data in the correct form? e.g., a date entered in a field for gender would not be valid.
    • Is an empty field missing or a non-value? e.g., an empty tick box could indicate a “no” or be part of a group where no ticks boxes have been completed.
  • Variable Creation:
    • What can be extracted from a text box? Can meaningful categories be created from key words? e.g., extracting the town, county and region from an address.
    • Are the dates of key moments recorded? e.g., the duration between a date of birth and the application’s issue date for calculating ages.
  • Cross-validation:
    • Does the data agree with itself? e.g., a mother’s date of birth being before that of their child, or a mother recorded as male.

There isn’t one rule fits all and the context of the data will change the answer to these questions.

See here for a previous research project that started with raw data. The data was typed by hand into open text boxes by call receivers to Safeline and the Male Survivors Helpline, an anonymous service for victims of sexual violence. The first step was to correct obvious misspellings, including 15 different ways of typing the word “unknown”. 

Processed Data

Processed administrative data has been very similar to my experiences working with survey data, where the data owners and/or data processors (for example Office of National Statistics, Eurostat etc.) take responsibility for processing the data. That is not to say the data is necessarily a nice and tidy statistical dataset, but the variables and values will be consistent. The processing could also include the removal of any personal or sensitive data, or aggregation of the individual level data to a higher level. 

Decisions on data validation and the variables created from the raw data have already been made, which means the research can be replicated and is reproducible with multiple researchers able to access the same data. Although the scope of the data may be limited, especially if researchers were not involved in the data processing.

Data owners and processors, however, are often willing to work closely with researchers, and requests for other variables are sometimes possible.

Secure Access Data

Typically, this has involved working on a secure remote desktop with security procedures and requirements for researchers to be trained in General Data Protection Regulation (GDPR) and statistical disclosure.

Data often contains a lot of valuable information at the individual level. A resource many researchers desire access to in order to answer research questions with methods that just wouldn’t be possible otherwise.

Another additional step in the process is clearing items through statistical disclosure so results of the research can be published. Different data owners and processors have slightly different requirements and boundaries; but the premise is the same. Frequencies are common ground, where no small numbers can be published (some set the minimum count as 5 or some as 10), but even larger frequencies can raise issues with secondary disclosure. Secondary disclosure could reveal a small number if the total count is disclosed at the start of a report and a complete breakdown of the percentages are given later.

Statistical disclosure therefore prevents the research being reproduced by anyone without access to the data, and the research project would have to go through the lengthy application process with the proposed purpose of reproducing the results.

Linked Data

The advantage of the secure individual level administrative data is the ability to form data linkages, connecting data at the individual level across multiple systems for example family court data with health records through probabilistic modelling.

 

All these topics will be discussed in further details in a blog post or thread specific to each one. If you have your own experiences, please feel free to share in the comments and I may ask you to contribute a post.

First Post

Welcome to this new blog on using administrative data for research! This a rapidly growing area of research with the increased availability of access to administrative datasets and the drive for access to secure datasets from public services. Most recently, the ambitious data-linking programme Data First, led by the Ministry of Justice and funded by ADR UK, and the SAIL databank, funded by the Welsh Government’s Health and Care Research Wales.

This blog will feature discussions on topics specific to the use of anonymised person-based administrative data for research, which will include:

  • Accessing secure data
  • Statistical disclosure
  • Synthetic data
  • Data processing; working with raw data
  • Creating research relevant variable
  • The public good

The topics discussed will be based on experience that will help inform future researchers about using administrative data.