Category Archives: General

Conversation – who writes the data?

The main difference between survey and administrative data is the input process.

Design

Survey data is input either by the subject or on behalf of the subject, by an interviewer or proxy. Survey questions are heavily designed with a lot of research behind each one to obtain maximum response and be as accommodating of all possible answers, including ‘don’t know’ and ‘prefer not to say’.

Administrative data is input via an administrative person, whose job is to complete the form in order to provide a service to the subject. The form is designed around what the system needs in order to fulfil it’s purpose, and only what is needed according to GDPR. The data then has to go through a process for the system to work, so the form is therefore designed with barriers and restrictions to direct the data input; often without flexibility should the real value not fit within these boundaries. If a field must be completed then a computer-based system should flag a warning when it is empty or if the data required to be in a certain format then this will also be flagged. But what happens when the real data does not fit within these boundaries? Or the data simply does not make sense? It is down to the administrator to decide. For example a date is required but the event happened over a couple of days or weeks so the administrator then puts the earliest date. But they could have chosen to put the latest date or the first day of the week, or last day of the month, or any other date, into the form.

Source(s)

Surveys are a one-to-one event between survey and respondent. One respondent completes one survey.

Administrative data can contain multiple sources. Multiple individuals and organisations are often involved in an administrative process, and sometimes the data is secondary from a previous administrative process. For example, in the courts administrators will input data from the applicant, the defendant, the local authority, the police, the CPS and/or so on. Some sections will require direct input from a single source and some sections could be completed by multiple sources. To demonstrate take say a defendant’s name and address. Ideally this would be come directly from the defendant, but it could come from the police, or any other organisation, that collected the data for their own processes (This is what makes data linkage possible). It is this complex network of data collection that makes the pros and cons of working with administrative data.

Researchers need to keep in mind who provided the data and who input the data. The individual person entering data may be dependant on another source for the data but also have a strict deadline.

Signatory

Respondents to a survey will often provide a signature to declare that it was indeed them that completed the survey.

Signatures in an administrative process will often be a declaration that the data is correct and accurate, with acceptance of responsibility should otherwise be proven.

Researchers using administrative data need to be mindful of this declaration. It is often tempting in data processing to correct errors, such as “correcting” dates from 3018 to 2018 or when the calculated age does not match the declared age. Do not! I would seriously recommend researchers restrict themselves to only filtering valid and invalid data, and an inclusion/exclusion criteria.

 

Forms of administrative data

Administrative data is collected in many different settings by businesses and public services, usually in exchange for a product or service. Nowadays it is commonly accepted that private companies use customers’ data to conduct analysis and market research that informs business decisions on marketing and sales. However, research of this kind with data from public services is only just entering the spotlight.

The administrative data collected by public services will be the primary focus of this blog.

Raw Data

Direct from the source raw data can be a dry place to start research. In its original form the value of the raw data is not always obvious in terms of research. On the other hand, there are endless possibilities for derived variables that are meaningful and valuable for research.

A key part aspect of raw data is the data processing that involves a number of steps and considerations. To illustrate a few:

  • Data validation:
    • Is each data in the correct form? e.g., a date entered in a field for gender would not be valid.
    • Is an empty field missing or a non-value? e.g., an empty tick box could indicate a “no” or be part of a group where no ticks boxes have been completed.
  • Variable Creation:
    • What can be extracted from a text box? Can meaningful categories be created from key words? e.g., extracting the town, county and region from an address.
    • Are the dates of key moments recorded? e.g., the duration between a date of birth and the application’s issue date for calculating ages.
  • Cross-validation:
    • Does the data agree with itself? e.g., a mother’s date of birth being before that of their child, or a mother recorded as male.

There isn’t one rule fits all and the context of the data will change the answer to these questions.

See here for a previous research project that started with raw data. The data was typed by hand into open text boxes by call receivers to Safeline and the Male Survivors Helpline, an anonymous service for victims of sexual violence. The first step was to correct obvious misspellings, including 15 different ways of typing the word “unknown”. 

Processed Data

Processed administrative data has been very similar to my experiences working with survey data, where the data owners and/or data processors (for example Office of National Statistics, Eurostat etc.) take responsibility for processing the data. That is not to say the data is necessarily a nice and tidy statistical dataset, but the variables and values will be consistent. The processing could also include the removal of any personal or sensitive data, or aggregation of the individual level data to a higher level. 

Decisions on data validation and the variables created from the raw data have already been made, which means the research can be replicated and is reproducible with multiple researchers able to access the same data. Although the scope of the data may be limited, especially if researchers were not involved in the data processing.

Data owners and processors, however, are often willing to work closely with researchers, and requests for other variables are sometimes possible.

Secure Access Data

Typically, this has involved working on a secure remote desktop with security procedures and requirements for researchers to be trained in General Data Protection Regulation (GDPR) and statistical disclosure.

Data often contains a lot of valuable information at the individual level. A resource many researchers desire access to in order to answer research questions with methods that just wouldn’t be possible otherwise.

Another additional step in the process is clearing items through statistical disclosure so results of the research can be published. Different data owners and processors have slightly different requirements and boundaries; but the premise is the same. Frequencies are common ground, where no small numbers can be published (some set the minimum count as 5 or some as 10), but even larger frequencies can raise issues with secondary disclosure. Secondary disclosure could reveal a small number if the total count is disclosed at the start of a report and a complete breakdown of the percentages are given later.

Statistical disclosure therefore prevents the research being reproduced by anyone without access to the data, and the research project would have to go through the lengthy application process with the proposed purpose of reproducing the results.

Linked Data

The advantage of the secure individual level administrative data is the ability to form data linkages, connecting data at the individual level across multiple systems for example family court data with health records through probabilistic modelling.

 

All these topics will be discussed in further details in a blog post or thread specific to each one. If you have your own experiences, please feel free to share in the comments and I may ask you to contribute a post.