Administrative data is collected in many different settings by businesses and public services, usually in exchange for a product or service. Nowadays it is commonly accepted that private companies use customers’ data to conduct analysis and market research that informs business decisions on marketing and sales. However, research of this kind with data from public services is only just entering the spotlight.
The administrative data collected by public services will be the primary focus of this blog.
Raw Data
Direct from the source raw data can be a dry place to start research. In its original form the value of the raw data is not always obvious in terms of research. On the other hand, there are endless possibilities for derived variables that are meaningful and valuable for research.
A key part aspect of raw data is the data processing that involves a number of steps and considerations. To illustrate a few:
- Data validation:
- Is each data in the correct form? e.g., a date entered in a field for gender would not be valid.
- Is an empty field missing or a non-value? e.g., an empty tick box could indicate a “no” or be part of a group where no ticks boxes have been completed.
- Variable Creation:
- What can be extracted from a text box? Can meaningful categories be created from key words? e.g., extracting the town, county and region from an address.
- Are the dates of key moments recorded? e.g., the duration between a date of birth and the application’s issue date for calculating ages.
- Cross-validation:
- Does the data agree with itself? e.g., a mother’s date of birth being before that of their child, or a mother recorded as male.
There isn’t one rule fits all and the context of the data will change the answer to these questions.
See here for a previous research project that started with raw data. The data was typed by hand into open text boxes by call receivers to Safeline and the Male Survivors Helpline, an anonymous service for victims of sexual violence. The first step was to correct obvious misspellings, including 15 different ways of typing the word “unknown”.
Processed Data
Processed administrative data has been very similar to my experiences working with survey data, where the data owners and/or data processors (for example Office of National Statistics, Eurostat etc.) take responsibility for processing the data. That is not to say the data is necessarily a nice and tidy statistical dataset, but the variables and values will be consistent. The processing could also include the removal of any personal or sensitive data, or aggregation of the individual level data to a higher level.
Decisions on data validation and the variables created from the raw data have already been made, which means the research can be replicated and is reproducible with multiple researchers able to access the same data. Although the scope of the data may be limited, especially if researchers were not involved in the data processing.
Data owners and processors, however, are often willing to work closely with researchers, and requests for other variables are sometimes possible.
Secure Access Data
Typically, this has involved working on a secure remote desktop with security procedures and requirements for researchers to be trained in General Data Protection Regulation (GDPR) and statistical disclosure.
Data often contains a lot of valuable information at the individual level. A resource many researchers desire access to in order to answer research questions with methods that just wouldn’t be possible otherwise.
Another additional step in the process is clearing items through statistical disclosure so results of the research can be published. Different data owners and processors have slightly different requirements and boundaries; but the premise is the same. Frequencies are common ground, where no small numbers can be published (some set the minimum count as 5 or some as 10), but even larger frequencies can raise issues with secondary disclosure. Secondary disclosure could reveal a small number if the total count is disclosed at the start of a report and a complete breakdown of the percentages are given later.
Statistical disclosure therefore prevents the research being reproduced by anyone without access to the data, and the research project would have to go through the lengthy application process with the proposed purpose of reproducing the results.
Linked Data
The advantage of the secure individual level administrative data is the ability to form data linkages, connecting data at the individual level across multiple systems for example family court data with health records through probabilistic modelling.
All these topics will be discussed in further details in a blog post or thread specific to each one. If you have your own experiences, please feel free to share in the comments and I may ask you to contribute a post.