{"id":73,"date":"2021-08-27T10:45:57","date_gmt":"2021-08-27T09:45:57","guid":{"rendered":"http:\/\/wp.lancs.ac.uk\/administrative-data-research\/?p=73"},"modified":"2022-01-17T15:28:30","modified_gmt":"2022-01-17T14:28:30","slug":"statistical-disclosure-the-dos-and-donts","status":"publish","type":"post","link":"http:\/\/wp.lancs.ac.uk\/administrative-data-research\/2021\/08\/27\/statistical-disclosure-the-dos-and-donts\/","title":{"rendered":"Statistical disclosure, the dos and don&#8217;ts"},"content":{"rendered":"<p><span style=\"font-family: arial, helvetica, sans-serif\">Ensuring that individuals (or organisations) cannot be identified from research outputs <\/span><span style=\"font-family: arial, helvetica, sans-serif\">when using administrative data has<\/span><span style=\"font-family: arial, helvetica, sans-serif\"> additional hurdles to cross and can require extra levels of preparation. Data providers will often review any outputs before they are published, and each provider has a set criteria that every output will have to clear. In the case of data that requires secure access, this will include obtaining outputs from the secure space while writing your draft articles, reports and other publications.<\/span><\/p>\n<p><span style=\"font-family: arial, helvetica, sans-serif\">Statistical disclosure is when a small number of observations are isolated, say in a count, and could be used to identify an individual. This can be in the form of primary disclosure where the small number observations are shown directly, or secondary disclosure where another source can be used to calculate a hidden small number (say another table in your report or external official statistics).<\/span><\/p>\n<p><span style=\"font-family: arial, helvetica, sans-serif\">A particular consideration when working with administrative data from a public service is the individuals that work there, for example geographical breakdowns that inadvertently identify members of the judiciary from court data.<\/span><\/p>\n<hr \/>\n<p><span style=\"font-family: arial, helvetica, sans-serif\">Data providers require researchers to undergo training before being given access to their data. The UK Data Service run a Safe Researcher Training course with the Office of National Statistics (ONS) and the Administrative Data Research Council (ADR UK) often found <a href=\"https:\/\/ukdataservice.ac.uk\/training-events\/\">here<\/a>:<\/span><\/p>\n<ul>\n<li><span style=\"font-family: arial, helvetica, sans-serif\">Safe User of Research data Environments (SURE) Training course \u2013 run by\u00a0<a href=\"https:\/\/www.ons.gov.uk\/aboutus\/whatwedo\/statistics\/requestingstatistics\/approvedresearcherscheme\">ONS<\/a>,\u00a0<a href=\"https:\/\/ukdataservice.ac.uk\/\">the UK Data Service<\/a>,<a href=\"https:\/\/www.adruk.org\/\">\u00a0the Administrative Data Research Network<\/a>.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-family: arial, helvetica, sans-serif\">Alternatively the Medical Research Council run the following courses <a href=\"https:\/\/byglearning.com\/mrcrsc-lms\/course\/index.php?categoryid=1\">here<\/a>, which are also accepted by some providers:<\/span><\/p>\n<ul>\n<li><span style=\"font-family: arial, helvetica, sans-serif\">MRC Regulatory Support Centre: Research Data and Confidentiality e-learning<\/span><\/li>\n<li><span style=\"font-family: arial, helvetica, sans-serif\">MRC\u2019s Research, GDPR and confidentiality<\/span><\/li>\n<\/ul>\n<hr \/>\n<p><span style=\"font-family: arial, helvetica, sans-serif\">If, like me, you end up going through multiple forms of these training courses and reading through a few different sets of guidelines and rules it can be difficult to keep track of the different boundaries and criteria. So below I have highlighted a few dos and don&#8217;ts to keep in mind throughout a research project.<\/span><\/p>\n<p><span style=\"font-family: arial, helvetica, sans-serif\">UK Data Service provides a &#8220;Handbook on Statistical Disclosure Control for Outputs&#8221; <a href=\"https:\/\/dam.ukdataservice.ac.uk\/media\/622521\/thf_datareport_aw_web.pdf\">here<\/a>.<\/span><\/p>\n<p style=\"text-align: left\"><span style=\"font-family: arial, helvetica, sans-serif\">The following is not comprehensive of the criteria and rules regarding statistical disclosure, there are many grey areas, and it is best to maintain good communication with the data providers to discuss on a case-by-case basis.<\/span><\/p>\n<hr \/>\n<h2><span style=\"font-family: arial, helvetica, sans-serif\">Dos<\/span><\/h2>\n<ul>\n<li>\n<h3><span style=\"font-family: arial, helvetica, sans-serif\">Keep to a threshold of 10 at all times<\/span><\/h3>\n<\/li>\n<\/ul>\n<p><span style=\"font-family: arial, helvetica, sans-serif\"><em>From a statistical stand-point this will be pretty obvious as low frequencies can already cause problems when it comes to running tests and fitting models. If you don&#8217;t have low frequencies from the start they&#8217;ll be less of a problem later. This can mean grouping categories together, or defining the variables a little differently.<\/em><\/span><\/p>\n<p><span style=\"font-family: arial, helvetica, sans-serif\"><em>By &#8220;at all times&#8221; I mean from the very first frequency counts, down to the finest detail of every cell in the largest cross-tabulation. Secondary disclosure can be tricky to navigate when writing up your research if there are any small numbers. For example, you usually want to start with a frequency table that lays out the scale of the problem you want to address; there are no small numbers so it is fine<\/em><em>\u00a0on it&#8217;s own<\/em><em>. However, if you come across a small number later when you break down the problem going through the controlling factors and explanatory variables, then what you can publish and pass through the disclosure control could be limited. If you know where the small numbers are you can make better decisions in your research and be prepared.<\/em><\/span><\/p>\n<p><span style=\"font-family: arial, helvetica, sans-serif\"><em>Not all data providers have a threshold of 10, some are lower at 5, so keeping to 10 is an easy way to make sure you always keep within the criteria.<\/em><\/span><\/p>\n<ul>\n<li>\n<h3><span style=\"font-family: arial, helvetica, sans-serif\">Request the data table underlying every graphic and visualisation for clearance<\/span><\/h3>\n<\/li>\n<\/ul>\n<p><span style=\"font-family: arial, helvetica, sans-serif\"><em>Perfecting the aesthetics of a graph or a visualisation can be a tedious affair, with minute changes to scales, labels or even altering the colour scheme. At the same time, getting outputs through disclosure checks can also be tedious and lengthy at times (I once experienced delays over a graph where a single pixel could be used to calculate a small number that was hidden in a later table).<\/em><\/span><\/p>\n<p><span style=\"font-family: arial, helvetica, sans-serif\"><em>It is, therefore, advisable to request the data used to create the visualisation alongside the visualisation itself. That way you can re-create the graphic with the minute aesthetic changes without the need to pass through the disclosure control again.<\/em><\/span><\/p>\n<p><span style=\"font-family: arial, helvetica, sans-serif\"><em>On submitting a visualisation for clearance you may be required to provide the underlying data anyway to prove that there are no small counts represented in the graphic.<\/em><\/span><\/p>\n<ul>\n<li>\n<h3><span style=\"font-family: arial, helvetica, sans-serif\">Plan your outputs before you start<\/span><\/h3>\n<\/li>\n<\/ul>\n<p><span style=\"font-family: arial, helvetica, sans-serif\"><em>Think about what outputs you might want before you get started. Similar to my first &#8220;do&#8221;, being prepared is key. What will be an absolute &#8220;must have&#8221; in the article\/report? What will you show in tables or figures? What will connect together and risk any secondary disclosure, and which would you change to prevent disclosure?<\/em><\/span><\/p>\n<ul>\n<li>\n<h3><span style=\"font-family: arial, helvetica, sans-serif;font-size: 22px\">Be cautious of creating graphs from individual level data<\/span><\/h3>\n<\/li>\n<\/ul>\n<p><span style=\"font-family: arial, helvetica, sans-serif\"><em>The problem with individual data points on graphs is self-identification. Including outputs from a statistical model that show the values of any observed variable. You know your own data, so it&#8217;s possible for a person to know which point corresponds to themselves and anyone else they know that value for.<\/em><\/span><\/p>\n<p><span style=\"font-family: arial, helvetica, sans-serif\"><em>That is not to say it is not possible to have a graph with individual data points, like a outliers on a boxplot, but a good explanation will have to given that justifies why it does not disclose any personal data.<\/em><\/span><\/p>\n<ul>\n<li>\n<h3><span style=\"font-family: arial, helvetica, sans-serif\">Check for dominance<\/span><\/h3>\n<\/li>\n<\/ul>\n<p><span style=\"font-family: arial, helvetica, sans-serif\"><em><span style=\"font-family: arial, helvetica, sans-serif\">Dominance occurs when the largest &#8220;unit&#8221; accounts for more than 43.75% of the total, or all but the top two units account for less than 12.5% of the largest unit (as defined on the UK Data Service Safe Researcher training course, check with your provider for their specific criteria). Although this does not usually happen at the individual level, consider for example identifying a particular court or centre. Data owners will not allow you to upset the organisations providing the data, and it may not be in the public interest to identify local practices. You wouldn&#8217;t, for example, identify an interviewer or branch of a study&#8217;s survey.<\/span><\/em><\/span><\/p>\n<h2><span style=\"font-family: arial, helvetica, sans-serif\">Don&#8217;ts<\/span><\/h2>\n<ul>\n<li>\n<h3><span style=\"font-family: arial, helvetica, sans-serif\">Don&#8217;t display a complete set of frequencies<\/span><\/h3>\n<\/li>\n<\/ul>\n<p><span style=\"font-family: arial, helvetica, sans-serif\"><em>Presenting frequencies can be key to portraying scale, however they make it difficult to prevent later secondary disclosures. Personally I avoid frequency tables wherever possible, especially cross-tabulations, and stick to in-line rounded figures where necessary.\u00a0 Alternatively, present the distribution in terms of percentages, rates or other appropriate statistics. Beware that a total frequency can be used to calculate any hidden small values.<\/em><\/span><\/p>\n<ul>\n<li>\n<h3><span style=\"font-family: arial, helvetica, sans-serif\">Don&#8217;t provide the minimum and maximum<\/span><\/h3>\n<\/li>\n<\/ul>\n<p><span style=\"font-family: arial, helvetica, sans-serif\"><em>The minimum and maximum often only relate to a single observation, a big no no.<\/em><em>\u00a0A suggested solution can be to average the top\/bottom 10 instead.<\/em><em> In general summarising continuous variables has its own set of problems and categorising these variables for display purposes can make things a lot easier.\u00a0<\/em><\/span><\/p>\n<ul>\n<li>\n<h3><span style=\"font-family: arial, helvetica, sans-serif\">Don&#8217;t overlook 0 counts<\/span><\/h3>\n<\/li>\n<\/ul>\n<p><span style=\"font-family: arial, helvetica, sans-serif\"><em>Zeroes in themselves do not identify any individuals but you can infer from them. Say you find zero individuals in the data are without a disability, then you meet someone and know they used the service you were analysing (thus were present in the dataset). The results would then disclose that the individual you met has a disability, which would be breach of their personal information.<\/em><\/span><\/p>\n<p><span style=\"font-family: arial, helvetica, sans-serif\"><em>Structural zeroes can be ok, like highest qualification by age, it would be expected that there would be zero under 16s with a qualification from higher education.<\/em><\/span><\/p>\n<hr \/>\n<p>Don&#8217;t be overwhelmed. Statistical disclosure comes with a lot to take in and all the information is taught during the required training courses. Data providers are also readily available to help guide you along the way. Working together it is possible to resolve most statistical disclosure issues.<\/p>\n<hr \/>\n<p><span style=\"font-family: arial, helvetica, sans-serif\">[subscribe2]<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Ensuring that individuals (or organisations) cannot be identified from research outputs when using administrative data has additional hurdles to cross and can require extra levels of preparation. Data providers will often review any outputs before they are published, and each provider has a set criteria that every output will have to clear. In the case &hellip; <a href=\"http:\/\/wp.lancs.ac.uk\/administrative-data-research\/2021\/08\/27\/statistical-disclosure-the-dos-and-donts\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Statistical disclosure, the dos and don&#8217;ts<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":933,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[7],"tags":[],"class_list":["post-73","post","type-post","status-publish","format-standard","hentry","category-statistical-disclosure"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"http:\/\/wp.lancs.ac.uk\/administrative-data-research\/wp-json\/wp\/v2\/posts\/73","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/wp.lancs.ac.uk\/administrative-data-research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/wp.lancs.ac.uk\/administrative-data-research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/wp.lancs.ac.uk\/administrative-data-research\/wp-json\/wp\/v2\/users\/933"}],"replies":[{"embeddable":true,"href":"http:\/\/wp.lancs.ac.uk\/administrative-data-research\/wp-json\/wp\/v2\/comments?post=73"}],"version-history":[{"count":20,"href":"http:\/\/wp.lancs.ac.uk\/administrative-data-research\/wp-json\/wp\/v2\/posts\/73\/revisions"}],"predecessor-version":[{"id":97,"href":"http:\/\/wp.lancs.ac.uk\/administrative-data-research\/wp-json\/wp\/v2\/posts\/73\/revisions\/97"}],"wp:attachment":[{"href":"http:\/\/wp.lancs.ac.uk\/administrative-data-research\/wp-json\/wp\/v2\/media?parent=73"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/wp.lancs.ac.uk\/administrative-data-research\/wp-json\/wp\/v2\/categories?post=73"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/wp.lancs.ac.uk\/administrative-data-research\/wp-json\/wp\/v2\/tags?post=73"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}