Updated disclosure control guidance
Since launching OpenSAFELY, we have made some updates to our disclosure control guidance based on our experience with running the OpenSAFELY output checking service. We have made the following changes, which are summarised in more detail below:
- Requiring rounding of counts in addition to suppression of low counts
- Increasing the small number suppression threshold to 7
- Encouraging release of underlying data earlier in the analysis pipeline
- A new checklist for requesting a release
- Updated recommendations for release of log files
- More detail on allowed file types
Requiring rounding of counts in addition to suppression of low counts
When output checking, as well as checking for appropriate small number suppression to avoid primary disclosure, we also have to think about secondary disclosure, where an individual’s attributes can be indirectly learned using other available information, including other releases in the same or similar projects. Making this assessment can be challenging; it’s impossible to consider all the existing public information or anticipate any future release that could pose a secondary disclosure risk. Rounding is an easy to understand and familiar approach which broadly protects against secondary disclosure.
We initially only recommended rounding in cases where we believed there was secondary disclosure risk. E.g. The same analysis run at different time points. As the number of projects using OpenSAFELY, and hence the number of outputs has grown, this has become increasingly challenging, with projects using different, but potentially overlapping population definitions and producing potentially overlapping outputs. To reduce this risk, we have introduced recommendations to round all counts, following any small number suppression. We now make the following recommendations:
- Round all counts, including counts underlying any figures. At minimum we recommend all counts be rounded to the nearest 5.
- Rates should be calculated from rounded values. This includes crude risk ratios and odds ratios.
- Non rounded counts - users have to explain why they are important to release.
Rounding is a perturbative method to disclosure control. That is, it sacrifices data truthfulness by introducing error into the data. As a result, it has the potential to reduce the utility of any downstream outputs. We have developed documentation for midpoint 6 rounding as an alternative approach to reduce bias introduced by rounding, with an example implementation in R. In practice, considering the large number of patients in OpenSAFELY, we have found recommended rounding to have minimal impact on interpretation of most research outputs.
Increasing the small number suppression threshold to 7
Our previous requirement was to redact counts <=5. Using this threshold combined with rounding counts to the nearest 5, counts of 5 can be inferred to be either 6 or 7. Redacting counts <=7 followed by rounding provides the same protection for all counts.
Encouraging release of underlying data earlier in the analysis pipeline
OpenSAFELY provides a way to run fully reproducible analysis pipelines all the way from data extraction to producing journal ready figures and tables. Our initial recommendation was that requests to release results from the server should be for outputs for final submission to a journal or for a small number of necessary outputs for discussion with external collaborators. This meant that all of the code for final outputs was open, and the amount of data being released was minimised to reduce the burden on reviewers.
In practice this has several drawbacks:
- It requires re-review for any minor edits to final outputs such as changing figure labels.
- It can result in a large number of outputs. For example, a set of figures produced from a single underlying table.
- Understanding the relationship between final outputs and the underlying data can be challenging for reviewers.
We have therefore progressively suggested researchers request release of outputs earlier in the pipeline. This could include data for producing figures (rather than just the figures themselves) or data for running a model that doesn’t require person-level data. This allows researchers to produce and edit downstream outputs within their local environment and avoids duplicative output review. Whilst outputs earlier in the pipeline may have more potential to include disclosive information, they are often easier to review for disclosure concerns as they have a more systematic structure.
There are some important considerations for researchers requesting release of these outputs:
- There is potential for the last leg of a project pipeline, such as producing figures, to be less reproducible as it does not have to fit into the OpenSAFELY pipeline. We recommend that users continue to develop their downstream analysis actions within the OpenSAFELY pipeline, even if they don’t intend to run them on the server against unreleased data.
- However, released data should not be committed to the repository.
- Intermediate results can contain much more data than outputs produced at the end of the analysis pipeline. The data contained within these outputs should still only be the minimum amount required to produce the downstream outputs or receive feedback from project collaborators.
A new checklist for requesting a release
We have added a new checklist to make requirements for requesting a release clearer.
Updated recommendations for release of log files
Log files are records of events, errors, warnings and information generated when code is executed. They help developers analyse and troubleshoot code behaviour by providing a record of the program’s activities, errors and important system events.
In OpenSAFELY, a log file is produced for each step of an analysis pipeline, which we call an action. During the initial phase of OpenSAFELY, we supported the release of log files, which could include analysis outputs. This reflects how analysis outputs are commonly produced in a local environment. This has several drawbacks: it can be difficult to disentangle verbose warning messages from research outputs that need checking; there can be numerous outputs per file; the formatting of outputs can make them difficult to review.
We have therefore updated our handling of log files to make the following recommendations:
- Research outputs should always be output separately to any log files, using a standard format, e.g.
.csvfor tabular outputs.
- Review of log files, to analyse errors or warnings should primarily happen on the Level 4 server and doesn’t require release.
- In exceptional circumstances, they can be requested for release. For example, if you need to discuss the error and any related data with a researcher who writes code but does not have Level 4 access. When they are requested, data within the log file should be minimised.
More detail on allowed file types
We have added more detail to the allowed file type section. In particular, this includes being more restrictive around the release of HTML outputs
In addition to these challenges, we also observed that HTML outputs were commonly being produced when they weren’t needed, such as reports containing a single table with no narrative overview. To make output checking easier and more efficient, we therefore now make the following recommendations for HTML outputs:
- HTML files can be released, but this should be reserved for occasions where both contextual text and embedded outputs are required. Most commonly, this will be a HTML report that is intended to be hosted on reports.opensafely.org. If not hosting the report there, you can request the release of individual outputs and generate a HTML report locally.
- Where an HTML file is requested for release, the individual outputs should be output separately.