Principles of OpenSAFELY, for users and the team

OpenSAFELY is a highly secure open-source software platform for analysis of electronic health records data where all code for data management and analysis is shared openly, by default, for scientific review and efficient re-use.

Core Design Features for Privacy and Openness

OpenSAFELY aims to substantially exceed, by design, the current legal and ethical requirements on securing sensitive healthcare data. OpenSAFELY does not move patient data outside of the secure environments where it already resides: instead, trusted analysts can run large scale computation across near real-time pseudonymised patient records in situ. OpenSAFELY does not give users unconstrained access to raw data on a remote machine: instead, bespoke open software imposes flexible, pragmatic standards on the transformation of raw data into research-ready datasets, and handles the execution of code across the data. These constraints allow OpenSAFELY to deliver multiple additional benefits around security, quality, and efficiency. It means that all code created for data management and analysis can be shared automatically, and in readable form, for review and re-use by all subsequent users. It allows the platform to keep informative logs of all activity. Crucially it also allows the platform to produce simulated “dummy data” so that researchers can develop analysis code using open tools and services such as GitHub, without needing unconstrained access to view the underlying data, before their code is sent through to be securely executed in the live data environment.

Open Working, Open Sharing

OpenSAFELY has been created, with public and charitable funding, for the benefit of population health. All underlying code for the platform is open-source: it can be viewed, evaluated, and re-used freely by all. All analysis code that executes on the platform must also be shared for review and re-use under an Open Source Initiative approved license, with the MIT license being the default. This is in keeping with best practices, including those set out in the DataLab Open Manifesto: researchers should share their methods and code so that other teams can review their work, learn from it, and re-use it; researchers should feel safe and confident about publishing pragmatic, imperfect, working code; people developing code for tools and services should ideally work in public from early on in the project where possible.

Using OpenSAFELY is a collaborative process. All users are able to use the codelists, code, libraries, documentation and other open resources produced by users who have worked in OpenSAFELY before them. In turn, all users contribute to the development and expansion of the platform as they deliver their own work: for example by contributing to codelists, documentation, user-research, research papers, code libraries, lay summaries, or blogs. With respect to credit: we are aware of the need for researchers to receive recognition for productive outputs, of pressures to publish, and the need for researchers to make their work sustainable by taking credit for work that they themselves have developed, or resourced; we also recognise that software developers should receive prominent credit for their contributions to papers, and that great work in computational data science is produced by mixed teams of developers and researchers working hand in hand. We understand that the community is working to develop new norms and mechanisms in this space - we are contributing - and that current norms may not always work well for contributions to code, codelists, and engineering. As an interim, we assume: all prior work that is re-used should receive appropriate recognition and attribution; all those making a contribution to code should be offered authorship, more prominently where substantial new code was developed bespoke for a specific output; and, where bespoke software tools or code are produced or funded for a specific project by a specific team, then it is reasonable to expect that team will have the opportunity to be the first user, notwithstanding long delays to the wider community.

Ethics and Integrity

As in all other research settings OpenSAFELY users are expected to maintain the highest standards of research integrity as described by, for example, Universities UK’s Concordat to Support Research Integrity and the UK Policy Framework for Health and Social Care Research. Research projects must be scientifically sound and guided by ethical principles: as with all other research work, gaining appropriate research ethics approval is a prerequisite for using the OpenSAFELY platform. Mechanisms must be in place to hold the research team accountable for delivering on their pre-specified aims: in addition to other mechanisms used by the research community to achieve this objective, all activity on OpenSAFELY is clearly logged, leaving a record of every analysis executed by each user. The intention, design and process of the research should be appropriately described and justified in a research proposal or protocol, noting that the OpenSAFELY platform can currently only be used to support research that will deliver urgent results related to the global COVID-19 emergency. All research proposals submitted will be assessed by NHS England and the OpenSAFELY collaboration in accordance with the process reviewed by the OpenSAFELY Oversight Board. This ensures that the platform is used for safe projects by safe people. OpenSAFELY currently executes code against the full pseudonymised primary care records of over 58 million people, linked onto multiple additional sources of person-level data. All OpenSAFELY researchers must remain conscious of the level of responsibility required from all those analysing sensitive health data, and be cognisant of their obligations to respect the individuals to whom the data relates.

Security and Privacy

OpenSAFELY is not a data extraction service: it cannot be used to export, download, or share large extracts of patient-level data. The only data that can be released outside the secure environment is minimally disclosive summary data from an analysis, that has undergone strict disclosivity checks and redactions, such as summary tables or figures. This ensures safe data, safe settings and safe outputs. The OpenSAFELY platform substantially exceeds current best practice around secure execution of analysis code on pseudonymised patient data, as described above. When access to the OpenSAFELY platform is considered to be appropriate, the dataset described by the study definition should also be justified and proportionate; the OpenSAFELY team are developing tools to evaluate and facilitate proportionate data access. This is in accordance with the Caldicott principles and the DCMS data ethics framework.

Transparency

OpenSAFELY does not rely on an assumption of trust: it aims to be provably trustworthy, and transparent. Analysis is done inside the data centre of the electronic health records software company, using open methods; all analytic code is shared by default; and there is a live audit log of all research tasks completed using the OpenSAFELY platform available in public. It is accepted that some GitHub repositories will remain closed while an analysis is in development; all GitHub repositories become open when the results of the analysis are shared.

Contributing to best practice around open science

OpenSAFELY has been built as a service that aims to embody best practice around open science, and privacy preservation when working with pseudonymised patient records. All study definition code, codelists, and results released from the platform (with disclosure controls applied) are made public by the researchers at or before the time when results are shared; and papers are submitted for peer review publication. To support the change in culture towards open science OpenSAFELY will automatically make public the GitHub repository (to include the study definition code, codelists and released results) 12 months after any code has been executed against the pseudonymised patient records. In addition, OpenSAFELY will create a public dashboard listing all approved projects: description of purpose; contact email of researcher; affiliated organisation; when first code run; link to GitHub repository (to be made public at most 12 months after first code executed, and in almost all cases at the time of results dissemination); links to published material (supplied by researchers); this is supplementary to the live audit log of all analyses.

We are keen to hear feedback on these principles, and learn from open working practices in other projects. Lastly, we are always keen to participate in collaborations and community building that will help sustain the growing productive ecosystem developing around the principles of modern, open, collaborative computational data science.