OpenSAFELY is a secure, transparent, open-source software platform for analysis of electronic health records data. All platform activity is publicly logged. All code for data management and analysis is shared, under open licenses and by default, for scientific review and efficient re-use.
OpenSAFELY is a set of best practices encoded as software. It can be deployed to create a Trusted Research Environment (TRE) alongside appropriate database, compute, governance, and administrative elements; or it can be deployed as a privacy-enhancing layer on any existing secure database or TRE.
OpenSAFELY software is currently deployed within the secure data centres of the two largest electronic health record providers in the NHS, to support urgent research into the Covid-19 emergency: thereby creating OpenSAFELY-TPP and OpenSAFELY-EMIS. We are also deploying OpenSAFELY in other data centres with other NHS partners to support rapid, transparent, and open analytics. Designed to provide secure and federated analytics, OpenSAFELY helps the NHS minimise the sharing of confident patient information.
To understand the technical design choices of OpenSAFELY, it is helpful to first understand the privacy and disclosure risks which the tools aim to mitigate. “Pseudonymisation” is a widely used process for protecting patients' privacy whereby explicit identifiers such as names, addresses, and dates of birth are removed from patients' medical records before they are used or shared. Pseudonymisation is necessary to protect patients' privacy, but it is not a sufficient safeguard on its own. For example, pseudonymisation might help to prevent a researcher accidentally seeing a piece of information about someone they know, flashing past on the screen; but beyond this, it does very little to preserve privacy. For example, in a comprehensive set of NHS patient records, someone misusing the data could easily find and read Tony Blair’s entire health record by searching for a patient who matches information that is openly available on his Wikipedia page: his age, the approximate dates he was treated for an abnormal heart rhythm, and the fact that he lived in London.
Because this kind of re-identification can be easy, we believe that detailed pseudonymised health data should be handled as if it were identifiable, taking all reasonable technical steps to prevent and detect misuse of the data: it should be disseminated as little as possible; accessed in environments that do their best to prevent researchers ever needing to even view the underlying raw data; and managed in a setting where comprehensive logs of all actions are kept, and ideally shared, in a form where they can be easily reviewed at scale. These technical safeguards should sit alongside the other widely implemented administrative safeguards to scrutinise that the proposed analysis is likely to have public benefit, and to evaluate users and ensure that all analysts accessing data are trustworthy: functions that support safe projects and safe people.
OpenSAFELY aims to substantially exceed, by design, the current requirements on securing sensitive healthcare data.
OpenSAFELY does not move patient data outside of the secure environments where it already resides: instead, trusted analysts can run large scale computation across pseudonymised patient records in situ, and in near-real-time.
In the case of OpenSAFELY-TPP and OpenSAFELY-EMIS, we have implemented OpenSAFELY inside the data centres of the largest providers of GP electronic health record software in England, in the locations where patients' records already reside. This means that the data never moves location. (It also means that we get to work closely with EHR software developers in these companies, who know their own data extremely well).
However this is not the only privacy safeguard. In addition, we do not give users unconstrained access to view and manipulate raw data on a remote machine: instead, users work on the data at arm’s length using OpenSAFELY services.
OpenSAFELY contains a range of flexible, pragmatic, but broadly standardised tools that users work with to convert raw patient data into “research ready” datasets, and to then execute code across those datasets. Standardising the data management pathway in this way brings numerous benefits around re-usability, efficiency, security, and transparency.
- All code created for data management and analysis can be shared, informatively, for review and re-use by all subsequent users. In most settings for NHS patient data analysis the same data management tasks are achieved by a huge range of bespoke and duplicative methods, in a huge range of different tools, with single tasks often spread between platforms or programming languages. In OpenSAFELY the data management is always done the same way, using the same OpenSAFELY tools, so code is created in a form where it can be quickly read, understood, adapted, and re-used by any user for any other data science project.
- Users are blocked from directly viewing the raw patient data or the research ready datasets, but still write code as if they were in a live data environment. In most other settings analysts write their code - to convert raw data into finished graphs and tables - by working directly with (and seeing) the real data, iterating and testing as they go. In OpenSAFELY, the data management tools used to produce their research-ready datasets also produce simulated, randomly generated “dummy data” that has the same structure as the real data, but none of the disclosive risks. Every researcher is therefore provided with a full, offline development environment where they can build all their data management and analysis code quickly, but only against dummy patient data. This minimises needless interaction with disclosive patient records and allows anyone with technical skills to swifty check and reproduce the methods. Researchers develop all their code for statistical analysis, dashboards, graphs and tables against this dummy data, using open tools and services like GitHub. Their code is then tested automatically by the OpenSAFELY tools, using the dummy data. When it is capable of running to completion, it is packaged up inside a “container”, using a tool called “Docker”. All their data management and analysis code is_ then sent securely into the live data environment to be executed against the real patient data: researchers can only view their results tables and graphs, but no researcher ever needs to enter the real patient-data environment, or see the real patient data. It is useful to contrast this against other settings which work with synthetic data (real data, but with statistical noise added in an effort to preserve privacy): they typically require researchers to also use that synthetic data to run their analyses, which can undermine the reliability of the results: in OpenSAFELY the synthetic dummy data is _only used for code development, not code execution. In this way we get all the privacy preserving benefits of completely random synthetic data, but also retain all of the analytic benefits that come from executing code against real patient data.
- All code ever executed against the patient data can be shared, as an informative public log, because none of that code is disclosive of patient data. Normally TREs that execute code against real patient data try to keep a log of activity in the platform, but they cannot share every action in each user session, because the code for data management and analysis was generated while working directly with the real data, and so there is a substantial risk that the code itself might contain some disclosive information about individual patients. Some platforms log screen recordings, or keystrokes, for later review, but these can also never be shared openly, because of the disclosure risk (they are also laborious to review). OpenSAFELY code is only generated by working with dummy data, so we can be certain that it is non-disclosive. This means it can be shared, and so we do share it: all of it, automatically, in public, and by default. This means that every interested stakeholder can see what every analyst has done with patients' data inside OpenSAFELY, and all patients, professionals and policymakers can be confident that data has only been used for the purpose for which access was granted: this is crucial for building trust, and a substantial improvement on the current paradigm whereby TREs only share a list of projects with permissions. The removal of privacy risks from data management and analysis code also frees up OpenSAFELY code for sharing and re-use under open licenses: it means that there is no information governance or privacy barrier to users sharing code for others to review, critically evaluate, improve, and re-use, wherever they wish.
These working methods, and the code in which they are embodied, mean that OpenSAFELY substantially exceeds current best practice around secure execution of analysis code on pseudonymised patient data, when combined with the other governance features of a strong TRE. After completion of each analysis, only minimally disclosive summary data is released outside the secure environment, such as summary tables or figures, after strict disclosivity checks and redactions, to ensure safe data, safe settings and safe outputs. When access to any TRE, including one using OpenSAFELY code, is considered to be appropriate, the dataset described by the study definition should also be justified and proportionate, in accordance with the Caldicott principles and the DCMS data ethics framework.
OpenSAFELY has been built as a broad collaboration between a huge range of organisations, and users, each of whom bring unique expertise.
- The DataLab at the University of Oxford led on building the software platform, as a mixed team of software developers and traditional academic researchers.
- The EHR group at London School of Hygiene and Tropical Medicine has decades of deep expertise on working with GP data and other forms of NHS electronic health records.
- TPP and EMIS are electronic health record system suppliers covering 58 million patient records in total across England: they have very deep knowledge of electronic health records data, and have provided data infrastructure and other support pro bono for OpenSAFELY-TPP and OpenSAFELY-EMIS in the context of the global COVID-19 pandemic.
- NHS England and NHSX handle all information governance, permissions, and additional data sources.
- We also have a growing list of broader collaborations including ICNARC, ISARIC, PHOSP, ONS, the National Core Studies Longitudinal Health team representing a range of cohort studies, and more.
Together we represent a large national team of software developers, clinicians, and epidemiologists, all pooling diverse skills and knowledge to deliver high performance, highly secure, and high quality data analysis on NHS records. Our aim is to combine best practices from academia and the open source software development community. We now have a growing community of health data analysts who can speak the same language as software developers, doing “pull requests” and “code reviews”; and full stack software developers with deep knowledge and understanding of health services and NHS data.
The project is led by Ben Goldacre, Director of the DataLab; Seb Bacon, CTO at the DataLab; and Liam Smeeth, Director of LSHTM.
OpenSAFELY does not rely on an assumption of trust: it aims to be provably trustworthy, and transparent, by providing a full public log of all activity in the platform. This allows patients, professionals and the public to hold the entire system accountable, because it’s their data being used for public benefit. Below are some examples of how we drive accountability:
- All projects started within OpenSAFELY are visible to the public. For example, here are all projects by the London School of Hygiene and Tropical medicine.
- OpenSAFELY requires all researchers to archive and publish their analytic code: this is the only way they are allowed to run code against real data. For example, this paper about school age children and COVID-19 has its own project page, which links to the full code used to generate the analysis (and a history of every version of the code ever run).
- Every time a researcher wants to run this code against real data, the event is logged in public.
- Every time a researcher changes their code, it is automatically checked in public (the green or red circles here are a record of this), to provide reassurance it can be run without errors.
- Clinical research isn’t only about writing analytic code. It also involves compiling “codelists”: sets of clinical terms that define symptoms, investigations, diseases, conditions or other actions (such as the provision of a referral letter, or a “sick note”). The OpenSAFELY platform includes a set of tools for creating and sharing codelists in public; for example, this one about haemoglobin tests, which includes an automatically generated audit trail of how the codelist was constructed.
- It is accepted that some code may remain private while an analysis is in development. However, all code is published when the results of the analysis are shared (or, for non-complete projects, as soon as possible, usually at the point of their cessation, and no later than 12 months after any code has been executed against the raw patient data).
The OpenSAFELY commitment to public accountability also extends to how the software is developed. The software is developed on a non-commercial basis by the DataLab at the University of Oxford, and is freely available to the public to inspect, and re-use. Full documentation is published on all aspects of the platform. All the feature development and product support is in a public forum.
As in all other research settings OpenSAFELY users are expected to maintain the highest standards of research integrity as described by, for example, Universities UK’s Concordat to Support Research Integrity and the UK Policy Framework for Health and Social Care Research. Research projects must be scientifically sound and guided by ethical principles: as with all other research work, gaining appropriate research ethics approval is a prerequisite for using the OpenSAFELY platform.
All information governance for OpenSAFELY-TPP and OpenSAFELY-EMIS is handled by NHS England. Research proposals submitted for possible execution in OpenSAFELY-TPP and OpenSAFELY-EMIS are assessed by NHS England and the OpenSAFELY collaboration in accordance with the process reviewed by the OpenSAFELY Oversight Board. This ensures that the platform is used for safe projects by safe people. In OpenSAFELY-TPP and OpenSAFELY-EMIS code can currently be executed against the full pseudonymised primary care records of over 58 million people, linked onto multiple additional sources of person-level data: this is a privilege, made possible by patients and the NHS; the scale of this privilege must never be forgotten. All OpenSAFELY researchers must remain conscious of the level of responsibility required from all those analysing sensitive health data, and be cognisant of their obligations to respect the individuals to whom the data relates. The intention, design and process of the research should be appropriately described and justified in a research proposal or protocol, noting that the OpenSAFELY platform can currently only be used to support research that will deliver urgent results related to the global COVID-19 emergency.
It is important that research teams are accountable for delivering on their pre-specified aims: in addition to other mechanisms used by the research community to achieve this objective, all activity on OpenSAFELY is clearly logged, leaving a record of every analysis executed by each user. All code is made public at the point that results are shared, if not sooner. All results from OpenSAFELY analyses must be shared within 12 months of execution (a code of conduct we require all researchers to sign up to): this should be regarded as the absolute latest data for sharing, with a clear expectation of rapid dissemination in preprints, reports, and papers.
OpenSAFELY has been built as a service that aims to embody best practice around open science. In particular, its design enforces the principle that no analysis should happen without that analysis being prespecified in code. To this end the platform imposes several important defaults on users: these all aim to help users produce high quality, openly shared code for health data science.
- The platform makes it impossible for researchers to query any real data without first writing their query as code and saving it in a code repository.
- The code is automatically tested by the system prior to running. This ensures the code is always in a reproducible state.
- We provide tools for researchers to precisely reproduce the production environment on their own computer. This minimises the risk of errors due to old conflicting versions of software, and frees them to do most of their development without touching patient data.
- This clean separation of study design and execution encourages clear, upfront, hypothesis-driven design. Moreover, the fact that every code execution is logged in public makes undisclosed p-hacking impossible.
- Integration with Github, the popular version control website, encourages best practices around open task management and code review.
- Because all code and configuration is recorded alongside project initiation documentation (including initial project design, and legal governance documentation), researchers can demonstrate public accountability and participation.
- Common research tasks such as data aggregation, case matching, time-based numerator/denominator pairs, low number suppression and statistical summaries are provided as libraries (called “actions”) that are reusable in any supported language (currently Python, R and Stata). Actions are rigorously tested and improved over time, benefiting the whole community. Anyone can contribute new actions; we are building a rich “actions library”.
- Variables relating to characteristics of patients are commonly developed around the concept of a “code list”: a set of codes matching clinical terms recorded in the database. We have provided a novel authoring tool that allows researchers to record and publish not only lists of codes, but also the logical process they went through to arrive at these lists.
- The fact that all code is published under an open source license makes it possible for researchers to learn from (and build on) each others' best practice.
- All study definition code, codelists, and results released from the platform (with disclosure controls applied) are made public by the researchers at or before the time when results are shared and papers are submitted for peer review publication.
- To support the change in culture towards open science OpenSAFELY will automatically make public the GitHub repository (to include the study definition code, codelists and released results) 12 months after any code has been executed against the pseudonymised patient records.
- In addition, OpenSAFELY is creating a public dashboard listing all approved projects: description of purpose; contact email of researcher; affiliated organisation; when the first code was executed; link to GitHub repository (the contents of which must be made public at the time of results dissemination, if not sooner); and links to published material (supplied by researchers). This is supplementary to the live audit log of all analyses described here.
We are keen to hear feedback on these principles, and learn from open working practices in other projects. Lastly, we are always keen to participate in collaborations and community building that will help sustain the growing productive ecosystem developing around the principles of modern, open, collaborative computational data science.
OpenSAFELY has been created, with public and charitable funding, for the benefit of population health. All underlying code for the platform is open-source: it can be viewed, evaluated, and re-used freely by all. All analysis code that executes on the platform must also be shared for review and re-use under an Open Source Initiative approved license, with the MIT license being the default.
This is in keeping with best practices, including those set out in the DataLab Open Manifesto: researchers should share their methods and code so that other teams can review their work, learn from it, and re-use it; researchers should feel safe and confident about publishing pragmatic, imperfect, working code; people developing code for tools and services should ideally work in public from early on in the project where possible.
Using OpenSAFELY is a collaborative process. All users are able to review, evaluate and re-use the codelists, code, libraries, documentation and other open resources produced by users who have worked in OpenSAFELY before them. In turn, all users contribute to the development and expansion of the platform as they deliver their own work: for example by contributing to codelists, documentation, user-research, research papers, code libraries, lay summaries, or blogs.
While open code is the norm in other disciplines - such as physics, or structural genomics - we recognise that open working can raise concerns for some users around credit, resource, and reward, especially under current arrangements for research funding. Specifically, we recognise the need for researchers to make their work sustainable by taking credit for work that they themselves have developed, or resourced, and of pressures to publish. We also recognise the shifting norms around the role of research software engineers, and the need for software developers to receive prominent credit for their contributions to papers, since great work in computational data science is only produced by mixed teams of developers and researchers working hand in hand. We are running and participating in various projects to help develop new norms and mechanisms to recognise contributions to code, data acquisition, codelists, and engineering. As an interim, we assume: where bespoke software tools, data, or code are produced or funded for a specific project by a specific team, then it is reasonable to expect that team will have the opportunity to be the first user, notwithstanding long access delays for the wider community; developers contributing to code that delivered an output in OpenSAFELY should be offered authorship, more prominently where substantial new code was developed bespoke for a specific output; all prior work on codelists, code, data acquisition, and methods that are re-used should receive appropriate recognition and attribution. Separately, we hope that OpenSAFELY users will want to participate in community efforts around developing new norms that can incentivise and deliver more open and reproducible science in healthcare.