Skip to main content

OpenSAFELY in detail


Introduction

OpenSAFELY is an open source trusted research environment (TRE), specifically tailored for electronic health record (EHR) analysis. It’s designed to deliver high quality analytic outputs, as openly as possible, without compromising on accuracy, quality, speed or privacy.

UK GP records are incredibly powerful for research, innovation, and health service improvement. They have both breadth (one record for every citizen) and depth (huge amounts of detail about every citizen’s health). They are also a huge privacy risk, because of that breadth and depth. We knew that this privacy issue could not be ignored, so instead we engineered and innovated around it.

We designed OpenSAFELY with new ways of managing privacy risks. Researchers are given dummy data to develop their analysis, then later submit code for automated remote execution against real patient records, without ever needing to directly interact with them. The patient data remain in situ, managed and kept secure by the organisations that have been doing that job for many years.

The BMA, the Royal College of GPs and privacy campaigners MedConfidential have all given their full support to the platform, in formal written letters to the Secretary of State. These organisations objected to previous approaches for national-scale GP data analysis.

OpenSAFELY is a public asset. It is publicly funded, built by researchers and software developers at the University of Oxford, all IP is shared openly, and the Data Controller is NHS England.

OpenSAFELY is made possible by formal Directions from the Secretary of State for Health and Social Care. Those are the 2020 COVID-19 Directions, which permitted research into COVID-19 related topics; and the 2025 Pilot Directions, which allow for research on any healthcare topic.


How OpenSAFELY works

OpenSAFELY is a secure, transparent, open-source software platform for analysis of electronic health records data. All platform activity is publicly logged. All code for data management and analysis is shared, under open licenses and by default, for scientific review and efficient re-use.

The software is deployed within the secure data centres of the two largest electronic health record providers in the NHS: TPP and Optum (formerly Emis). We are also deploying OpenSAFELY in other data centres with other NHS partners to support rapid, transparent, and open analytics.

OpenSAFELY and pseudonymisation

Pseudonymisation is a widely used process for protecting patients’ privacy, where explicit identifiers such as names, addresses, and dates of birth are removed from patients’ medical records, before those records are used or shared.

Pseudonymisation is necessary to protect patients’ privacy, but it is not a sufficient safeguard on its own. OpenSAFELY does not rely on it.

For example: pseudonymisation might help to prevent a researcher accidentally seeing a piece of information about someone they know, flashing past on the screen; but beyond this, it does very little to preserve privacy. In a comprehensive set of NHS patient records, someone misusing the data could easily find a celebrity’s entire health record by searching for information that’s already in the public domain.

That’s why we always treat pseudonymised health data as if it could still identify people.

In our view, even pseudonymised data should be:

  • disseminated as little as possible;
  • accessed in environments that do their best to prevent researchers ever needing to even view the underlying raw data; and
  • managed in a setting where comprehensive logs of all actions are kept, and ideally shared, in a form where they can be easily reviewed at scale.

We have implemented technical safeguards to address all of these, alongside other widely implemented administrative safeguards. New project proposals have to demonstrate likely public benefit; we check new users to make sure, as far as possible, that they are trustworthy. These are functions that support safe projects and safe people.

Core design features for privacy, transparency, and open working

OpenSAFELY aims to substantially exceed the current requirements on securing sensitive healthcare data. As explained above, patient data does not move outside the secure environments where it is already exists. Users never get unconstrained access to that raw data.

In addition, we’ve designed OpenSAFELY with a set of built-in flexible, pragmatic, but broadly standardised tools to help users convert raw patient data into “research ready” datasets, then execute code across those datasets. That means:

All code created for data management and analysis can be shared for review and re-use by all subsequent users. In OpenSAFELY, the data management is always done the same way, using the same OpenSAFELY tools, so code is created in a form where it can be quickly read, understood, adapted, and re-used by any user for any other data science project. Code for one is code for all.

Although users cannot directly access raw patient data or the research ready datasets, they can still write code as if they were in a live data environment. In most other settings, analysts write their code, which converts raw data into finished graphs and tables, by working directly with (and seeing) the real data, iterating and testing as they go. OpenSAFELY’s dummy data has the same structure as the real data, but comes with zero risk of disclosing patient secrets.

Every researcher gets a full, offline development environment where they can build all their data management and analysis code quickly. Anyone can do this, any time, before they apply to conduct a formal project. You can try the tutorial now, if you like.

This approach means there’s far less risk, and allows anyone with technical skills to swiftly check and reproduce their methods.

Researchers develop all their code for statistical analysis, dashboards, graphs and tables against the dummy data, using open tools and services like GitHub. When a user’s code is able to run to completion, it is packaged up inside a Docker container and then sent securely into the live data environment, to be executed against the real patient data.

Researchers can only view their results tables and graphs, but no researcher ever needs to enter the real patient-data environment, or see the real patient data. OpenSAFELY code is only generated by working with dummy data, so we can be certain that it is non-disclosive. This means it can be shared, and so we do share it: all of it, automatically, in public, and by default.

Consequently, anyone can see what every analyst has done with patients’ data inside OpenSAFELY. All patients, professionals and policymakers can be confident that data has only been used for the purpose for which access was granted. This is crucial for building trust, and a substantial improvement on the current paradigm whereby TREs only share a list of projects with permissions.

Removing the privacy risks from the data management and analysis code also frees up OpenSAFELY code for sharing and re-use under open licenses. It means that there is no information governance or privacy barrier to users sharing code for others to review, critically evaluate, improve, and re-use, wherever they wish.

OpenSAFELY is a collaboration

OpenSAFELY has been built as a broad collaboration between a huge range of organisations, and users, each of whom bring unique expertise.

Transparency and public logs

OpenSAFELY does not rely on an assumption of trust: we aim to be provably trustworthy and transparent, by providing a full public log of all activity in the platform.

This allows patients, professionals and the public to hold us and all our users accountable, because it’s their data being used for public benefit. Below are some examples of how we drive accountability:

  • All projects started within OpenSAFELY are visible to the public. For example, here are all projects by the London School of Hygiene and Tropical medicine.
  • OpenSAFELY requires all researchers to archive and publish their analytic code: this is the only way they are allowed to run code against real data. For example, this paper about school age children and COVID-19 has its own project page, which links to the full code used to generate the analysis (and a history of every version of the code ever run).
  • Every time a researcher wants to run this code against real data, the event is logged in public.
  • Every time a researcher changes their code, it is automatically checked in public, to provide reassurance it can be run without errors.
  • Clinical research isn’t only about writing analytic code. It also involves compiling codelists: sets of clinical terms that define symptoms, investigations, diseases, conditions or other actions (such as the provision of a referral letter, or a “sick note”). OpenSAFELY includes a set of tools for creating and sharing codelists in public; for example, this one about HbA1c tests, which includes an automatically generated audit trail of how the codelist was constructed.
  • Some code might remain private while an analysis is being developed. But all code is published when the results of the analysis are shared (or, for non-complete projects, as soon as possible, usually at the point of their cessation, and no later than 12 months after any code has been executed against the raw patient data).

The OpenSAFELY commitment to public accountability also extends to how the software is developed. The software is developed on a non-commercial basis and is freely available to the public to inspect, and re-use. All our documentation is published in the open. All the feature development and product support happens in a public forum.

What using the service actually looks like

Researchers write code in Electronic Health Records Query Language (known as ehrQL for short – it rhymes with ‘circle’) – to extract and shape data from the available data sets.

When they’ve prepared their dataset for analysis, they then write analysis code (in standard languages like Python, R or Stata) to produce graphs and tables, or to run statistical tests.

All the code users write is made up of individual units called actions, and those actions are organised into a pipeline. By working in this way, we ensure every users’ code is well organised. As explained above, they use dummy data to develop code, before submitting it to run on the real data.

Each package of work is known as a job. OpenSAFELY automatically keeps track of all the jobs, including every action being run, what it does, who requested it, and when it happened – there’s a live public dashboard on the web at jobs.opensafely.org, where anyone can keep an eye on what’s happening.

OpenSAFELY then runs that research code automatically, at arm’s length. When each job is complete, researchers see summary results (mostly in the form of tables and graphs) inside the secure environment, using a tool called Airlock. Inside Airlock, users can see log files (useful for debugging and problem-solving), and data outputs. Airlock has automatic controls to restrict data (such as very large files, or certain file types).

Output checking to check for disclosive data

For all projects, we run an output checking service. After a researcher requests that some outputs be released from the secure environment – some graphs, or results tables – then at least two trained and qualified humans will manually check that they aren’t accidentally releasing anything that could possibly contain any information about any individual, even an anonymous individual.

Output checkers are experienced data science professionals. Current and former members of the output checking team include people from the Bennett Institute, the London School of Hygiene and Tropical Medicine and the University of Bristol.

Even with their data science backgrounds, all checkers go through formal training and have to pass an exam to complete it. That training is provided by colleagues from the Data Research, Access and Governance Network at the University of the West of England.

An output checker’s job is to examine research outputs from OpenSAFELY, and judge whether or not they pose any risk of disclosing identifiable private information. Just like the researchers, the output checkers don’t have direct access to the raw data – but they do have the experience and expertise to spot the kinds of outputs that could be problematic.

It’s important that output checking happens, and that it isn’t done in a rush. No-one wants to delay scientific progress, either. We aim to strike a sensible balance, and the majority of requests submitted by researchers have been checked in fewer than seven days.

The process for output checking looks like this:

  1. Having run a job within OpenSAFELY, researchers use Airlock to view the initial outputs within the secure environment. They are expected to carry out disclosure checking on these themselves, before passing anything up to the output checking service. (All OpenSAFELY users have to complete training on this before they’re allowed to start using the platform.) Then they’ll fill in a request form, in which they explain what each output file shows, and any disclosure controls they’ve already applied.
  2. The request is tracked as an issue on GitHub, and two output checkers are assigned to take a look.
  3. Each output checker gets access to the same output results that the researcher saw, and marks each file with a grade: approved, approved subject to change, or rejected.
  4. The reviews are sent back to the researcher who proceeds accordingly: if any files were marked as “approved subject to change”, the output checker will explain what change is necessary, and the researcher will have to re-submit for another output check after making the changes.
  5. Once the outputs are approved by both output checkers, they are released from the secure environment to the researchers, who can continue with any further analysis.

Those approved outputs are then moved to a secure location, outside the secure environment, from where they can be released to the outside world.

The output checking process is also fully audited, including requests for changes made by output checkers, and responses from the researchers. It’s called ‘Airlock’ for a reason: it’s a secure place where outputs can be viewed, understood and output-checked. Some of those outputs will be released, but many aren’t.

Co-piloting to help new users

While we designed OpenSAFELY to be as simple as possible, and while we want its users to work independently, we’re conscious that it comes with a learning curve. So, we set up a Co-Pilot Programme to help new users get to grips with using the service.

Once enrolled, new external users (pilots) are paired with an experienced OpenSAFELY researcher (co-pilot), who helps them understand the OpenSAFELY philosophy, learn the various software tools, and work through each step required for a successful project.

Every new user is different. Some are more familiar with software tools like Git, and others have to learn how to use them from scratch. So the co-pilot’s role varies a lot. In addition to direct support via email and Slack, co-pilots organise regular calls to set goals, discuss progress made and resolve any issues. Co-pilots also support pilots through sessions on specific topics, such as paired programming, implementing quality assurance steps, and statistical disclosure control. More than anything, co-pilots are a calm, friendly face: moral support is as important as technical support.

The support that pilots receive from co-pilots is initially very intensive, but fades over time as pilots start to gain experience. But co-pilots don’t just disappear overnight. Additional support is offered where needed until project completion (and even beyond). This is largely in terms of continued output checking and manuscript review, but also through ad-hoc meetings which will be offered in order for co-pilots to check progress and to help with any issues that may still arise. Often, even after publishing a paper, a co-pilot and pilot will keep in touch to discuss new ideas and collaborations.

Co-piloting doesn’t just work one way, either. It helps us learn from users and make OpenSAFELY better.

Ethics, integrity and information governance

As in all other research settings, OpenSAFELY users are expected to maintain the highest standards of research integrity as described by, for example, Universities UK’s Concordat to Support Research Integrity and the UK Policy Framework for Health and Social Care Research. Research projects must be scientifically sound and guided by ethical principles. Gaining appropriate research ethics approval is a prerequisite for using the OpenSAFELY platform.

NHS England is responsible for all information governance, and for approving new research proposals.

All OpenSAFELY researchers must remain conscious of the level of responsibility required from all those analysing sensitive health data, and be cognisant of their obligations to respect the individuals to whom the data relates.

Information governance (IG) rules govern how researchers access patient data. They exist to help us maintain the highest standards of patient privacy, whilst still adhering to the necessary legal frameworks and best-practice ethical principles. IG is a vital component of OpenSAFELY – without those rules, and a system for maintaining and checking that they’re adhered to, OpenSAFELY simply couldn’t function. It would no longer be considered ‘safe.’

The governance of OpenSAFELY is a complex and, above all, collaborative process. NHS England is the Data Controller for the service as a whole. The GP practices themselves remain the Data Controller for the raw GP data that the OpenSAFELY tools operate on.

Day-to-day, our IG team supports researchers from one end of the process to the other – from applying to use OpenSAFELY, to publishing a paper. We help to make sure that researchers are properly trained; have the correct permissions to access data; and are given access to the relevant policies. We also check that every project using the rules for COVID-19 data access meets the relevant criteria.

We work across the whole platform to ensure that all relevant permissions are in place. This entails close work with NHS England and other external bodies such as the Health Research Authority (HRA), ONS and the Department of Health and Social Care (DHSC).

We help to identify the legal basis (under UK GDPR and Common Law) for processing patient data, supporting NHS England to complete all the necessary documentation, including the Data Protection Impact Assessment (DPIA); Data Processing Agreements (DPAs) with Optum and TPP; Data Sharing Agreements (DSAs) with data providers; and the Data Provision Notice to GP practices explaining the legal obligation they are under to share patient data to OpenSAFELY.

A lot of our time is spent talking to GPs, patients and the public, policymakers and other groups, to learn about their concerns, and to collaboratively develop solutions that will manage their concerns around data access and maintain support for OpenSAFELY across the wider community.


Earning and maintaining trust: PPIE and more

Trust is vitally important to the ongoing success of OpenSAFELY. We believe it’s important not just to say that we’re trustworthy, but to actively demonstrate it. That’s one of the reasons why we champion openness and transparency, as described above.

We actively seek input from external stakeholders, privacy campaigners, and members of the public. We run a variety of Patient and Public Involvement and Engagement (PPIE) activities, including our Digital Critical Friends group.


Crediting work done in OpenSAFELY

All OpenSAFELY researchers are expected to work in the open, as much as possible. We recognise that this causes concerns for some users around credit, resource and reward, especially under current arrangements for research funding.

Specifically, we recognise the need for researchers to make their work sustainable by taking credit for work that they themselves have developed, or resourced. And we know all about the pressure to publish.

We also recognise the shifting norms around the role of research software engineers, and the need for software developers to receive prominent credit for their contributions to papers, since great work in computational data science is only produced by mixed teams of developers and researchers working hand in hand.

Where bespoke software tools, data, or code are produced or funded for a specific project by a specific team, then it is reasonable to expect that team will have the opportunity to be the first author, notwithstanding long access delays for the wider community. Developers who contribute to code that delivers an output in OpenSAFELY should be offered authorship, more prominently where substantial new code was developed bespoke for a specific output.

All prior work on codelists, code, data acquisition, and methods that are re-used should receive appropriate recognition and attribution.

Separately, we hope that OpenSAFELY users will want to participate in community efforts around developing new norms that can incentivise and deliver more open and reproducible science in healthcare.


→ Read some OpenSAFELY case studies

→ Read the documentation

→ Find out how to become an OpenSAFELY user