How to use dummy data in an ehrQL dataset definition🔗

Because OpenSAFELY doesn't allow direct access to individual patient records, ehrQL allows you to work with dummy data for developing analytic code on your own computer.

There are three ways to use dummy data in ehrQL.

Let ehrQL generate a dummy dataset from your dataset definition
Supply your own dummy dataset
Supply your own dummy tables

Let ehrQL generate a dummy dataset from your dataset definition🔗

ehrQL is designed so that the same command can be used to output a dummy dataset when run on your own computer and then output a real dataset when run inside the secure environment as part of an OpenSAFELY pipeline.

Refer to the documentation on generating a dummy dataset to see how this works.

You do not need to add anything to the dataset definition itself in order to generate a dummy dataset in this way. ehrQL will use the dataset definition to set up dummy data and generate matching patients.

By default, 500 patients will be generated in a dummy dataset. If you need to increase this number, you can configure it in the dataset definition with:

dataset.configure_dummy_data(population_size=1000)

Increasing the population size will increase the time required to generate the dataset.

Supply your own dummy dataset🔗

You can provide a dummy dataset file in the following formats.

Format	File extension
CSV	.csv
Compressed CSV	.csv.gz
Arrow	.arrow

Your dummy dataset file must have the relevant file extension shown in the table above.

For example, take this dataset definition from the tutorial:

from ehrql import create_dataset
from ehrql.tables.beta.core import patients, medications

dataset = create_dataset()

dataset.define_population(patients.date_of_birth.is_on_or_before("1999-12-31"))

asthma_codes = ["39113311000001107", "39113611000001102"]
latest_asthma_med = (
    medications.where(medications.dmd_code.is_in(asthma_codes))
    .sort_by(medications.date)
    .last_for_patient()
)

dataset.asthma_med_date = latest_asthma_med.date
dataset.asthma_med_code = latest_asthma_med.dmd_code

And this dummy dataset, in a CSV file named dummy.csv:

patient_id	asthma_med_date	asthma_med_code
1	2021-01-01	39113311000001107
2	2020-03-10	39113311000001107
3	2022-11-22	39113311000001107
4	2021-06-07	39113611000001102
5	2023-10-17	39113611000001102

Run the dataset definition with the dummy dataset file:

opensafely exec ehrql:v0 generate-dataset dataset_definition.py --dummy-data-file dummy.csv

Now, instead of a generated dummy dataset, you'll see the data from the dummy data file that you provided.

A screenshot of VS Code, showing the terminal after the opensafely exec command was run

Boolean values🔗

Note that in CSV data, boolean values must be specified as "T" or "F".

Add an extra had_asthma_event to the dataset above:

dataset.had_asthma_event = (
    medications.where(medications.dmd_code.is_in(asthma_codes))
    .exists_for_patient()
)

This is a boolean value and the updated dummy dataset would look like this:

patient_id	asthma_med_date	asthma_med_code	had_asthma_event
1	2021-01-01	39113311000001107	T
2	2020-03-10	39113311000001107	T
3	2022-11-22	39113311000001107	T
4	2021-06-07	39113611000001102	T
5	2023-10-17	39113611000001102	T

Dummy dataset errors🔗

ehrQL will check the column names, types and categorical values in your dummy dataset file. If errors are found, they will be shown in the terminal output.

Using ehrQL to generate a base dummy dataset🔗

Generating your own dummy dataset can be difficult (or tedious!), especially if you have a large number of output columns, or you need a large amount of data. One option is to use ehrQL to generate an initial dataset, and then modify it as you need.

Run the dataset definition with an output path:

opensafely exec ehrql:v0 generate-dataset dataset_definition.py --output dataset.csv

Now you can edit dataset.csv as you want, and rerun the dataset definition, using it as the dummy data file:

opensafely exec ehrql:v0 generate-dataset dataset_definition.py --dummy-data-file dataset.csv

Supply your own dummy tables🔗

Instead of supplying a single dummy dataset file that contains the exact output data, you can provide ehrQL with a path to a folder of dummy data tables. Dummy data tables are supplied as CSV files, with one file per table, representing the data in the backend database. Refer to the table schema reference for documentation on the available tables and columns. ehrQL will use the dummy tables as the backend data from which to extract the dataset. This is a good way to test that your dataset definition is working as expected.

The running ehrQL as a standalone action documentation demonstrates how to to run a dataset definition against the example dummy tables in the ehrQL tutorial.

Generating dummy tables🔗

ehrQL can be used to create dummy tables for you, using your dataset definition to determine the specific tables that are required.

Try this out by running the following command against the simple dataset definition above:

opensafely exec ehrql:v0 create-dummy-tables dataset_definition.py dummy-folder

A screenshot of VS Code, showing the terminal after the create-dummy-tables command was run

A new folder, dummy-folder, has been created, which contains just the two tables that the dataset definition requires - patients.csv and medications.csv.

Now you can run ehrQl with these generated tables instead:

opensafely exec ehrql:v0 generate-dataset dataset_definition.py --dummy-tables dummy-folder