Pooling and Exchanging Data: Chapter 15
- Pooled data can be stored either in centralized data warehouses or in distributed data warehouses
- Identifying an appropriate data model and associated meta-data is an important aspect of maximizing the utility of pooled patient-reported outcome (PRO) data, and there are numerous models to choose from
Combining PRO data collected across different medical sites and different points can create robust datasets, allowing for more meaningful research questions to be answered. However, it is important to ensure that there is some degree of consistency across the aggregated data and how it is specified in the data model.
There are two overarching approaches to data architecture and curation. The first is through a centralized warehouse, which stores all the extracted data from many sites. The second is through a federated/distributed warehouse, wherein sites maintain their own data and respond to data analysis queries independently.
There are hundreds of available data models to choose from. Several factors must be considered when choosing a data model, including the granularity and specificity of the data and the clinical domains supported by the data model. Examples of popular data models include PCORnet, Consolidated Clinical Document Architecture (CCDA), and Shared Health Research Informatics Network (SHRINE).
Questions and Considerations
A. What are the different architectural approaches?
Centralized data warehouse
- Centralized data warehouses store all the data extracted from many different sites that use a given system
- Centralized data warehouses are typically maintained by a coordinating center that ensures that the data are entered into the warehouse and available for use
- Centralized data storage facilitates better data analysis as it allows statisticians to understand which data were collected, which are missing, and to conduct quality checks
- All sites contributing to the centralized warehouse must address legal, regulatory, and proprietary data sharing issues
- Contributing sites need to agree on a standard data interchange format
Federated/distributed data warehouse
- In federated/distributed data warehouses, data are kept in a locally maintained data warehouse at each site
- Data analysis queries can be submitted to the local sites, which run the analysis and respond with summary data
- There are fewer organizational concerns about sharing potentially identifiable patient data, as the local site has control of the data and only reports aggregated results
- Although data is held locally, this approach still requires different sites to agree on mapping of local types and potential values of data to the standard values and formats
- Record linkage to data is more difficult
- It may be difficult or impossible to replicate analyses, since they are conducted at the local level
B. What are the considerations for choosing a common data model?
- Pooled PRO data have little value if there is not a consistent data model and meta-data
- Considerations when selecting or creating a common data model include:
- Granularity of data and whether person-level analyses are supported
- De-identification and other limitations of data sets, including bins or categories of data rather than specific values (e.g. age range rather than date of birth)
- Data specificity (e.g. how de-identification was handled with respect to dates)
- Clinical domain(s) in the data model
- Governance of the data model
- Model use of standard interoperability references
C. What are some examples of data models to choose from?
- There are hundreds of data models to consider
- Here are examples of several popular options
- PCORnet: Developed by the Patient-Centered Outcomes Research Institute (PCORI). Describes meaning of each data item, and in some instances the context of the collected data
- Consolidated Clinical Document Architecture (CCDA): A general-purpose XML-based clinical data interchange format. It is commonly available in electronic health records that are certified by the Office of the National Coordinator – Authorized Testing and Certification Body. It is often used to move data from one system to another when the two systems have different internal data models
- i2b2 – Shared Health Research Informatics Network (SHRINE): An open-source, XML-based network that allows groups to link their aggregated counts of patients meeting selected inclusion and exclusion criteria for demographics and other variables
- Project-specific ad hoc data models: As opposed to choosing from an existing data model, a new data model can be created that includes only the data required for a specific project
Relevant Primary Resources
The information presented here is an overview of pooling and exchanging data. For more detailed information please see the following sources: