Real-world data (RWD) are becoming an increasingly important resource for drug development and post-market surveillance. These population-based data sources that capture patient health status and healthcare in real-world settings often generate real-world evidence (RWE) that provide valuable insights into safety, effectiveness, and utilization of a marketed product. Additionally, RWD can enhance the pre-approval clinical trial process through targeted patient recruitment and identification of external control arms. Regulatory bodies are now accepting more RWD and RWE as part of their decision-making process, as highlighted in a recent U.S. Food and Drug Administration (FDA) statement: “FDA is committed to realizing the full potential of fit-for-purpose RWD to generate RWE that will advance the development of therapeutic products and strengthen regulatory oversight of medical products across their lifecycle.”1
With the proliferation of available and increasingly diverse types of RWD, a critical challenge faces researchers at the onset of designing a database study: selecting the right data sources to support the research. This article provides approaches to help ensure that database feasibility assessments — conducted in advance of research to evaluate key study and logistical criteria — are successful in identifying the most relevant data sources appropriate for the evaluation of study objectives.
Identification of RWD sources with potential to satisfy evaluation criteria
Identifying a wide range of potential data sources is a crucial first step in the feasibility assessment process. Some common RWD sources include electronic medical records (EMR), patient registries, medical and pharmacy claims, and patient-generated data, such as those gathered from surveys. One of the best approaches for this first step is to review relevant published literature that present results from database or registry studies with similar objectives to the current study of interest.
Secondly, the use of online database search tools offers an efficient way to selectively identify healthcare databases that match key study criteria like geography or population demographics. These online tools include B.R.I.D.G.E. TO DATA®, an online database profiling global population healthcare databases for use in epidemiology and health outcomes, and the HMA-EMA Catalogues of RWD sources and studies. These repositories collect metadata from RWD sources and studies with the potential to help researchers identify suitable data to address specific research questions.
Importantly, researchers should consider RWD that offer unique, less traditional types of data elements not consistently found in conventional RWD sources. For instance, in the US, data from commercial patient support programs sponsored by pharmaceutical companies to improve patient access to, and overall experience with, a marketed drug may provide unique data elements for a study that is evaluating treatment patterns and non-compliance to a newly marketed product.
Establishing key criteria for selecting the appropriate data sources
An essential consideration for conducting successful database feasibility assessments is prioritizing the most important criteria for database evaluation and ultimate database selection. Publicly available documents have been issued by the FDA and the European Medicines Agency (EMA) that contain guidance to support high-quality RWE generation, with the goal of continuing to strengthen the use of RWD and associated RWE for regulatory decision-making. Two recently released documents by regulators that provide valuable insights into critical characteristics of RWD to be considered in regulatory submissions include: Journey towards a roadmap for regulatory guidance on real-world evidence (EMA, February 2025)2 and Real-world data: Assessing electronic health records and medical claims data To support regulatory decision-making for drug and biological products (FDA, July 2024).3
Some of this guidance is specific to the type of RWD, such as FDA’s recommendation that sponsors using EMRs should evaluate the completeness, accuracy, and plausibility of the data, including verifying data against the source.4 It is important to provide evidence to regulators that a selected database is appropriate for addressing the specific study question of interest. In general, the following evaluation categories should be used to assess ‘fit-for-purpose’ for selecting data sources to use in conducting the study of interest:
- Data specificity — particularly important for key variables to create relevant study cohorts and assess outcomes
- Data quality — includes data completeness, plausibility of data values, and established quality assurance/quality control plan
- Accessibility — e.g., patient-level vs aggregate data only; data must remain local to data owner vs data may leave the owner’s geographic region
- Data reliability — includes data accrual (e.g., data source and collection methods) and data assurance (e.g., processes and personnel involved in data capture to minimize errors)
Data linkage may be necessary to evaluate all study objectives
In some instances, data sources may need to be combined to provide sufficient data to address the study objectives. For example, pharmacy claims can reveal patterns in drug adherence or switching between medications, but cannot provide clinical data often needed to identify study cohorts or evaluate study outcomes. Laboratory results can be used to evaluate a number of healthcare measures, such as the impact of a drug on biomarkers, as illustrated by the potential effect of a GLP-1 therapy on hemoglobin A1c levels in patients with diabetes.5 In situations where no single data source is sufficient to generate results that address all study objectives (or may not cover all relevant timeframes required to evaluate study objectives), exploration of the potential for data linkage across patient populations using common patient identifiers (e.g., Datavant de-identified patient token) should be considered.
Evolving tools for RWD feasibility
Technological advancements are continuously improving our ability to evaluate and utilize RWD effectively. In particular, the evolution of artificial intelligence (AI) and machine learning (ML) presents numerous exciting possibilities that enhance our ability to ‘see’ into the information that is captured within a database. AI/ML enables multiple tools that can be leveraged in RWD assessment, including natural language processing, language learning models, named entity recognition, and machine vision. Natural language processing, for instance, can be used to determine if specific types of information are captured in unstructured notes in EMRs — for example, positive identification of genetic mutations as underlying etiology of a patient’s disease — which are otherwise highly challenging to obtain. Algorithms can be created to enable the efficient AI-based evaluation of databases and their suitability for studies.
Another technological advancement that enables efficient and comprehensive assessment of data captured in an RWD source (structured and unstructured) is the availability of software platforms that provide clinically validated code lists — medications, diagnoses, procedures, etc. — as a resource to researchers who are faced with evaluating data sources that differ in their underlying coding of relevant variables. Commonly referred to as Computable Operational Definitions (CODefs), these AI-informed, indication-specific libraries provide current definitions of study design elements such as disease cohort and outcome identification.
Potential implications of inadequate feasibility assessment
Importantly, serious negative consequences can result from selecting inappropriate or unreliable data to conduct a study. Failing to select suitable data sources from the outset can incur unnecessary costs by purchasing and processing datasets that do not fit research needs. In a worst case scenario, the wrong data have the potential to create bias and lead to erroneous study conclusions, damaging the researchers’ and sponsor’s reputations, and even leading to regulatory rejection.
Partnering with experts
Formal database feasibility assessments should follow a rigorous process, starting with the development of a focused data owner questionnaire to assess a wide variety of data characteristics, including population coverage, years of data availability, data elements, and potential for linkage. Documentation of rigorous methods for conducting a gap analysis to identify whether the appropriate variables needed for evaluation are present should be created and followed.
Partnering with an experienced organization such as UBC brings informed insights and years of expertise in the feasibility assessment processes. Our experienced team of data analysts, epidemiologists, clinicians, regulatory experts, and information technology specialists have conducted dozens of global database and registry feasibility assessments covering multiple data source types and across varied disease and exposed populations. Many of these assessments were to support sponsors’ regulatory requirements, such as a post authorization safety study (PASS). UBC can greatly improve a sponsor’s ability to effectively leverage RWE to support regulatory submissions and discretionary research.
Database feasibility assessments to evaluate key study and logistical criteria are a first step to ensuring the benefits of RWD/RWE are maximized for your clinical study. To learn more about how RWD/RWE can be effectively leveraged, as well as data interoperability and the generation strategies that support it, read UBC’s case study, “Enriching clinical studies with longitudinal real-world data” to better understand a specific application of RWD/RWE enrichment.
References
1. Center for Drug Evaluation and Research and Center for Biologics Evaluation and Research. (2025, June 9). Real-world evidence. U.S. Food and Drug Administration. Accessed August 11, 2025. https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence
2. Human Medicines Division/Methodology Working Party. (2025, February 19). Journey towards a roadmap for regulatory guidance on real-world evidence. European Medicines Agency. Accessed August 11, 2025. https://www.ema.europa.eu/en/documents/other/journey-towards-roadmap-regulatory-guidance-real-world-evidence_en.pdf
3. Center for Drug Evaluation and Research Center for Biologics Evaluation and Research Oncology Center of Excellence. (2024, July 25). Real-world data: Assessing electronic health records and medical claims data to support regulatory decision-making for drug and biological products. U.S. Food and Drug Administration. Accessed August 11, 2025. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/real-world-data-assessing-electronic-health-records-and-medical-claims-data-support-regulatory
4. Food and Drug Administration. “Real-World Data: Assessing Electronic Health Records and Medical Claims Data to Support Regulatory Decision-Making for Drug and Biological Products.” July 2024.
5. Tran V, Tran H, Demirel S, Thompson-Moore N. Impact of Glucagon-Like Peptide 1 Receptor Agonists in Patients with Hemoglobin A1c of 9% or Greater. J Pharm Pract. 2023;36(5):1125-1133. doi:10.1177/08971900221087933
About UBC
United BioSource LLC (UBC) is the leading provider of evidence development solutions with expertise in uniting evidence and access. UBC helps biopharma mitigate risk, address product hurdles, and demonstrate safety, efficacy, and value under real-world conditions. UBC leads the market in providing integrated, comprehensive clinical, safety, and commercialization services and is uniquely positioned to seamlessly integrate best-in-class services throughout the lifecycle of a product.
About the Authors

Irene Cosmatos, MSc, is a Senior Director of Epidemiology & Real-World Evidence at UBC. She and her team support UBC’s work that involves the use of diverse, observational healthcare databases, registries, or patient medical charts to answer sponsors’ research questions. She brings more than 25 years of experience in analyses of large, retrospective patient databases across all therapeutic areas, US and non-US, in support of epidemiologic and health outcomes research.

Jeff Lowry is the Senior Director of Technology Solutions Services at UBC leading UBC’s enterprise data services and technology solution services team. With a focus on patient identity management and study independent data model designs, Jeff and his teams have produced flexible, innovative models that integrate clinical, safety, and commercial data to support complete patient lifecycle solutions.