Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • Write for Us
  • BMJ Journals

You are here

  • Volume 15, Issue 2
  • What is a p value and what does it mean?
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • Dorothy Anne Forbes
  • Correspondence to Dorothy Anne Forbes Faculty of Nursing, University of Alberta, Level 3, Edmonton Clinic Health Academy, Edmonton, Alberta, T6G 1C9, Canada; dorothy.forbes{at}ualberta.ca

https://doi.org/10.1136/ebnurs-2012-100524

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Researchers aim to make the strongest possible conclusions from limited amounts of data. To do this, they need to overcome two problems. First, important differences in the findings can be obscured by natural variability and experimental imprecision. Thus, it is difficult to distinguish real differences from random variability. Second, researchers' natural inclination is to conclude that differences are real, and to minimise the contribution of random variability. Statistical probability minimises this from happening. 1

Statistical probability or p values reveal whether the findings in a research study are statistically significant, meaning that the findings are unlikely to have occurred by chance. To understand the p value concept, it is important to understand its relationship with the α level. Before conducting a study, researchers specify the α level which is most often set at 0.05 (5%). This conventional level was based on the writings of Sir Ronald Fisher, an influential statistician, who in 1926 reported that he preferred the 0.05 cut-off for separating the probable from the improbable. 2 Researchers who set α at 0.05 are willing to accept that there is a 5% chance that their findings are wrong. However, researchers may adopt probability cut-offs that are more generous (eg, an α set at 0.10 means there is a 10% chance that the conclusions are wrong) or more stringent (eg, an α set at 0.01 means there is a 1% chance that the conclusions are wrong). The design of the study, purpose or intuition may influence the researcher's setting of the α level. 2

To illustrate how setting the α level may affect the conclusions of a study, let us examine a research study that compared the annual incomes of hospital based nurses and community based nurses. The mean annual income for hospital based nurses was reported to be $70 000 and for community based nurses to be $60 000. The p value of this study was 0.08. If the researchers set the α level at 0.05, they would conclude that there was no significant difference between the annual incomes of hospital and community-based nurses, since the p value of 0.08 exceeded the α level of 0.05. However, if the α level had been set at 0.10, the p value of 0.08 would be less than the α level and the researchers would conclude that there was a significant difference between the annual incomes of hospital and community based nurses. Two very different conclusions. 3

It is easy to read far too much into the word significant because the statistical use of the word has a meaning entirely distinct from its usual meaning. Just because a difference is statistically significant does not mean that it is important or interesting. In the example above, at the 0.10 α level, although the findings are statistically significant, results due to chance occur 1 out of 10 times. Thus, chance of conclusion error is higher than when the α level is set at 0.05 and results due to chance occur 5 out of 100 times or 1 in 20 times. In the end, the reader must decide if the researchers selected the appropriate α level and whether the conclusions are meaningful or not.

  • ↵ Graphpad . What is a p value ? 2011 . http://www.graphpad.com/articles/pvalue.htm (accessed 10 Dec 2011) .
  • Munroe BH ,
  • Jacobsen BS
  • El-Masri MM

Competing interests None.

Read the full text or download the PDF:

P-Value And Statistical Significance: What It Is & Why It Matters

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Learn about our Editorial Process

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

The p-value in statistics quantifies the evidence against a null hypothesis. A low p-value suggests data is inconsistent with the null, potentially favoring an alternative hypothesis. Common significance thresholds are 0.05 or 0.01.

P-Value Explained in Normal Distribution

Hypothesis testing

When you perform a statistical test, a p-value helps you determine the significance of your results in relation to the null hypothesis.

The null hypothesis (H0) states no relationship exists between the two variables being studied (one variable does not affect the other). It states the results are due to chance and are not significant in supporting the idea being investigated. Thus, the null hypothesis assumes that whatever you try to prove did not happen.

The alternative hypothesis (Ha or H1) is the one you would believe if the null hypothesis is concluded to be untrue.

The alternative hypothesis states that the independent variable affected the dependent variable, and the results are significant in supporting the theory being investigated (i.e., the results are not due to random chance).

What a p-value tells you

A p-value, or probability value, is a number describing how likely it is that your data would have occurred by random chance (i.e., that the null hypothesis is true).

The level of statistical significance is often expressed as a p-value between 0 and 1.

The smaller the p -value, the less likely the results occurred by random chance, and the stronger the evidence that you should reject the null hypothesis.

Remember, a p-value doesn’t tell you if the null hypothesis is true or false. It just tells you how likely you’d see the data you observed (or more extreme data) if the null hypothesis was true. It’s a piece of evidence, not a definitive proof.

Example: Test Statistic and p-Value

Suppose you’re conducting a study to determine whether a new drug has an effect on pain relief compared to a placebo. If the new drug has no impact, your test statistic will be close to the one predicted by the null hypothesis (no difference between the drug and placebo groups), and the resulting p-value will be close to 1. It may not be precisely 1 because real-world variations may exist. Conversely, if the new drug indeed reduces pain significantly, your test statistic will diverge further from what’s expected under the null hypothesis, and the p-value will decrease. The p-value will never reach zero because there’s always a slim possibility, though highly improbable, that the observed results occurred by random chance.

P-value interpretation

The significance level (alpha) is a set probability threshold (often 0.05), while the p-value is the probability you calculate based on your study or analysis.

A p-value less than or equal to your significance level (typically ≤ 0.05) is statistically significant.

A p-value less than or equal to a predetermined significance level (often 0.05 or 0.01) indicates a statistically significant result, meaning the observed data provide strong evidence against the null hypothesis.

This suggests the effect under study likely represents a real relationship rather than just random chance.

For instance, if you set α = 0.05, you would reject the null hypothesis if your p -value ≤ 0.05. 

It indicates strong evidence against the null hypothesis, as there is less than a 5% probability the null is correct (and the results are random).

Therefore, we reject the null hypothesis and accept the alternative hypothesis.

Example: Statistical Significance

Upon analyzing the pain relief effects of the new drug compared to the placebo, the computed p-value is less than 0.01, which falls well below the predetermined alpha value of 0.05. Consequently, you conclude that there is a statistically significant difference in pain relief between the new drug and the placebo.

What does a p-value of 0.001 mean?

A p-value of 0.001 is highly statistically significant beyond the commonly used 0.05 threshold. It indicates strong evidence of a real effect or difference, rather than just random variation.

Specifically, a p-value of 0.001 means there is only a 0.1% chance of obtaining a result at least as extreme as the one observed, assuming the null hypothesis is correct.

Such a small p-value provides strong evidence against the null hypothesis, leading to rejecting the null in favor of the alternative hypothesis.

A p-value more than the significance level (typically p > 0.05) is not statistically significant and indicates strong evidence for the null hypothesis.

This means we retain the null hypothesis and reject the alternative hypothesis. You should note that you cannot accept the null hypothesis; we can only reject it or fail to reject it.

Note : when the p-value is above your threshold of significance,  it does not mean that there is a 95% probability that the alternative hypothesis is true.

One-Tailed Test

Probability and statistical significance in ab testing. Statistical significance in a b experiments

Two-Tailed Test

statistical significance two tailed

How do you calculate the p-value ?

Most statistical software packages like R, SPSS, and others automatically calculate your p-value. This is the easiest and most common way.

Online resources and tables are available to estimate the p-value based on your test statistic and degrees of freedom.

These tables help you understand how often you would expect to see your test statistic under the null hypothesis.

Understanding the Statistical Test:

Different statistical tests are designed to answer specific research questions or hypotheses. Each test has its own underlying assumptions and characteristics.

For example, you might use a t-test to compare means, a chi-squared test for categorical data, or a correlation test to measure the strength of a relationship between variables.

Be aware that the number of independent variables you include in your analysis can influence the magnitude of the test statistic needed to produce the same p-value.

This factor is particularly important to consider when comparing results across different analyses.

Example: Choosing a Statistical Test

If you’re comparing the effectiveness of just two different drugs in pain relief, a two-sample t-test is a suitable choice for comparing these two groups. However, when you’re examining the impact of three or more drugs, it’s more appropriate to employ an Analysis of Variance ( ANOVA) . Utilizing multiple pairwise comparisons in such cases can lead to artificially low p-values and an overestimation of the significance of differences between the drug groups.

How to report

A statistically significant result cannot prove that a research hypothesis is correct (which implies 100% certainty).

Instead, we may state our results “provide support for” or “give evidence for” our research hypothesis (as there is still a slight probability that the results occurred by chance and the null hypothesis was correct – e.g., less than 5%).

Example: Reporting the results

In our comparison of the pain relief effects of the new drug and the placebo, we observed that participants in the drug group experienced a significant reduction in pain ( M = 3.5; SD = 0.8) compared to those in the placebo group ( M = 5.2; SD  = 0.7), resulting in an average difference of 1.7 points on the pain scale (t(98) = -9.36; p < 0.001).

The 6th edition of the APA style manual (American Psychological Association, 2010) states the following on the topic of reporting p-values:

“When reporting p values, report exact p values (e.g., p = .031) to two or three decimal places. However, report p values less than .001 as p < .001.

The tradition of reporting p values in the form p < .10, p < .05, p < .01, and so forth, was appropriate in a time when only limited tables of critical values were available.” (p. 114)

  • Do not use 0 before the decimal point for the statistical value p as it cannot equal 1. In other words, write p = .001 instead of p = 0.001.
  • Please pay attention to issues of italics ( p is always italicized) and spacing (either side of the = sign).
  • p = .000 (as outputted by some statistical packages such as SPSS) is impossible and should be written as p < .001.
  • The opposite of significant is “nonsignificant,” not “insignificant.”

Why is the p -value not enough?

A lower p-value  is sometimes interpreted as meaning there is a stronger relationship between two variables.

However, statistical significance means that it is unlikely that the null hypothesis is true (less than 5%).

To understand the strength of the difference between the two groups (control vs. experimental) a researcher needs to calculate the effect size .

When do you reject the null hypothesis?

In statistical hypothesis testing, you reject the null hypothesis when the p-value is less than or equal to the significance level (α) you set before conducting your test. The significance level is the probability of rejecting the null hypothesis when it is true. Commonly used significance levels are 0.01, 0.05, and 0.10.

Remember, rejecting the null hypothesis doesn’t prove the alternative hypothesis; it just suggests that the alternative hypothesis may be plausible given the observed data.

The p -value is conditional upon the null hypothesis being true but is unrelated to the truth or falsity of the alternative hypothesis.

What does p-value of 0.05 mean?

If your p-value is less than or equal to 0.05 (the significance level), you would conclude that your result is statistically significant. This means the evidence is strong enough to reject the null hypothesis in favor of the alternative hypothesis.

Are all p-values below 0.05 considered statistically significant?

No, not all p-values below 0.05 are considered statistically significant. The threshold of 0.05 is commonly used, but it’s just a convention. Statistical significance depends on factors like the study design, sample size, and the magnitude of the observed effect.

A p-value below 0.05 means there is evidence against the null hypothesis, suggesting a real effect. However, it’s essential to consider the context and other factors when interpreting results.

Researchers also look at effect size and confidence intervals to determine the practical significance and reliability of findings.

How does sample size affect the interpretation of p-values?

Sample size can impact the interpretation of p-values. A larger sample size provides more reliable and precise estimates of the population, leading to narrower confidence intervals.

With a larger sample, even small differences between groups or effects can become statistically significant, yielding lower p-values. In contrast, smaller sample sizes may not have enough statistical power to detect smaller effects, resulting in higher p-values.

Therefore, a larger sample size increases the chances of finding statistically significant results when there is a genuine effect, making the findings more trustworthy and robust.

Can a non-significant p-value indicate that there is no effect or difference in the data?

No, a non-significant p-value does not necessarily indicate that there is no effect or difference in the data. It means that the observed data do not provide strong enough evidence to reject the null hypothesis.

There could still be a real effect or difference, but it might be smaller or more variable than the study was able to detect.

Other factors like sample size, study design, and measurement precision can influence the p-value. It’s important to consider the entire body of evidence and not rely solely on p-values when interpreting research findings.

Can P values be exactly zero?

While a p-value can be extremely small, it cannot technically be absolute zero. When a p-value is reported as p = 0.000, the actual p-value is too small for the software to display. This is often interpreted as strong evidence against the null hypothesis. For p values less than 0.001, report as p < .001

Further Information

  • P Value Calculator From T Score
  • P-Value Calculator For Chi-Square
  • P-values and significance tests (Kahn Academy)
  • Hypothesis testing and p-values (Kahn Academy)
  • Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “ p “< 0.05”.
  • Criticism of using the “ p “< 0.05”.
  • Publication manual of the American Psychological Association
  • Statistics for Psychology Book Download

Bland, J. M., & Altman, D. G. (1994). One and two sided tests of significance: Authors’ reply.  BMJ: British Medical Journal ,  309 (6958), 874.

Goodman, S. N., & Royall, R. (1988). Evidence and scientific research.  American Journal of Public Health ,  78 (12), 1568-1574.

Goodman, S. (2008, July). A dirty dozen: twelve p-value misconceptions . In  Seminars in hematology  (Vol. 45, No. 3, pp. 135-140). WB Saunders.

Lang, J. M., Rothman, K. J., & Cann, C. I. (1998). That confounded P-value.  Epidemiology (Cambridge, Mass.) ,  9 (1), 7-8.

Print Friendly, PDF & Email

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 26 November 2021

The clinician’s guide to p values, confidence intervals, and magnitude of effects

  • Mark R. Phillips   ORCID: orcid.org/0000-0003-0923-261X 1   na1 ,
  • Charles C. Wykoff 2 , 3 ,
  • Lehana Thabane   ORCID: orcid.org/0000-0003-0355-9734 1 , 4 ,
  • Mohit Bhandari   ORCID: orcid.org/0000-0001-9608-4808 1 , 5 &
  • Varun Chaudhary   ORCID: orcid.org/0000-0002-9988-4146 1 , 5

for the Retina Evidence Trials InterNational Alliance (R.E.T.I.N.A.) Study Group

Eye volume  36 ,  pages 341–342 ( 2022 ) Cite this article

19k Accesses

7 Citations

14 Altmetric

Metrics details

  • Outcomes research

A Correction to this article was published on 19 January 2022

This article has been updated

Introduction

There are numerous statistical and methodological considerations within every published study, and the ability of clinicians to appreciate the implications and limitations associated with these key concepts is critically important. These implications often have a direct impact on the applicability of study findings – which, in turn, often determine the appropriateness for the results to lead to modification of practice patterns. Because it can be challenging and time-consuming for busy clinicians to break down the nuances of each study, herein we provide a brief summary of 3 important topics that every ophthalmologist should consider when interpreting evidence.

p -values: what they tell us and what they don’t

Perhaps the most universally recognized statistic is the p-value. Most individuals understand the notion that (usually) a p -value <0.05 signifies a statistically significant difference between the two groups being compared. While this understanding is shared amongst most, it is far more important to understand what a p -value does not tell us. Attempting to inform clinical practice patterns through interpretation of p -values is overly simplistic, and is fraught with potential for misleading conclusions. A p -value represents the probability that the observed result (difference between the groups being compared)—or one that is more extreme—would occur by random chance, assuming that the null hypothesis (the alternative scenario to the study’s hypothesis) is that there are no differences between the groups being compared. For example, a p -value of 0.04 would indicate that the difference between the groups compared would have a 4% chance of occurring by random chance. When this probability is small, it becomes less likely that the null hypothesis is accurate—or, alternatively, that the probability of a difference between groups is high [ 1 ]. Studies use a predefined threshold to determine when a p -value is sufficiently small enough to support the study hypothesis. This threshold is conventionally a p -value of 0.05; however, there are reasons and justifications for studies to use a different threshold if appropriate.

What a p -value cannot tell us, is the clinical relevance or importance of the observed treatment effects. [ 1 ]. Specifically, a p -value does not provide details about the magnitude of effect [ 2 , 3 , 4 ]. Despite a significant p -value, it is quite possible for the difference between the groups to be small. This phenomenon is especially common with larger sample sizes in which comparisons may result in statistically significant differences that are actually not clinically meaningful. For example, a study may find a statistically significant difference ( p  < 0.05) between the visual acuity outcomes between two groups, while the difference between the groups may only amount to a 1 or less letter difference. While this may be in fact a statistically significant difference, the difference is likely not large enough to make a meaningful difference for patients. Thus, p -values lack vital information on the magnitude of effects for the assessed outcomes [ 2 , 3 , 4 ].

Overcoming the limitations of interpreting p -values: magnitude of effect

To overcome this limitation, it is important to consider both (1) whether or not the p -value of a comparison is significant according to the pre-defined statistical plan, and (2) the magnitude of the treatment effects (commonly reported as an effect estimate with 95% confidence intervals) [ 5 ]. The magnitude of effect is most often represented as the mean difference between groups for continuous outcomes, such as visual acuity on the logMAR scale, and the risk or odds ratio for dichotomous/binary outcomes, such as occurrence of adverse events. These measures indicate the observed effect that was quantified by the study comparison. As suggested in the previous section, understanding the actual magnitude of the difference in the study comparison provides an understanding of the results that an isolated p -value does not provide [ 4 , 5 ]. Understanding the results of a study should shift from a binary interpretation of significant vs not significant, and instead, focus on a more critical judgement of the clinical relevance of the observed effect [ 1 ].

There are a number of important metrics, such as the Minimally Important Difference (MID), which helps to determine if a difference between groups is large enough to be clinically meaningful [ 6 , 7 ]. When a clinician is able to identify (1) the magnitude of effect within a study, and (2) the MID (smallest change in the outcome that a patient would deem meaningful), they are far more capable of understanding the effects of a treatment, and further articulate the pros and cons of a treatment option to patients with reference to treatment effects that can be considered clinically valuable.

The role of confidence intervals

Confidence intervals are estimates that provide a lower and upper threshold to the estimate of the magnitude of effect. By convention, 95% confidence intervals are most typically reported. These intervals represent the range in which we can, with 95% confidence, assume the treatment effect to fall within. For example, a mean difference in visual acuity of 8 (95% confidence interval: 6 to 10) suggests that the best estimate of the difference between the two study groups is 8 letters, and we have 95% certainty that the true value is between 6 and 10 letters. When interpreting this clinically, one can consider the different clinical scenarios at each end of the confidence interval; if the patient’s outcome was to be the most conservative, in this case an improvement of 6 letters, would the importance to the patient be different than if the patient’s outcome was to be the most optimistic, or 10 letters in this example? When the clinical value of the treatment effect does not change when considering the lower versus upper confidence intervals, there is enhanced certainty that the treatment effect will be meaningful to the patient [ 4 , 5 ]. In contrast, if the clinical merits of a treatment appear different when considering the possibility of the lower versus the upper confidence intervals, one may be more cautious about the expected benefits to be anticipated with treatment [ 4 , 5 ].

There are a number of important details for clinicians to consider when interpreting evidence. Through this editorial, we hope to provide practical insights into fundamental methodological principals that can help guide clinical decision making. P -values are one small component to consider when interpreting study results, with much deeper appreciation of results being available when the treatment effects and associated confidence intervals are also taken into consideration.

Change history

19 january 2022.

A Correction to this paper has been published: https://doi.org/10.1038/s41433-021-01914-2

Li G, Walter SD, Thabane L. Shifting the focus away from binary thinking of statistical significance and towards education for key stakeholders: revisiting the debate on whether it’s time to de-emphasize or get rid of statistical significance. J Clin Epidemiol. 2021;137:104–12. https://doi.org/10.1016/j.jclinepi.2021.03.033

Article   PubMed   Google Scholar  

Gagnier JJ, Morgenstern H. Misconceptions, misuses, and misinterpretations of p values and significance testing. J Bone Joint Surg Am. 2017;99:1598–603. https://doi.org/10.2106/JBJS.16.01314

Goodman SN. Toward evidence-based medical statistics. 1: the p value fallacy. Ann Intern Med. 1999;130:995–1004. https://doi.org/10.7326/0003-4819-130-12-199906150-00008

Article   CAS   PubMed   Google Scholar  

Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, et al. Statistical tests, p values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31:337–50. https://doi.org/10.1007/s10654-016-0149-3

Article   PubMed   PubMed Central   Google Scholar  

Phillips M. Letter to the editor: editorial: threshold p values in orthopaedic research-we know the problem. What is the solution? Clin Orthop. 2019;477:1756–8. https://doi.org/10.1097/CORR.0000000000000827

Devji T, Carrasco-Labra A, Qasim A, Phillips MR, Johnston BC, Devasenapathy N, et al. Evaluating the credibility of anchor based estimates of minimal important differences for patient reported outcomes: instrument development and reliability study. BMJ. 2020;369:m1714. https://doi.org/10.1136/bmj.m1714

Carrasco-Labra A, Devji T, Qasim A, Phillips MR, Wang Y, Johnston BC, et al. Minimal important difference estimates for patient-reported outcomes: a systematic survey. J Clin Epidemiol. 2020;0. https://doi.org/10.1016/j.jclinepi.2020.11.024

Download references

Author information

Authors and affiliations.

Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, ON, Canada

Mark R. Phillips, Lehana Thabane, Mohit Bhandari & Varun Chaudhary

Retina Consultants of Texas (Retina Consultants of America), Houston, TX, USA

Charles C. Wykoff

Blanton Eye Institute, Houston Methodist Hospital, Houston, TX, USA

Biostatistics Unit, St. Joseph’s Healthcare-Hamilton, Hamilton, ON, Canada

Lehana Thabane

Department of Surgery, McMaster University, Hamilton, ON, Canada

Mohit Bhandari & Varun Chaudhary

NIHR Moorfields Biomedical Research Centre, Moorfields Eye Hospital, London, UK

Sobha Sivaprasad

Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA

Peter Kaiser

Retinal Disorders and Ophthalmic Genetics, Stein Eye Institute, University of California, Los Angeles, CA, USA

David Sarraf

Department of Ophthalmology, Mayo Clinic, Rochester, MN, USA

Sophie J. Bakri

The Retina Service at Wills Eye Hospital, Philadelphia, PA, USA

Sunir J. Garg

Center for Ophthalmic Bioinformatics, Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA

Rishi P. Singh

Cleveland Clinic Lerner College of Medicine, Cleveland, OH, USA

Department of Ophthalmology, University of Bonn, Boon, Germany

Frank G. Holz

Singapore Eye Research Institute, Singapore, Singapore

Tien Y. Wong

Singapore National Eye Centre, Duke-NUD Medical School, Singapore, Singapore

Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, East Melbourne, VIC, Australia

Robyn H. Guymer

Department of Surgery (Ophthalmology), The University of Melbourne, Melbourne, VIC, Australia

You can also search for this author in PubMed   Google Scholar

  • Varun Chaudhary
  • , Mohit Bhandari
  • , Charles C. Wykoff
  • , Sobha Sivaprasad
  • , Lehana Thabane
  • , Peter Kaiser
  • , David Sarraf
  • , Sophie J. Bakri
  • , Sunir J. Garg
  • , Rishi P. Singh
  • , Frank G. Holz
  • , Tien Y. Wong
  •  & Robyn H. Guymer

Contributions

MRP was responsible for conception of idea, writing of manuscript and review of manuscript. VC was responsible for conception of idea, writing of manuscript and review of manuscript. MB was responsible for conception of idea, writing of manuscript and review of manuscript. CCW was responsible for critical review and feedback on manuscript. LT was responsible for critical review and feedback on manuscript.

Corresponding author

Correspondence to Varun Chaudhary .

Ethics declarations

Competing interests.

MRP: Nothing to disclose. CCW: Consultant: Acuela, Adverum Biotechnologies, Inc, Aerpio, Alimera Sciences, Allegro Ophthalmics, LLC, Allergan, Apellis Pharmaceuticals, Bayer AG, Chengdu Kanghong Pharmaceuticals Group Co, Ltd, Clearside Biomedical, DORC (Dutch Ophthalmic Research Center), EyePoint Pharmaceuticals, Gentech/Roche, GyroscopeTx, IVERIC bio, Kodiak Sciences Inc, Novartis AG, ONL Therapeutics, Oxurion NV, PolyPhotonix, Recens Medical, Regeron Pharmaceuticals, Inc, REGENXBIO Inc, Santen Pharmaceutical Co, Ltd, and Takeda Pharmaceutical Company Limited; Research funds: Adverum Biotechnologies, Inc, Aerie Pharmaceuticals, Inc, Aerpio, Alimera Sciences, Allergan, Apellis Pharmaceuticals, Chengdu Kanghong Pharmaceutical Group Co, Ltd, Clearside Biomedical, Gemini Therapeutics, Genentech/Roche, Graybug Vision, Inc, GyroscopeTx, Ionis Pharmaceuticals, IVERIC bio, Kodiak Sciences Inc, Neurotech LLC, Novartis AG, Opthea, Outlook Therapeutics, Inc, Recens Medical, Regeneron Pharmaceuticals, Inc, REGENXBIO Inc, Samsung Pharm Co, Ltd, Santen Pharmaceutical Co, Ltd, and Xbrane Biopharma AB—unrelated to this study. LT: Nothing to disclose. MB: Research funds: Pendopharm, Bioventus, Acumed – unrelated to this study. VC: Advisory Board Member: Alcon, Roche, Bayer, Novartis; Grants: Bayer, Novartis – unrelated to this study.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original version of this article was revised: In this article the middle initial in author name Sophie J. Bakri was missing.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Phillips, M.R., Wykoff, C.C., Thabane, L. et al. The clinician’s guide to p values, confidence intervals, and magnitude of effects. Eye 36 , 341–342 (2022). https://doi.org/10.1038/s41433-021-01863-w

Download citation

Received : 11 November 2021

Revised : 12 November 2021

Accepted : 15 November 2021

Published : 26 November 2021

Issue Date : February 2022

DOI : https://doi.org/10.1038/s41433-021-01863-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

what is the p value in a research study

  • Search Search Please fill out this field.

What Is P-Value?

Understanding p-value.

  • P-Value in Hypothesis Testing

The Bottom Line

  • Corporate Finance
  • Financial Analysis

P-Value: What It Is, How to Calculate It, and Why It Matters

what is the p value in a research study

Yarilet Perez is an experienced multimedia journalist and fact-checker with a Master of Science in Journalism. She has worked in multiple cities covering breaking news, politics, education, and more. Her expertise is in personal finance and investing, and real estate.

what is the p value in a research study

In statistics, a p-value is defined as a number that indicates how likely you are to obtain a value that is at least equal to or more than the actual observation if the null hypothesis is correct.

The p-value serves as an alternative to rejection points to provide the smallest level of significance at which the null hypothesis would be rejected. A smaller p-value means stronger evidence in favor of the alternative hypothesis.

P-value is often used to promote credibility for studies or reports by government agencies. For example, the U.S. Census Bureau stipulates that any analysis with a p-value greater than 0.10 must be accompanied by a statement that the difference is not statistically different from zero. The Census Bureau also has standards in place stipulating which p-values are acceptable for various publications.

Key Takeaways

  • A p-value is a statistical measurement used to validate a hypothesis against observed data.
  • A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true.
  • The lower the p-value, the greater the statistical significance of the observed difference.
  • A p-value of 0.05 or lower is generally considered statistically significant.
  • P-value can serve as an alternative to—or in addition to—preselected confidence levels for hypothesis testing.

Jessica Olah / Investopedia

P-values are usually found using p-value tables or spreadsheets/statistical software. These calculations are based on the assumed or known probability distribution of the specific statistic tested. The sample size, which determines the reliability of the observed data, directly influences the accuracy of the p-value calculation. he p-value approach to hypothesis testing uses the calculated he p-value approach to hypothesis testing uses the calculated P-values are calculated from the deviation between the observed value and a chosen reference value, given the probability distribution of the statistic, with a greater difference between the two values corresponding to a lower p-value.

Mathematically, the p-value is calculated using integral calculus from the area under the probability distribution curve for all values of statistics that are at least as far from the reference value as the observed value is, relative to the total area under the probability distribution curve. Standard deviations, which quantify the dispersion of data points from the mean, are instrumental in this calculation.

The calculation for a p-value varies based on the type of test performed. The three test types describe the location on the probability distribution curve: lower-tailed test, upper-tailed test, or two-tailed test . In each case, the degrees of freedom play a crucial role in determining the shape of the distribution and thus, the calculation of the p-value.

In a nutshell, the greater the difference between two observed values, the less likely it is that the difference is due to simple random chance, and this is reflected by a lower p-value.

The P-Value Approach to Hypothesis Testing

The p-value approach to hypothesis testing uses the calculated probability to determine whether there is evidence to reject the null hypothesis. This determination relies heavily on the test statistic, which summarizes the information from the sample relevant to the hypothesis being tested. The null hypothesis, also known as the conjecture, is the initial claim about a population (or data-generating process). The alternative hypothesis states whether the population parameter differs from the value of the population parameter stated in the conjecture.

In practice, the significance level is stated in advance to determine how small the p-value must be to reject the null hypothesis. Because different researchers use different levels of significance when examining a question, a reader may sometimes have difficulty comparing results from two different tests. P-values provide a solution to this problem.

Even a low p-value is not necessarily proof of statistical significance, since there is still a possibility that the observed data are the result of chance. Only repeated experiments or studies can confirm if a relationship is statistically significant.

For example, suppose a study comparing returns from two particular assets was undertaken by different researchers who used the same data but different significance levels. The researchers might come to opposite conclusions regarding whether the assets differ.

If one researcher used a confidence level of 90% and the other required a confidence level of 95% to reject the null hypothesis, and if the p-value of the observed difference between the two returns was 0.08 (corresponding to a confidence level of 92%), then the first researcher would find that the two assets have a difference that is statistically significant , while the second would find no statistically significant difference between the returns.

To avoid this problem, the researchers could report the p-value of the hypothesis test and allow readers to interpret the statistical significance themselves. This is called a p-value approach to hypothesis testing. Independent observers could note the p-value and decide for themselves whether that represents a statistically significant difference or not.

Example of P-Value

An investor claims that their investment portfolio’s performance is equivalent to that of the Standard & Poor’s (S&P) 500 Index . To determine this, the investor conducts a two-tailed test.

The null hypothesis states that the portfolio’s returns are equivalent to the S&P 500’s returns over a specified period, while the alternative hypothesis states that the portfolio’s returns and the S&P 500’s returns are not equivalent—if the investor conducted a one-tailed test , the alternative hypothesis would state that the portfolio’s returns are either less than or greater than the S&P 500’s returns.

The p-value hypothesis test does not necessarily make use of a preselected confidence level at which the investor should reset the null hypothesis that the returns are equivalent. Instead, it provides a measure of how much evidence there is to reject the null hypothesis. The smaller the p-value, the greater the evidence against the null hypothesis.

Thus, if the investor finds that the p-value is 0.001, there is strong evidence against the null hypothesis, and the investor can confidently conclude that the portfolio’s returns and the S&P 500’s returns are not equivalent.

Although this does not provide an exact threshold as to when the investor should accept or reject the null hypothesis, it does have another very practical advantage. P-value hypothesis testing offers a direct way to compare the relative confidence that the investor can have when choosing among multiple different types of investments or portfolios relative to a benchmark such as the S&P 500.

For example, for two portfolios, A and B, whose performance differs from the S&P 500 with p-values of 0.10 and 0.01, respectively, the investor can be much more confident that portfolio B, with a lower p-value, will actually show consistently different results.

Is a 0.05 P-Value Significant?

A p-value less than 0.05 is typically considered to be statistically significant, in which case the null hypothesis should be rejected. A p-value greater than 0.05 means that deviation from the null hypothesis is not statistically significant, and the null hypothesis is not rejected.

What Does a P-Value of 0.001 Mean?

A p-value of 0.001 indicates that if the null hypothesis tested were indeed true, then there would be a one-in-1,000 chance of observing results at least as extreme. This leads the observer to reject the null hypothesis because either a highly rare data result has been observed or the null hypothesis is incorrect.

How Can You Use P-Value to Compare 2 Different Results of a Hypothesis Test?

If you have two different results, one with a p-value of 0.04 and one with a p-value of 0.06, the result with a p-value of 0.04 will be considered more statistically significant than the p-value of 0.06. Beyond this simplified example, you could compare a 0.04 p-value to a 0.001 p-value. Both are statistically significant, but the 0.001 example provides an even stronger case against the null hypothesis than the 0.04.

The p-value is used to measure the significance of observational data. When researchers identify an apparent relationship between two variables, there is always a possibility that this correlation might be a coincidence. A p-value calculation helps determine if the observed relationship could arise as a result of chance.

U.S. Census Bureau. “ Statistical Quality Standard E1: Analyzing Data .”

what is the p value in a research study

  • Terms of Service
  • Editorial Policy
  • Privacy Policy

Educational resources and simple solutions for your research journey

What is p-value: How to calculate it and statistical significance

What is p-value: How to Calculate It and Statistical Significance

What is p-value: How to calculate it and statistical significance

“What is a p-value?” are words often uttered by early career researchers and sometimes even by more experienced ones. The p-value is an important and frequently used concept in quantitative research. It can also be confusing and easily misused. In this article, we delve into what is a p-value, how to calculate it, and its statistical significance.

Table of Contents

What is a p-value

The p-value, or probability value, is the probability that your results occurred randomly given that the null hypothesis is true. P-values are used in hypothesis testing to find evidence that differences in values or groups exist. P-values are determined through the calculation of the test statistic for the test you are using and are based on the assumed or known probability distribution.

For example, you are researching a new pain medicine that is designed to last longer than the current commonly prescribed drug. Please note that this is an extremely simplified example, intended only to demonstrate the concepts. From previous research, you know that the underlying probability distribution for both medicines is the normal distribution, which is shown in the figure below.

What is p-value: How to calculate it and statistical significance

You are planning a clinical trial for your drug. If your results show that the average length of time patients are pain-free is longer for the new drug than that for the standard medicine, how will you know that this is not just a random outcome? If this result falls within the green shaded area of the graph, you may have evidence that your drug has a longer effect. But how can we determine this scientifically? We do this through hypothesis testing.

What is a null hypothesis

Stating your null and alternative hypotheses is the first step in conducting a hypothesis test. The null hypothesis (H 0 ) is what you’re trying to disprove, usually a statement that there is no relationship between two variables or no difference between two groups. The alternative hypothesis (H a ) states that a relationship exists or that there is a difference between two groups. It represents what you’re trying to find evidence to support.

Before we conduct the clinical trial, we create the following hypotheses:

H 0 : the mean longevity of the new drug is equal to that of the standard drug

H a : the mean longevity of the new drug is greater than that of the standard drug

Note that the null hypothesis states that there is no difference in the mean values for the two drugs. Because H a includes “greater than,” this is an upper-tailed test. We are not interested in the area under the lower side of the curve.

Next, we need to determine our criterion for deciding whether or not the null hypothesis can be rejected. This is where the critical p-value comes in. If we assume the null hypothesis is true, how much longer does the new drug have to last?

what is the p value in a research study

Let’s say your results show that the new drug lasts twice as long as the standard drug. In theory, this could still be a random outcome, due to chance, even if the null hypothesis were true. However, at some point, you must consider that the new drug may just have a better longevity. The researcher will typically set that point, which is the probability of rejecting the null hypothesis given that it is true, prior to conducting the trial. This is the critical p-value. Typically, this value is set at p = .05, although, depending on the circumstances, it could be set at another value, such as .10 or .01.

Another way to consider the null hypothesis that might make the concept clearer is to compare it to the adage “innocent until proven guilty.” It is assumed that the null hypothesis is true unless enough strong evidence can be found to disprove it. Statistically significant p-value results can provide some of that evidence, which makes it important to know how to calculate p-values.

How to calculate p-values

The p-value that is determined from your results is based on the test statistic, which depends on the type of hypothesis test you are using. That is because the p-value is actually a probability, and its value, and calculation method, depends on the underlying probability distribution. The p-value also depends in part on whether you are conducting a lower-tailed test, upper-tailed test, or two-tailed test.

The actual p-value is calculated by integrating the probability distribution function to find the relevant areas under the curve using integral calculus. This process can be quite complicated. Fortunately, p-values are usually determined by using tables, which use the test statistic and degrees of freedom, or statistical software, such as SPSS, SAS, or R.

For example, with the simplified clinical test we are performing, we assumed the underlying probability distribution is normal; therefore, we decide to conduct a t-test to test the null hypothesis. The resulting t-test statistic will indicate where along the x-axis, under the normal curve, our result is located. The p-value will then be, in our case, the area under the curve to the right of the test statistic.

Many factors affect the hypothesis test you use and therefore the test statistic. Always make sure to use the test that best fits your data and the relationship you’re testing. The sample size and number of independent variables you use will also impact the p-value.

P-Value and statistical significance

You have completed your clinical trial and have determined the p-value. What’s next? How can the result be interpreted? What does a statistically significant result mean?

A statistically significant result means that the p-value you obtained is small enough that the result is not likely to have occurred by chance. P-values are reported in the range of 0–1, and the smaller the p-value, the less likely it is that the null hypothesis is true and the greater the indication that it can be rejected. The critical p-value, or the point at which a result can be considered to be statistically significant, is set prior to the experiment.

In our simplified clinical trial example, we set the critical p-value at 0.05. If the p-value obtained from the trial was found to be p = .0375, we can say that the results were statistically significant, and we have evidence for rejecting the null hypothesis. However, this does not mean that we can be absolutely certain that the null hypothesis is false. The results of the test only indicate that the null hypothesis is likely false.  

what is the p value in a research study

P-value table

So, how can we interpret the p-value results of an experiment or trial? A p-value table, prepared prior to the experiment, can sometimes be helpful. This table lists possible p-values and their interpretations.

P-value range Interpretation
> 0.05 Results are not statistically significant; do not reject the null hypothesis
< 0.05 Results are statistically significant; in general, reject the null hypothesis
0.01 Results are highly statistically significant; reject the null hypothesis

How to report p-values in research

P-values, like all experimental outcomes, are usually reported in the results section, and sometimes in the abstract, of a research paper. Enough information also needs to be provided so that the readers can place the p-values into context. For our example, the test statistic and effect size should also be included in the results.

To enable readers to clearly understand your results, the significance threshold you used, the critical p-value should be reported in the methods section of your paper. For our example, we might state that “In this study, the statistical threshold was set at p = .05.” The sample sizes and assumptions should also be discussed there as they will greatly impact the p-value.

How one can use p-value to compare two different results of a hypothesis test?

What if we conduct two experiments using the same null and alternative hypotheses? Or what if we conduct the same clinical trial twice with different drugs? Can we use the resulting p-values to compare them?

In general, it is not a good idea to compare results using only p-values. A p-value only reflects the probability that those specific results occurred by chance; it is not related at all to any other results and does not indicate degree. So, just because you obtained a p-value of .04 in with one drug and a value of .025 in with a second drug does not necessarily mean that the second drug is better.

Using p-values to compare two different results may be more feasible if the experiments are exactly the same and all other conditions are controlled except for the one being studied. However, so many different factors impact the p-value that it would be difficult to control them all.

Why just using p-values is not enough while interpreting two different variables

P-values can indicate whether or not the null hypothesis should be rejected; however, p-values alone are not enough to show the relative size differences between groups. Therefore, both the statistical significance and the effect size should be reported when discussing the results of a study.

For example, suppose the sample size in our clinical trials was very large, maybe 1,000, and we found the p-value to be .035. The difference between the two drugs is statistically significant because the p-value was less than .05. However, if we looked at the difference in the actual times the drugs were effective, we might find that the new drug lasted only 2 minutes longer than the standard drug. Large sample sizes generally show even very small differences to be significant. We would need this information to make any recommendations based on the results of the trial.

Statistical significance, or p-values, are dependent on both sample size and effect size. Therefore, they all need to be reported for readers to clearly understand the results.

Things to consider while using p-values

P-values are very useful tools for researchers. However, much care must be taken to avoid treating them as black and white indicators of a study’s results or misusing them. Here are a few other things to consider when using p-values:

  • When using p-values in your research report, it’s a good idea to pay attention to your target journal’s guidelines on formatting. Typically, p-values are written without a leading zero. For example, write p = .01 instead of p = 0.01. Also, p-values, like all other variables, are usually italicized, and spaces are included on both sides of the equal sign.
  • The significance threshold needs to be set prior to the experiment being conducted. Setting the significance level after looking at the data to ensure a positive result is considered unethical.
  • P-values have nothing to say about the alternative hypothesis. If your results indicate that the null hypothesis should be rejected, it does not mean that you accept the alternative hypothesis.
  • P-values never prove anything. All they can do is provide evidence to support rejecting or not rejecting the null hypothesis. Statistics are extremely non-committal.
  • “Nonsignificant” is the opposite of significant. Never report that the results were “insignificant.”

Frequently Asked Questions (FAQs) on p-value  

Q: What influences p-value?   

The primary factors that affect p-value in statistics include the size of the observed effect, sample size, variability within the data, and the chosen significance level (alpha). A larger effect size, a larger sample size, lower variability, and a lower significance level can all contribute to a lower p-value, indicating stronger evidence against the null hypothesis.  

Q: What does p-value of 0.05 mean?   

A p-value of 0.05 is a commonly used threshold in statistical hypothesis testing. It represents the level of significance, typically denoted as alpha, which is the probability of rejecting the null hypothesis when it is true. If the p-value is less than or equal to 0.05, it suggests that the observed results are statistically significant at the 5% level, meaning they are unlikely to occur by chance alone.  

Q: What is the p-value significance of 0.15?  

The significance of a p-value depends on the chosen threshold, typically called the significance level or alpha. If the significance level is set at 0.05, a p-value of 0.15 would not be considered statistically significant. In this case, there is insufficient evidence to reject the null hypothesis. However, it is important to note that significance levels can vary depending on the specific field or study design.  

Q: Which p-value to use in T-Test?   

When performing a T-Test, the p-value obtained indicates the probability of observing the data if the null hypothesis is true. The appropriate p-value to use in a T-Test is based on the chosen significance level (alpha). Generally, a p-value less than or equal to the alpha indicates statistical significance, supporting the rejection of the null hypothesis in favour of the alternative hypothesis.  

Q: Are p-values affected by sample size?   

Yes, sample size can influence p-values. Larger sample sizes tend to yield more precise estimates and narrower confidence intervals. This increased precision can affect the p-value calculations, making it easier to detect smaller effects or subtle differences between groups or variables. This can potentially lead to smaller p-values, indicating statistical significance. However, it’s important to note that sample size alone is not the sole determinant of statistical significance. Consider it along with other factors, such as effect size, variability, and chosen significance level (alpha), when determining the p-value.  

Editage All Access is a subscription-based platform that unifies the best AI tools and services designed to speed up, simplify, and streamline every step of a researcher’s journey. The Editage All Access Pack is a one-of-a-kind subscription that unlocks full access to an AI writing assistant, literature recommender, journal finder, scientific illustration tool, and exclusive discounts on professional publication services from Editage.  

Based on 22+ years of experience in academia, Editage All Access empowers researchers to put their best research forward and move closer to success. Explore our top AI Tools pack, AI Tools + Publication Services pack, or Build Your Own Plan. Find everything a researcher needs to succeed, all in one place –  Get All Access now starting at just $14 a month !    

Related Posts

graphical abstract

How to Make a Graphical Abstract for Your Research Paper (with Examples)

AI tools for research

Leveraging AI in Research: Kick-Start Your Academic Year with Editage All Access

The p value – definition and interpretation of p-values in statistics

This article examines the most common statistic reported in scientific papers and used in applied statistical analyses – the p -value . The article goes through the definition illustrated with examples, discusses its utility, interpretation, and common misinterpretations of observed statistical significance and significance levels. It is structured as follows:

What does ‘ p ‘ in ‘ p -value’ stand for?

What does p measure and how to interpret it.

  • A p-value only makes sense under a specified null hypothesis

How to calculate a p -value?

A practical example, p -values as convenient summary statistics.

  • Quantifying the relative uncertainty of data

Easy comparison of different statistical tests

  • p -value interpretation in outcomes of experiments (randomized controlled trials)
  • p -value interpretation in regressions and correlations of observational data

Mistaking statistical significance with practical significance

Treating the significance level as likelihood for the observed effect, treating p -values as likelihoods attached to hypotheses, a high p -value means the null hypothesis is true, lack of statistical significance suggests a small effect size, p -value definition and meaning.

The technical definition of the p -value is (based on [4,5,6]):

A p -value is the probability of the data-generating mechanism corresponding to a specified null hypothesis to produce an outcome as extreme or more extreme than the one observed.

However, it is only straightforward to understand for those already familiar in detail with terms such as ‘probability’, ‘null hypothesis’, ‘data generating mechanism’, ‘extreme outcome’. These, in turn, require knowledge of what a ‘hypothesis’, a ‘statistical model’ and ‘statistic’ mean, and so on. While some of these will be explained on a cursory level in the following paragraphs, those looking for deeper understanding should consider consulting the following glossary definitions: statistical model , hypothesis , null hypothesis , statistic .

A slightly less technical and therefore more accessible definition is:

A p -value quantifies how likely it is to erroneously reject a specific statistical hypothesis, were it true, based on a given set of data.

Let us break these down and examine several examples to make both of these definitions make sense.

p stands for p robability where probability means the frequency with which an event occurs under certain assumptions. The most common example is the frequency with which a coin lands heads under the assumption that it is equally balanced (a fair coin toss ). That frequency is 0.5 (50%).

Capital ‘P’ stands for probability in general, whereas lowercase ‘ p ‘ refers to the probability of a particular data realization. To expand on the coin toss example: P would stand for the probability of heads in general, whereas p could refer to the probability of landing a series of five heads in a row, or the probability of landing less than or equal to 38 heads out of 100 coin flips.

Given that it was established that p stands for probability, it is easy to figure out it measures a sort of probability.

In everyday language the term ‘probability’ might be used as synonymous to ‘chance’, ‘likelihood’, ‘odds’, e.g. there is 90% probability that it will rain tomorrow. However, in statistics one cannot speak of ‘probability’ without specifying a mechanism which generates the observed data. A simple example of such a mechanism is a device which produces fair coin tosses. A statistical model based on this data-generating mechanism can be put forth and under that model the probability of 38 or less heads out of 100 tosses can be estimated to be 1.05%, for example by using a binomial calculator . The p -value against the model of a fair coin would be ~0.01 (rounding it to 0.01 from hereon for the purposes of the article).

The way to interpret that p -value is: observing 38 heads or less out of the 100 tosses could have happened in only 1% of infinitely many series of 100 fair coin tosses. The null hypothesis in this case is defined as the coin being fair, therefore having a 50% chance for heads and 50% chance for tails on each toss.

Assuming the null hypothesis is true allows the comparison of the observed data to what would have been expected under the null. It turns out the particular observation of 38/100 heads is a rather improbable and thus surprising outcome under the assumption of the null hypothesis. This is measured by the low p -value which also accounts for more extreme outcomes such as 37/100, 36/100, and so on all the way to 0/100.

If one had a predefined level of statistical significance at 0.05 then one would claim that the outcome is statistically significant since it’s p -value of 0.01 meets the 0.05 significance level (0.01 ≤ 0.05). A visual representation of the relationship between p -values, significance level ( p -value threshold), and statistical significance of an outcome is illustrated visually in this graph:

P-value and significance level explained

In fact, had the significance threshold been at any value above 0.01, the outcome would have been statistically significant, therefore it is usually said that with a p -value of 0.01, the outcome is statistically significant at any level above 0.01 .

Continuing with the interpretation: were one to reject the null hypothesis based on this p -value of 0.01, they would be acting as if a significance level of 0.01 or lower provides sufficient evidence against the hypothesis of the coin being fair. One could interpret this as a rule for a long-run series of experiments and inferences . In such a series, by using this p -value threshold one would incorrectly reject the fair coin hypothesis in at most 1 out of 100 cases, regardless of whether the coin is actually fair in any one of them. An incorrect rejection of the null is often called a type I error as opposed to a type II error which is to incorrectly fail to reject a null.

A more intuitive interpretation proceeds without reference to hypothetical long-runs. This second interpretation comes in the form of a strong argument from coincidence :

  • there was a low probability (0.01 or 1%) that something would have happened assuming the null was true
  • it did happen so it has to be an unusual (to the extent that the p -value is low) coincidence that it happened
  • this warrants the conclusion to reject the null hypothesis

( source ). It stems from the concept of severe testing as developed by Prof. Deborah Mayo in her various works [1,2,3,4,5] and reflects an error-probabilistic approach to inference.

A p -value only makes sense under a specified null hypothesis

It is important to understand why a specified ‘null hypothesis’ should always accompany any reported p -value and why p-values are crucial in so-called Null Hypothesis Statistical Tests (NHST) . Statistical significance only makes sense when referring to a particular statistical model which in turn corresponds to a given null hypothesis. A p -value calculation has a statistical model and a statistical null hypothesis defined within it as prerequisites, and a statistical null is only interesting because of some tightly related substantive null such as ‘this treatment improves outcomes’. The relationship is shown in the chart below:

The relationship between a substantive hypothesis to a statistical model, significance threshold and p-value

In the coin example, the substantive null that is interesting to (potentially) reject is the claim that the coin is fair. It translates to a statistical null hypothesis (model) with the following key properties:

  • heads having 50% chance and tails having 50% chance, on each toss
  • independence of each toss from any other toss. The outcome of any given coin toss does not depend on past or future coin tosses.
  • homogeneity of the coin behavior over time (the true chance does not change across infinitely many tosses)
  • a binomial error distribution

The resulting p -value of 0.01 from the coin toss experiment should be interpreted as the probability only under these particular assumptions.

What happens, however, if someone is interested in rejecting the claim that the coin is somewhat biased against heads? To be precise: the claim that it has a true frequency of heads of 40% or less (hence 60% for tails) is the one they are looking to deny with a certain evidential threshold.

The p -value needs to be recalculated under their null hypothesis so now the same 38 heads out of 100 tosses result in a p -value of ~0.38 ( calculation ). If they were interested in rejecting such a null hypothesis, then this data provide poor evidence against it since a 38/100 outcome would not be unusual at all if it were in fact true (p ≤ 0.38 would occur with probability 38%).

Similarly, the p -value needs to be recalculated for a claim of bias in the other direction, say that the coin produces heads with a frequency of 60% or more. The probability of observing 38 or fewer out of 100 under this null hypothesis is so extremely small ( p -value ~= 0.000007364 or 7.364 x 10 -6 in standard form , calculation ) that maintaining a claim for 60/40 bias in favor of heads becomes near-impossible for most practical purposes.

A p -value can be calculated for any frequentist statistical test. Common types of statistical tests include tests for:

  • absolute difference in proportions;
  • absolute difference in means;
  • relative difference in means or proportions;
  • goodness-of-fit;
  • homogeneity
  • independence
  • analysis of variance (ANOVA)

and others. Different statistics would be computed depending on the error distribution of the parameter of interest in each case, e.g. a t value, z value, chi-square (Χ 2 ) value, f -value, and so on.

p -values can then be calculated based on the cumulative distribution functions (CDFs) of these statistics whereas pre-test significance thresholds (critical values) can be computed based on the inverses of these functions. You can try these by plugging different inputs in our critical value calculator , and also by consulting its documentation.

In its generic form, a p -value formula can be written down as:

p = P(d(X) ≥ d(x 0 ); H 0 )

where P stands for probability, d(X) is a test statistic (distance function) of a random variable X , x 0 is a typical realization of X and H 0 is the selected null hypothesis. The semi-colon means ‘assuming’. The distance function is the aforementioned cumulative distribution function for the relevant error distribution. In its generic form a distance function equation can be written as:

Standard score distance function

X -bar is the arithmetic mean of the observed values, μ 0 is a hypothetical or expected mean to which X is compared, and n is the sample size. The result of a distance function will often be expressed in a standardized form – the number of standard deviations between the observed value and the expected value.

The p -value calculation is different in each case and so a different formula will be applied depending on circumstances. You can see examples in the p -values reported in our statistical calculators, such as the statistical significance calculator for difference of means or proportions , the Chi-square calculator , the risk ratio calculator , odds ratio calculator , hazard ratio calculator , and the normality calculator .

A very fresh (as of late 2020) example of the application of p -values in scientific hypothesis testing can be found in the recently concluded COVID-19 clinical trials. Multiple vaccines for the virus which spread from China in late 2019 and early 2020 have been tested on tens of thousands of volunteers split randomly into two groups – one gets the vaccine and the other gets a placebo. This is called a randomized controlled trial (RCT). The main parameter of interest is the difference between the rates of infections in the two groups. An appropriate test is the t-test for difference of proportions, but the same data can be examined in terms of risk ratios or odds ratio.

The null hypothesis in many of these medical trials is that the vaccine is at least 30% efficient. A statistical model can be built about the expected difference in proportions if the vaccine’s efficiency is 30% or less, and then the actual observed data from a medical trial can be compared to that null hypothesis. Most trials set their significance level at the minimum required by the regulatory bodies (FDA, EMA, etc.), which is usually set at 0.05 . So, if the p -value from a vaccine trial is calculated to be below 0.05, the outcome would be statistically significant and the null hypothesis of the vaccine being less than or equal to 30% efficient would be rejected.

Let us say a vaccine trial results in a p -value of 0.0001 against that null hypothesis. As this is highly unlikely under the assumption of the null hypothesis being true, it provides very strong evidence against the hypothesis that the tested treatment has less than 30% efficiency.

However, many regulators stated that they require at least 50% proven efficiency. They posit a different null hypothesis and so the p -value presented before these bodies needs to be calculated against it. This p -value would be somewhat increased since 50% is a higher null value than 30%, but given that the observed effects of the first vaccines to finalize their trials are around 95% with 95% confidence interval bounds hovering around 90%, the p -value against a null hypothesis stating that the vaccine’s efficiency is 50% or less is likely to still be highly statistically significant, say at 0.001 . Such an outcome is to be interpreted as follows: had the efficiency been 50% or below, such an extreme outcome would have most likely not been observed, therefore one can proceed to reject the claim that the vaccine has efficiency of 50% or less with a significance level of 0.001 .

While this example is fictitious in that it doesn’t reference any particular experiment, it should serve as a good illustration of how null hypothesis statistical testing (NHST) operates based on p -values and significance thresholds.

The utility of p -values and statistical significance

It is not often appreciated how much utility p-values bring to the practice of performing statistical tests for scientific and business purposes.

Quantifying relative uncertainty of data

First and foremost, p -values are a convenient expression of the uncertainty in the data with respect to a given claim. They quantify how unexpected a given observation is, assuming some claim which is put to the test is true. If the p-value is low the probability that it would have been observed under the null hypothesis is low. This means the uncertainty the data introduce is high. Therefore, anyone defending the substantive claim which corresponds to the statistical null hypothesis would be pressed to concede that their position is untenable in the face of such data.

If the p-value is high, then the uncertainty with regard to the null hypothesis is low and we are not in a position to reject it, hence the corresponding claim can still be maintained.

As evident by the generic p -value formula and the equation for a distance function which is a part of it, a p -value incorporates information about:

  • the observed effect size relative to the null effect size
  • the sample size of the test
  • the variance and error distribution of the statistic of interest

It would be much more complicated to communicate the outcomes of a statistical test if one had to communicate all three pieces of information. Instead, by way of a single value on the scale of 0 to 1 one can communicate how surprising an outcome is. This value is affected by any change in any of these variables.

This quality stems from the fact that assuming that a p -value from one statistical test can easily and directly be compared to another. The minimal assumptions behind significance tests mean that given that all of them are met, the strength of the statistical evidence offered by data relative to a null hypothesis of interest is the same in two tests if they have approximately equal p -values.

This is especially useful in conducting meta-analyses of various sorts, or for combining evidence from multiple tests.

p -value interpretation in outcomes of experiments

When a p -value is calculated for the outcome of a randomized controlled experiment, it is used to assess the strength of evidence against a null hypothesis of interest, such as that a given intervention does not have a positive effect. If H 0 : μ 0 ≤ 0% and the observed effect is μ 1 = 30% and the calculated p -value is 0.025, this can be used to reject the claim H 0 : μ 0 ≤ 0% at any significance level ≥ 0.025. This, in turn, allows us to claim that H 1 , a complementary hypothesis called the ‘alternative hypothesis’, is in fact true. In this case since H 0 : μ 0 ≤ 0% then H 1 : μ 1 > 0% in order to exhaust the parameter space, as illustrated below:

Composite null versus composite alternative hypothesis in NHST

A claim as the above corresponds to what is called a one-sided null hypothesis . There could be a point null as well, for example the claim that an intervention has no effect whatsoever translates to H 0 : μ 0 = 0%. In such a case the corresponding p -value refers to that point null and hence should be interpreted as rejecting the claim of the effect being exactly zero. For those interested in the differences between point null hypotheses and one-sided hypotheses the articles on onesided.org should be an interesting read. TLDR: most of the time you’d want to reject a directional claim and hence a one-tailed p -value should be reported [8] .

These finer points aside, after observing a low enough p -value, one can claim the rejection of the null and hence the adoption of the complementary alternative hypothesis as true. The alternative hypothesis is simply a negation of the null and is therefore a composite claim such as ‘there is a positive effect’ or ‘there is some non-zero effect’. Note that any inference about a particular effect size within the alternative space has not been tested and hence claiming it has probability equal to p calculated against a zero effect null hypothesis (a.k.a. the nil hypothesis) does not make sense.

p – value interpretation in regressions and correlations of observational data

When performing statistical analyses of observational data p -values are often calculated for regressors in addition to regression coefficients and for the correlation in addition to correlation coefficients. A p -value falling below a specific statistical significance threshold measures how surprising the observed correlation or regression coefficient would be if the variable of interest is in fact orthogonal to the outcome variable. That is – how likely would it be to observe the apparent relationship, if there was no actual relationship between the variable and the outcome variable.

Our correlation calculator outputs both p -values and confidence intervals for the calculated coefficients and is an easy way to explore the concept in the case of correlations. Extrapolating to regressions is then straightforward.

Misinterpretations of statistically significant p -values

There are several common misinterpretations [7] of p -values and statistical significance and no calculator can save one from falling for them. The following errors are often committed when a result is seen as statistically significant.

A result may be highly statistically significant (e.g. p -value 0.0001) but it might still have no practical consequences due to a trivial effect size. This often happens with overpowered designs, but it can also happen in a properly designed statistical test. This error can be avoided by always reporting the effect size and confidence intervals around it.

Observing a highly significant result, say p -value 0.01 does not mean that the likelihood that the observed difference is the true difference. In fact, that likelihood is much, much smaller. Remember that statistical significance has a strict meaning in the NHST framework.

For example, if the observed effect size μ 1 from an intervention is 20% improvement in some outcome and a p -value against the null hypothesis of μ 0 ≤ 0% has been calculated to be 0.01, it does not mean that one can reject μ 0 ≤ 20% with a p -value of 0.01. In fact, the p -value against μ 0 ≤ 20% would be 0.5, which is not statistically significant by any measure.

To make claims about a particular effect size it is recommended to use confidence intervals or severity, or both.

For example, stating that a p -value of 0.02 means that there is 98% probability that the alternative hypothesis is true or that there is 2% probability that the null hypothesis is true . This is a logical error.

By design, even if the null hypothesis is true, p -values equal to or lower than 0.02 would be observed exactly 2% of the time, so one cannot use the fact that a low p -value has been observed to argue there is only 2% probability that the null hypothesis is true. Frequentist and error-statistical methods do not allow one to attach probabilities to hypotheses or claims, only to events [4] . Doing so requires an exhaustive list of hypotheses and prior probabilities attached to them which goes firmly into decision-making territory. Put in Bayesian terms, the p -value is not a posterior probability.

Misinterpretations of statistically non-significant outcomes

Statistically non-significant p-values – that is, p is greater than the specified significance threshold α (alpha), can lead to a different set of misinterpretations. Due to the ubiquitous use of p -values, these are committed often as well.

Treating a high p -value / low significance level as evidence, by itself, that the null hypothesis is true is a common mistake. For example, after observing p = 0.2 one may claim this is evidence that there is no effect, e.g. no difference between two means, is a common mistake.

However, it is trivial to demonstrate why it is wrong to interpret a high p -value as providing support for the null hypothesis. Take a simple experiment in which one measures only 2 (two) people or objects in the control and treatment groups. The p -value for this test of significance will surely not be statistically significant. Does that mean that the intervention is ineffective? Of course not, since that claim has not been tested severely enough. Using a statistic such as severity can completely eliminate this error [4,5] .

A more detailed response would say that failure to observe a statistically significant result, given that the test has enough statistical power, can be used to argue for accepting the null hypothesis to the extent warranted by the power and with reference to the minimum detectable effect for which it was calculated. For example, if the statistical test had 99% power to detect an effect of size μ 1 at level α and it failed, then it could be argued that it is quite unlikely that there exists and effect of size μ 1 or greater as in that case one would have most likely observed a significant p -value.

This is a softer version of the above mistake wherein instead of claiming support for the null hypothesis, a low p -value is taken, by itself, as indicating that the effect size must be small.

This is a mistake since the test might have simply lacked power to exclude many effects of meaningful size. Examining confidence intervals and performing severity calculations against particular hypothesized effect sizes would be a way to avoid this issue.

References:

[1] Mayo, D.G. 1983. “An Objective Theory of Statistical Testing.” Synthese 57 (3): 297–340. DOI:10.1007/BF01064701. [2] Mayo, D.G. 1996 “Error and the Growth of Experimental Knowledge.” Chicago, Illinois: University of Chicago Press. DOI:10.1080/106351599260247. [3] Mayo, D.G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” The British Journal for the Philosophy of Science 57 (2): 323–357. DOI:10.1093/bjps/axl003. [4] Mayo, D.G., and A. Spanos. 2011. “Error Statistics.” Vol. 7, in Handbook of Philosophy of Science Volume 7 – Philosophy of Statistics , by D.G., Spanos, A. et al. Mayo, 1-46. Elsevier. [5] Mayo, D.G. 2018 “Statistical Inference as Severe Testing.” Cambridge: Cambridge University Press. ISBN: 978-1107664647 [6] Georgiev, G.Z. (2019) “Statistical Methods in Online A/B Testing”, ISBN: 978-1694079725 [7] Greenland, S. et al. (2016) “Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations”, European Journal of Epidemiology 31:337–350; DOI:10.1007/s10654-016-0149-3 [8] Georgiev, G.Z. (2018) “Directional claims require directional (statistical) hypotheses” [online, accessed on Dec 07, 2020, at https://www.onesided.org/articles/directional-claims-require-directional-hypotheses.php]

what is the p value in a research study

An applied statistician, data analyst, and optimizer by calling, Georgi has expertise in web analytics, statistics, design of experiments, and business risk management. He covers a variety of topics where mathematical models and statistics are useful. Georgi is also the author of “Statistical Methods in Online A/B Testing”.

Recent Articles

  • Mastering Formulas in Baking: Baker’s Math, Kitchen Conversions, + More
  • Margin vs. Markup: Decoding Profitability in Simple Terms
  • How Much Do I Have to Earn Per Hour to Afford the Cost of Living?
  • How to Calculate for VAT When Traveling Abroad
  • Mathematics in the Kitchen
  • Search GIGA Articles
  • Cybersecurity
  • Home & Garden
  • Mathematics

P-Value in Statistical Hypothesis Tests: What is it?

P value definition.

A p value is used in hypothesis testing to help you support or reject the null hypothesis . The p value is the evidence against a null hypothesis . The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.

P values are expressed as decimals although it may be easier to understand what they are if you convert them to a percentage . For example, a p value of 0.0254 is 2.54%. This means there is a 2.54% chance your results could be random (i.e. happened by chance). That’s pretty tiny. On the other hand, a large p-value of .9(90%) means your results have a 90% probability of being completely random and not due to anything in your experiment. Therefore, the smaller the p-value, the more important (“ significant “) your results.

When you run a hypothesis test , you compare the p value from your test to the alpha level you selected when you ran the test. Alpha levels can also be written as percentages.

p value

P Value vs Alpha level

Alpha levels are controlled by the researcher and are related to confidence levels . You get an alpha level by subtracting your confidence level from 100%. For example, if you want to be 98 percent confident in your research, the alpha level would be 2% (100% – 98%). When you run the hypothesis test, the test will give you a value for p. Compare that value to your chosen alpha level. For example, let’s say you chose an alpha level of 5% (0.05). If the results from the test give you:

  • A small p (≤ 0.05), reject the null hypothesis . This is strong evidence that the null hypothesis is invalid.
  • A large p (> 0.05) means the alternate hypothesis is weak, so you do not reject the null.

P Values and Critical Values

p-value

What if I Don’t Have an Alpha Level?

In an ideal world, you’ll have an alpha level. But if you do not, you can still use the following rough guidelines in deciding whether to support or reject the null hypothesis:

  • If p > .10 → “not significant”
  • If p ≤ .10 → “marginally significant”
  • If p ≤ .05 → “significant”
  • If p ≤ .01 → “highly significant.”

How to Calculate a P Value on the TI 83

Example question: The average wait time to see an E.R. doctor is said to be 150 minutes. You think the wait time is actually less. You take a random sample of 30 people and find their average wait is 148 minutes with a standard deviation of 5 minutes. Assume the distribution is normal. Find the p value for this test.

  • Press STAT then arrow over to TESTS.
  • Press ENTER for Z-Test .
  • Arrow over to Stats. Press ENTER.
  • Arrow down to μ0 and type 150. This is our null hypothesis mean.
  • Arrow down to σ. Type in your std dev: 5.
  • Arrow down to xbar. Type in your sample mean : 148.
  • Arrow down to n. Type in your sample size : 30.
  • Arrow to <μ0 for a left tail test . Press ENTER.
  • Arrow down to Calculate. Press ENTER. P is given as .014, or about 1%.

The probability that you would get a sample mean of 148 minutes is tiny, so you should reject the null hypothesis.

Note : If you don’t want to run a test, you could also use the TI 83 NormCDF function to get the area (which is the same thing as the probability value).

Dodge, Y. (2008). The Concise Encyclopedia of Statistics . Springer. Gonick, L. (1993). The Cartoon Guide to Statistics . HarperPerennial.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

An Easy Introduction to Statistical Significance (With Examples)

Published on January 7, 2021 by Pritha Bhandari . Revised on June 22, 2023.

If a result is statistically significant , that means it’s unlikely to be explained solely by chance or random factors. In other words, a statistically significant result has a very low chance of occurring if there were no true effect in a research study.

The p value , or probability value, tells you the statistical significance of a finding. In most studies, a p value of 0.05 or less is considered statistically significant, but this threshold can also be set higher or lower.

Table of contents

How do you test for statistical significance, what is a significance level, problems with relying on statistical significance, other types of significance in research, other interesting articles, frequently asked questions about statistical significance.

In quantitative research , data are analyzed through null hypothesis significance testing, or hypothesis testing. This is a formal procedure for assessing whether a relationship between variables or a difference between groups is statistically significant.

Null and alternative hypotheses

To begin, research predictions are rephrased into two main hypotheses: the null and alternative hypothesis .

  • A null hypothesis ( H 0 ) always predicts no true effect, no relationship between variables , or no difference between groups.
  • An alternative hypothesis ( H a or H 1 ) states your main prediction of a true effect, a relationship between variables, or a difference between groups.

Hypothesis testin g always starts with the assumption that the null hypothesis is true. Using this procedure, you can assess the likelihood (probability) of obtaining your results under this assumption. Based on the outcome of the test, you can reject or retain the null hypothesis.

  • H 0 : There is no difference in happiness between actively smiling and not smiling.
  • H a : Actively smiling leads to more happiness than not smiling.

Test statistics and p values

Every statistical test produces:

  • A test statistic that indicates how closely your data match the null hypothesis.
  • A corresponding p value that tells you the probability of obtaining this result if the null hypothesis is true.

The p value determines statistical significance. An extremely low p value indicates high statistical significance, while a high p value means low or no statistical significance.

Next, you perform a t test to see whether actively smiling leads to more happiness. Using the difference in average happiness between the two groups, you calculate:

  • a t value (the test statistic) that tells you how much the sample data differs from the null hypothesis,
  • a p value showing the likelihood of finding this result if the null hypothesis is true.

Prevent plagiarism. Run a free check.

The significance level , or alpha (α), is a value that the researcher sets in advance as the threshold for statistical significance. It is the maximum risk of making a false positive conclusion ( Type I error ) that you are willing to accept .

In a hypothesis test, the  p value is compared to the significance level to decide whether to reject the null hypothesis.

  • If the p value is  higher than the significance level, the null hypothesis is not refuted, and the results are not statistically significant .
  • If the p value is lower than the significance level, the results are interpreted as refuting the null hypothesis and reported as statistically significant .

Usually, the significance level is set to 0.05 or 5%. That means your results must have a 5% or lower chance of occurring under the null hypothesis to be considered statistically significant.

The significance level can be lowered for a more conservative test. That means an effect has to be larger to be considered statistically significant.

The significance level may also be set higher for significance testing in non-academic marketing or business contexts. This makes the study less rigorous and increases the probability of finding a statistically significant result.

As best practice, you should set a significance level before you begin your study. Otherwise, you can easily manipulate your results to match your research predictions.

It’s important to note that hypothesis testing can only show you whether or not to reject the null hypothesis in favor of the alternative hypothesis. It can never “prove” the null hypothesis, because the lack of a statistically significant effect doesn’t mean that absolutely no effect exists.

When reporting statistical significance, include relevant descriptive statistics about your data (e.g., means and standard deviations ) as well as the test statistic and p value.

There are various critiques of the concept of statistical significance and how it is used in research.

Researchers classify results as statistically significant or non-significant using a conventional threshold that lacks any theoretical or practical basis. This means that even a tiny 0.001 decrease in a p value can convert a research finding from statistically non-significant to significant with almost no real change in the effect.

On its own, statistical significance may also be misleading because it’s affected by sample size. In extremely large samples , you’re more likely to obtain statistically significant results, even if the effect is actually small or negligible in the real world. This means that small effects are often exaggerated if they meet the significance threshold, while interesting results are ignored when they fall short of meeting the threshold.

The strong emphasis on statistical significance has led to a serious publication bias and replication crisis in the social sciences and medicine over the last few decades. Results are usually only published in academic journals if they show statistically significant results—but statistically significant results often can’t be reproduced in high quality replication studies.

As a result, many scientists call for retiring statistical significance as a decision-making tool in favor of more nuanced approaches to interpreting results.

That’s why APA guidelines advise reporting not only p values but also  effect sizes and confidence intervals wherever possible to show the real world implications of a research outcome.

Aside from statistical significance, clinical significance and practical significance are also important research outcomes.

Practical significance shows you whether the research outcome is important enough to be meaningful in the real world. It’s indicated by the effect size of the study.

Clinical significance is relevant for intervention and treatment studies. A treatment is considered clinically significant when it tangibly or substantially improves the lives of patients.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Statistical significance is a term used by researchers to state that it is unlikely their observations could have occurred under the null hypothesis of a statistical test . Significance is usually denoted by a p -value , or probability value.

Statistical significance is arbitrary – it depends on the threshold, or alpha value, chosen by the researcher. The most common threshold is p < 0.05, which means that the data is likely to occur less than 5% of the time under the null hypothesis .

When the p -value falls below the chosen alpha value, then we say the result of the test is statistically significant.

A p -value , or probability value, is a number describing how likely it is that your data would have occurred under the null hypothesis of your statistical test .

P -values are usually automatically calculated by the program you use to perform your statistical test. They can also be estimated using p -value tables for the relevant test statistic .

P -values are calculated from the null distribution of the test statistic. They tell you how often a test statistic is expected to occur under the null hypothesis of the statistical test, based on where it falls in the null distribution.

If the test statistic is far from the mean of the null distribution, then the p -value will be small, showing that the test statistic is not likely to have occurred under the null hypothesis.

No. The p -value only tells you how likely the data you have observed is to have occurred under the null hypothesis .

If the p -value is below your threshold of significance (typically p < 0.05), then you can reject the null hypothesis, but this does not necessarily mean that your alternative hypothesis is true.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2023, June 22). An Easy Introduction to Statistical Significance (With Examples). Scribbr. Retrieved July 10, 2024, from https://www.scribbr.com/statistics/statistical-significance/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, understanding p values | definition and examples, what is effect size and why does it matter (examples), hypothesis testing | a step-by-step guide with easy examples, what is your plagiarism score.

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

How to Find the P value: Process and Calculations

By Jim Frost 4 Comments

P values are everywhere in statistics . They’re in all types of hypothesis tests. But how do you calculate a p-value ? Unsurprisingly, the precise calculations depend on the test. However, there is a general process that applies to finding a p value.

In this post, you’ll learn how to find the p value. I’ll start by showing you the general process for all hypothesis tests. Then I’ll move on to a step-by-step example showing the calculations for a p value. This post includes a calculator so you can apply what you learn.

General Process for How to Find the P value

To find the p value for your sample , do the following:

  • Identify the correct test statistic.
  • Calculate the test statistic using the relevant properties of your sample.
  • Specify the characteristics of the test statistic’s sampling distribution.
  • Place your test statistic in the sampling distribution to find the p value.

Before moving on to the calculations example, I’ll summarize the purpose for each step. This part tells you the “why.” In the example calculations section, I show the “how.”

Identify the Correct Test Statistic

All hypothesis tests boil your sample data down to a single number known as a test statistic. T-tests use t-values. F-tests use F-values. Chi-square tests use chi-square values. Choosing the correct one depends on the type of data you have and how you want to analyze it. Before you can find the p value, you must determine which hypothesis test and test statistic you’ll use.

Test statistics assess how consistent your sample data are with the null hypothesis. As a test statistic becomes more extreme, it indicates a larger difference between your sample data and the null hypothesis.

Calculate the Test Statistic

How you calculate the test statistic depends on which one you’re using. Unsurprisingly, the method for calculating test statistics varies by test type. Consequently, to calculate the p value for any test, you’ll need to know the correct test statistic formula.

To learn more about test statistics and how to calculate them for other tests, read my article, Test Statistics .

Specify the Properties of the Test Statistic’s Sampling Distribution

Test statistics are unitless, making them tricky to interpret on their own. You need to place them in a larger context to understand how extreme they are.

The sampling distribution for the test statistic provides that context. Sampling distributions are a type of probability distribution. Consequently, they allow you to calculate probabilities related to your test statistic’s extremeness, which lets us find the p value!

Probability distribution plot that displays a t-distribution.

Like any distribution, the same sampling distribution (e.g., the t-distribution) can have a variety of shapes depending upon its parameters . For this step, you need to determine the characteristics of the sampling distribution that fit your design and data.

That usually entails specifying the degrees of freedom (changes its shape) and whether the test is one- or two-tailed (affects the directions the test can detect effects). In essence, you’re taking the general sampling distribution and tailoring it to your study so it provides the correct probabilities for finding the p value.

Each test statistic’s sampling distribution has unique properties you need to specify. At the end of this post, I provide links for several.

Learn more about degrees of freedom and one-tailed vs. two-tailed tests .

Placing Your Test Statistic in its Sampling Distribution to Find the P value

Finally, it’s time to find the p value because we have everything in place. We have calculated our test statistic and determined the correct properties for its sampling distribution. Now, we need to find the probability of values more extreme than our observed test statistic.

In this context, more extreme means further away from the null value in both directions for a two-tailed test or in one direction for a one-tailed test.

At this point, there are two ways to use the test statistic and distribution to calculate the p value. The formulas for probability distributions are relatively complex. Consequently, you won’t calculate it directly. Instead, you’ll use either an online calculator or a statistical table for the test statistic. I’ll show you both approaches in the step-by-step example.

In summary, calculating a p-value involves identifying and calculating your test statistic and then placing it in its sampling distribution to find the probability of more extreme values!

Let’s see this whole process in action with an example!

Step-by-Step Example of How to Find the P value for a T-test

For this example, assume we’re tasked with determining whether a sample mean is different from a hypothesized value. We’re given the sample statistics below and need to find the p value.

  • Mean: 330.6
  • Standard deviation: 154.2
  • Sample size: 25
  • Null hypothesis value: 260

Let’s work through the step-by-step process of how to calculate a p-value.

First, we need to identify the correct test statistic. Because we’re comparing one mean to a null value, we need to use a 1-sample t-test. Hence, the t-value is our test statistic, and the t-distribution is our sampling distribution.

Second, we’ll calculate the test statistic. The t-value formula for a 1-sample t-test is the following:

Test statistic formula for the 1-sample t-test.

  • x̄ is the sample mean.
  • µ 0 is the null hypothesis value.
  • s is the sample standard deviation.
  • n is the sample size
  • Collectively, the denominator is the standard error of the mean .

Let’s input our sample values into the equation to calculate the t-value.

Calculations for the t-value, which leads to the p-value.

Third, we need to specify the properties of the sampling distribution to find the p value. We’ll need the degrees of freedom.

The degrees of freedom for a 1-sample t-test is n – 1. Our sample size is 25. Hence, we have 24 DF. We’ll use a two-tailed test, which is the standard.

Now we’ve got all the necessary information to calculate the p-value. I’ll show you two ways to take the final step!

P-value Calculator

One method is to use an online p-value calculator, like the one I include below.

Enter the following in the calculator for our t-test example.

  • In What do you want? , choose Two-tailed p-value (the default).
  • In What do you have? , choose t-score .
  • In Degrees of freedom (d) , enter 24 .
  • In Your t-score , enter 2.289 .

The calculator displays a result of 0.031178.

There you go! Using the standard significance level of 0.05, our results are statistically significant!

Using a Statistical Table to Find the P Value

The other common method is using a statistical table. In this case, we’ll need to use a t-table. For this example, I’ll truncate the rows. You can find my full table here: T-Table .

This method won’t find the exact p value, but you’ll find a range and know whether your results are statistically significant.

T-table for finding the p value.

Start by looking in the row for 24 degrees of freedom, highlighted in light green. We need to find where our t-score of 2.289 fits in. I highlight the two table values that our t-value fits between, 2.064 and 2.492. Then we look at the two-tailed row at the top to find the corresponding p values for the two t-values.

In this case, our t-value of 2.289 produces a p value between 0.02 and 0.05 for a two-tailed test. Our results are statistically significant, and they are consistent with the calculator’s more precise results.

Displaying the P value in a Chart

In the example above, you saw how to calculate a p-value starting with the sample statistics. We calculated the t-value and placed it in the applicable t-distribution. I find that the calculations and numbers are dry by themselves. I love graphing things whenever possible, so I’ll use a probability distribution plot to illustrate the example.

Using statistical software, I’ll create the graphical equivalent of calculating the p-value above.

Chart of finding p value.

This chart has two shaded regions because we performed a two-tailed test. Each region has a probability of 0.01559. When you sum them, you obtain the p-value of 0.03118. In other words, the likelihood of a t-value falling in either shaded region when the null hypothesis is true is 0.03118.

I showed you how to find the p value for a t-test. Click the links below to see how it works for other hypothesis tests:

  • One-Way ANOVA F-test
  • Chi-square Test of Independence

Now that we’ve found the p value, how do you interpret it precisely? If you’re going beyond the significant/not significant decision and really want to understand what it means, read my posts, Interpreting P Values  and Statistical Significance: Definition & Meaning .

If you’re learning about hypothesis testing and like the approach I use in my blog, check out my Hypothesis Testing book! You can find it at Amazon and other retailers.

Cover image of my Hypothesis Testing: An Intuitive Guide ebook.

Share this:

what is the p value in a research study

Reader Interactions

' src=

January 9, 2024 at 9:58 am

how did you get the 0.01559? is it from the t table or somewhere else. please put me through

' src=

January 9, 2024 at 3:13 pm

The value of 0.01559 comes from the t-distribution. It’s the probability of each red shaded region in the graph I show. These regions are based on the t-value. Typically, you’ll use either statistical software or a t-distribution calculator to find probabilities associated with t-values. Or use a t-table. I used my statistical software. You don’t calculate those probabilities yourself because the calculations are complex.

I hope that helps!

' src=

November 23, 2022 at 2:08 am

Simply superb. Easy for us who are starters to enjoy statistic made enjoyable.

' src=

November 22, 2022 at 6:41 pm

I like the way your presentation so that every one can undersanf in the simplest way. If you can support this by power point it will be more intetrsted. I know it takes your valuable time. However, forwarding your knowledge to those who need is more valuable, supporting and appreciation. Continue doing this teaching approach. Thank you. I wish you all the best. God bless you.

Comments and Questions Cancel reply

  • How it works

researchprospect post subheader

P-Value: A Complete Guide

Published by Owen Ingram at August 31st, 2021 , Revised On August 3, 2023

You might have come across this term many times in hypothesis testing .  Can you tell me what p-value is and how to calculate it? For those who are new to this term, sit back and read this guide to find out all the answers. Those already familiar with it, continue reading because you might get a chance to dig deeper about the p-value and its significance in statistics .

Before we start with what a p-value is, there are a few other terms you must be clear of. And these are the null hypothesis and alternative hypothesis .

What are the Null Hypothesis and Alternative Hypothesis?

 The alternative hypothesis is your first hypothesis predicting a relationship between different variables . On the contrary, the null hypothesis predicts that there is no relationship between the variables you are playing with.

For instance, if you want to check the impact of two fertilizers on the growth of two sets of plants. Group A of plants is given fertilizer A, while B is given fertilizer B. Now by using a two-tailed t-test , you can find out the difference between the two fertilizers.

Null Hypothesis : There is no difference in growth between the two sets of plants.

Alternative Hypothesis: There is a difference in growth between the two groups.

What is the P-value?

The p-value in statistics is the probability of getting outcomes at least as extreme as the outcomes of a statistical hypothesis test, assuming the null hypothesis to be correct. To put it in simpler words, it is a calculated number from a statistical test that shows how likely you are to have found a set of observations if the null hypothesis were plausible.

This means that p-values are used as alternatives to rejection points for providing the smallest level of significance at which the null hypothesis can be rejected . If the p-value is small, it implies that the evidence in favour of the alternative hypothesis is bigger. Similarly, if the value is big, the evidence in favour of the alternative hypothesis would be small.

How is the P-value Calculated?

You can either use the p-value tables or statistical software to calculate the p-value. The calculated numbers are based on the known probability distribution of the statistic being tested.

The online p-value tables depict how frequently you can expect to see test statistics under the null hypothesis. P-value depends on the statistical test one uses to test a hypothesis.

  • Different statistical tests can have different predictions, hence developing different test statistics. Researchers can choose a statistical test depending on what best suits their data and the effect they want to test
  • The number of independent variables in your test determines how large or small the test statistic must be to produce the same p-value

Get statistical analysis help at an affordable price

  • An expert statistician will complete your work
  • Rigorous quality checks
  • Confidentiality and reliability
  • Any statistical software of your choice
  • Free Plagiarism Report

Get statistical analysis help at an affordable price

When is a P-value Statistically Significant?

Before we talk about when a p-value is statistically significant, let’s first find out what does it mean to be statistically significant.

Any guesses?

To be statistically significant is another way of saying that a p-value is so small that it might reject a null hypothesis.

Now the question is how small?

If a p-value is smaller than 0.05 then it is statistically significant. This means that the evidence against the null hypothesis is strong. The fact that there is less than a 5 per cent chance of the null hypothesis being correct and plausible, we can accept the alternative hypothesis and reject the null hypothesis.

Nevertheless , if the p-value is less than the threshold of significance , the null hypothesis can be rejected, but that does not mean there would be a 95 percent probability of the alternative hypothesis being true. Note that the p-value is dependent or conditioned upon the null hypothesis is plausible, but it is not related to the correctness and falsity of the alternative hypothesis.

When the p-value is greater than 0.05, it is not statistically significant. It also indicates that the evidence for the null hypothesis is strong. So, the alternative hypothesis, in this case, is rejected, and the null hypothesis is retained. An important thing to keep in mind here is that you still cannot accept the null hypothesis. You can only fail to reject it or reject it.

Here is a table showing hypothesis interpretations:

P-value Decision
Not statistically significant and do not rejects the null hypothesis.
Statistically significant and rejects the null hypothesis in favour of the alternative hypothesis.
Highly statistically significant and rejects the null hypothesis in favour of the alternative hypothesis.

Is it clear now? We thought so! Let’s move on to the next heading, then.

How to Use P-value in Hypothesis Testing?

Follow these three simple steps to use p-value in hypothesis testing .

Step 1: Find the level of significance. Make sure to choose the significance level during the initial steps of the design of a hypothesis test. It is usually 0.10, 0.05, and 0.01.

Step 2: Now calculate the p-value. As we discussed earlier, there are two ways of calculating it. A simple way out would be using Microsoft Excel, which allows p-value calculation with Data Analysis ToolPak .

Step 3: Start comparing the p-value with the significance level and deduce conclusions accordingly. Following the general rule, if the value is less than the level of significance, there is enough evidence to reject the null hypothesis of an experiment.

FAQs About P-Value

What is a null hypothesis.

It is a statistical theory suggesting that there is no relationship between a set of variables .

What is an alternative hypothesis?

The alternative hypothesis is your first hypothesis predicting a relationship between different variables .

What is the p-value?

The p-value in statistics is the probability of getting outcomes at least as extreme as the outcomes of a statistical hypothesis test, assuming the null hypothesis to be correct. It is a calculated number from a statistical test that shows how likely you are to have found a set of observations if the null hypothesis were plausible.

What is the level of significance?

To be statistically significant is another way of saying that a p-value is so small that it might reject a null hypothesis. This table shows when the p-value is significant.

You May Also Like

Statistical power is a decision by a researcher/statistician that results of a study/experiment can be explained by factors other than chance alone. The statistical power of a study is also referred to as its sensitivity in some cases.

Experimental research refers to the experiments conducted in the laboratory or under observation in controlled conditions. Here is all you need to know about experimental research.

Ordinal data, as the name itself suggests, has its variables in a specific hierarchy or order. It is categorical data with a set scale or order to it.

USEFUL LINKS

LEARNING RESOURCES

researchprospect-reviews-trust-site

COMPANY DETAILS

Research-Prospect-Writing-Service

  • How It Works

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

AP®︎/College Statistics

Course: ap®︎/college statistics   >   unit 10.

  • Idea behind hypothesis testing
  • Examples of null and alternative hypotheses
  • Writing null and alternative hypotheses

P-values and significance tests

  • Comparing P-values to different significance levels
  • Estimating a P-value from a simulation
  • Estimating P-values from simulations
  • Using P-values to make conclusions

what is the p value in a research study

Want to join the conversation?

  • Upvote Button navigates to signup page
  • Downvote Button navigates to signup page
  • Flag Button navigates to signup page

Great Answer

Video transcript

what is the p value in a research study

The P value: What it really means

As nurses, we must administer nursing care based on the best available scientific evidence. But for many nurses, critical appraisal, the process used to determine the best available evidence, can seem intimidating. To make critical appraisal more approachable, let’s examine the P value and make sure we know what it is and what it isn’t.

Defining P value

The P value is the probability that the results of a study are caused by chance alone. To better understand this definition, consider the role of chance.

The concept of chance is illustrated with every flip of a coin. The true probability of obtaining heads in any single flip is 0.5, meaning that heads would come up in half of the flips and tails would come up in half of the flips. But if you were to flip a coin 10 times, you likely would not obtain heads five times and tails five times. You’d be more likely to see a seven-to-three split or a six-to-four split. Chance is responsible for this variation in results.

Just as chance plays a role in determining the flip of a coin, it plays a role in the sampling of a population for a scientific study. When subjects are selected, chance may produce an unequal distribution of a characteristic that can affect the outcome of the study. Statistical inquiry and the P value are designed to help us determine just how large a role chance plays in study results. We begin a study with the assumption that there will be no difference between the experimental and control groups. This assumption is called the null hypothesis. When the results of the study indicate that there is a difference, the P value helps us determine the likelihood that the difference is attributed to chance.

Competing hypotheses

In every study, researchers put forth two kinds of hypotheses: the research or alternative hypothesis and the null hypothesis. The research hypothesis reflects what the researchers hope to show—that there is a difference between the experimental group and the control group. The null hypothesis directly competes with the research hypothesis. It states that there is no difference between the experimental group and the control group.

It may seem logical that researchers would test the research hypothesis—that is, that they would test what they hope to prove. But the probability theory requires that they test the null hypothesis instead. To support the research hypothesis, the data must contradict the null hypothesis. By demonstrating a difference between the two groups, the data contradict the null hypothesis.

Testing the null hypothesis

Now that you know why we test the null hypothesis, let’s look at how we test the null hypothesis.

After formulating the null and research hypotheses, researchers decide on a test statistic they can use to determine whether to accept or reject the null hypothesis. They also propose a fixed-level P value. The fixed level P value is often set at .05 and serves as the value against which the test-generated P value must be compared. (See Why .05?)

A comparison of the two P values determines whether the null hypothesis is rejected or accepted. If the P value associated with the test statistic is less than the fixed-level P value, the null hypothesis is rejected because there’s a statistically significant difference between the two groups. If the P value associated with the test statistic is greater than the fixed-level P value, the null hypothesis is accepted because there’s no statistically significant difference between the groups.

The decision to use .05 as the threshold in testing the null hypothesis is completely arbitrary. The researchers credited with establishing this threshold warned against strictly adhering to it.

Remember that warning when appraising a study in which the test statistic is greater than .05. The savvy reader will consider other important measurements, including effect size, confidence intervals, and power analyses when deciding whether to accept or reject scientific findings that could influence nursing practice.

Real-world hypothesis testing

How does this play out in real life? Let’s assume that you and a nurse colleague are conducting a study to find out if patients who receive backrubs fall asleep faster than patients who do not receive backrubs.

1. State your null and research hypotheses

Your null hypothesis will be that there will be no difference in the average amount of time it takes patients in each group to fall asleep. Your research hypothesis will be that patients who receive backrubs fall asleep, on average, faster than those who do not receive backrubs. You will be testing the null hypothesis in hopes of supporting your research hypothesis.

2. Propose a fixed-level P value

Although you can choose any value as your fixed-level P value, you and your research colleague decide you’ll stay with the conventional .05. If you were testing a new medical product or a new drug, you would choose a much smaller P value (perhaps as small as .0001). That’s because you would want to be as sure as possible that any difference you see between groups is attributed to the new product or drug and not to chance. A fixed-level P value of .0001 would mean that the difference between the groups was attributed to chance only 1 time out of 10,000. For a study on backrubs, however, .05 seems appropriate.

3. Conduct hypothesis testing to calculate a probability value

You and your research colleague agree that a randomized controlled study will help you best achieve your research goals, and you design the process accordingly. After consenting to participate in the study, patients are randomized to one of two groups:

  • the experimental group that receives the intervention—the backrub group
  • the control group—the non-backrub group.

After several nights of measuring the number of minutes it takes each participant to fall asleep, you and your research colleague find that on average, the backrub group takes 19 minutes to fall asleep and the non-backrub group takes 24 minutes to fall asleep.

Now the question is: Would you have the same results if you conducted the study using two different groups of people? That is, what role did chance play in helping the backrub group fall asleep 5 minutes faster than the non-backrub group? To answer this, you and your colleague will use an independent samples t-test to calculate a probability value.

An independent samples t-test is a kind of hypothesis test that compares the mean values of two groups (backrub and non-backrub) on a given variable (time to fall asleep).

Hypothesis testing is really nothing more than testing the null hypothesis. In this case, the null hypothesis is that the amount of time needed to fall asleep is the same for the experimental group and the control group. The hypothesis test addresses this question: If there’s really no difference between the groups, what is the probability of observing a difference of 5 minutes or more, say 10 minutes or 15 minutes?

We can define the P value as the probability that the observed time difference resulted from chance. Some find it easier to understand the P value when they think of it in relationship to error. In this case, the P value is defined as the probability of committing a Type 1 error. (Type 1 error occurs when a true null hypothesis is incorrectly rejected.)

4. Compare and interpret the P value

Early on in your study, you and your colleague selected a fixed-level P value of .05, meaning that you were willing to accept that 5% of the time, your results might be caused by chance. Also, you used an independent samples t-test to arrive at a probability value that will help you determine the role chance played in obtaining your results. Let’s assume, for the sake of this example, that the probability value generated by the independent samples t-test is .01 (P = .01). Because this P value associated with the test statistic is less than the fixed-level statistic (.01 < .05), you can reject the null hypothesis. By doing so, you declare that there is a statistically significant difference between the experimental and control groups. (See Putting the P value in context.)

In effect, you’re saying that the chance of observing a difference of 5 minutes or more, when in fact there is no difference, is less than 5 in 100. If the P value associated with the test statistic would have been greater than .05, then you would accept the null hypothesis, which would mean that there is no statistically significant difference between the control and experimental groups. Accepting the null hypothesis would mean that a difference of 5 minutes or more between the two groups would occur more than 5 times in 100.

Putting the P value in context

Although the P value helps you interpret study results, keep in mind that many factors can influence the P value—and your decision to accept or reject the null hypothesis. These factors include the following:

  • Insufficient power. The study may not have been designed appropriately to detect an effect of the independent variable on the dependent variable. Therefore, a change may have occurred without your knowing it, causing you to incorrectly reject your hypothesis.
  • Unreliable measures. Instruments that don’t meet consistency or reliability standards may have been used to measure a particular phenomenon.
  • Threats to internal validity. Various biases, such as selection of patients, regression, history, and testing bias, may unduly influence study outcomes.

A decision to accept or reject study findings should focus not only on P value but also on other metrics including the following:

  • Confidence intervals (an estimated range of values with a high probability of including the true population value of a given parameter)
  • Effect size (a value that measures the magnitude of a treatment effect)

Remember, P value tells you only whether a difference exists between groups. It doesn’t tell you the magnitude of the difference.

5. Communicate your findings

The final step in hypothesis testing is communicating your findings. When sharing research findings (hypotheses) in writing or discussion, understand that they are statements of relationships or differences in populations. Your findings are not proved or disproved. Scientific findings are always subject to change. But each study leads to better understanding and, ideally, better outcomes for patients.

Key concepts

The P value isn’t the only concept you need to understand to analyze research findings. But it is a very important one. And chances are that understanding the P value will make it easier to understand other key analytical concepts.

Selected references

Burns N, Grove S: The Practice of Nursing Research: Conduct, Critique, and Utilization. 5th ed. Philadelphia: WB Saunders; 2004.

Glaser DN: The controversy of significance testing: misconceptions and alternatives. Am J Crit Care. 1999;8(5):291-296.

Kenneth J. Rempher, PhD, RN, MBA, CCRN, APRN,BC, is Director, Professional Nursing Practice at Sinai Hospital of Baltimore (Md.). Kathleen Urquico, BSN, RN, is a Direct Care Nurse in the Rubin Institute of Advanced Orthopedics at Sinai Hospital of Baltimore.

what is the p value in a research study

NurseLine Newsletter

  • First Name *
  • Last Name *
  • Hidden Referrer

*By submitting your e-mail, you are opting in to receiving information from Healthcom Media and Affiliates. The details, including your email address/mobile number, may be used to keep you informed about future products and services.

Test Your Knowledge

Recent posts.

what is the p value in a research study

Interpreting statistical significance in nursing research

what is the p value in a research study

Introduction to qualitative nursing research

what is the p value in a research study

Navigating statistics for successful project implementation

what is the p value in a research study

Nurse research and the institutional review board

What are descriptive statistics

Research 101: Descriptive statistics

what is the p value in a research study

Research 101: Forest plots

what is the p value in a research study

Understanding confidence intervals helps you make better clinical decisions

what is the p value in a research study

Differentiating statistical significance and clinical significance

Differentiating research, evidence-based practice, and quality improvement

Differentiating research, evidence-based practice, and quality improvement

what is the p value in a research study

Are you confident about confidence intervals?

what is the p value in a research study

Making sense of statistical power

what is the p value in a research study

Help us protect a landmark research study for serious mental illness

what is the p value in a research study

Quantitative vs. Qualitative Research Design: Understanding the Differences

what is the p value in a research study

As a future professional in the social and education landscape, research design is one of the most critical strategies that you will master to identify challenges, ask questions and form data-driven solutions to address problems specific to your industry. 

Many approaches to research design exist, and not all work in every circumstance. While all data-focused research methods are valid in their own right, certain research design methods are more appropriate for specific study objectives.

Unlock our resource to learn more about jump starting a career in research design — Research Design and Data Analysis for the Social Good .

We will discuss the differences between quantitative (numerical and statistics-focused) and qualitative (non-numerical and human-focused) research design methods so that you can determine which approach is most strategic given your specific area of graduate-level study. 

Understanding Social Phenomena: Qualitative Research Design

Qualitative research focuses on understanding a phenomenon based on human experience and individual perception. It is a non-numerical methodology relying on interpreting a process or result. Qualitative research also paves the way for uncovering other hypotheses related to social phenomena. 

In its most basic form, qualitative research is exploratory in nature and seeks to understand the subjective experience of individuals based on social reality.

Qualitative data is…

  • often used in fields related to education, sociology and anthropology; 
  • designed to arrive at conclusions regarding social phenomena; 
  • focused on data-gathering techniques like interviews, focus groups or case studies; 
  • dedicated to perpetuating a flexible, adaptive approach to data gathering;
  • known to lead professionals to deeper insights within the overall research study.

You want to use qualitative data research design if:

  • you work in a field concerned with enhancing humankind through the lens of social change;
  • your research focuses on understanding complex social trends and individual perceptions of those trends;
  • you have interests related to human development and interpersonal relationships.

Examples of Qualitative Research Design in Education

Here are just a few examples of how qualitative research design methods can impact education:

Example 1: Former educators participate in in-depth interviews to help determine why a specific school is experiencing a higher-than-average turnover rate compared to other schools in the region. These interviews help determine the types of resources that will make a difference in teacher retention. 

Example 2: Focus group discussions occur to understand the challenges that neurodivergent students experience in the classroom daily. These discussions prepare administrators, staff, teachers and parents to understand the kinds of support that will augment and improve student outcomes.

Example 3: Case studies examine the impacts of a new education policy that limits the number of teacher aids required in a special needs classroom. These findings help policymakers determine whether the new policy affects the learning outcomes of a particular class of students.

Interpreting the Numbers: Quantitative Research Design

Quantitative research tests hypotheses and measures connections between variables. It relies on insights derived from numbers — countable, measurable and statistically sound data. Quantitative research is a strategic research design used when basing critical decisions on statistical conclusions and quantifiable data.

Quantitative research provides numerical-backed quantifiable data that may approve or discount a theory or hypothesis.

Quantitative data is…

  • often used in fields related to education, data analysis and healthcare; 
  • designed to arrive at numerical, statistical conclusions based on objective facts;
  • focused on data-gathering techniques like experiments, surveys or observations;
  • dedicated to using mathematical principles to arrive at conclusions;
  • known to lead professionals to indisputable observations within the overall research study.

You want to use quantitative data research design if:

  • you work in a field concerned with analyzing data to inform decisions;
  • your research focuses on studying relationships between variables to form data-driven conclusions;
  • you have interests related to mathematics, statistical analysis and data science.

Examples of Quantitative Research Design in Education

Here are just a few examples of how quantitative research design methods may impact education:

Example 1: Researchers compile data to understand the connection between class sizes and standardized test scores. Researchers can determine if and what the relationship is between smaller, intimate class sizes and higher test scores for grade-school children using statistical and data analysis.

Example 2: Professionals conduct an experiment in which a group of high school students must complete a certain number of community service hours before graduation. Researchers compare those students to another group of students who did not complete service hours — using statistical analysis to determine if the requirement increased college acceptance rates.

Example 3: Teachers take a survey to examine an education policy that restricts the number of extracurricular activities offered at a particular academic institution. The findings help better understand the far-reaching impacts of extracurricular opportunities on academic performance.

Making the Most of Research Design Methods for Good: Vanderbilt University’s Peabody College

Vanderbilt University's Peabody College of Education and Human Development offers a variety of respected, nationally-recognized graduate programs designed with future agents of social change in mind. We foster a culture of excellence and compassion and guide you to become the best you can be — both in the classroom and beyond.

At Peabody College, you will experience

  • an inclusive, welcoming community of like-minded professionals;
  • the guidance of expert faculty with real-world industry experience;
  • opportunities for valuable, hands-on learning experiences,
  • the option of specializing depending on your specific area of interest.

Explore our monthly publication — Ideas in Action — for an inside look at how Peabody College translates discoveries into action.

Please click below to explore a few of the graduate degrees offered at Peabody College:

  • Child Studies M.Ed. — a rigorous Master of Education degree that prepares students to examine the developmental, learning and social issues concerning children and that allows students to choose from one of two tracks (the Clinical and Developmental Research Track or the Applied Professional Track).
  • Cognitive Psychology in Context M.S. — an impactful Master of Science program that emphasizes research design and statistical analysis to understand cognitive processes and real-world applications best, making it perfect for those interested in pursuing doctoral studies in cognitive science.
  • Education Policy M.P.P — an analysis-focused Master of Public Policy program designed for future leaders in education policy and practice, allowing students to specialize in either K-12 Education Policy, Higher Education Policy or Quantitative Methods in Education Policy. 
  • Quantitative Methods M.Ed. — a data-driven Master of Education degree that teaches the theory and application of quantitative analysis in behavioral, social and educational sciences.

Connect with the Community of Professionals Seeking to Enhance Humankind at Peabody College

At Peabody College, we equip you with the marketable, transferable skills needed to secure a valuable career in education and beyond. You will emerge from the graduate program of your choice ready to enhance humankind in more meaningful ways than you could have imagined.

If you want to develop the sought-after skills needed to be a force for change in the social and educational spaces, you are in the right place .

We invite you to request more information ; we will connect you with an admissions professional who can answer all your questions about choosing one of these transformative graduate degrees at Peabody College. You may also take this opportunity to review our admissions requirements and start your online application today. 

A young boy is sitting at a table with his parents.

Subscribe to the Blog

← All Posts

Media Inquiries

615-322-6397 Email

Latest Stories

7 quantitative data careers in education that can make a difference, the rewarding outcomes of being a special education teacher, should i earn an m.p.p. degree in education policy 4 questions to ask, keep reading.

what is the p value in a research study

Explore Stories by Topic

  • M.Ed. Degrees
  • Research Design

We’re sorry, this feature is currently unavailable. We’re working to restore it. Please try again later.

  • Our network

The Sydney Morning Herald

Ozempic is much more than a weight-loss drug, studies show, by mary ward, save articles for later.

Add articles to your saved list and come back to them any time.

Ozempic and similar new diabetes drugs are very likely to improve heart and kidney health for sufferers, research is revealing. But Australia’s supply shortage will not be resolved this year.

New research from Sydney’s George Institute for Global Health and Royal North Shore Hospital suggests Ozempic can be safely used alongside existing treatments to reduce heart and kidney complications.

Colleen Alexander has taken medication for type 2 diabetes for years.

Colleen Alexander has taken medication for type 2 diabetes for years. Credit: Dominic Lorrimer

The paper, published in The Lancet Diabetes & Endocrinology last week and presented at the European Renal Association Congress, analysed data from more than 72,000 patients worldwide.signups

It found no increase in adverse events when combining newer diabetes drugs known as GLP1-RAs, such as semaglutide (sold as Ozempic), with older medications, known as flozins.

George Institute Clinical Associate Professor Brendon Neuen said the analysis showed that combining the treatments was a valid course of action.

“There has been a lot of interest in using them in combination, but we haven’t had the data to support that until now,” he said.

Because the two classes of drug work in different ways – GLP1-RAs enhance insulin release and sensitivity, while flozins lower blood glucose by increasing its excretion in urine – Neuen said it was considered very likely that combining the drugs improved effectiveness.

“The way we are interpreting this is, it is very likely to provide additive protection,” he said.

The new research follows the highly anticipated results of the international FLOW trial, a large-scale clinical trial that suggested a weekly dose of semaglutide reduced risk of major kidney failure in people with type 2 diabetes and chronic kidney disease by 24 per cent.

University of Sydney diabetes researcher Professor Stephen Twigg, who was not involved in either of the recent papers, said there was a “rich tapestry” of research coming through in the field.

“There are studies showing these drugs have been good at preventing some major health outcomes that we may not have been predicting,” he said.

Twigg said the next steps should be to assess the cost-effectiveness of such treatment – weighing up the price of subsidising expensive medications against the saving they could provide for the health system – and to examine whether different drugs performed better for different people.

So-called “personalised” or “precision” medicine is a booming field in oncology , where genetic testing is increasingly guiding cancer treatments.

“But in personalising care for diabetes, we are still learning about sub-groups of patients and how we can best target interventions,” Twigg said.

“As more evidence becomes available, we are getting a better feel for which medicines are going to be more effective.”

Neuen said further research should look at whether combining flozins with GLP1-RAs was effective in people who do not have type 2 diabetes.

Ozempic has been in short supply in Australia since 2022, when its popularity skyrocketed as social media videos credited the drug for rapid weight loss. Due to demand, people with type 2 diabetes complained pharmacists were struggling to fill their scripts .

The Therapeutic Goods Administration then discourage GPs from prescribing it off-label for weight loss.

The latest advice from the drug’s manufacturer, Novo Nordisk, is that supply will remain limited throughout 2024. The TGA has also cracked down on compounded versions of the drug, due to quality control concerns .

Type 2 diabetes patient Colleen Alexander has taken a flozin for several years. Having lost half a kidney to cancer, the 82-year-old from Gladesville, in Sydney’s north, said she would be discussing her ongoing treatment options with her doctor.

Get the day’s breaking news, entertainment ideas and a long read to enjoy. Sign up to receive our Evening Edition newsletter.

Most Viewed in National

Study finds that engagement with Level2 leads to improved type 2 diabetes outcomes

Value-based care solution, Level2, helps members with type 2 diabetes improve control over their condition, while also reducing costs.

what is the p value in a research study

In a study recently showcased at the 2024 American Diabetes Association Scientific Sessions , 73% of UnitedHealthcare members enrolled in Level2 with a starting hemoglobin A1C above 7.0 had a clinically meaningful improvement in this measurement. Among members included in the study, the average reduction in A1C was 1.39 percentage points after one year and 1.36 at two years, demonstrating that Level2 can lead to sustained, lower glucose levels among participants. 1

Level2® , a value-based care solution that combines wearable technology with customized clinical support, aims to help improve the health of people living with type 2 diabetes.

Level2   is a virtual care organization designed to help people living with type 2 diabetes work to improve their blood sugar levels, improve their health, and even work towards remission. 

13.3% higher productivity costs and 2x higher medical costs for employees with type 2 diabetes compared to those without diabetes 3

Grounded in metabolic and behavioral science, Level2 equips eligible participants with continuous glucose monitors (CGM) and personalized support from a care team to help encourage healthier lifestyle decisions, like food choices, exercise and sleep patterns.

Level2 aims to help reduce eligible employers’ overall cost of care by improving health outcomes and avoiding costly complications.

What this means for employers? With about 11.5% of Americans living with diabetes, 2  employers likely have a significant portion of their workforces managing diabetes or prediabetes. Type 2 diabetes diagnoses have a direct impact on productivity and costs.

“Type 2 diabetes and related complications are largely preventable,” says Dr. Donna O’Shea, chief medical officer of population health for UnitedHealthcare. “With the right support, it can also be more effectively managed through a combination of consistent lifestyle changes, helping achieve better health and avoiding costly complications. That can be good news for workers, their families and their employers.”

For eligible employers with self-funded health plans, the Level2 Assured Value Program can not only help eligible employees better manage and potentially even improve their type 2 diabetes, but also aims to ensure employers realize value from their paid program fees.

Here’s how the Level2 program assures eligible employers realize value:

  • For eligible employers with more than 125 covered employees with type 2 diabetes, 100% of program fees are reconciled against the actual medical and pharmacy claims savings generated, assuring employers realize value
  • If the value generated is less than the eligible employer’s paid program fees, Level2 returns the difference to the employer

Eligible employers can add the Level2 Assured Value Program to most UnitedHealthcare administered plans. The reduction in the employer’s overall cost of care is achieved by improving health outcomes and avoiding costly complications of unmanaged type 2 diabetes, as well as by reducing member reliance on medications.

How Level2 helped one employer better manage their type 2 diabetes spend

To help employees better manage their type 2 diabetes, Midwest automotive group Gurley Leep Automotive Family started offering its employees Level2.

With Level2, Gurley Leep employees and their dependent family members were able to earn 100% coverage on common type 2 diabetes medication, supplies, lab work and PCP visits by engaging in activities that help manage their condition.

In its first year of offering Level2, Gurley Leep experienced higher engagement among employees than they had with their previous diabetes management program in addition to 5% guaranteed savings — and many employees are thrilled at the results.

“With Level2, UnitedHealthcare is working to change the impact of type 2 diabetes by helping people better manage their condition using continuous glucose monitoring, coaching, lifestyle modifications and medication management,” said Dr. Rhonda L. Randall, chief medical officer for UnitedHealthcare Employer & Individual. 

“Level2 is one way we are working to reduce the impact of type 2 diabetes by helping people work to improve their condition while offering employers incentives to adopt evidence-based solutions. Level2’s combination of data, technology and personal support can help members achieve better health and support employers seeking to address potentially one of their biggest drivers of medical costs.” — Dr. Rhonda Randall, Chief Medical Officer, UnitedHealthcare Employer & Individual

More articles

Broker - page template - more news experience fragment, current broker or employer group client.

Access uhceservices to check commissions, manage eligibility, request ID cards and more.

  • Open access
  • Published: 09 July 2024

Outcome evaluation of technical strategies on reduction of patient waiting time in the outpatient department at Kilimanjaro Christian Medical Centre—Northern Tanzania

  • Manasseh J. Mwanswila   ORCID: orcid.org/0000-0003-3378-2865 1 , 2 ,
  • Henry A. Mollel 2 &
  • Lawrencia D. Mushi 2  

BMC Health Services Research volume  24 , Article number:  785 ( 2024 ) Cite this article

56 Accesses

Metrics details

The Tanzania healthcare system is beset by prolonged waiting time in its hospitals particularly in the outpatient departments (OPD). Previous studies conducted at Kilimanjaro Christian Medical Centre (KCMC) revealed that patients typically waited an average of six hours before receiving the services at the OPD making KCMC have the longest waiting time of all the Zonal and National Referral Hospitals. KCMC implemented various interventions from 2016 to 2021 to reduce the waiting time. This study evaluates the outcome of the interventions on waiting time at the OPD.

This is an analytical cross-sectional mixed method using an explanatory sequential design. The study enrolled 412 patients who completed a structured questionnaire and in-depth interviews (IDI) were conducted among 24 participants (i.e., 12 healthcare providers and 12 patients) from 3rd to 14th July, 2023. Also, a documentary review was conducted to review benchmarks with regards to waiting time. Quantitative data analysis included descriptive statistics, bivariable and multivariable. All statistical tests were conducted at 5% significance level. Thematic analysis was used to analyse qualitative data.

The findings suggest that post-intervention of technical strategies, the overall median OPD waiting time significantly decreased to 3 h 30 min IQR (2.51–4.08), marking a 45% reduction from the previous six-hour wait. Substantial improvements were observed in the waiting time for registration (9 min), payment (10 min), triage (14 min for insured patients), and pharmacy (4 min). Among the implemented strategies, electronic medical records emerged as a significant predictor to reduced waiting time (AOR = 2.08, 95% CI, 1.10–3.94, p -value = 0.025). IDI findings suggested a positive shift in patients' perceptions of OPD waiting time. Problems identified that still need addressing include, ineffective implementation of block appointment and extension of clinic days was linked to issues of ownership, organizational culture, insufficient training, and ineffective follow-up. The shared use of central modern diagnostic equipment between inpatient and outpatient services at the radiology department resulted in delays.

The established technical strategies have been effective in reducing waiting time, although further action is needed to attain the global standard of 30 min to 2 h OPD waiting time.

Peer Review reports

The Tanzanian healthcare system is beset by prolonged waiting times in its hospitals, particularly in the outpatient departments. The reported contributing factors include the increased need for healthcare due to uncontrolled population growth, an inadequate number of medical experts, underdeveloped healthcare systems, and ineffective referral systems [ 1 ]. The audit report from the Ministry of Health on the management of referral and emergency healthcare services at zonal and regional referral hospitals showed a high OPD waiting time. Previous studies suggest that the average waiting time at, Muhimbili National Hospital OPD was 4 – 6 h; Muloganzila Zonal Referral Hospital was 3 – 4 h; Bugando Medical Centre was 2.5 h, Mbeya Zonal Hospital was 3 – 4 h and Kilimanjaro Christian Medical Centre (KCMC) was 6 h [ 1 , 2 ]. According to these data, KCMC has the longest waiting time of any zonal and National referral hospital in Tanzania. In response to the long waiting time, KCMC implemented a series of interventions that were incorporated into the strategic plan from 2016 to 2021. The interventions included the use of a block appointment system, the transition from paper to electronic medical records (EMRs), the extension of clinic days and the acquisition of modern diagnostic equipment.

Effective scheduling is crucial to minimize patient waiting times. Appointment systems should include rules for setting appointments and sequencing patients' arrivals, aligning them with doctors' schedules. Studies have shown that optimizing block appointment scheduling can significantly reduce patient waiting times without increasing physician idle time [ 3 , 4 , 5 ]. Effective appointment scheduling has been shown to significantly reduce patient waiting time in outpatient facilities. A study conducted in the USA demonstrated that planning appointment slots can decrease waiting time by as much as 56%.This evidence suggests that optimizing block appointment scheduling is a viable strategy to enhance outpatient efficiency [ 6 ]. Another study in Sri Lanka, demonstrated that implementing a well-structured appointment scheduling system could reduce total patient waiting time by over 60%. Therefore, adopting a block appointment system allows for more efficient allocation of resources and scheduling, ultimately enhancing the overall patient experience and optimizing healthcare delivery [ 7 , 8 ]. In Mozambique they introduced a block appointment scheduling system to evaluate its impact on waiting time. The findings revealed a reduction in waiting time by 1 h and 40 min (100 min) The study concluded that by introducing block appointment scheduling, patient arrivals were distributed more evenly throughout the day, resulting in reduced waiting times [ 9 ].

The implementation of electronic medical records (EMRs) has been shown to offer significant advantages in healthcare delivery, particularly in less developed nations. Evidence indicates that EMRs can decrease patient waiting time, lower hospital operating costs and communication between departments; enable doctors to share best practices. Unlike paper-based records, EMRs provide greater flexibility and leverage, enhancing overall healthcare efficiency [ 10 ]. Long waiting times in the OPDs are often exacerbated by inefficiencies in managing patient records. A tertiary medical college hospital in Mangalore, Karnataka, evaluated patient waiting and identified disorganized manual files as a primary cause of delays. These findings underscore the disadvantages of paper-based records and suggest that implementing electronic medical records (EMRs) can greatly enhance efficiency [ 11 ]. Reducing outpatient waiting times is a critical challenge for healthcare systems. Evidence from a study in Korea demonstrated that implementing EMRs can significantly reduce waiting time by nearly 60% and enhance operational efficiency. [ 12 ]. Addressing long waiting time in the OPD is essential for enhancing patient satisfaction and healthcare efficiency. A systematic survey study aimed at utilizing various models to shorten OPD waiting time found that healthcare providers significantly favored electronic medical records (EMRs) over manual records. The primary reasons cited were significant time savings and a consequent reduction in long waiting time.

[ 13 ]. The issue of long waiting time in outpatient departments (OPDs) is a prevalent problem faced by healthcare facilities worldwide. A study conducted in Brazil applied Lean thinking and an action research strategy to address patient flow issues and identify the causes of prolonged waiting time at the OPD. The study's findings highlighted that many hospitals globally are tackling this issue by investing in electronic medical records (EMRs) to transition away from manual medical records. This evidence suggests that implementing technical strategies, such as EMRs, can significantly improve patient flow and reduce waiting times [ 14 ].

Extending clinic days throughout the week has been found to be more effective in reducing waiting times than extending clinic hours. Studies have demonstrated substantial reductions in patient waiting times and increased patient satisfaction following the extension of clinic days. In Canada the study found that extending clinic day was more effective in reducing waiting time than extending clinic hours. Extending clinic days resulted in a 26% reduction in average waiting time, whereas extending clinic hours led to a 16% reduction. This research provides valuable insights for healthcare administrators seeking to optimize clinic operations and enhance patient experience [ 15 ]. At a tertiary care hospital in Oman the findings revealed a substantial 56% reduction in patient waiting time following the extension of clinic days. Additionally, patient feedback indicated a high level of satisfaction with the extended clinic days, with 97% of patients reporting satisfaction with the service [ 16 ]. Extending clinic days throughout the week has demonstrated promising results in a study conducted at a tertiary care hospital in India. The findings revealed a noteworthy 46% reduction in average patient waiting time following the extension of clinic days. This substantial decrease underscores the effectiveness of extending clinic hours in streamlining patient flow and improving efficiency. Consequently, these results provide compelling evidence supporting the rationale for extending clinic days throughout the week as a viable intervention to alleviate patient waiting times and enhance overall healthcare service delivery [ 17 ].

Utilizing modern equipment in healthcare settings has shown significant potential in reducing patient waiting times. A study conducted at a tertiary care hospital in Italy evaluated the effectiveness of modern equipment on patient. The findings indicated a notable reduction in patient waiting time, with an average decrease of 14 min per patient following the introduction of modern equipment. These results suggest that integrating modern equipment into can be a highly effective intervention for improving operational efficiency and reducing patient waiting time [ 18 ]. Modern equipment can be instrumental in reducing patient waiting times. A tertiary care hospital in Pakistan revealed that one of the primary causes of prolonged waiting time was the lack of adequate examination equipment. By addressing the equipment deficiencies highlighted in the study, healthcare providers can significantly reduce waiting times, thereby improving patient satisfaction and overall efficiency. Therefore, investing in modern equipment is justified as a strategic intervention to enhance patient flow and optimize healthcare service delivery [ 19 , 20 , 21 , 22 ]. Modern equipment is essential for reducing patient waiting times in healthcare facilities. An audit assessment conducted in zonal hospitals in Tanzania by the Ministry of Health revealed that outdated equipment, such as x-ray machines, significantly contributed to long waiting time. The limited capacity of these machines meant that only a certain number of patients could be attended to each day, and the equipment required rest periods to avoid overheating. These findings underscore the necessity of updating and maintaining modern medical equipment to improve patient throughput and reduce waiting times [ 2 ].

In Tanzania the Ministry of Health has not established the gold standard waiting time for patients to wait for services at the OPD [ 2 ]. However the United States Institute of Medicine (IOM) has established their gold standard patient waiting time at the OPD which suggests that medical care should be provided to at least 90% of patients no later than 30 min after their scheduled appointment time [ 23 , 24 ]. The Patient's Charter of UK, has recommended the same standard as the IOM [ 25 ]. The absence of a gold standard waiting time carries several significant implications. It results in inconsistent patient experiences with unpredictable waiting time across facilities, leading to frustration and dissatisfaction. Prolonged and varied waiting time can compromise the quality of care, affecting patient outcomes. Inefficient resource allocation becomes a challenge, hampering the ability to determine staffing and infrastructure needs [ 26 ]. This lack of a benchmark reduces accountability, and healthcare providers may not be incentivized to improve waiting time. It adversely affects patient satisfaction, the reputation of healthcare providers, and can exacerbate healthcare disparities. [ 19 ]. Hence, the findings from this research will provide valuable insights to the hospital management, enabling them to reinforce substantial improvements in patient waiting time and target areas where progress has been limited within the OPD at KCMC.

The objective of this study is to assess the patient waiting time at KCMC after intervention. Thus, the specific objectives were to determine the OPD patient waiting time since the inception of implementation of the interventions and to assess the effect of technical strategies on patient waiting time.

Design and methods

The study was conducted at Kilimanjaro Christian Medical Centre (KCMC) Outpatient department. KCMC is located in the foothills of the snow-capped Mount Kilimanjaro. It is one among the six zonal consultant hospitals in Tanzania. It was established in 1971 as a Zonal Referral Consultant hospital owned by the Evangelical Lutheran Church of Tanzania (ELCT) under the Good Samaritan Foundation (GSF). The referral hospital was established in order to serve the northern, eastern and central zone of Tanzania. Its record in medical services, research, and education has significant influence in Tanzania, East Africa and beyond. It serves a potential catchment population of 15 million people with 630 official bed capacity. The hospital has a number of clinical departments namely, General Surgery, Orthopaedic and Trauma, Dental, Dermatology, Paediatric, Eye, Otorhinolaryngology, Obstetric and Gynaecological and Internal Medicine. There are 1300 staff seeing about 1200 outpatients and 800 inpatients. The hospital has 100 specialists, 52 medical doctors, 465 nurses and the remaining 643 are paramedical and supporting staff. This area was chosen because the outpatient department at KCMC sees a high volume of patients on a regular basis from diverse backgrounds, including rural and urban populations of Tanzania as well as neighbouring countries. For instance in the year 2022, a total of 301,091 patients attended KCMC hospital, of which 92% ( n  = 277,013) attended the OPD. This high patient volume made it a suitable location for studying patient waiting time.

Study design

This was an outcome evaluation whereby an analytical cross-sectional design was used to examine the subject matter. This study employed a mixed method explanatory sequential evaluation approach.

Population and sampling

The study surveyed 412 patients quantitatively and conducted qualitative interviews with 12 patients and 12 healthcare providers. In addition patients who were involved in quantitative were not involved in the qualitative sample. The quantitative sample size was obtained using the following formula [ 27 ]:

n  = sample size.

Z = is the standard normal deviation which is 1.96 for a 95% confidence interval.

P = is the percentage of patients attending the OPD at KCMC is estimated to be 0.5, attributed to the absence of prior research data.

d = is the margin of error, which is 5% (0.05).

Therefore, the minimum sample size for this study was 384 patients approximated to be 422 after adjustment for a 10 percent non response rate.

Quantitative sampling

The systematic sampling process was designed to select 412 patients for interviews for working 10 days, with a daily minimum patient arrival of 500 patients. The daily interview target was calculated by dividing the total number of patients (412) by the number of days (10), resulting in an average of 41.2 interviews per day.

The systematic sampling process began with setting up a consent desk and queue number system. Patients were informed about the survey, and consent was obtained. Each patient was assigned a unique queue number upon arrival.

To determine the sampling interval, the total daily patients (500) were divided by the daily interview targets (41 or 42 patients). This resulted in a sampling interval of approximately 12. A random starting point between 1 and 12 was selected, and from this point, every 12th patient was chosen for the interview.

For the daily interview allocation, 42 patients were interviewed on the first 5 days, and 41 patients were interviewed on the remaining 5 days. This method ensured an even distribution of interviews and a representative sample for the survey.

Qualitative sample size

This study adopted a sample size of 12 respondents for the qualitative data collection, because it has been suggested that in practical research data saturation in a relatively homogeneous population can be achieved with this sample size [ 28 ]. Therefore, twelve (12) healthcare providers at the OPD and twelve (12) patients were selected making a total sample size of 24 for qualitative study.

Qualitative sampling

To select 12 healthcare providers purposive sampling was employed. We targeted specific roles to ensure a comprehensive representation of the outpatient department: doctors, nurses, management, cashiers, and medical records personnel. The selection included 3 doctors, 3 nurses, 2 management personnel, 2 cashiers, and 2 medical records personnel. Doctors were chosen based on their direct patient interaction and diverse specializations within outpatient care. Nurses were selected to represent varying levels of experience, from junior to senior roles. Management personnel were chosen for their administrative and operational oversight responsibilities. Cashiers who handle patient transactions and medical records personnel involved in managing patient records were also included. This purposive sampling strategy aimed to capture a holistic view of the outpatient department's operations and challenges, providing valuable insights for the study. Also to select 12 patients we used convenience sampling. We chose individuals based on their accessibility and willingness to participate at the outpatient department. This approach involved approaching patients who were readily available and consented to participate in the study. The sampling process took place over several days, with researchers stationed in the waiting area to identify potential participants. Patients were approached in a systematic manner, ensuring a mix of different ages, genders, and medical conditions to achieve a varied sample. Each patient was briefly informed about the study's purpose and asked for their consent to participate. Those who agreed were included in the sample until the target of 12 patients was reached. This method was chosen for its practicality and ease of implementation, allowing researchers to quickly gather insights from a diverse group of patients without the need for complex selection criteria.

Inclusion criteria

The study focused on patients aged 18 and older who attended the OPD during the data collection period.

Exclusion criteria

Patients below 18 years or who were severely ill or had scheduled admission appointments were excluded, as well as first time attendees (new patients) because they lacked prior experience with the implemented interventions.

Data collection tools and procedures

The researcher developed a structured questionnaire as a data collection tool. The tool had socio-demographic characteristics which included age, gender, marital status, education level, occupation, place of address, mode of payment and year of attendance at KCMC. The measurement scale for technical strategy was typically ordinal, based on fourteen (14) Likert scale questions with response options of 1 = strongly disagree, 2 = disagree, 3 = neutral, 4 = agree and 5 = strongly agree. Allowing patients to indicate their level of agreement or disagreement with statements related to technical strategies. Additionally, strongly disagree and disagree were consolidated as disagree and neutral, agree and strongly agree were consolidated as agree following the approach used in a previous study [ 29 ]. The internal reliability of the fourteen items used to assess effectiveness of technical strategies on reducing patient waiting time was measured using Cronbach’s alpha which was found to be 0.940. The survey included questions on arrival time, time the queue number was issued, registration waiting time, payment waiting time, triage waiting time, waiting time to see the doctor, pharmacy waiting time, laboratory waiting time, radiology waiting time and exit time. This data was collected from patients who attended clinics such as the general OPD clinic, orthopedic clinic, Medical clinic, surgical clinic, Urology clinic, Ear, Nose & Throat, Diabetic, cardiac clinic, Neurology and Neurosurgery. Waiting time was measured with a stopwatch.

Semi-structured guides for conducting in-depth interviews with patients and healthcare providers were developed. The interview guide had questions on socio-demographic and technical strategies such as the new block appointment system, use of EMR, extension of clinic days and availability of modern diagnostic equipment.

Also, the researcher conducted a documentary review, analyzing written records detailing time allocation before the studied event. This approach offered insights into past practices, aiding pattern and trend analysis. It involved reviewing benchmarks like a six-hour average waiting time, median waiting time for specific clinics, and total treatment duration for patients in various clinics. The six-hour benchmark was derived from the Ministry of Health's assessment report on OPD waiting time at KCMC and patients' information was not matched or linked to this report. Therefore, we considered the six-hour mark as our reference point." The data collection was conducted for two consecutive weeks from 3rd July to 14th July 2023.

Data analysis

Quantitative data.

The data collected were imported to the STATA programme (version 18.0) for further analysis. Descriptive Statistics: The analysis began with the presentation of data using various methods, including figures, graphs, and frequency distributions. The effect each response was rated on a scale of 1 to 5. Subsequently, cut-off points were utilized for each area to categorize the effectiveness of each intervention strategies as follows: 1–1.8 (very low), 1.8–2.6 (low), 2.6–3.4 (medium), 3.4–4.2 (high), and 4.2–5 (very high) [ 30 ]. Also, in this study, efficacy was determined by calculating the percentage reduction in OPD waiting time achieved through the implementation of intervention strategies. The current overall OPD waiting time (as shown in Table  4 ) was used as the numerator and the 6-h benchmark as the denominator [ 2 ].

The study defined the dependent variable as follows: overall patient waiting time, which was captured using a stopwatch, was categorized as a binary dummy variable. A value of 1 represented OPD waiting time less than 3 h, while a value of 0 indicated OPD waiting time exceeding 3 h. Comparison with Standards: The analysis involved evaluating OPD waiting time against established benchmarks. This included comparing the waiting time with the standards outlined in the Patients Charter of the United Kingdom (UK) and the recommendations from the United States Institute of Medicine (IOM), which advocate that at least 90% of patients should receive medical care within 30 min of their scheduled appointment time. Additionally, the study compared the observed 6-h waiting time, set as outpatient waiting time at KCMC Zonal Hospital, to assess whether there was any reduction post-intervention. Statistical Tests: To explore potential associations between dependent and independent variables, statistical tests were employed. Logistic regression analysis, encompassing both bivariate and multivariate analyses, was conducted. The multivariable analysis included all variables with p  < 0.200 as identified during the bivariable analysis. It was further adjusted for sex, level of education, and mode of payment. All statistical analyses were conducted at a significance level of 0.05. These analytical steps were taken to provide a comprehensive assessment of the effect of the intervention on patient waiting time.

Qualitative data

All interview transcripts were transcribed verbatim and translated into English. In order to maintain the original meaning back translation was employed. The analysis was done using the English transcript. Thematic data analysis was employed using both deductive and inductive reasoning. Consequently, a preliminary codebook for data analysis was developed, aligning with the study objectives, after which the final codebook was imported into Atlas.ti 7.0 qualitative data analysis computer software. Inductive coding was assigned to text segments which built on emerged new themes that were not pre-determined. The codes were sorted into categories then were clustered into sub-themes which were aligned into themes. The entire process of analysis was iterative. In ensuring rigor, validity, and the mitigation of bias in the qualitative component, it was considered important to ensure the credibility, transferability, dependability, and confirmability of qualitative component to enhance its trustworthiness [ 31 , 32 ]. In this study, credibility ensured that the data accurately reflects the real experiences and perceptions of those involved in the waiting process, allowing for subsequent decision-making. Transferability sought to make the findings relevant and to be applied to various healthcare settings beyond the specific study setting, ensuring that solutions can be adapted and implemented effectively in different contexts. Dependability ensured that the methods used to reduce waiting time were consistent and reliable over time, thus enabling the replication of the study's results. Confirmability ensures that the strategies for reducing waiting time are grounded in the data collected, rather than being influenced by the researchers' biases, thus enhancing the trustworthiness and effectiveness of the research findings in addressing waiting time issues in healthcare settings, thereby increasing the objectivity and validity of the research.

Ethical clearance

The Clearance Committee from Mzumbe University from the Directorates of Research, Publication and Postgraduate Studies provided ethical clearance with reference number MU/DPGS/INT/38/Vol. IV/236. Subsequently, the proposal was submitted for evaluation to the College Research Ethics and Review Committee (CRERC) at Kilimanjaro Christian Medical University College – Moshi. The CRERC granted approval, as indicated by certificate number 2639. Additionally, the data collection procedure received endorsement from the directors of KCMC Hospital reference number KCMC/P.1/Vol. XII. Prior to data collection, participants provided written informed consent. To ensure respondents’ autonomy, patients were fully informed about the purpose and nature of the study and provided with the option to withdraw at any time without any impact on their medical care. Patients were then questioned after completing their medical care. Also interviews were conducted in a private office within the OPD premises.

In this study, the initial calculated sample size was 422 patients. However, out of this group, only 412 patients consented to participate and completed the questionnaire. This resulted in a response rate of 97.6%. The median age was 52 (IQR, 38–65), with the majority aged over sixty. Over half were female (53.6%, n  = 221), and the majority were married (76%, n  = 313). Most had a basic education, including primary (44.7%, n  = 184) and secondary education (26.7%, n  = 110). More than half were peasant farmers (52.4%, n  = 218), and the vast majority (94.7%, n  = 338) resided within the KCMC catchment area. The majority were insurance patients (82.0%, n  = 338), and more than two-thirds (66.5%, n  = 274) had attended KCMC before the intervention's inception (Table  1 ).

Demographic characteristics in the qualitative sample for healthcare providers

A total of 12 healthcare providers were enrolled of whom half were male (50%, n  = 6) and half (50%, n  = 6) were female (Table  2 ).

Demographic characteristics in the qualitative sample for patients

A total of 12 patients were enrolled of whom half were male (50%, n  = 6) and half (50%, n  = 6) were female (Table  3 ).

Sub-themes from the in-depth interviews

During IDIs sub-themes that emerged were; ownership, training, organization culture, ineffective follow up, effective follow up and enhanced process simplification (Table  4 ).

OPD waiting time since the inception of implementation of the interventions

Following the intervention, the overall median waiting time in the OPD was 3.30 h IQR (2.51–4.08) a reduction of 2.30 h after the intervention.

The median waiting time for registration was 9 min IOR (0.03–0.15). For payment, the median waiting time was 10 min IOR (0.07–0.15). For triage patients using out-of-pocket payments experienced median waiting time of 17 min IQR (0.05–0.19) while those with insurance had median waiting time of 14 min IQR (0.06–0.19) and the median waiting time to see a doctor was 1.36 h IQR (0.51–2.01). The time from arrival to actually seeing a doctor was measured at 3.08 h IQR (2.13–3.30). Furthermore, the median consultation time was 19 min IQR (0.15–0.24), waiting time at the pharmacy was 4 min IQR (0.02–0.06), at the laboratory it was 31 min IQR (0.20–0.37) and waiting time at Radiology varied based on the specific service. X-ray services in different rooms had average waiting time ranging from 35 min to 1.15 h with varying IQR (0.23–2.19). Ultrasound services had median waiting time of 32 min (Table  5 ).

Qualitative findings

Registration (medical records department).

The adoption of electronic medical records (EMRs) appears to have enhanced the overall efficiency of the KCMC OPD registration process, benefiting both patients and staff.

"I have been receiving treatment here at KCMC for over 20 years. In the past, in the medical records department, it was necessary to have someone, a staff member, whom you would contact in advance, preferably three days before your clinic day, so that they could start looking for your file. This way, you could save time waiting. However, nowadays, this process is no longer in place. When I arrive, I simply present my card, and in no time, I'm on my way to the next area. There's no longer any time wasted at the reception." (IDI – Male Patient, aged 67 years)

Another interviewee added that:

"Nowadays, with the system in place, the process is streamlined, allowing me to efficiently register as many patients as possible in a short amount of time. I no longer have to leave the reception area to search for files, which has significantly improved the efficiency of the registration process." (IDI – Male healthcare provider (HCP), aged 45 years)

Waiting time to see the doctor

The issue of waiting time for patients to see the doctor has emerged as a significant concern within the healthcare facility. This concern is consistently echoed in both the quantitative data and qualitative interviews.

For example a female HCP aged 40 years reported:

" […] commencing clinics promptly can be challenging for doctors, as it is crucial for them to first participate in the morning report, which provides essential updates on the status of hospitalized patients." (IDI – male HCP, aged 40 years).

After probing as to why the medical staff cannot split into two teams of doctors so that one team could attend to outpatients the response was as follows:

"We have a limited number of doctors, making it challenging to divide them into two groups. Moreover, admitted patients demand our additional attention, as some rely on oxygen for breathing, while others are too ill to walk. Unlike outpatients, the majority of whom can independently come for treatment, we kindly request their understanding as we prioritize the care of our admitted patients." (IDI – male HCP, aged 40 years).

A female patient aged 53 years gave some observations.

“[….] Mmh! I want to highlight that delay in seeing the doctor can have serious consequences. It can lead to a worsening of symptoms or conditions, increase stress levels, and ultimately result in reduced satisfaction with the healthcare service. It's imperative that we address these extended waiting times. This is crucial not just for the comfort of the patient, but also to ensure that medical care is administered in a timely and effective manner.” (IDI – female patient, aged 53 years).

In the pharmacy department, there has been a notable improvement in waiting time. Patients now experience a comfortable and efficient process, with minimal time spent before receiving their prescribed medications.

"With the use of a computerized system, things have been greatly simplified. The waiting time to collect medicine has become short. When I come here, I wait for just a little while and quickly get my medicine." (IDI – Male patient, aged 45 years).
“Apart from using the computerized system in place, which has simplified things, the hospital administration has managed to establish three additional pharmacies apart from this one, thus reducing congestion in a single pharmacy, as it used to be in the past. That's why now a patient can be served quickly.” (IDI – male HCP, aged 50).

Laboratory department

In the laboratory department, the waiting time has been a subject of varying experiences among patients. Some patients have reported relatively short waiting periods, while others have encountered longer waits.

“I have been patiently waiting for a long time to be called for my tests, I’ve not yet been called up to now.” (IDI – female patient, aged 43 years).

Another interviewee shared that:

"I've noticed that one of the main reasons for long waiting time at the laboratory here is the limited space. The laboratory rooms at the Outpatient Department (OPD) have remained the same since the hospital was established, which means they can only accommodate a small number of patients at a time. This often leads to a backlog of patients waiting to get their tests done. It's clear that expanding the laboratory facilities is crucial to reduce these extended waiting time and ensure more efficient service delivery for everyone” (IDI – male HCP, aged 55 years).

Radiology department

Despite having modern diagnostic equipment, which appears to have significantly contributed to reducing patient waiting time, there are still instances where patients experience long waiting time in the radiology department.

"For me, even though waiting for an X-ray may take some time, I don't mind the wait. I've noticed a significant improvement in waiting time compared to before. In addition nowadays, when I have an X-ray, I can also consult with my doctor on the same day, which wasn't possible in the past” (IDI – male patient, aged 40 years).

One interviewee highlighted a crucial factor contributing to the extended waiting time at the radiology department and pointed out that:

“The same rooms at the radiology department are utilized for both outpatient and inpatient cases. As a result, priority is often given to the admitted patients, leading to longer waiting time for those seeking outpatient radiology services. This dual-use of facilities poses a challenge in managing patient flow and significantly contributes to the observed delays in the radiology department”. (IDI – female HCP, aged 49 years)

Patient OPD waiting time with Six (6) and Three (3) Hours Threshold

Not a single patient managed to complete the treatment within the recommended 30-min window following their scheduled appointment. When assessed based on the KCMC benchmark of a 6-h timeframe, the vast majority of patients (98.3%, n  = 407, 95% CI, 97.0%-99.5%) indicated that they received the OPD services within a period of less than six hours. However, when the time threshold was further reduced to three hours, 31% ( n  = 128, 95% CI, 26.6%-35.6%) of all surveyed patients reported that they received OPD services within a duration of fewer than three hours (Fig.  1 ).

figure 1

Patient OPD waiting time with six (6) and three (3) hours threshold ( n  = 412)

Furthermore, during the in-depth interviews (IDIs), patients emphasized receiving OPD services within a timeframe of below three hours.

For instance, a 58-year-old female patient remarked:

“ Certainly, drawing from my extensive experience of over 15 years attending KCMC hospital, I can attest to the positive changes in the waiting time for OPD services. Patients, including myself, are genuinely appreciative of this effective reduction in waiting time. I personally find it remarkable that I can now complete all the necessary OPD services in just about three hours, which is a stark contrast to the longer waiting periods we used to endure. This improvement has undoubtedly enhanced the overall patient experience and contributes positively to our healthcare journey”. (IDI – female patient, aged 58 years)

Effect of technical strategies on patient waiting time

Descriptive statistics of the technical strategies.

The study assessed the effectiveness of various technical strategies on reducing patient waiting time, categorized into four domains: block appointment, implementation of electronic medical records (EMR), extension of clinic days throughout the week, and utilization of modern diagnostic tools. The self-reported data were analyzed using mean scores and standard deviations, measured on a Likert scale ranging from 1 (strongly disagree) to 5 (strongly agree). The effectiveness of strategies in reducing patient waiting time was categorized as follows: very low (1–1.8), low (1.8–2.6), medium (2.6–3.4), high (3.4–4.2), and very high (4.2–5).

Overall, the average effectiveness of technical strategies in reducing patient waiting time was found to be very high with a mean score of 4.27 (SD = 0.904) with a descriptive equivalent of “very high”. Specifically, the new block appointment system obtained a mean score of 4.36 (SD = 0.856) with a descriptive equivalent of “high”. Additionally, the introduction of hourly appointments demonstrated positive effects with a mean score of 4.18 (SD = 1.024) with a descriptive equivalent of “high”. The transition from paper based to electronic medical records was also effective and obtained a mean score of 4.09 (SD = 1.033) with a descriptive equivalent of “high”. Moreover, the extension of clinic days obtained a mean score of 4.31 (SD = 0.832) with a descriptive equivalent of “very high”. Finally, the availability of modern diagnostic services, achieving a mean score of 4.30 (SD = 0.861) with a descriptive equivalent of “very high” (Table  6 ).

Bivariable analysis of technical strategies and patient waiting time

Bivariable regression analysis established a significant association between new block appointment system (OR 3.34; CI 1.28–8.77: p  = 0.014), hourly appointment system (OR 2.49; CI 1.01–6.13; p  = 0.047) and patient waiting time (Table  7 ).

Multivariable analysis between technical strategies and patient waiting time

Multivariable logistic analysis was employed to determine which technical strategy played a significant role in reducing patient waiting time. The findings in the adjusted odds ratio indicate that there was an association between reduction of patient waiting time and migrating from paper based to electronic medical records, thus electronic medical records remained a significant factor for patient waiting (AOR = 2.08, 95% CI, 1.10–3.94, p -value = 0.025). However, the introduction of the new block appointment system demonstrated a higher likelihood for a positive effect on reducing waiting time, although the findings were not statistically significant (AOR = 2.49; 95% CI, 0.68–9.10, p -value = 0.168) (Table  8 ).

Qualitative findings with regards to technical strategies

  • Block appointment

Based on the findings from in-depth interviews, both patients and healthcare providers expressed varying opinions on the block appointment system.

One female patient aged 62 years said that:

"Over the years, I have become accustomed to coming in the morning. I can't come at other time besides the morning; it would disrupt my plans." (IDI, female patient, aged 62 years).

A male healthcare provider aged 39 years shared the experience:

“ The truth is, we haven't been very successful in using block appointments. We tried it on the first and second days, but things went back to how they were before. The problem is, patients arrive very early in the morning, and you find them all crowded, waiting for service. Once a patient arrives, they must be attended to. We've realized that this block appointment system requires the whole team to be involved, from medical records (reception) to doctors, nurses, and the patients themselves” (IDI – male HCP, aged 51 years).

However, it's important to note that amidst these negative perspectives, several interviewees also acknowledged the positive effect of the system.

“ Since the introduction of the appointment system in 2020, we've observed a significant reduction in patient waiting time, which has led to quicker and more efficient service delivery for patients. When a patient arrives, the waiting area is usually less crowded. Furthermore, doctors now have more spaced-out appointments, allowing them to devote ample time to each patient.” (IDI – female HCP, aged 47 years).
  • Electronic medical records

The quantitative finding regarding migrating from paper based to electronic medical records aligns with our qualitative findings.

"From my experience, dealing with physical files presented its own set of challenges. There was a lengthy process, and files were prone to being misplaced, including important test results. Sometimes, files would be delayed in reaching the clinic. This was particularly problematic for patients who arrived early; if their files couldn't be located promptly, it would cause a delay. However, with the new system in place, everything operates swiftly and efficiently. The system has truly revolutionized the process” (IDI – female HCP, aged 55 years)

One interviewee shared the experience.

“When I come for treatment nowadays, I no longer experience the frustration of my test results going missing or my file being unavailable." (IDI – male patient, aged 50 years)

Extension of clinic days

The implementation of the daily clinic schedule has yielded mixed results.

One interviewee stated that:

"In our department, the limited number of staff has posed a challenge. Conducting daily clinics becomes demanding, as the same doctor is tasked with conducting ward rounds, making decisions for admitted patients, and performing surgery. However, once we have an adequate staff complement, we can begin seeing patients on a daily basis." (IDI – male HCP, aged 42 years)

On the contrary, extension of clinic days has proven to be a highly beneficial strategy in our facility serving as one of the key strategies to address patient waiting time.

“It has significantly reduced the patient waiting time. In the past, clinics used to run until 6 pm in the evening. Since they implemented the daily clinic schedule, patients are now seen earlier, and the clinics end earlier. This is because patients have been scheduled throughout the week”. (IDI – female HCP, aged 48 years)

Another interviewee supported that.

“It has helped in limiting the number of patients flocking to a single clinic, but it doesn't necessarily reduce patient waiting time." (IDI – male HCP, aged 51 years)

One interviewee shared the experience:

"These days, I finish my treatments earlier than I used to.” (IDI – male patient, aged 49 years)

Availability of modern diagnostic equipment

The integration of modern diagnostic equipment stands as a substantial contributor to the reduction of patient waiting time. This positive trend is supported by both our quantitative and qualitative findings, affirming the significance of having advanced diagnostic tools readily accessible within our healthcare facility.

" Nowadays, the procedure has become significantly more simplified. You just need to consult the system to retrieve the patient's results. When you open it, you can readily peruse the information, making the process more efficient. If I require additional specifics about the condition, it's easy to locate them in the patient's file. I simply access it in the system, and their image is readily available, leading to a substantial time-saving." (IDI – male HCP, aged 40 years)

More experience is shared from a male patient aged 45 years.

"I now do my investigations on the same day and return to the doctor for my results. This contrasts with the past when I needed to be scheduled for a different day to pick up the results. This has resulted in a considerable time-saving." (IDI – male patient, aged 45 years)

OPD waiting time

Following the intervention, it was observed that the overall median waiting time in the OPD was reduced to 3.30 h in contrast to the previous six-hour (6) waiting time prior to the intervention, showing the effectiveness of the intervention achieving a reduction of waiting time by 45%. This improvement is significant and suggests that the interventions have had a positive effect.

These findings align with other research involved adding more human resources and changing business and management practices. The findings demonstrated a significant success in reducing wait time in the USA, China, Sri Lanka and Taiwan by 15%, 78%, 60%, and 50%, respectively [ 33 ].

The study at KCMC found low median waiting time of 9 min for registration. This is not congruent with findings in China and Saudi Arabia where registration time were notably higher [ 24 , 34 ]. In Ethiopia, waiting time varied, with some patients waiting over an hour [ 35 ], while another study reported a median wait of 18 min [ 36 ]. In Kenya, registration waiting time were even shorter 5.8 min [ 37 ]. These discrepancies could be explained by variations in patient flow management techniques or data collection techniques. Overall, the study shows that the waiting time for registration has significantly decreased at KCMC, clearly demonstrating the efficiency of the technical strategies that have been put in place to cut down on waiting time.

In terms of payment processing, the median waiting time was 10 min. Although it appears majority of patients were insured, the mode of payments had no significant association with waiting time. This suggests that insured patients were handled just as quickly as patients paying with cash. These results are relatively congruent with a study conducted in a Tertiary Care Hospital in Pune, India, where patients spent an average of 7 min at the cashier [ 26 ]. This shared emphasis on streamlined payment processes underscores their significance in enhancing the patient experience, reinforcing the importance of efficient payment processing in healthcare settings.

At the triage area, patients paying cash had a median waiting time of 17 min, while insured patients experienced slightly shorter median waiting time of 14 min. These results are congruent with a study by [ 38 ], who found that insured patients at a hospital in Northeast Thailand had an average triage waiting time of 13 min. The consistency in findings between these studies suggests that insurance status may play a role in patient waiting time, with insured patients benefiting from somewhat more efficient service and well streamlined patient flow. However, it's important to note that regional contexts may influence waiting time, and these results may vary in different healthcare settings and countries.

The median waiting time before seeing a doctor from arrival to consultation was 3.08 h. These results resonate with research from Nigeria, where 38% of respondents waited for over 2 h for a consultation [ 39 , 40 ] found an average waiting time of 137.02 ± 53.64 min before seeing a doctor. In contrast, some studies reported shorter waiting time, such as 40 min in India [ 26 ], over 90% of patients waiting for more than 20 min in Saudi Arabia [ 24 ] and more than half of patients waiting for over 60 min in Ethiopia [ 35 ]. Involvement of doctors in teaching students, long ward rounds, staff constraints and prioritizing inpatients over outpatients could all contribute to doctors coming late to the clinics, thus, causing increased stress, discomfort, and impatience among patients.

The study's findings emphasize a positive aspect of healthcare delivery at KCMC, specifically in pharmacy services, with remarkably short median waiting time of 4 min. This aligns with research in Iran by [ 41 ], which also reported the pharmacy as having the shortest average waiting time of 5 ± 3 min and in Kenya, where patients experienced a similar pattern with an average waiting time of 5.5 min [ 37 ]. However, these results contrast with a study in a Tertiary Care Hospital in Pune, India, revealing a 15-min average waiting time at the pharmacy [ 26 ]. In Ethiopia, [ 35 ] found that only 23.6% of patients received their prescribed drugs within ≤ 30 min, while a comparable number received them within 30–60 min or > 60 min. During interview, patients commended the computerized system's effectiveness in streamlining the medication collection process, which the study attributes to its implementation. In addition, three new pharmacies have been added to the existing one, reducing congestion and allowing patients to receive faster service. The aforementioned positive results serve as evidence of the efficiency with which KCMC's pharmacy services have integrated technology.

The median waiting time at the laboratory department was 31 min. This is congruent with studies done in Ethiopia which reported a similar median of waiting time of 31 min, reflecting consistency in laboratory waiting time within Ethiopian healthcare settings [ 36 ]. Similarly, another study noted that 58.1% of patients received laboratory services within 30 to 60 min, with only 12.0% within ≤ 30 min [ 35 ]. On the contrary, in Nigeria a study revealed a longer waiting time, with patients waiting over 50 min on average for laboratory services. This suggests that KCMC’s laboratory waiting time maybe more favourable when compared to other hospitals [ 42 ]. Nevertheless, another study reported an average waiting time which was significantly shorter, 12.75 min which suggest that there may be variations in waiting time between KCMC and Indian healthcare facility. The reason for the long waiting time at KCMC could be due to the limited space within the laboratory rooms resulting in the accommodation of fewer patients at any given time. This emphasized the necessity of expanding facilities to improve the effectiveness of service delivery [ 26 ].

Waiting time at the Radiology department showed significant differences depending on which investigation was ordered. Thus the median waiting time for X-ray services varied between rooms, from 35 min to 1.15 h, whereas the median waiting time for ultrasound services was 32 min. Important insights into patient experiences were obtained through in-depth interviews. Some patients expressed contentment with the waiting time for X-rays because they were able to get the results on the same day and continue with further treatment from their doctors on the very same day. Various studies revealed differing median radiology waiting time. Iran reported 27 min ± 11 [ 41 ] while India recorded 36.05 min [ 26 ], Ethiopia’s studies indicated 33 min [ 36 ] and 60 min [ 35 ], all indicating relatively shorter waiting time. Conversely, Nigeria showed the longest waiting time for radiological services at 77 min [ 42 ]. There were issues identified within KCMC's Radiology department, such as the dual use of rooms for outpatient and inpatient cases, which prioritized admitted patients and resulted in longer wait time for outpatients. This organizational practice complicates patient flow management and contributes considerably to perceived delays in the radiology department. The findings emphasize that waiting time in Radiology are influenced by resource availability, facility organization, and patient flow management.

Technical strategies on patient waiting time

The implemented block appointment system appears to have the potential to improve waiting time, even though the effect was not statistically significant. Early patient arrivals continue to be problematic, which emphasizes how crucial it is to provide efficient patient education and coordination in order to reap the full rewards of this system. Similar findings in Nigeria demonstrate that appointments with specific time are uncommon, resulting in early patient arrivals and possible delays in the start of services [ 8 ]. However, in other nations where it has been used, the block appointment system has proved to be successful. Research conducted in the United States [ 43 ] and the United Kingdom [ 44 ] have demonstrated its effectiveness in reducing patient wait time. In Thailand [ 5 ] and Sri Lanka [ 7 ] demonstrated the possible advantages of carefully planned scheduling by demonstrating how the use of appointment systems can dramatically reduce average waiting time. Block appointment scheduling also successfully spread out patient arrivals throughout the day, as shown by a pilot study conducted in Mozambique, which significantly decreased waiting time [ 9 ]. Hence, coordinated efforts involving medical records, physicians, nurses, and patients themselves are needed to operate the system.

The transition from paper to electronic medical records had a significant and positive impact on reducing long waiting time at the OPD. Various studies underlined the benefits of electronic medical records over paper-based systems, including how it can improve patient waiting time, increase efficiency, and improve the delivery of healthcare services [ 10 , 11 ] and [ 12 ]. Another study highlighted the preference for electronic health records among healthcare providers due to their efficiency and speed in patient care. By eliminating labour intensive procedures, space limitations, and document misplacement problems associated with manual filing systems, the switch to electronic records helped to create more efficient and productive operations [ 13 ]. The entire patient experience was greatly enhanced since patients were no longer frustrated by lost records or delayed test results. The implementation of electronic health records has proven to be beneficial in reducing extended wait time in outpatient clinics, as evidenced by a study carried out in Brazil [ 14 ].

The extension of clinic days yielded a mean score of 4.31 (SD = 0.832) signifying positive effect. Similarly, qualitative findings from healthcare providers and patients shed light on the effect of extending clinic days. The department's small staffing posed a significant challenge, as doctors had to manage multiple responsibilities, such as ward rounds, decision-making for admitted patients, and surgery. These findings are not congruent with those of other locations where clinic days have been extended. For instance, a study suggested that extending clinic days was more effective, resulting in a 26% reduction in average waiting time [ 15 ]. Additionally, another study found a significant 56% reduction in average waiting time after extending clinic days, coupled with high patient satisfaction rates [ 16 ]. Similarly in other study extending clinic days resulted in an astounding 46% decrease in average waiting time. The study also found that patient satisfaction was high and that the number of patients seen each day had increased [ 17 ].

The availability of modern diagnostic services had a mean score of 4.30 (SD = 0.861), signifying a positive effect. This demonstrates that advanced diagnostic equipment played a significant role in streamlining healthcare processes and enhancing efficiency. Qualitative findings from both healthcare providers and patients supported this, highlighting how digital systems and modern equipment simplified procedures and expedited healthcare services. Access to electronic patient information and test results contributed to time savings. These findings are congruent with studies conducted in Italy [ 18 ], Pakistan [ 19 , 20 ], and Iran [ 21 ], which all demonstrated reductions in waiting time following the acquisition of modern equipment. A study from India also supported the positive impact of modern equipment on patient waiting time [ 22 ]. Additionally, audit assessments in Tanzania by the Ministry of Health and equipment-related observations in zonal hospitals emphasized the critical role of modern equipment in healthcare settings. Outdated equipment can lead to extended patient waiting time, underscoring the importance of maintaining and upgrading diagnostic facilities to improve healthcare efficiency and patient care [ 2 ].

The implemented technical strategies resulted in a significant reduction in overall OPD waiting time to an average of 3.30 h, marking a 45% reduction from the previous six-hour wait. While there have been notable improvements in registration, payment, triage, and pharmacy services, issues remain in doctor consultations, laboratory, and radiology services, resulting in extended waiting time for some patients. The adoption of electronic medical records emerged as the most effective technical strategy, emphasizing its critical role in improving OPD efficiency. Despite these advancements, additional improvements are required to meet the global standard of waiting time ranging from 30 min to 2 h. Nevertheless, ineffective implementation of block appointment and extension of clinic days appears to stem from lack of ownership and proactive involvement by hospital managers in driving these strategies forward. Furthermore, the hospital's dominant organizational culture seemed to be resistant to change, which could hinder the effective implementation of these strategies. The results indicated a possible training shortfall, suggesting that personnel may not have had enough training to properly adopt and implement these new strategies. Moreover, there was a lack of effective follow-up and management strategies by hospital managers, potentially hindering the sustained implementation of these strategies. Moreover, the shared use of central modern diagnostic equipment between inpatient and outpatient services at the radiology department resulted in delays, impacting waiting time. Alongside, a comprehensive review of the diagnostic service structure might be necessary to alleviate delays and streamline services for both inpatient and outpatient care.

Limitations of the study

Since only one hospital was involved in the study, generalization to cover the rest of Tanzania remains uncertain. Additionally, there was a chance that selection bias might have impacted the findings.

Availability of data and materials

Data is available upon request from the corresponding author.

Abbreviations

College research ethics and review committee

Healthcare provider

In-depth interview

Institute of medicine

Interquartile range

Kilimanjaro Christian Medical Centre

Ministry of health, community development, gender, elderly and children

Mzumbe University

  • Outpatient department

Msengwa AS, Rashidi J, Mniachi RA. Waiting time and Resource Allocation for Out-patient Department: A case of Mwananyamala Hospital in Dar es Salaam, Tanzania. Tanzania J Popul Stud Dev. 2020;27:64–81.

Article   Google Scholar  

Ministry of Health, Community Development, Gender, Elderly and Children. Performance audit report on management of referral and emergency healthcare services in higher level referral hospitals – Tanzania. Report of the Controller and Auditor General of the United Republic of Tanzania; 2019.

Lee H, Choi EK, Min KA, Bae E, Lee H, Lee J. Physician-customized strategies for reducing outpatient waiting Time in South Korea using queueing theory and probabilistic metamodels. Int J Environ Res Public Health. 2022;19:2073.

Article   PubMed   PubMed Central   Google Scholar  

Mardiah FP, Basri MH. The Analysis of appointment system to reduce outpatient waiting time at Indonesia’s Public. Hospital. 2013;3:27–33. https://doi.org/10.5923/j.hrmr.20130301.06 .

Panaviwat C, Lohasiriwat H, Tharmmaphornphilas W. Designing an appointment system for an outpatient department. IOP Conf. Ser. Mater. Sci. Eng. 2014;58. https://doi.org/10.1088/1757-899X/58/1/012010 .

Huang YL, Hancock WM, Herrin GD. An alternative outpatient scheduling system: Improving the outpatient experience. IIE Trans Healthc Syst Eng. 2012;2:97–111. https://doi.org/10.1080/19488300.2012.680003 .

Algiriyage N, Sampath R, Pushpakumara C, Wijayarathna G. A Simulation for Reduced Outpatient Waiting Time. 2020.

Google Scholar  

Ogaji D, Mezie-Okoye M. Waiting time and patient satisfaction: Survey of patients seeking care at the general outpatient clinic of the University of Port Harcourt Teaching Hospital. Port Harcourt Med Ical J. 2017;11:148–55. https://doi.org/10.4103/phmj.phmj_41_17 .

Steenland M, Dula J, Albuquerque A De, Fernandes Q, Cuco RM, Chicumbe S, et al. Effects of appointment scheduling on waiting time and utilisation of antenatal care in Mozambique. 2019. https://doi.org/10.1136/bmjgh-2019-001788

Achampong EK. Electronic health record system: a survey in Ghanaian. Hospitals. 2012;1:2–5. https://doi.org/10.4172/scientificreports.1 .

Thapa R, Saldanha S, Bucker N, Rishith P. an Assessment of Patient Waiting and Consultation Time in the Outpatient Department At a Selected Tertiary Care Teaching Hospital. J Evol Med Dent Sci. 2018;7:984–8. https://doi.org/10.14260/jemds/2018/225 .

Cho KW, Kim SM, Chae YM, Song YU. Application of queueing theory to the analysis of changes in outpatients’ waiting times in hospitals introducing EMR. Healthc Inform Res. 2017;23:35–42.

Chandra D. Reducing waiting time of outdoor patients in hospitals using different types of models: a systematic survey. Int J Bus Manag Rev. 2017;4:79–91.

Lot LT, Sarantopoulos A, Min LL, Perales SR, Boin I de FSF, Ataide EC de. Using Lean tools to reduce patient waiting time. Leadersh Heal Serv. 2018;31:343–51. https://doi.org/10.1108/LHS-03-2018-0016 .

O’Brien MA, Rogers S, Jamtvedt G, Oxman AD, Odgaard-Jensen J, Kristoffersen DT, Forsetlund L, Bainbridge D, Freemantle N. Extending clinic hours or extending clinic days: which works better for reducing waiting times? Cochrane Database Syst Rev. 2016;7:CD012775. https://doi.org/10.1002/14651858.CD012775 .

Al-Abri R, Al-Balushi A, Al-Sinawi H, Al-Dhuhli H. Extension of outpatient clinic hours reduces waiting time and improves patient satisfaction. Oman Med J. 2019;34:224–30. https://doi.org/10.5001/omj.2019.41 .

Gopalan S, Singh H, Adeyemi O, Padmanabhan P. Impact of extending the hours of the outpatient department and increasing the number of doctors on patient waiting times at a tertiary care hospital in India. J Fam Med Prim Care. 2018;7:268–73. https://doi.org/10.4103/jfmpc.jfmpc_233_16 .

Botti L, Buonocore D, Conforti D, Di Giovanni P, Mininno V. Effectiveness of modern equipment availability on reducing waiting time and enhancing patient satisfaction in a tertiary care hospital outpatient clinic. J Med Syst. 2019;43:275.

Sarwat A. The effects of waiting time and satisfaction among patients visiting medical outpatient department of a tertiary care hospital. J Pak Psychiatr Soc. 18(03):25–9. https://doi.org/10.63050/jpps.18.03.99 .

Khan MA, Khan MU, Ahmed N, Ahmed N. Reduction in waiting time of patients in a tertiary care hospital: role of modern equipment. J Pak Med Assoc. 2019;69:1443–7.

Hemmati F, Mahmoudi G, Dabbaghi F, Fatehi F, Rezazadeh E. The factors affecting the waiting time of outpatients in the emergency unit of selected teaching hospitals of Tehran. Electron J Gen Med. 2018;15(4):em66. https://doi.org/10.29333/ejgm/93135 .

Gupta S, Nayak RP. Availability of modern equipment and its impact on waiting time of patients in a tertiary care hospital. Indian J Public Health. 2017;61:131–4.

Usman SO, Olowoyeye E, Adegbamigbe OJ, Olubayo GP, Ibijola AA, Tijani AB, et al. Patient waiting time: gaps and determinants of patients waiting time in hospitals in our communities to receive quality services. Eur J Med Heal Sci. 2020;2:2–5. https://doi.org/10.24018/ejmed.2020.2.1.136 .

Harajin RS Al, Subaie SA Al, Elzubair AG. And patient satisfaction in outpatient clinics : Findings from a tertiary care hospital in Saudi Arabia. 2019:17–22. https://doi.org/10.4103/jfcm.JFCM .

Diri GE, Eledo B. Comparative Study on the performance of health workers in the reduction of patients waiting time in public hospitals in Yenagoa. Sci Heal Centre, Fed Med. 2020;5:1–13.

Pandit A, Varma EL, Amruta P. Impact of OPD waiting time on patient satisfaction. Int Educ Res J. 2016;2:86–90.

Cochran W. Sampling technique. 3rd ed. New York: John wiley & Sons; 1977.

Boddy CR. Sample size for qualitative research. Qual Mark Res. 2016;19:426–32. https://doi.org/10.1108/QMR-06-2016-0053 .

AbdulazizAlsubaie H, Alnaim M, Alsubaie S. Literature review: waiting time and patient satisfaction relationship. Int J Sci Basic Appl Res. 2021;59:161–4.

Farzandipur M, Jeddi FR, Azimi E. Factors affecting successful implementation of hospital information systems. Acta Inform Medica. 2016;24:51–5. https://doi.org/10.5455/aim.2016.24.51-55 .

Graneheim UH, Lundman B. Qualitative content analysis in nursing research : concepts, procedures and measures to achieve trustworthiness 2004:105–12. https://doi.org/10.1016/j.nedt.2003.10.001 .

Johnson JL, Adkins D, Chauvin S. A review of the quality indicators of rigor in qualitative research. Am J Pharm Educ. 2020;84:138–46. https://doi.org/10.5688/ajpe7120 .

Almomani I, Alsarheed A. Enhancing outpatient clinics management software by reducing patients ’ waiting time. J Infect Public Health. 2016;9:734–43.

Article   PubMed   Google Scholar  

Xie Z, Or C. Associations between waiting times, service times, and patient satisfaction in an endocrinology outpatient department: A time study and questionnaire survey. Inq (United States) 2017;54. https://doi.org/10.1177/0046958017739527 .

Geta ET, Edessa AM. Satisfaction with Waiting Time and Associated Factors among Outpatients at Nekemte Referral Hospital, Western Ethiopia. 2020;5:18–25. https://doi.org/10.11648/j.rs.20200502.12 .

Biya M, Gezahagn M, Birhanu B, Yitbarek K, Getachew N, Beyene W. Waiting time and its associated factors in patients presenting to outpatient departments at Public Hospitals of Jimma Zone, Southwest Ethiopia. BMC Health Serv Res. 2022;22:1–8. https://doi.org/10.1186/s12913-022-07502-8 .

Wafula RB, Ayah R. Factors Associated With Patient Waiting Time at a Medical Outpatient Clinic: A Case Study of University of Nairobi Health Services. Int J Innov Res Med Sci. 2021;6:915–8. https://doi.org/10.23958/ijirms/vol06-i12/1307 .

Boonma A, Sethanan K, Talangkun S, Laonapakul T. Patient waiting time and satisfaction in GP clinic at a tertiary hospital in Thailand. MATEC Web Conf. 2018;192:01034.  https://doi.org/10.1051/matecconf/201819201034 .

Abdulsalam A, Khan HTA. Hospital Services for Ill Patients in the Geopolitical Zone , Nigeria : Patient ’ s Waiting Time and Level of Satisfaction 2020. https://doi.org/10.1177/1054137316689582

Usman S, Olowoyeye E, Adegbamigbe O, Olubayo G, Ibijola A, Tijani A. Patient waiting time: gaps and determinants of patients waiting time in hospitals in our communities to receive quality services. Eur J Med Heal Sci. 2020;2:2–5.

Jalili R, Mohebbi M, Asefzadeh S, Mohebbi M. Evaluation of Waiting Time and Satisfaction in Outpatients in Imam Hossein Polyclinic of Zanjan Using Patient-Pathway. Analysis. 2021;10:34–41.

Ogaji DS, Mezie-Okoye M. Waiting time and patient satisfaction: Survey of patients seeking care at the general outpatient clinic of the University of Port Harcourt Teaching Hospital. Port Harcourt Med J. 2017:148–55. https://doi.org/10.4103/phmj.phmj_41_17 .

Baker J, Moore T. The impact of block scheduling on patient wait times in a tertiary care outpatient setting. J Healthc Manag. 2017;62:54–65.

Hensher M, Price M, Adomakoh S, et al. Reducing waiting time in outpatient clinics: A case study. J Med Syst. 2016;40:212. https://doi.org/10.1007/s10916-016-0571-6 .

Download references

Acknowledgements

We extend our gratitude to the patients who participated in this study and the research assistants who contributed to data collection namely, Geofrey A. Sikaluzwe, Mbayani J. Kivuyo, Richard Hezron Mwamahonje, Emmanuel M. Mabula, Abel E. Lucas, Amos Francis, Dr. (Mrs) Angela Savage for proof reading and Dr. Bernard Njau for his continuity guidance. Also Dr. Theresia Mkenda for availing us with research assistants.

This study had no funding.

Author information

Authors and affiliations.

Kilimanjaro Christian Medical Centre, P. O. Box 3010, Moshi, Tanzania

Manasseh J. Mwanswila

Department of Health Systems Management, School of Public Administration and Management, Mzumbe, P.O. Box 2, Morogoro, Tanzania

Manasseh J. Mwanswila, Henry A. Mollel & Lawrencia D. Mushi

You can also search for this author in PubMed   Google Scholar

Contributions

M.J.M conceptualized and conducted the study, handling data collection, analysis, and initial manuscript drafting. H.A.M and L.D.M provided oversight and reviewed the process from proposal to final manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Manasseh J. Mwanswila .

Ethics declarations

Ethics approval and consent to participate.

The Clearance Committee from Mzumbe University from the Directorates of Research, Publication and Postgraduate Studies provided ethical clearance with reference number MU/DPGS/INT/38/Vol. IV/236. Subsequently, the proposal was submitted for evaluation to the College Research Ethics and Review Committee (CRERC) at Kilimanjaro Christian Medical University College – Moshi. The CRERC granted approval, as indicated by certificate number 2639. Additionally, the data collection procedure received endorsement from the directors of KCMC Hospital reference number KCMC/P.1/Vol. XII. Prior to data collection, participants provided written informed consent. To ensure respondent autonomy, patients were fully informed about the purpose and nature of the study and provided with the option to withdraw at any time without any impact on their medical care. Patients were then questioned as completing their medical care.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary material 1., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Mwanswila, M.J., Mollel, H.A. & Mushi, L.D. Outcome evaluation of technical strategies on reduction of patient waiting time in the outpatient department at Kilimanjaro Christian Medical Centre—Northern Tanzania. BMC Health Serv Res 24 , 785 (2024). https://doi.org/10.1186/s12913-024-11231-5

Download citation

Received : 20 January 2024

Accepted : 21 June 2024

Published : 09 July 2024

DOI : https://doi.org/10.1186/s12913-024-11231-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Waiting time
  • Patient waiting time

BMC Health Services Research

ISSN: 1472-6963

what is the p value in a research study

Ohio State nav bar

The Ohio State University

  • BuckeyeLink
  • Find People
  • Search Ohio State

what is the p value in a research study

A cosmic tool for studying twisters and other severe storms

Physicists say particle-finding technique has value on earth.

Cosmic rays could offer scientists another way to track and study violent tornadoes and other severe weather phenomena, a new study suggests. 

By combining local weather data with complex astrophysics simulations, researchers explored whether a device that typically detects high-energy particles called muons could be used to remotely measure tornado-producing supercell thunderstorms. 

Conventional tornado-tracking instrumentation relies on measurements made by technologies like drones or weather balloons, but those methods often require humans to get dangerously close to the path of an oncoming storm. 

Yet through studying how these storms affect muons, which are heavier than electrons and travel through matter at nearly the speed of light, these findings can act as another tool for scientists to gain a more accurate picture of underlying weather conditions. 

William Luszczak

The study was published on the open-access preprint server arXiv.

Compared to other cosmic particles, muons have many unique real-world applications, including helping scientists to peer inside large, dense objects like the pyramids or detecting hazardous nuclear material. Now, Luszczak’s simulations in this paper imply that supercell thunderstorms cause very slight changes in the number, direction and intensity of these particles. 

To determine this, the researchers applied a three-dimensional cloud model that could account for multiple variables, including wind, potential temperature, rain, snow and hail. Then, using atmospheric observations gathered from the 2011 supercell that passed through El Reno, Oklahoma , and spawned a tornado outbreak, Luszczak applied that information to measure variations in air pressure in the region around a simulated storm over the span of an hour. 

Overall, their results found that muons are indeed affected by the pressure field inside tornadoes, though more research is needed to learn more about the process.

In terms of how well it could work in the field, the concept is especially appealing, as utilizing muons to predict and analyze future weather patterns would also mean scientists wouldn’t necessarily have to try to place instruments very near a tornado to gain these pressure measurements, said Luszczak. 

Still, the type of muon particle detector that Luszczak’s paper considers is much smaller than other more well-known cosmic ray projects, such as the Pierre Auger Observatory in Argentina and the University of Utah’s Telescope Array.  

Unfortunately, these detectors don’t reside in places where they can study tornadoes, said Luszczak, but if placed in a region like Tornado Alley in the United States, researchers imagine that the device could easily complement typical meteorological and barometric measurements for tornadic activity. 

That said, the device’s size also influences how precise its measurements are, as scaling it up enhances the number of particles it can detect, said Luszczak. 

The smallest detector researchers describe in this paper is 50 meters across, or about the size of five buses. But while such a tool would be portable enough to ensure scientists could place it near many different types of storm systems, being so small would likely cause it to face some errors in its data-gathering, said Luszczak. 

Despite these potential setbacks, as supercell thunderstorms typically form and disappear in short periods, the paper emphasizes it may be well worth future scientists’ time to consider implementing a large detector in some regions – one that would likely be a permanent stationary establishment to catch as many muons as possible during severe weather events. 

More importantly, because current weather modeling systems are directly linked to when and where severe weather alerts are issued, using cosmic rays to strengthen those models would give the public a more detailed sense of a storm’s various twists and turns as well as more time to prepare for the phenomenon.  

“By having better measurements of the atmosphere surrounding a tornado, our modeling improves, which then improves the accuracy of our warnings,” said Luszczak. “This concept has a lot of promise, and it’s a really exciting idea to try to put into action.”

Leigh Orf of the University of Wisconsin-Madison was a co-author.  

More Ohio State News

Gulbenkian prize for humanity awarded to renowned soil scientist rattan lal.

A globally renowned soil scientist at The Ohio State University, Rattan Lal, has been awarded the 2024 Gulbenkian Prize for Humanity for his significant contributions to global food security, climate resilience and ecosystem protection.

Ohio State Speech and Hearing Science department home to Otoscan

“I had the same reluctance as any middle-aged person does to make that leap,” he said.

24 Buckeyes set for 2024 Summer Olympics

The Ohio State University Department of Athletics will have 24 current, former and incoming student-athletes competing at the 2024 Paris Olympics set to begin later this month. The Buckeyes represent 10 countries and nine sports at the XXXIII Olympiad.

Ohio State News

Contact: Admissions | Webmaster | Page maintained by University Communications

Request an alternate format of this page | Web Services Status | Nondiscrimination notice

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Open Cardiovasc Med J

The Value of p -Value in Biomedical Research

This is an open access article licensed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/3.0/ ) which permits unrestricted, non-commercial use, distribution and reproduction in any medium, provided the work is properly cited.

Significance tests and the corresponding p -values play a crucial role in decision making. In this commentary the meaning, interpretation and misinterpretation of p -values is presented. Alternatives for evaluating the reported evidence are also discussed.

INTRODUCTION

Evidence-based medicine aims to apply scientific information retrieved from the research to certain parts of medical practice. Particularly, it seeks to assess the quality of evidence relevant to the risks and benefits of individuals’ characteristics or treatments [ 1 ]. According to the Centre for Evidence-Based Medicine, " Evidence-based medicine is the conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients " [ 2 ]. A cornerstone in evidence-based medicine is decision quality. Under the concept of evidence-based medicine the research is categorized and ranked according to the strength of the lack from various biases. The strongest evidence for therapeutic interventions is provided by meta-analyses of randomized, double-blind, controlled clinical trials. On the contrary, case reports and expert opinion have little value. The U.S. Preventive Services Task Force [ 1 ] ranks scientific evidence in the following order: (a) Evidence obtained from more than one randomized controlled trials (Level I); (b) Evidence obtained from controlled trials without randomization (Level II-1); or Evidence obtained from prospective or case-control epidemiologic studies (Level II-2); or Evidence obtained from multiple time series with or without the intervention (Level II-3); and (c) Opinions of respected authorities, based on clinical experience, descriptive studies, or reports of expert committees (Level III). The UK National Health Service uses a similar system with categories labeled A, B, C, and D. Anytime a selection must be made among several alternative choices, a decision is being made, and the role of the researcher is to assist in this process. Especially, when decisions are complicated and require careful consideration and systematic review of the available information, the researcher’s role becomes paramount.

Evidence-based medicine attempts to express clinical research using mathematical methods. Tools used by researchers include: the likelihood ratios, various (univariate or multivariate) statistical tests, the area under the receiver operator characteristic curve (ROC) and many others. The p -value is one of the most widely used statistical terms in decision making in biomedical research, which assists the investigators to conclude about the significance of a research consideration. Up today, most researchers base their decision on the value of the probability p . However, the term p -value is often miss- or over- interpreted, leading to serious methodological errors and misinterpretations [ 3 ]. In this article the interpretation of the p -value and some alternatives options are discussed.

DEFINITION OF THE P -VALUE

In statistical science, the p -value is the probability of obtaining a result at least as extreme as the one that was actually observed in the biological or clinical experiment or epidemiological study, given that the null hypothesis is true [ 4 ]. The testing of hypothesis is fundamental in statistics, and it could be considered as a “method” of making statistical decisions using experimental data. At this point we have to introduce some terms regarding hypothesis testing. There are two hypotheses, the null and the alternative. Usually, the null hypothesis that indicates no association between the investigated factors or characteristics (measured using random variables), e.g., “the prevalence of cardiovascular disease is equal between males and females”; thus, “there is no association between gender and the disease”. On the other hand, the alternative hypothesis indicates an association between the investigated variables (i.e. the prevalence of cardiovascular disease differs between genders (two-sided hypothesis), or the prevalence of males is greater than the prevalence of females or the prevalence of females is greater than the prevalence of males (one-sided hypothesis). In the 1950s, Fisher [ 5 ] proposed significance tests as a means of examining the discrepancy between the data and the null hypothesis. Some of the most often used significance tests in biomedical research are: the Z-test, the Student’s t -test, the F -test and the chi -square among others.

In statistical theory, the p -value is a random variable defined over the sample space (i.e. the set of all possible outcomes) of the experiment, such that its distribution under the null hypothesis is uniform on the interval (0, 1). For example, a phase III clinical trial (experiment) is performed to determine if total cholesterol levels differ between the group that was under drug A treatment, compared with the group that was under drug B treatment. For simplicity, it is assumed that baseline levels of cholesterol were equal, and after 12 months of treatment a mean absolute reduction on total cholesterol levels of 27±10 mg/dl was observed in group A and a mean absolute reduction of 25±10 mg/dl was observed in group B. If 100 patients were allocated to each treatment arm, and taking into account the assumptions of the appropriate significance test, the p -value of this hypothesis testing is equal to 0.15. In this case the null hypothesis is that “in the population the mean absolute reductions were equal” against the alternative were “in the population the mean absolute reductions were not equal”. The p -value of this result is the chance of observing a 2 mg/dl difference between the two treatment arms, under the context that a similar reduction on cholesterol levels exists (i.e., the null hypothesis). The p -value of 0.15, means that the observed difference can be attributed to chance by 15%. In Fisher’s approach the null hypothesis is never proved, but is possibly disproved. Moreover, Fisher suggested 0.05 as a threshold of significance (i.e., α); if the p -value is less than α, there is evidence to reject the null hypothesis. However, there has been considerable criticism about this choice, and its usefulness. Despite the criticisms made, all agree that the significance level should be decided before the data are viewed, and is compared against the p -value after the test has been performed. Moreover, although p -values are widely used, there are several misunderstandings. In the text below, an attempt is made to clarify what the p -value really is and what it is not.

WHAT THE P -VALUE IS AND WHAT IT IS NOT.

The p -value is not the probability that the null hypothesis is true, and this is because hypotheses do not have probabilities in classical statistics. Moreover, the p -value is not the probability of falsely rejecting the null hypothesis. Falsely rejecting the mull hypothesis is a Type I error. This error is a version of the so-called “prosecutor's fallacy”. The Type I error rate is closely related to the p -value since we reject the null hypothesis when p -value is less than a pre-defined level, α. The p -value does not indicate the size or importance of the observed effect. Thus, a very small p -value, let say 0.000… (usually presented as <0.001) does not necessarily mean a strong association (compared with effect size which is a measure of the strength of the relationship between 2 variables, e.g. odds ratio, relative risk, correlation coefficient, Cohen’s d etc [ 5 , 6 ]). Moreover, the p -value is influenced by sample size. For example, the Fig. ( ​ 1 1 ) illustrates the impressive decrease in p -value according to sample size, keeping the observed findings constant. It can be seen that if the initial sample size is doubled (i.e. n=200 per treatment arm) the study’s results achieve significance.

An external file that holds a picture, illustration, etc.
Object name is TOCMJ-2-97_F1.jpg

Theoretical example of p -values in relation to sample size for the same difference in the data.

Another major issue that influences medical decision making is the multiple comparisons problem which occurs when a family of statistical inferences is considered simultaneously. For example, with just one hypothesis test performed at 5% significance level, there is only a 5% probability of obtaining a result at least as extreme as the one that was observed when the null hypothesis is true. However, with 100 tests performed with all null hypotheses being true, it is more likely that at least one null hypothesis will be rejected. These errors are called false positives, and many mathematical techniques have been developed to control them. Most of these techniques modify the significance level α, in order to account for the inflation of type I error rate and make the comparison of p -value more accurate.

For all the aforementioned reasons, many Journals have long been recommended to the authors to present confidence intervals instead of p -values since they are not considered mathematically sound [ 7 ].

Finally, the p -value is not the probability that the experiment would not yield the same conclusion after replications. For this reason Killeen [ 8 ] proposed p rep as a statistical alternative to the p -value, which calculates the probability of replicating an effect. An approximate of the p rep is the following:

p rep = 1 + p 1 − p 2 3 − 1

The lower the p -value is, the higher the p rep . The Association for Psychological Science (APS) recommends to contributing authors of journals to present p rep instead of p -values. However, considerable criticism has been made. For example, p rep does not take prior probabilities into account [ 9 ], and does not bring any additional information on the significance of the result of a given experiment.

Recently, Ioanndis [ 10 ] suggested that more “detailed” statistical methods should be applied, like Bayes factor B, to interpret ‘‘significant’’ associations. In general, Bayesian inference is a method for determining how scientific belief should be modified by observed data. Most important, Bayes factors require the addition of background knowledge to be transformed into inferences. The simplest form of Bayes factor is the likelihood ratio (i.e., the ratio Λ of the maximum probability of a result under two different hypotheses, the null where no associations are observed and the alternative). The minimum Bayes factor is objective and can be used instead of p -value as a measure of the evidential strength. However, medical researchers have not been so enthusiastic to understand and adopt that Bayesian statistical methodologies perceive a subjective approach to evidence-based analysis. Despite the criticism, for many scientists the use of Bays factor B is an alternative to the classical hypothesis testing mentioned above. Particularly, as Ioannides observed, when the factor B was calculated on 272 observational studies and 50 meta-analyses on gene-disease associations (752 studies) for which statistically significant associations had been claimed ( p <0.05), statistically significant results offered less than strong support to the credibility for 54–77% of the epidemiologic associations and 44–70% of the 50 associations from genetic meta-analyses [ 10 ].

In brief, unlike p- values, Bayes factors have a sound interpretation that allows their use in both inference and decision making, since they make the distinction clear between experimental evidence and inferential conclusions while providing a framework in which to combine prior with current evidence.

CONCLUDING REMARKS

In this article an attempt was made to interpret the meaning of p -value, a probability that is the basis, in most biomedical research, of decision making. Recent guidelines for presenting the results of clinical experiments or observational studies, suggest providing confidence intervals instead or together with the p -values, and giving the effect sizes of the investigated associations. Nevertheless, the p -value still has significant value when correctly interpreted and used.

This paper is in the following e-collection/theme issue:

Published on 12.7.2024 in Vol 26 (2024)

A Computed Tomography–Based Fracture Prediction Model With Images of Vertebral Bones and Muscles by Employing Deep Learning: Development and Validation Study

Authors of this article:

Author Orcid Image

Original Paper

  • Sung Hye Kong 1 * , MD, PhD   ; 
  • Wonwoo Cho 2 * , MSc   ; 
  • Sung Bae Park 3 , MD, PhD   ; 
  • Jaegul Choo 2 * , PhD   ; 
  • Jung Hee Kim 4 * , MD, PhD   ; 
  • Sang Wan Kim 5 , MD, PhD   ; 
  • Chan Soo Shin 4 , MD, PhD  

1 Department of Internal Medicine, Seoul National University Bundang Hospital, Seongnam, Republic of Korea

2 Kim Jaechul Graduate School of AI, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea

3 Department of Neurosurgery, Seoul National University Boramae Hospital, Seoul, Republic of Korea

4 Department of Internal Medicine, Seoul National University Hospital, Seoul, Republic of Korea

5 Department of Internal Medicine, Seoul National University Boramae Hospital, Seoul, Republic of Korea

*these authors contributed equally

Corresponding Author:

Jung Hee Kim, MD, PhD

Department of Internal Medicine

Seoul National University Hospital

101 Dae-hak ro, Jongno-gu

Seoul, 03080

Republic of Korea

Phone: 82 220724839

Email: [email protected]

Background: With the progressive increase in aging populations, the use of opportunistic computed tomography (CT) scanning is increasing, which could be a valuable method for acquiring information on both muscles and bones of aging populations.

Objective: The aim of this study was to develop and externally validate opportunistic CT-based fracture prediction models by using images of vertebral bones and paravertebral muscles.

Methods: The models were developed based on a retrospective longitudinal cohort study of 1214 patients with abdominal CT images between 2010 and 2019. The models were externally validated in 495 patients. The primary outcome of this study was defined as the predictive accuracy for identifying vertebral fracture events within a 5-year follow-up. The image models were developed using an attention convolutional neural network–recurrent neural network model from images of the vertebral bone and paravertebral muscles.

Results: The mean ages of the patients in the development and validation sets were 73 years and 68 years, and 69.1% (839/1214) and 78.8% (390/495) of them were females, respectively. The areas under the receiver operator curve (AUROCs) for predicting vertebral fractures were superior in images of the vertebral bone and paravertebral muscles than those in the bone-only images in the external validation cohort (0.827, 95% CI 0.821-0.833 vs 0.815, 95% CI 0.806-0.824, respectively; P <.001). The AUROCs of these image models were higher than those of the fracture risk assessment models (0.810 for major osteoporotic risk, 0.780 for hip fracture risk). For the clinical model using age, sex, BMI, use of steroids, smoking, possible secondary osteoporosis, type 2 diabetes mellitus, HIV, hepatitis C, and renal failure, the AUROC value in the external validation cohort was 0.749 (95% CI 0.736-0.762), which was lower than that of the image model using vertebral bones and muscles ( P <.001).

Conclusions: The model using the images of the vertebral bone and paravertebral muscle showed better performance than that using the images of the bone-only or clinical variables. Opportunistic CT screening may contribute to identifying patients with a high fracture risk in the future.

Introduction

The globally aging society has driven an increase in the incidence of fragility fractures and imposed a significant burden on health care systems, societies, and most importantly, on patients and their families [ 1 - 3 ]. Thus, proactively identifying patients with a high risk of fractures is vital. There are well-established methods to evaluate the risk of fractures, such as dual-energy X-ray absorptiometry (DXA) to assess bone mineral density (BMD), which is a reference standard for the diagnosis of osteoporosis [ 4 ]. However, a large proportion of patients have never undergone DXA, and 60% of the patients with major osteoporotic fractures do not receive proper treatment to reduce the risk of fractures [ 5 ].

Opportunistic computed tomography (CT) scans can be a novel approach for identifying patients with a high risk of fractures. Along with the increase in progressively aging populations, the use of opportunistic CT scanning is increasing, with over 80 million examinations performed each year in the United States [ 6 ]. Retrieval of information that can help assess the fracture risks from opportunistic CT scans does not require additional costs, time, or equipment, and data can be retrospectively acquired. Thus, it may help reduce the efforts associated with screening patients with high risks of fractures. Several studies have assessed BMD by using opportunistic CT scans [ 7 ], mainly utilizing the attenuation data of the trabecular bone of the spine [ 8 , 9 ].

There have been significant advances in deep learning techniques for medical image analysis, such as the convolutional neural network (CNN) method [ 10 ]. The CNN method facilitates the utilization of highly representative, data-driven image features, arranged in a layered hierarchical structure, which are effective in successfully classifying medical images. Various images have been used in CNN to classify patients with a high fracture risk [ 11 - 14 ]. Previous studies have primarily focused on the use of radiographic images for fracture detection, with AUROCs reported in the range of 0.73-0.80 [ 14 , 15 ]. However, there is scarcity of research utilizing CT images, and this is limited to bone texture analysis. Our study may fill this gap by applying CNN techniques to CT scans, which may provide a more accurate assessment of fracture risk due to the detailed and comprehensive nature of CT imaging. Further, as paravertebral muscles are among the critical contributing factors to vertebral fractures [ 16 , 17 ], CT could be a valuable method for acquiring information on both the muscle and vertebral bone. Nevertheless, to our knowledge, no previous study has reported using images of both the vertebral bone and muscle from CT scans by using the CNN method. Therefore, we aimed to develop and externally validate a CT-based fracture prediction model by using images of vertebral bones and muscles by employing a deep learning method. This study may help identify patients having high risk of fractures among those who undergo opportunistic CT scans for screening or other purposes.

Study Design and Participants

This study was based on a retrospective longitudinal cohort study of 32,435 patients having abdominal CT images at Seoul National University Bundang Hospital between 2010 and 2019. Patients who met all the inclusion criteria were included. Inclusion criteria were as follows: (1) patients who had abdominal CT imaging at Seoul National University Bundang Hospital between 2010 and 2019 and had follow-up images at the 5-year timepoint, (2) those who were aged between 50 and 80 years, and (3) those who were followed up for over a year. Further, patients who met any one of the exclusion criteria were excluded. The exclusion criteria and the number of excluded patients were as follows: (1) patients who were younger than 50 years or older than 80 years (n=2643), (2) those whose follow-up periods were less than a year (n=8029), and (3) those who had compression fractures or spinal surgery at the baseline (n=3258) ( Figure 1 ). Finally, 18,505 patients were included in the analysis. During follow-up, 693 patients experienced vertebral fractures, while the remaining 17,812 patients did not. Among the 693 patients, after excluding 85 patients owing to the poor image quality or inappropriate CT protocols, 608 patients remained as cases.

For the control group, we selected individuals from the same time frame as the fracture cases. Among 17,812 patients who did not experience fracture, after excluding 2141 patients with poor image quality, we selected 606 age-, sex-, and BMI-matched individuals at a ratio of 1 patient to 1 control within a similar follow-up period. The fracture events were determined by reviewing medical records, with efforts to exclude any fractures associated with trauma. If patients had multiple CT scans during the follow-up, the earliest CT scan was used.

As a result, 1214 patients were eligible for analysis and constituted the development set. In addition, we developed an external validation set of 495 patients from Seoul National University Boramae Hospital by using the same protocol but without case-control matching between 2012 and 2013. An external validation set was developed to assess the performance of the intervention. The same protocol was used to ensure consistency in evaluation while allowing for a broader application of the findings in real-world settings.

what is the p value in a research study

Ethics Approval

This study protocol was approved by the institutional review board of the Seoul National University Bundang Hospital (B-2104-677-402). The requirement for informed consent was waived owing to the retrospective design of this study. This study was conducted in accordance with the ethical standards laid down in the 1964 World Medical Association Declaration of Helsinki and its later amendments. This study also complies with the ethical principles for medical research.

Primary Outcome

The primary outcome of this study was the predictive accuracy of vertebral fracture events between T12 and L4 occurring within 5 years. Vertebral fractures were defined as morphometric fractures and were confirmed using radiographic or CT images. These images were adjudicated by SHK and JHK, who were blinded to any patient information prior to their assessments. Morphometric vertebral fractures were confirmed by radiographic or reconstructed CT images with measurements of anterior (Ha), middle (Hm), and posterior (Hp) height of each vertebral body from T11 to L4 measured. Normal population was classified as having no vertebral fracture by gross visual inspection as being within normal range for vertebral height and shape. The mean (SD) of ratio of normal vertebral height was obtained from patients without incident fractures. We first calculated the anterior to posterior (Ha/Hp), middle to posterior (Hm/Hp), and posterior to posterior above and below (Hpi/Hpi+1 and Hpi/Hpi–1) ratios. Vertebral fracture was defined if any of the following ratios were more than 3 SDs below the normal mean for that vertebral level, which was 0.91 (SD 0.08), as described in previous reports [ 18 , 19 ].

Measurements of Clinical Factors

Sociodemographic factors, including age, sex, and medical history, were obtained from a review of electronic medical records at baseline. Height and body weight were measured using standard methods by trained staff with a scale and wall-mounted extensometer, and the participants wore lightweight clothes. BMI was calculated as weight divided by height in meters squared (kg/m 2 ). Current smokers were defined as patients who were smoking during the study period, while current alcohol consumers were defined as those who consumed 3 or more units of alcohol daily. The use of glucocorticoids was defined as using oral glucocorticoids or having been exposed to oral glucocorticoids for more than 3 months at a prednisolone dose >5 mg or its equivalent doses. Possible secondary osteoporosis is defined as osteoporosis that occurs due to factors other than primary menopause or age-related causes. It includes patients with osteoporosis and concurrent diagnosis with type 1 diabetes, osteogenesis imperfecta in adulthood, hyperthyroidism, hypogonadism, premature menopause (age<45 years), chronic malnutrition, malabsorption, or chronic liver disease [ 20 ].

CT Protocols, Image Preprocessing, and Deep Learning Techniques

Intravenous contrast-enhanced images were obtained using CT scanners with 64 detector rows (Brilliance; Philips Medical Systems). All the patients were placed in a supine position and scanned from the diaphragm to the symphysis pubis. The reference tube current–time product was empirically set, aiming at effective radiation doses of 2 mSv. The effective tube current–time product generally ranges between 25 mA and 40 mA. The actual radiation dose was adjusted according to the body size by automatically modulating the tube current (Dose-Right; Philips Medical Systems). The values of tube voltage, collimation, rotation speed, and pitch were 120 kVp, 64 mm×0.625 mm, 0.5 seconds, and 0.891, respectively. Patients were administered 2 mL iopromide/kg (Ultravist 370; Schering) intravenously at a rate of 3 mL/s via the antecubital vein, and scanning was initiated 60 seconds after the enhancement of the descending aorta reached 150 HU. From each helical scan, the images were reconstructed using a section thickness of 5 mm.

Consecutive image processing was applied to all the CT images for accurate deep learning–based image analysis. Each axial slice of the abdominal CT scan was resampled to obtain a pixel spacing of 1×1 mm 2 . The signal intensity of each CT image was min-max normalized to the –1 to 1 range after windowing the Hounsfield unit values in the range of –200 to 1000. Subsequently, 2 classes of image data, that is, vertebral body only (bone-only) and vertebral body with paravertebral muscles (bone+muscle) were extracted from each CT image. Based on the manual annotations of the vertebral body, excluding the intervertebral disc, the vertebral body regions from T12 to L4 were extracted from the CT images, where each bone-only image had 96×96 pixels in the axial plane. Centered on the vertebral body, the images of the paravertebral muscle were automatically cropped using a rectangular box (96×144 pixels) in the axial plane (Figure S1 in Multimedia Appendix 1 ).

Deep learning–based image features for the 5-year risk analysis of vertebral fractures were extracted using the attention CNN–recurrent neural network (CNN-RNN) model for image data ( Figure 2 ) [ 21 , 22 ]. In the CNN-RNN model, ImageNet-pretrained ResNeXt-50 and gated recurrent units were employed as the CNN encoder backbone and the RNN recurrent decoder, respectively. As inputs of the model, 14 equidistant axial slices were extracted from the T12 to L4 vertebrae region of each CT image, where the starting slice was randomly selected at each training iteration for data augmentation. In the training phase, the CNN-RNN model was optimized using the Adam optimizer and cross-entropy loss, where the learning rate and batch size were 1e-5 and 128, respectively. In our image-based fracture prediction model, the utilization of CT images was primarily driven by deep learning methods, particularly CNNs. Although specific imaging parameters such as attenuation values and density were not directly used as standalone inputs, the CNN’s learning process inherently captured these aspects as part of the comprehensive image analysis. The model processed the entire CT images, extracting deep features that potentially included characteristics related to bone and muscle attenuation and density, among others. This approach allowed for a sophisticated interpretation of the CT scans, identifying nuanced patterns indicative of fracture risk. To incorporate clinical variables into our image-based prediction model, we first standardized the clinical variables to ensure consistency and comparability. Following this, we concatenated the standardized variables to the image features in the final layer of the CNN-RNN model.

To understand how the model identifies and differentiates key areas for predicting vertebral fractures in CT images, we employed the gradient-weighted class activation mapping technique. This approach involves highlighting the most crucial regions within the images, marked by a bright red overlay, thereby revealing the model’s decision-making process and focal areas for classification ( Figure 3 ). The image models were developed with a high-performance computing server with 4 NVIDIA GeForce GTX 1080 Ti (NVIDIA) graphic processing units and the Ubuntu 16.04.4 operating system.

what is the p value in a research study

Statistical Analyses

For the baseline characteristics, depending on the distribution, continuous parameters were presented as means with standard deviations and categorical data were presented as proportions. Comparisons between the groups having continuous variables were analyzed using 2-sided Student t test, whereas χ 2 test was used for the categorical variables. The area under the receiver operating characteristic curve (AUROC) was calculated to compare the preprocessed images. Cases predicted to have an actual fracture event and experienced it during follow-up were defined as true positive. Those predicted to have but did not experience a fracture was designated as false positive. Cases predicted to be free from fracture events but experienced one during follow-up were defined as false negative. True-negative cases were predicted to be free of fracture events with no fractures during follow-up. Sensitivity and specificity were calculated for each time series as follows: sensitivity = true positive / (true positive + false negative) and specificity = true negative / (true negative + false positive). The risk prediction performance measures were gauged using 10-fold cross-validation.

The image-only 5-year risk analyses of vertebral fractures were conducted by applying a fully connected layer, which generates binary prediction results to the CNN-RNN feature extractor. In addition to the image-only model, clinical models (models A, B, C, and D) were developed by analyzing the corresponding clinical variables via XGBoost. The clinical variables included age, sex, BMI, use of steroids, smoking status, and possible secondary osteoporosis. Model A included age and sex as independent variables; model B additionally incorporated BMI; model C further included clinical variables such as the use of glucocorticoids, history of alcohol consumption, smoking, and possible secondary osteoporosis; and model D included type 2 diabetes mellitus, HIV, hepatitis C infection [ 23 ], and renal failure [ 24 ].

The matching of cases and controls in our study was conducted based on age and sex. This process was facilitated using the propensity score matching method, implemented through the MatchIt package in R (version 4.1.2; R Foundation). PyTorch and Scikit-learn libraries from Python were used for the analyses. A P value <.05 was considered significant. Correction for multiple testing was not performed across models. Statistical analyses were performed using Python (version 3.8.10; Python Software Foundation). The programs used in the experiments were PyCharm (JetBrains s.r.o.) and Visual Studio (Microsoft Corp).

Clinical Characteristics

A total of 1709 individuals were included in the analysis. The participants were divided into a development set from Seoul National University Bundang Hospital (n=1214) and an external validation set (n=495) from Seoul National University Boramae Medical Center. As shown in Table 1 , the development set was older (mean 72.5, SD 7.9 years) than the external validation set (mean 67.6, SD 8.6 years), with a statistically significant difference ( P <.001). The proportion of females in the external validation set (390/495, 78.8%) was higher than that in the development set (839/1214, 69.1%), with this difference also being significant ( P <.001). However, no significant differences were observed in weight and BMI between the 2 sets. When considering lifestyle factors, there was a higher prevalence of current smokers (284/1214, 23.4% vs 28/495, 5.7%; P <.001) and current drinkers (236/1214, 19.4% vs 34/495, 6.9%; P <.001) in the development set than those in the external validation set, respectively. The use of steroids was similar across both groups ( P =.56), while the prevalence of possible secondary osteoporosis was significantly higher in the development set (122/1214, 10.1% vs 16/495, 3.2%, respectively; P <.001). Within 5 years of follow-up, 454 (37.4%) and 61 (12.3%) individuals experienced vertebral fractures in the development and external validation sets, respectively.

In the development set (n=1214), participants were matched based on age, sex, and BMI to compare those with incident fractures (n=608) to those without (n=606). There was a less than 1-year age difference (mean age 72.0, SD 7.6 years in the nonfracture group vs 72.9, SD 8.2 years in the fracture group; P <.001) and a BMI difference of less than 0.5 (mean BMI 23.7, SD 3.4 kg/m² in the nonfracture group vs 23.5, SD 3.5 kg/m² in the fracture group; P =.39). Gender distribution was balanced between the 2 groups (422/606, 69.6% females in the nonfracture group vs 417/608, 68.6% females in the fracture group; P =.96).

Despite these matched parameters, a higher prevalence of current smokers was noted in the fracture group than in the nonfracture group (169/608, 27.8% vs 115/606, 18.9%, respectively; P =.001) along with a significantly higher use of steroids (114/608, 18.8% vs 47/606, 7.8%, respectively; P <.001) and a greater prevalence of possible secondary osteoporosis (77/608, 12.7% vs 45/606, 7.4%, respectively; P =.001). No significant differences were observed in the height, weight, and current drinking status between the 2 groups ( Table 2 ).


Development set (n=1214)External validation set (n=495) value
Age (years), mean (SD)72.5 (7.9)67.6 (8.6)<.001
Female, n (%)839 (69.1)390 (78.8)<.001
Height (cm), mean (SD)157.0 (8.4)155.2 (7.9)<.001
Weight (kg), mean (SD)58.2 (9.9)57.6 (9.7).22
BMI (kg/m ), mean (SD)23.6 (3.5)23.9 (3.6).18
Current smoker, n (%)284 (23.4)28 (5.7)<.001
Current drinker, n (%)236 (19.4)34 (6.9)<.001
Use of steroids, n (%)161 (13.3)57 (11.5).56
Possible secondary osteoporosis, n (%)122 (10.1)16 (3.2)<.001
Vertebral fracture within 5 years, n (%)608 (50)61 (12.3)<.001

a The variables between the groups were compared using the 2-sided Student t test for continuous variables and the χ 2 test for categorical variables.

b Use of steroids was defined as the use of prednisolone 5 mg daily or equivalent over 3 months.

c Possible secondary osteoporosis includes patients with osteoporosis and concurrent diagnosis with type 1 diabetes, osteogenesis imperfecta in adulthood, hyperthyroidism, hypogonadism, premature menopause (<45 years), chronic malnutrition, malabsorption, and chronic liver disease.


Incident fracture (-) (n=606)Incident fracture (+) (n=608) value
Age (years), mean (SD)72.0 (7.6)72.9 (8.2)<.001
Females, n (%)422 (69.6)417 (68.6).96
Height (cm), mean (SD)156.9 (8.5)157.1 (8.3).32
Weight (kg), mean (SD)58.29 (9.5)58.11 (10.4).23
BMI (kg/m ), mean (SD)23.7 (3.4)23.5 (3.5).39
Current smoker, n (%)115 (18.9)169 (27.8).001
Current drinker, n (%)113 (18.7)123 (20.2).92
Use of steroids, n (%)47 (7.8)114 (18.8)<.001
Possible secondary osteoporosis, n (%)45 (7.4)77 (12.7).001

a The variables between the groups were compared using the 2-sided Student t test for continuous variables and the χ 2 test for categorical variables. Fracture (-) and (+) groups represent participants who did not and did experience fractures at 5 years of follow-up, respectively.

c Possible secondary osteoporosis includes patients with osteoporosis and concurrent diagnosis with type 1 diabetes, osteogenesis imperfecta in adulthood, hyperthyroidism, hypogonadism, or premature menopause (<45 years), chronic malnutrition, malabsorption, and chronic liver disease.

Comparisons Between the Performances of Image Models in Predicting Vertebral Fractures

As demonstrated in Table 3 , for the development set, the models using images that included both vertebral bone and paravertebral muscle showed significantly better AUROC, accuracy, and precision values compared to those using bone-only images. Specifically, the bone-only images had an AUROC of 0.677 (95% CI 0.674-0.680) and accuracy of 0.669 (95% CI 0.665-0.673). In contrast, the images including both bone and muscle exhibited an AUROC of 0.739 (95% CI 0.737-0.741) and accuracy of 0.719 (95% CI 0.715-0.722; all P <.001). The fracture risk assessment tool (FRAX) model for major osteoporotic fracture and hip fracture showed lower AUROCs of 0.557 and 0.563, respectively, indicating a significantly better performance of our image model (all P <.001).

Similar trends were observed in the external validation set, where bone-only images resulted in an AUROC of 0.815 (95% CI 0.806-0.824) and accuracy of 0.754 (95% CI 0.752-0.756), while the combined bone and muscle images demonstrated an AUROC of 0.827 (95% CI 0.821-0.833) and accuracy of 0.812 (95% CI 0.798-0.826; all P <.001), though the specificity value was similar between the 2 groups. The FRAX model for major osteoporotic fracture and hip fracture had AUROCs of 0.810 and 0.780, respectively. Again, these results confirmed the superior predictive capability of our image-based model (all P <.001).


Development setExternal validation set

Bone onlyBone + muscle valueBone onlyBone + muscle value
AUROC (95% CI)0.677 (0.674-0.680)0.739 (0.737-0.741)<.0010.815 (0.806-0.824)0.827 (0.821-0.833).04
Accuracy (95% CI)0.669 (0.665-0.673)0.719 (0.715-0.722)<.0010.754 (0.752-0.756)0.812 (0.798-0.826)<.001
Sensitivity (95% CI)0.746 (0.739-0.753)0.761 (0.746-0.776).230.645 (0.613-0.677)0.704 (0.675-0.733).054
Specificity (95% CI)0.601 (0.586-0.616)0.634 (0.625-0.643).0020.844 (0.810-0.877)0.855 (0.835-0.875).43

a AUROC: area under the receiver operating characteristic curve.

Comparisons Between the Performances of Image and Clinical Models

Compared to the clinical models, the image model using vertebral bone and muscle showed significantly higher performance than the clinical models in predicting the vertebral fractures during the 5-year follow-up period in the development and external validation sets ( Figure 4 , Table 4 ). In the development set, the images that included vertebral bone and muscle had significantly better AUROC and accuracy than the clinical model D, which included age, sex, BMI, history of alcohol consumption, smoking, possible secondary osteoporosis, type 2 diabetes mellitus, HIV, hepatitis C infection status, and renal failure (AUROC 0.667, 95% CI 0.661-0.672 and accuracy 0.640, 95% CI 0.661-0.649; all P <.001, Table 4 ). In addition, the performance did not show a significant change when the clinical variables were added to the image-only model (Table S1 in Multimedia Appendix 1 ).

what is the p value in a research study


AUROC (95% CI) valueAccuracy (95% CI) valueSensitivity (95% CI) valueSpecificity (95% CI) value

Image-only 0.739 (0.737-0.741)Reference0.719 (0.716-0.722)Reference0.761 ± 0.024 (0.746-0.776)Reference0.634 (0.625-0.643)Reference

Clinical model A 0.647 (0.643-0.651)<.0010.620 (0.614-0.626)<.0010.681 (0.643-0.719).030.575 0.549-0.601)<.001

Clinical model B 0.631 (0.626-0.636)<.0010.612 (0.610-0.614)<.0010.675 (0.639-0.711).020.558 (0.517-0.598).003

Clinical model C 0.663 (0.659-0.667)<.0010.637 (0.631-0.643)<.0010.723 (0.694-0.752).110.553 (0.521-0.585)<.001

Clinical model D 0.667 (0.661-0.672)<.0010.640 (0.661-0.649)<.0010.729 (0.690-0.768).130.560 (0.527-0.593).005

FRAX (MOF) 0.557 <.0010.557 <.0010.442 <.0010.672 <.001

FRAX (hip)0.563 <.0010.556 <.0010.449 <.0010.663 <.001

Image-only0.827 (0.821-0.833)Reference0.812 (0.798-0.826)Reference0.704 (0.675-0.733)Reference0.855 (0.834-0.875)Reference

Clinical model A0.731 (0.725-0.737)<.0010.651 (0.629-0.673)<.0010.715 (0.683-0.747)<.0010.656 (0.629-0.683)<.001

Clinical model B0.733 (0.725-0.737)<.0010.654 (0.625-0.683)<.0010.728 (0.673-0.783)<.0010.662 (0.621-0.703)<.001

Clinical model C0.745 (0.733-0.757)<.0010.669 (0.646-0.692)<.0010.713 (0.678-0.748)<.0010.720 (0.689-0.751)<.001

Clinical model D0.749 (0.736-0.762)<.0010.675 (0.643-0.707)<.0010.729 (0.690-0.768)<.0010.686 (0.650-0.722)<.001

FRAX (MOF)0.810 <.0010.810 <.0010.262 <.0010.887 <.001

FRAX (hip)0.780 <.0010.685 <.0010.705 <.0010.682 <.001

b Image model represents the model using bone and muscle.

c Model A includes age and sex.

d Model B additionally includes BMI.

e Model C additionally includes history of drinking, smoking, and possible secondary osteoporosis.

f Model D includes age, sex, BMI, history of alcohol consumption, smoking, possible secondary osteoporosis, type 2 diabetes mellitus, HIV, hepatitis C infection status, and renal failure.

g FRAX: fracture risk assessment tool.

h MOF: major osteoporotic fracture.

i Since this was calculated for a single data set, there are no 95% CI values.

As depicted in Figure 4 , in the external validation set, the images including vertebral bone and muscle showed a significantly better AUROC and accuracy than the clinical model D (AUROC 0.749, 95% CI 0.736-0.762 and accuracy 0.675, 95% CI 0.643-0.707; all P <.001). The results were similar for clinical models A, B, C, and D, which showed poorer performance than the image model.

In this study, we developed and externally validated a vertebral fracture prediction model by using abdominal CT images. In the development cohort, the performance of predicting vertebral fractures represented by AUROC was 0.688 (SD 0.001) by using images of vertebral bone-only and 0.736 (SD 0.003) by using images of vertebral bone and paravertebral muscle. In the validation cohort, the performances (AUROC) were 0.698 (SD 0.001) and 0.729 (SD 0.002) for images of vertebral bone-only and images of vertebral bone and paravertebral muscle, respectively. In addition, the performance of the model using images of vertebral bone and muscle was significantly better than that of the clinical models using age, sex, BMI, use of steroids, smoking status, and possible secondary osteoporosis, which showed performances of 0.635 (SD 0.002) and 0.698 (SD 0.021), respectively, for the development and validation cohorts.

Our model shows that the image models using vertebral bone and muscle had a better performance than those using images of vertebral bone-only. Osteosarcopenia, defined by combined occurrence of bone loss and sarcopenia, is one of the critical risk factors for osteoporotic fractures [ 25 , 26 ]. The paravertebral muscles are essential components of the vertebral column and are associated with osteoporotic vertebral fractures [ 27 , 28 ]. In previous studies, information retrieved from muscle images, such as cross-sectional area, volume, and degree of fat infiltration in the paravertebral muscle, was correlated with vertebral stability and the risk of fractures [ 28 , 29 ]. Specifically, Kim et al [ 30 ] reported lower cross-sectional areas and greater fat infiltration of the paravertebral muscles in patients with vertebral fractures than in those without fractures. This implies that not only the density and quality of the bones are correlated with the risk of fractures but also the quality of the muscles supporting and communicating with the bones [ 17 ]. Fat infiltration in the muscles, called myosteatosis, has been reported to be associated with an increased risk of fractures [ 17 , 31 ]. Thus, in line with previous studies, our study results imply that information from the images of the paravertebral muscles in addition to the information from the images of vertebral bones can help predict vertebral fractures more accurately.

Further, the image-based learning model with images of both vertebral bone and muscle showed better performance than the clinical variable–based models. This finding is consistent with a previous report that showed that information from the images of vertebral bones and muscles from CT scans can be used to predict major osteoporotic fractures and is comparable with FRAX [ 32 ]. Another group reported different algorithms by using opportunistic CT-based bone assessments for osteoporotic fracture prediction [ 33 ]. They showed that CT-based predictors (vertebral compression fractures, simulated DXA T-scores, and lumbar trabecular density) with metadata of age and sex showed better performance in AUROC than FRAX [ 33 ]. However, in that model, muscle information was not considered [ 33 ], which may further improve the performance. In addition to the attenuation information, we used information from the image itself on the quality of the bone and muscle structure, similar to the trabecular bone score [ 13 ]. The trabecular bone score is an algorithm used to calculate the microstructure of the bone based on DXA images [ 34 ]. More than 50% of the osteoporotic fractures occur in patients with a normal or osteopenic range of BMD [ 35 ], which implies that the microarchitecture of the bone is also a key determinant of bone strength [ 36 ]. Similarly, in our study, the model used the information on the qualities of bones and muscles from CT images, demonstrating the potential value of CT images that may include rich and various informative data for the metabolic diseases of bones and muscles.

We also observed that the performance did not significantly change when clinical variables were added to the image-only model. There is a possibility that information such as age and gender could already be reflected to some extent in the image itself [ 37 ]. Therefore, there could be an insignificant improvement in the performance because the information poses a redundant input to the model. It is widely accepted that there is a noticeable sex difference in the size of the vertebral body and paravertebral muscles [ 37 ], and BMI could be positively correlated with the size of the vertebrae and muscles. In addition, although the model was based on high-resolution peripheral quantitative CT, each bone has different characteristics according to age and sex, such as calcification and size, which could have influenced our analysis [ 38 ]. In addition, the vertebral endplate calcification increases with age, implying that age information can be reflected in the image [ 39 ]. In addition to that reported in previous studies, smoking and alcohol consumption status can be associated with low muscle mass [ 40 ], which may explain why adding simple clinical variables to the image may not significantly improve the model, as the image already contains some clinical information. The results are clinically promising, and they can be utilized in the future, as only opportunistic CT scans without detailed clinical variables may automatically provide the risk of osteoporotic fractures.

To extract pertinent information from each CT scan, we designed an image-only model to prevent overfitting and to focus on the essential regions. Since 3D CNN models, which have a large number of parameters to be optimized, tend to overfit the training data [ 41 ], the CNN encoder of our model took consecutive 2D images as its input data while keeping their sequential information with the RNN decoder [ 42 ]. The input processing strategy served as a robust data augmentation method because our model could exploit different 2D image sets from a single 3D CT scan at each training iteration. In addition, an attention module was applied to the CNN encoder to further enhance its robustness. The attention module automatically guided the image model to concentrate on essential regions [ 43 ] for the prediction of vertebral fractures. Thus, the attention CNN-RNN model avoids making predictions based on background regions, except for the vertebral body and paravertebral muscles. Unlike previous CNN model–based deep learning algorithms, which were limited to 2D X-ray analysis or bone texture analysis, our CNN-RNN model showed robust performance in fracture prediction. Owing to its design to mitigate the overfitting problem of conventional 3D CNN models [ 42 ], the CNN-RNN model could extract effective information from 3D CT images, which were intractable in previous approaches. In addition, the attention module forced our model to focus on important regions in the CT images by removing the effects of the background regions [ 43 ].

Our study has several limitations. The data set did not contain BMD due to the retrospective study design, which is an essential predictor for osteoporotic fracture. It was difficult to compare the clinical model containing BMD with the image model. The model showed a 5-year fracture prediction model instead of a 10-year model owing to the follow-up duration of the data set, which is relatively short to be utilized in real-world practice. Thus, due to the short time frame, we could not show the results for nonvertebral fractures because the number of cases was too small. In addition, the paravertebral muscles were included without distinction among the psoas, intervertebral, multifidus, longissimus, iliocostalis, and quadratus lumborum muscles. Therefore, it is difficult to interpret the contribution of each muscle. In addition, the number of images in the development set may not be sufficient for model optimization. Moreover, the utilization could be low in various contrast settings because it was based on contrast CT scans. There was also the disparity in vertebral fracture incidence between the development and the external validation set, which may affect the external validity and generalizability of our fracture prediction model. The retrospective nature inherently carries the potential for selection bias, including confounding by indication. Although we have employed propensity score adjustment to mitigate this bias, it is important to acknowledge that residual bias may still be present. Another limitation is the exclusion of radiographic imaging data with poor quality from our models. This decision might have introduced detection bias such that it may have impacted the diagnostic accuracy of our models in correctly identifying positive versus negative fracture cases. Further, we could not assess the reproducibility of these measurements through interexaminer and intraexaminer κ value assessments, which could be considered a limitation of our study. Future prospective studies could benefit from including such reproducibility assessments.

Our study has several strengths. Our study was longitudinally designed to observe future fracture events in patients who did not have baseline fractures. Furthermore, in the development cohort, we used controls with matched clinical variables, which made it possible to attenuate the effects of the major clinical variables in the model. It was also externally validated, which helped prove the generalizability of the model. In addition, the model used the image itself as an input, which made it possible to utilize the information on vertebral bone and muscle quality and quantity. This inclusion of the muscle image reflected the interplay between muscle health and fracture risk. For instance, factors such as muscle mass and muscle steatosis, which are visible in CT images as darker and more heterogeneous areas compared to normal muscle, could be crucial inputs. These muscle attributes, automatically analyzed by the CNN, contribute significantly to the model’s ability to discern patients at higher risk of fractures, offering a more comprehensive view than bone analysis alone. In addition, by sequentially applying bones and muscles to the model, it was possible to check the degree of contribution of muscles and bones to the model performance, thereby increasing the interpretability of the model. In addition, the differences in the clinical characteristics between development and external validation sets were purposefully leveraged to assess the generalizability of our model across populations with varying clinical profiles.

In this study, we showed that a deep learning model of the CNN-RNN structure based on CT images of the muscle and vertebral bone could help predict the risk of vertebral fractures. The model using images of the vertebral bone and muscle showed better performance than the model using images of the vertebral bone-only. This implies that the information from the muscle images provides additional key information for predicting fractures. In addition, the model using images showed better performance than the model using clinical variables, suggesting that images can provide useful information in addition to having known clinical variables. This study has clinical significance in suggesting that opportunistic CT screening with deep learning algorithms utilizing bone and muscle images may contribute to identifying patients with a high fracture risk in the future. Further prospective studies are needed to broaden the applicability of our model.

Acknowledgments

The study was funded by the National Research Foundation of Korea (grants 2020R1A2C2011587 and 2021R1A2C2003410).

Data Availability

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.

Conflicts of Interest

None declared.

Supplementary table and figure.

  • Tran O, Silverman S, Xu X, Bonafede M, Fox K, McDermott M, et al. Long-term direct and indirect economic burden associated with osteoporotic fracture in US postmenopausal women. Osteoporos Int. Jun 2021;32(6):1195-1205. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Williams SA, Daigle SG, Weiss R, Wang Y, Arora T, Curtis JR. Economic burden of osteoporosis-related fractures in the US Medicare population. Ann Pharmacother. Jul 2021;55(7):821-829. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Ahn SH, Park S, Park SY, Yoo J, Jung H, Nho J, et al. Osteoporosis and osteoporotic fracture fact sheet in Korea. J Bone Metab. Nov 2020;27(4):281-290. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Stone K, Seeley D, Lui L, Cauley J, Ensrud K, Browner W, et al. Osteoporotic Fractures Research Group. BMD at multiple sites and risk of fracture of multiple types: long-term results from the Study of Osteoporotic Fractures. J Bone Miner Res. Nov 2003;18(11):1947-1954. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Amarnath ALD, Franks P, Robbins JA, Xing G, Fenton JJ. Underuse and overuse of osteoporosis screening in a regional health system: a retrospective cohort study. J Gen Intern Med. Dec 2015;30(12):1733-1740. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • CT market outlook report. IMV. Jan 01, 2022. URL: https://imvinfo.com/product/2022-ct-market-outlook-report/ [accessed 2024-06-26]
  • Lenchik L, Weaver AA, Ward RJ, Boone JM, Boutin RD. Opportunistic screening for osteoporosis using computed tomography: state of the art and argument for paradigm shift. Curr Rheumatol Rep. Oct 13, 2018;20(12):74. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Pickhardt PJ, Pooler BD, Lauder T, del Rio AM, Bruce RJ, Binkley N. Opportunistic screening for osteoporosis using abdominal computed tomography scans obtained for other indications. Ann Intern Med. Apr 16, 2013;158(8):588-595. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Alacreu E, Moratal D, Arana E. Opportunistic screening for osteoporosis by routine CT in Southern Europe. Osteoporos Int. Mar 2017;28(3):983-990. [ CrossRef ] [ Medline ]
  • Dutta P, Upadhyay P, De M, Khalkar R. Medical image analysis using deep convolutional neural networks: CNN architectures and transfer learning. 2020. Presented at: 2020 International Conference on Inventive Computation Technologies (ICICT); Feb 26-28:175-180; Coimbatore, India. [ CrossRef ]
  • Kong SH, Lee J, Bae BU, Sung JK, Jung KH, Kim JH, et al. Development of a spine X-ray-based fracture prediction model using a deep learning algorithm. Endocrinol Metab. Aug 2022;37(4):674-683. [ CrossRef ]
  • Derkatch S, Kirby C, Kimelman D, Jozani MJ, Davidson JM, Leslie WD. Identification of vertebral fractures by convolutional neural networks to predict nonvertebral and hip fractures: a registry-based cohort study of dual x-ray absorptiometry. Radiology. Nov 2019;293(2):405-411. [ CrossRef ] [ Medline ]
  • Kong SH, Hong N, Kim J, Kim DY, Kim JH. Application of the trabecular bone score in clinical practice. J Bone Metab. May 2021;28(2):101-113. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Kong SH, Shin CS. Applications of machine learning in bone and mineral research. Endocrinol Metab. Oct 2021;36(5):928-937. [ CrossRef ]
  • Muehlematter UJ, Mannil M, Becker AS, Vokinger KN, Finkenstaedt T, Osterhoff G, et al. Vertebral body insufficiency fractures: detection of vertebrae at risk on standard CT images using texture analysis and machine learning. Eur Radiol. May 2019;29(5):2207-2217. [ CrossRef ] [ Medline ]
  • Zhang S, Chen H, Xu H, Yi Y, Fang X, Wang S. Computed tomography-based paravertebral muscle density predicts subsequent vertebral fracture risks independently of bone mineral density in postmenopausal women following percutaneous vertebral augmentation. Aging Clin Exp Res. Nov 2022;34(11):2797-2805. [ CrossRef ] [ Medline ]
  • Kim H, Kim C. Quality matters as much as quantity of skeletal muscle: clinical implications of myosteatosis in cardiometabolic health. Endocrinol Metab. Dec 2021;36(6):1161-1174. [ CrossRef ]
  • Genant H, Wu C, van Kuijk C, Nevitt M. Vertebral fracture assessment using a semiquantitative technique. J Bone Miner Res. Sep 1993;8(9):1137-1148. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Shin CS, Kim MJ, Shim SM, Kim JT, Yu SH, Koo BK, et al. The prevalence and risk factors of vertebral fractures in Korea. J Bone Miner Metab. Mar 2012;30(2):183-192. [ CrossRef ] [ Medline ]
  • Kanis J, Johansson H, McCloskey E, Liu E, Åkesson KE, Anderson F, et al. Previous fracture and subsequent fracture risk: a meta-analysis to update FRAX. Osteoporos Int. Dec 2023;34(12):2027-2045. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Wang J, Yang Y, Mao J, Huang Z, Huang C, Xu W. CNN-RNN: A unified framework for multi-label image classification. 2016. Presented at: Proceedings of the IEEE conference on computer vision and pattern recognition; March 20; Anaheim, CA. [ CrossRef ]
  • Khaki S, Wang L, Archontoulis SV. A CNN-RNN framework for crop yield prediction. Front Plant Sci. 2019;10:1750. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Dong H, Cortés YI, Shiau S, Yin M. Osteoporosis and fractures in HIV/hepatitis C virus coinfection: a systematic review and meta-analysis. AIDS. Sep 10, 2014;28(14):2119-2131. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Ganesan K, Jandu J, Anastasopoulou C, et al. Secondary osteoporosis. StatPearls [Internet]. Apr 2, 2023:1-20. [ Medline ]
  • Teng Z, Zhu Y, Teng Y, Long Q, Hao Q, Yu X, et al. The analysis of osteosarcopenia as a risk factor for fractures, mortality, and falls. Osteoporos Int. Nov 2021;32(11):2173-2183. [ CrossRef ] [ Medline ]
  • Cedeno-Veloz B, López-Dóriga Bonnardeauxa P, Duque G. [Osteosarcopenia: A narrative review]. Rev Esp Geriatr Gerontol. 2019;54(2):103-108. [ CrossRef ] [ Medline ]
  • Yagi M, Hosogane N, Watanabe K, Asazuma T, Matsumoto M, Keio Spine Research Group. The paravertebral muscle and psoas for the maintenance of global spinal alignment in patient with degenerative lumbar scoliosis. Spine J. Apr 2016;16(4):451-458. [ CrossRef ] [ Medline ]
  • Habibi H, Takahashi S, Hoshino M, Takayama K, Sasaoka R, Tsujio T, et al. Impact of paravertebral muscle in thoracolumbar and lower lumbar regions on outcomes following osteoporotic vertebral fracture: a multicenter cohort study. Arch Osteoporos. Jan 03, 2021;16(1):2. [ CrossRef ] [ Medline ]
  • Choi M, Kim S, Park C, Malla H, Kim S. Cross-sectional area of the lumbar spine trunk muscle and posterior lumbar interbody fusion rate: a retrospective study. Clin Spine Surg. Jul 2017;30(6):E798-E803. [ CrossRef ] [ Medline ]
  • Kim JY, Chae SU, Kim GD, Cha MS. Changes of paraspinal muscles in postmenopausal osteoporotic spinal compression fractures: magnetic resonance imaging study. J Bone Metab. Nov 2013;20(2):75-81. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Lang T, Cauley J, Tylavsky F, Bauer D, Cummings S, Harris T, et al. Health ABC Study. Computed tomographic measurements of thigh muscle cross-sectional area and attenuation coefficient predict hip fracture: the health, aging, and body composition study. J Bone Miner Res. Mar 2010;25(3):513-519. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Pickhardt PJ, Graffy PM, Zea R, Lee SJ, Liu J, Sandfort V, et al. Automated abdominal CT imaging biomarkers for opportunistic prediction of future major osteoporotic fractures in asymptomatic adults. Radiology. Oct 2020;297(1):64-72. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Dagan N, Elnekave E, Barda N, Bregman-Amitai O, Bar A, Orlovsky M, et al. Automated opportunistic osteoporotic fracture risk assessment using computed tomography scans to aid in FRAX underutilization. Nat Med. Jan 2020;26(1):77-82. [ CrossRef ] [ Medline ]
  • Pothuaud L, Carceller P, Hans D. Correlations between grey-level variations in 2D projection images (TBS) and 3D microarchitecture: applications in the study of human trabecular bone microarchitecture. Bone. Apr 2008;42(4):775-787. [ CrossRef ] [ Medline ]
  • Siris ES, Chen Y, Abbott TA, Barrett-Connor E, Miller PD, Wehren LE, et al. Bone mineral density thresholds for pharmacological intervention to prevent fractures. Arch Intern Med. May 24, 2004;164(10):1108-1112. [ CrossRef ] [ Medline ]
  • Silva B, Leslie W, Resch H, Lamy O, Lesnyak O, Binkley N, et al. Trabecular bone score: a noninvasive analytical method based upon the DXA image. J Bone Miner Res. Mar 2014;29(3):518-530. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Gilsanz V, Boechat MI, Gilsanz R, Loro ML, Roe TF, Goodman WG. Gender differences in vertebral sizes in adults: biomechanical implications. Radiology. Mar 1994;190(3):678-682. [ CrossRef ] [ Medline ]
  • Kazakia GJ, Nirody JA, Bernstein G, Sode M, Burghardt AJ, Majumdar S. Age- and gender-related differences in cortical geometry and microstructure: Improved sensitivity by regional analysis. Bone. Feb 2013;52(2):623-631. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Bernick S, Cailliet R. Vertebral end-plate changes with aging of human vertebrae. Spine (Phila Pa 1976). 1982;7(2):97-102. [ CrossRef ] [ Medline ]
  • Zhong J, Xie W, Wang X, Dong X, Mo Y, Liu D, et al. The prevalence of sarcopenia among Hunan province community-dwelling adults aged 60 years and older and its relationship with lifestyle: diagnostic criteria from the Asian Working Group for Sarcopenia 2019 update. Medicina (Kaunas). Oct 30, 2022;58(11):1562. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Kumawat S, Raman S. Lp-3dcnn: Unveiling local phase in 3D convolutional neural networks. 2019. Presented at: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; January 9; Long Beach, CA, USA. [ CrossRef ]
  • Chung J, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv. Preprint posted online on December 11, 2014. [ CrossRef ]
  • Jetley S, Lord N, Lee N, Torr P. Learn to pay attention. ArXiv. Preprint posted online on April 26, 2018. 2018. [ CrossRef ]

Abbreviations

area under the receiver operating characteristic curve
bone mineral density
convolutional neural network
computed tomography
dual-energy X-ray absorptiometry
fracture risk assessment
recurrent neural network

Edited by A Mavragani; submitted 27.04.23; peer-reviewed by M Liu, SH Lee, R Mpofu; comments to author 07.12.23; revised version received 27.01.24; accepted 30.05.24; published 12.07.24.

©Sung Hye Kong, Wonwoo Cho, Sung Bae Park, Jaegul Choo, Jung Hee Kim, Sang Wan Kim, Chan Soo Shin. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 12.07.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

Talent at a turning point: How people analytics can help

Employees are quitting. The talent gap is widening. And leaders are grappling with the hybrid dilemma—what an imminent return to the office might look like and why.

In this episode of McKinsey Talks Talent , HR expert David Green, coauthor (with Jonathan Ferrar) of Excellence in People Analytics (Kogan Page, July 2021), speaks with McKinsey’s Bryan Hancock and Bill Schaninger about a talent market in the throes of changes—and how HR leaders can use people analytics to navigate the current inflection point successfully.

The McKinsey Talks Talent podcast is hosted by Lucia Rahilly.

HR in the spotlight

Lucia Rahilly: So, David, we are closing in on two years of the COVID-19 pandemic. It’s obviously been a massively challenging crisis that has put both lives and livelihoods at risk for employees across the globe. What has this crisis meant for the role of HR?

David Green: Well, I suppose it’s been HR’s chance to shine, and in many companies, it has. There’s an elevated role for the CHRO, 1 Chief human resources officer. which means more expectations for the function, a thirst for data to drive decisions around people, more interest from the C-level, and more demands from the C-level as well. Good people analytics teams have been more focused around employees and understanding how employees are feeling at the various stages of the pandemic, and then are building that into their approach to hybrid work.

Bryan Hancock: I think that the role of HR and the role of the CHRO is going to continue to be elevated for the next few years. The pandemic was a unique human event that affected individuals, that affected people. Now, as we come back and we adjust to the new normal, HR has an opportunity to continue to step up, to continue to innovate and continue to use data, facts, and insights in how they guide, not just intuition.

Bill Schaninger: There’s a little bit of “watch what you wish for.” Because now—while HR is unbelievably front and center, with critical roles and critical pools and “how are we going to respond to the return to the office,” et cetera—that same light shines on deficiencies in the function.

Where maybe in the past HR might’ve carried some folks who were pleasant or good order takers or good caretakers, what HR is demanding now is being numerate, understanding the value tree, really knowing how to use analytics—all that kind of stuff. It’s all laid bare now. So, it is a wonderful time for the function, but the bar for individuals has been raised dramatically.

Rising resignations, rising pressures

Lucia Rahilly: In the US we hear so much about what some are calling the Great Resignation and we are calling the Great Attrition : employees reassessing their priorities and quitting their jobs at record rates.

David Green: There’s a lot of column inches devoted to it here, and in Europe as well, although maybe not quite as much as in the US. We 2 David Green is managing partner of Insight222. work with around 90 large global organizations, about half of them headquartered in the US. The person I’m speaking to is usually the head of people analytics. They’ve had a lot of panicked executives saying, “Oh my God, everyone’s leaving.” When they actually look at their data, in most cases it’s not more than they would normally expect it to be. They’re seeing the numbers—maybe a little bit higher, in some cases, than in 2019—but maybe what we would expect as almost a correction from 2020. In most cases, certainly, the companies that I’ve spoken to, they’re not seeing numbers that justify some of the panic at the moment.

Bryan Hancock: I think what we are seeing is that some people are choosing to leave the workforce and not necessarily go to another job. When you look in the US at workforce participation rates, they’re down. If you look at who is leaving, it’s disproportionately women , and it is disproportionately people toward retirement age. When we start looking at other populations, that’s where you need a real focus on the facts and the insights that let you solve the problems around flexibility, comfort of coming to work—whatever it may be. Which is to say, let’s not necessarily paint with a broad brush. Although we do see dissatisfaction broadly, let’s really dive into who’s leaving and why.

Employee expectations have gone up. It’s been happening for a while, but maybe the pandemic acts as a bit of a catalyst to this. David Green

Bill Schaninger: One of the things I’ve been toying with, and I don’t know that we have a great answer, is do we have to do a fundamental reset, almost, on the offer? We’re facing this moment where there’s been a fixation on wages, but even as the wages have gone up, in many cases to $25, $30 an hour, you’re missing the point, which is “who I work for, the conditions I work under, the nature of the interactions—it has to be better.” I’m curious as to your experience of that part. It’s beyond the data, you know what I mean? This idea of a higher calling.

David Green: I definitely think there’s a purpose  that people want to have at work, and that’s now coming out. Employee expectations have gone up. It’s been happening for a while, but maybe the pandemic acts as a bit of a catalyst to this. What’s fascinating is some of the research that you’ve been doing at McKinsey. There’s a growing disconnect between executives around the return to work and employees who aren’t ready yet. Generally, there are large numbers that want more hybrid work  moving forward.

I wonder if one of the consequences of the Great Resignation and all the press around it is that maybe some of these executives will start to be a bit more flexible and come closer to what employees are looking for in the hybrid workplace, which actually will benefit executives in the long run. Maybe there’ll be a good consequence of all the column inches that have been written about the Great Resignation.

I like the way that you guys have kind of reframed it as the Great Attraction , depending on the way a company approaches it.

Lucia Rahilly: Bill or Bryan, are you seeing that shift in mindset among employers—toward embracing, or at least being more accepting of, a hybrid culture?

Bill Schaninger: I think two-thirds are still in the stage of “it’s either transitory or slowly they’ll come to their senses, and we’re going to bring them back.” Maybe a third are wrapping their heads around the hybrid model and saying, “Well, this could be pretty interesting.”

What people analytics can do

Lucia Rahilly: David, tell us a little bit about what we’re talking about when we discuss people analytics and how it helps HR leaders improve retention during this interval of churn.

David Green: Excellence in People Analytics has got 30 case studies of real-life people analytics in companies. There’s a couple that touch on attrition. What can people analytics do? I think that the key thing is to separate the signal from the noise. It can help organizations understand if they actually have a problem with attrition and, if so, where, what job families, what locations? Is it people that have been tenured for a certain time? Is it certain groups? As Bryan said, it is women who are disproportionately leaving the workplace.

If attrition is a problem, what can you do about it? If it’s in parts of the business that you’re either looking to divest or invest in less, attrition  can arguably be your friend. If it’s in areas of the business that you’re really trying to grow, and people are leaving and going to your competitors, then clearly it’s a problem that you want to try and address. But you need to understand why people are leaving—if they are leaving—before you can even think about what you can do to solve it.

Subscribe to the McKinsey Talks Talent podcast

What employers get wrong about the great attrition.

Lucia Rahilly: Bryan, walk us through some of the Great Attrition/Great Attraction research that we did.

Bryan Hancock: There’s a disconnect  between what an employer and an employee think the main issue is. The employer is saying, “Hey, people must be leaving for another job, a better job, and better pay.” Employees are saying, “No, I’m leaving because I don’t feel valued at work.” Even asking the right questions and getting the right frame can—before you get more advanced forms of analytics on it—bring a fact-based and broader lens to make sure we’re having the right conversation.

A really good people analytics function combines the broad view—the broad understanding of organizational research, the broad understanding that this is a field that’s been around for a while, and we know what motivates people—and then brings that to bear to highlight individual facts.

Bill Schaninger: We started getting the data back, and I said, “Isn’t it an interesting pattern here, all the things that the managers are saying are exogenous: ‘the employee is maximizing for the money, my competitor is being foolish about raising the floor.’” It’s everything that was outside them, that allowed them to point the finger at someone else. Managers should just hold up the mirror to themselves and realize that they’ve caused this environment where employees don’t feel valued. They don’t feel well looked after. They feel like they’re a piece of machinery.

Asking the right questions and getting the right frame can—before you get more advanced forms of analytics on it—bring a fact-based and broader lens to make sure we’re having the right conversation. Bryan Hancock

I’m hopeful we can help managers without, maybe, poking them in the eye so much, but maybe it takes a little poke in the eye.

David Green: Microsoft has published some research that they’ve been doing during the pandemic. They found that managers  are even more important in a remote or hybrid work environment. They need to be checking in, to be doing one-on-ones regularly. If they’re not, don’t be surprised if people get demotivated and decide to leave. Understanding that is the job of people analytics.

Then we can start doing something about attrition, which is a problem in organizations, and start to nudge managers and leaders around behaviors that will actually encourage people to stay because they feel valued, they feel looked after, they’re given a great employee experience. If you do these sorts of things, then people are going to be much less inclined to look elsewhere.

Yes, people sometimes will get a 40 percent pay raise on a new job. That’s just going to happen. There’s not much you can do about that. You can obviously make sure that you’re paying market rates or above-market rates, if that’s what you want to do. But I think that by creating the right culture in the organization and making people feel more valued, you can keep people more than you lose.

Bryan Hancock: The point of the research on the middle manager  is exactly what we’re seeing at our clients. In the course of the pandemic, what we saw is that some people are naturally very good managers—they knew how to check in, how to use the one-on-ones. Then, on the other end of the spectrum, there are some people that never checked in.

At one point during the pandemic, there was a survey, and 40 percent of the employees surveyed said that no one had called to check in on them—no manager, no individual. And those people were 40 percent more likely to be exhibiting some sign of mental distress. I think companies are now recognizing that and saying, “Hey, if the role of the manager got elevated during the pandemic, what does it mean in a hybrid world?” And a number of organizations are now saying, “Gosh, if it mattered when everybody was remote, doesn’t it matter at least as much, if not more, when we’ve got a mixed model, with some people in the office, some remote? Don’t we need to have those one-on-one coaching skills, as well as intentionality about when we’re all coming together as a team and when we’re separated?”

David Green: That, again, is where people analytics teams come in—listening to employees, conducting regular pulse surveys, looking at some of the passive data as well. By looking at some of the metadata, people analytics teams can see the managers who are checking in regularly with their employees and understand the behaviors that drive engagement, that drive performance from teams.

Making a difference in diversity, equity, and inclusion

Lucia Rahilly: David, in our Women in the Workplace  research, we saw that women managers were much likelier than men managers to call to see how their reports were doing. We also know from other research that women and people of color  have been among the most affected during the pandemic and that people of color, in particular, are more likely than White employees to attribute quitting to a lack of a sense of belonging in the organization. Do you see analytics as playing a role in promoting diversity, equity, and inclusion in the workplace?

David Green: We conducted annual research among over a hundred organizations this year. And one of the questions we asked was, “What are the top three areas in your organization where people analytics is adding the most value?” Diversity, equity, and inclusion came out on top—54 percent of respondents included that in their top three. And that’s gone up significantly since we did that research last year.

Now we’re seeing that people analytics is really helping organizations move beyond counting diversity, to measuring inclusion. We’re still at the early stages of that, in many respects. Companies are starting to understand the importance of inclusion and belonging. They’re measuring it in surveys, and they’ve got people analytics teams that can be on top of that as well.

Second, by looking at some of the passive network analysis as well, you can start to understand the links and the strength of relationships within teams and between teams. I think that is helping. Leaders want to be better at diversity, equity, and inclusion and to meet the expectations of the employees. They also want their organizations to be better at diversity, equity, and inclusion.

Bill Schaninger: That pivot toward moving upstream and asking, “What’s the felt experience”—that really encouraged us to go back and look at how we were measuring inclusion, not just to ask a few “engagement-y” questions but instead to ask, “What’s your sense of the organization overall; what are you personally experiencing in your company and team and with your manager?”

That pivot toward moving upstream and asking, “What’s the felt experience”—that really encouraged us to go back and look at how we were measuring inclusion. Bill Schaninger

From complex data to compelling story

Bill Schaninger: When you think about the advanced math, how do you get some of these insights without losing people in the math? At some of our clients, you get some really cool quant jocks, and they lose everyone on the third word.

David Green: It’s turning that complex math into a compelling story that’s going to resonate with whichever audience you’re delivering it to and the impact that it has on the objective. I wonder if, in addition to the kind of active-based network analysis that’s been going on for years, we now have the technology to do this at scale, looking at some of the metadata. Of course, you need to be careful about the ethics and the privacy and make sure there’s benefit for employees, of course, as well. But you’re right. You’ve got to take quite complicated insights and turn them into a compelling story that drives action you can then measure.

Bryan Hancock: We took our new inclusion assessment and put it in our inaugural Race in the Workplace survey. Our focus was on Black leaders in corporate America. What became clear is that more Black workers  in corporate America were leaving before they ever got promoted. But the numbers were so small, in terms of the absolute—maybe out of every hundred workers, you might’ve had one or two more Black workers than expected leave and one or two fewer White workers.

So from an awareness standpoint, an individual manager wouldn’t pick it up. But when you look at the data you say, “This is like an invisible revolving door. What’s going on in there?” That’s something that makes an executive say, “OK, I now know what I need to do with our new entry-level diverse talent. I know I need to focus on that. Now, let me go back and figure out the next level of detail. What are the levels of initiatives? How do I check up on this? How do I follow up?”

David Green: Another thing from network analytics is that I’ve seen a few examples where high-performing women who don’t have strong networks at the senior level don’t get promoted and leave the organization. Men, who are quite good, generally, at changing their networks as they move up an organization, were getting promoted. I think the academic research backs this up. When you make people aware of that, then they might change their behaviors and consciously build those networks.

Managing privacy risks

Bill Schaninger: David, you said something earlier that was interesting about the challenges with privacy. The US has some challenges on the data front, often around security and what you’re doing with hashing and things like that. Europe, I’ve always found to be way more sensitive to the idea of a “Big Brother-ish” tracking of my movements. In your experience, what’s the balance there, because the insight you can get from this is pretty awesome.

David Green: I think you’re right. There is a balance there. At any organization that wants to use people analytics, it’s OK to start small, be transparent right from the start, think about the benefit for the employees, and ask, “What are the benefits that we’re trying to drive out of this? What is the business problem we’re trying to solve?”

You’ve got to speak to your privacy team. You’ve got to engage with works councils in Europe. You have to clearly articulate the benefit for employees and how you’re going to protect that data. It can be frustrating in terms of time because it slows up the process at the start. But as you say, you can get some really, really rich insights out of some of these technologies that actually have a really clear and direct benefit for employees and the business.

Rethinking the workplace

Bryan Hancock: Are you seeing a link now between the people analytics team and the real-estate team? We’re hearing a lot of organizations start to ask, “What should the workspace look like? What does it do?” Are you seeing linkages across the teams or are they existing in silos?

David Green: They’re definitely starting to see linkages, particularly in the companies that are maybe more advanced in people analytics. They are bringing exactly that sort of data together. And they’re thinking, “OK, in some parts of the world, our people are back in the office, but we’ve got these hybrid work models in place now. People are using the office differently. We need to measure how people are using the office and then redesign the workplaces with intention.” And I think we’ll see more of that in the next 18 months, two years.

Bill Schaninger: Maybe six months ago, Bryan, you and I had a run of webinars. There was an architect from Atlanta talking about “repurposing the space.” So much of this was around flexibility. I think the consensus was that we’d been on a two- or three-decade run about increasing the density and lowering the square footage per person, and we were perfectly happy when we had teleworkers and remote workers. Now we may need to go in another direction and pay a little bit more for configurability if we’re talking about a combination of individual work, team-based work, or even lecture hall kind of communication.

The people agenda now is almost stemming the tide of dramatically increasing the span of bosses, increasing the density of office space, hoteling—that whole thing. So much of this had almost gone unchecked. Now we’re saying, “Hey, if we want to bring them back, we’ve got to use the workspace differently.”

I’ve found in Europe that you often have a bit more intervention on things like sunlight. Are you seeing that or is that a US thing and we’re just late to the party?

David Green: We’re definitely seeing that. It makes sense, doesn’t it? Part of understanding people is understanding how they use workspaces. If we can make workspaces more productive, then that’s good. People become more productive; hopefully, more engaged; and maybe less likely to leave, as well.

Recruiting beyond the usual suspects

Lucia Rahilly: What’s the role of analytics in helping HR leaders to fill the surging volume of open roles as folks quit and the talent gap widens?

David Green: The two big use cases of people analytics, going back years, have been attrition and recruiting. It’s almost like coming back full circle now in many respects. The people analytics teams have access to technology that can really help companies. We’re using analytics to automate parts of the recruitment process. In many respects, that actually widens the funnel. By automating, you can potentially open up the process and get a more diverse set of people applying in the first place, which is obviously good.

One big investment bank that I spoke to recently is using analytics to help hiring managers understand that if they use education and experience in their role profiles, how many applicants they’re likely to get and, if they tweak one or two things, how that might make a change to the applicants. If they maybe change the language that they’re using in the effort, they might get more female applicants, for example, if they’re looking for a software engineer. I think analytics is playing a big, big role in that. You can look at analytics across the recruitment process. You can start to see where you might be suffering significant candidate drop-off. You can start to understand if you have a problem around offer to accept.

I would argue that recruiting doesn’t stop once the person starts. You also need to think about onboarding. You need to think about understanding where managers are having one-to-ones with new starters in the first week, two weeks. Does that have an impact on people’s time to productivity? Does that have an impact on first-year attrition? There’s so much that analytics can do.

And then the other bit that I haven’t mentioned is bringing some of that external data in to understand things like supply of talent, demand of talent, locations where we might want to hire talent—particularly now that hybrid’s potentially opening the game around that as well.

Bill Schaninger: You mentioned framing the description of the job as a way of making it more appealing to candidates. I’m assuming that the lexicon you’re using triggers different behavior. That’s great.

David Green: Using natural-language processing helps to understand words that may put off female applicants or other groups. There’s academic research which says if you put bullet points on a job description, men will apply if they meet half of them; women tend not to apply unless they feel they meet at least 90 percent of them. So the more bullet points, the more you can have a very biased male slate, perhaps.

What types of data and insight matter?

Bryan Hancock: Have you seen organizations navigate and manage through all of the new offerings and make sure that they pick the types of data and types of insights that will matter most to them, not just the ones that seem cool to a person who heard about them on a podcast?

David Green: You probably need someone in your team spending half the time scanning the market and understanding the market, trying to get proof of concepts. A lot of the smaller vendors will do that, but you’re right; it’s not there everywhere. So now the regulators are coming in. There’s regulation in New York recently around using AI in the hiring process. The US Equal Employment Opportunity Commission is looking into it as well—the use of algorithms in hiring and people management generally. But, of course, the most important thing is you’ve got to make sure that what talent acquisition professionals are telling you is actually valid. You’ve got to be careful around bias. Particularly if you’ve got a problem with diversity in your organization, you don’t want to perpetuate that through hiring as well.

Moving people analytics to the center of your HR strategy

Lucia Rahilly: Last question: Where do HR leaders stand in terms of their own skills in data-driven decision making? Do you see that there’s work to be done?

David Green: I think there’s work to be done. We did a survey with a focus on data-driven culture. Over a hundred companies participated. Ninety percent said that their CHROs have now communicated that people analytics was a core component of HR strategy, but only 42 percent said that their companies have a data-driven culture for HR at the moment. You could argue that the first sign is that the CHRO says it’s important. They use these data in their conversations with executives. Maybe they celebrate people in the HR team who are using data, setting that as an example to others, making it very clear that it’s expected.

And there are technologies coming in that are enabling organizations to democratize the data, both for HR’s business partners, who are particularly important in this, but also for managers in the business. This is a big change for HR, so you’ve got to bring in change management and support people through that process. Data literacy is a core skill that they need to have.

Bryan Hancock: I think HR is well along the journey. We now have an understanding that HR is no longer just in the business of feeling good about people. It is in the business of bringing data, facts, and insight into the people side of work. I think there is a real understanding and appreciation of that across the board.

What we’re doing is shifting the skills of folks who used to deal with transactional issues and may have dealt with investigations—a number of things that required a different skill set. Now we’re shifting them not just to have data literacy but also to ask the right questions, to synthesize in the right way, and to compellingly advocate for solutions. The next push on analytics isn’t just the analytics but how to equip the team to use it.

David Green: It’s absolutely key. There is this mistaken idea that suddenly everyone in HR needs to become a data scientist or a statistician. But as you said, the important thing is the ability to ask the right questions and maybe to work with the business to develop hypotheses you can test with analytics. Then it’s communicating the insights and driving the change in order to implement them.

Lucia Rahilly: Let’s close there. David, thanks so much for being with us today.

David Green: Well, it’s been a pleasure. I’ve really enjoyed the conversation.

Bryan Hancock

David Green is managing partner of Insight222.  Lucia Rahilly is global editorial director of McKinsey Global Publishing and is based in McKinsey's New York office.

Comments and opinions expressed by interviewees are their own and do not represent or reflect the opinions, policies, or positions of McKinsey & Company or have its endorsement.

Explore a career with us

Related articles.

Matching talent to value

The value of talent management

Putting talent at the top of the CEO agenda

Putting talent at the top of the CEO agenda

The new science of talent: From roles to returns

The new science of talent: From roles to returns

COMMENTS

  1. Hypothesis Testing, P Values, Confidence Intervals, and Significance

    Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting ...

  2. Understanding P-values

    The p value is a number, calculated from a statistical test, that describes how likely you are to have found a particular set of observations if the null hypothesis were true. P values are used in hypothesis testing to help decide whether to reject the null hypothesis. The smaller the p value, the more likely you are to reject the null hypothesis.

  3. What is a p value and what does it mean?

    Statistical probability or p values reveal whether the findings in a research study are statistically significant, meaning that the findings are unlikely to have occurred by chance. To understand the p value concept, it is important to understand its relationship with the α level. Before conducting a study, researchers specify the α level ...

  4. P-Value And Statistical Significance: What It Is & Why It Matters

    A p-value, or probability value, is a number describing how likely it is that your data would have occurred by random chance (i.e., that the null hypothesis is true). The level of statistical significance is often expressed as a p-value between 0 and 1. The smaller the p -value, the less likely the results occurred by random chance, and the ...

  5. The P Value and Statistical Significance: Misunderstandings

    The calculation of a P value in research and especially the use of a threshold to declare the statistical significance of the P value have both been challenged in recent years. There are at least two important reasons for this challenge: research data ...

  6. The clinician's guide to p values, confidence intervals, and ...

    Studies use a predefined threshold to determine when a p -value is sufficiently small enough to support the study hypothesis.

  7. What the P values really tell us

    To support the significance of the study's conclusion, the concept of "statistical significance", typically assessed with an index referred as P value is commonly used. The prevalent use of P values to summarize the results of research articles could result from the increased quantity and complexity of data in recent scientific research.

  8. P-Value: What It Is, How to Calculate It, and Why It Matters

    P-Value: The p-value is the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event. The p-value is used as an ...

  9. What is p-value: How to Calculate It and Statistical Significance

    The p-value is an important concept in quantitative research that can be confusing and easily misused. In this comprehensive article, we take a deeper look at what is a p-value, how to calculate it, and its statistical significance in research. Read more!

  10. The p value

    The p-value is the most commonly used statistic in scientific papers and applied statistical analyses. Learn what its definition is, how to interpret it and how to calculate statistical significance if you are performing statistical tests of hypotheses. The utility, interpretation, and common misinterpretations of observed p-values and significance levels are illustrated with examples.

  11. P-Value in Statistical Hypothesis Tests: What is it?

    A p value is used in hypothesis testing to help you support or reject the null hypothesis. The p value is the evidence against a null hypothesis. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis. P values are expressed as decimals although it may be easier to understand what they are if you convert ...

  12. An Easy Introduction to Statistical Significance (With Examples)

    The p value, or probability value, tells you the statistical significance of a finding. In most studies, a p value of 0.05 or less is considered statistically significant, but this threshold can also be set higher or lower.

  13. How to Find the P value: Process and Calculations

    Learn how to find the p value starting with the general process and then a step-by-step example showing the calculations.

  14. P-Value: A Complete Guide

    P-value in statistics is the probability of getting outcomes as extreme as the outcomes of a statistical hypothesis test.

  15. PDF What is a P-value?

    The interpretation of the p-value depends in large measure on the design of the study whose results are being reported. When the study is a randomized clinical trial, this interpretation is straightforward.

  16. p-value

    The p -value is used in the context of null hypothesis testing in order to quantify the statistical significance of a result, the result being the observed value of the chosen statistic . [note 2] The lower the p -value is, the lower the probability of getting that result if the null hypothesis were true. A result is said to be statistically ...

  17. P-values and significance tests

    About. Transcript. We compare a P-value to a significance level to make a conclusion in a significance test. Given the null hypothesis is true, a p-value is the probability of getting a result as or more extreme than the sample result by random chance alone. If a p-value is lower than our significance level, we reject the null hypothesis.

  18. The P value: What it really means

    The P value is the probability that the results of a study are caused by chance alone. To better understand this definition, consider the role of chance.

  19. Help us protect a landmark research study for serious mental illness

    Led by expert researcher, Dr. Deanna Kelly, the trial is a landmark inpatient study of a ketogenic diet for psychotic illness and is a collaborative initiative with the Maryland Psychiatric Research Center and Spring Grove Hospital Center.

  20. In Brief: The P Value: What Is It and What Does It Tell You?

    First and foremost, a p value is simply a probability. However, it is a conditional probability, in that its calculation is based on an assumption (condition) that H 0 is true. This is the most critical concept to keep in mind as it means that one cannot infer from the p value whether H 0 is true or false. More specifically, after we assume H 0 ...

  21. What is 'P' value in any research study? How to determine/calculate it

    P value is the risk that the relation between 2 variables exists by chance due to the sample under study and may not necessarily exist in the population. Lets say we have fixed an alpha risk ...

  22. Quantitative vs. Qualitative Research Design: Understanding the Differences

    We will discuss the differences between quantitative (numerical and statistics-focused) and qualitative (non-numerical and human-focused) research design methods so that you can determine which approach is most strategic given your specific area of graduate-level study.. Understanding Social Phenomena: Qualitative Research Design. Qualitative research focuses on understanding a phenomenon ...

  23. Home

    Zillow Research aims to be the most open, authoritative source for timely and accurate housing data and unbiased insight.

  24. Ozempic is much more than a weight-loss drug, studies show

    New diabetes drugs are very likely to improve heart and kidney health for sufferers, research reveals. But Australia's supply shortage will not be resolved this year.

  25. Study finds that engagement with Level2 leads to improved type 2

    Value-based care solution, Level2, helps members with type 2 diabetes improve control over their condition, while also reducing costs.

  26. Outcome evaluation of technical strategies on reduction of patient

    The study defined the dependent variable as follows: overall patient waiting time, which was captured using a stopwatch, was categorized as a binary dummy variable. A value of 1 represented OPD waiting time less than 3 h, while a value of 0 indicated OPD waiting time exceeding 3 h.

  27. A cosmic tool for studying twisters and other severe storms

    Cosmic rays could offer scientists another way to track and study violent tornadoes and other severe weather phenomena, a new study suggests. By combining local weather data with complex astrophysics simulations, researchers explored whether a device that typically detects high-energy particles called muons could be used to remotely measure torn...

  28. The Value of p-Value in Biomedical Research

    The p -value is one of the most widely used statistical terms in decision making in biomedical research, which assists the investigators to conclude about the significance of a research consideration. Up today, most researchers base their decision on the value of the probability p. However, the term p -value is often miss- or over- interpreted ...

  29. Journal of Medical Internet Research

    Background: With the progressive increase in aging populations, the use of opportunistic computed tomography (CT) scanning is increasing, which could be a valuable method for acquiring information on both muscles and bones of aging populations. Objective: The aim of this study was to develop and externally validate opportunistic CT-based fracture prediction models by using images of vertebral ...

  30. David Green on people analytics and talent

    David Green: We conducted annual research among over a hundred organizations this year. And one of the questions we asked was, "What are the top three areas in your organization where people analytics is adding the most value?" Diversity, equity, and inclusion came out on top—54 percent of respondents included that in their top three.