Jun 16, 2020

Stop Reading the Discussion

Read Time: 20 minutes

Zach Walston working on a triathlete's foot after a race.

“You have to evaluate each hypothesis in the light of the evidence of what you already know about it.” – R.A Fisher

Depending on the source, anywhere from 39-90% of acute low back pain cases will fully resolve without intervention.^[1] Regardless of your hand skills, education prowess, or eloquent pain science metaphors, many patients will perform just as well following the recommendations of a Facebook comment section. You personally have likely experienced episodes of low back pain; it’s hard to avoid with current school curriculums. Despite the aggregating acute pain after sitting too long or helping a friend move, your back felt fine a few days later.

Now, let’s say a patient comes into the clinic with acute low back pain, but this is the fifth episode they have experienced in the last two months. They are highly anxious, they frequently smoke, they have a desk job, and they think exercise is a waste of time. We would use that information to adjust the base rate. It may be tempting as a clinician to assume a poor prognosis is guaranteed and intervention is immediately needed, however, the base rate tells us the odds are we will recover for a sixth time. How much we rely on the base rate is the challenge.

When R.A. Fisher made the statement above, he was referring to Bayesian statistics and base rates. A Bayesian approach starts with the information you already know, the base rate, and layers of additional information. In many cases, the base rate is the best available evidence. Since randomized control trials are not perfectly pragmatic – we cannot control our environment in the real world as we can in a research setting – variability will enter the equation. Base rates are even more important when the evidence is weak. A Bayesian approaches allow us to stay grounded in data with more support and then layer on the details of the situation.

Quick note on Introduction and Discussion sections

In the last post, I provided some key areas of research design to be aware of when reading studies. That discussion, however, only concerned the set-up of a study. The other key component that needs assessment with a fine-tooth comb is the results, specifically the statistical analysis. Now before I get into that exciting topic, I will briefly address arguably a third area to be aware of: the author(s)’ interpretation of results, formally known as the discussion section. The main reason I am hesitant to cover the discussion section, or the introduction, is the lack of their importance.

The discussion is naturally biased into a direction the author(s) want you to go. As a published author, I can attest to this confidently. Unfortunately, research is not black and white. There are publication biases (more likely to publish studies with preferred outcomes), conflicts of interest (more publications = more grants, promotions, tenure), and strong personalities that are difficult to temper when provided the freedom of the discussion section to influence your interpretation of the results. This is why I encourage clinicians, especially residents and mentees, to focus on the methods and results to have a firm understanding of what the study accomplished and to develop their own interpretation of the study. Now, back to the strategies for completing said interpretations.

Disclaimer: I am not an expert in statistics

Ok, seeing as universities offer PhDs in statistics, I am not going to attempt to cover the entire profession in a couple paragraphs. For a more technical yet easily digestible review of the basics, I highly recommend onlinestatbook.com. If you enjoy reading books to garner more information, I recommend How Not to Be Wrong: The Power of Mathematical Thinking by Jordan Ellenberg. For the purpose of this post, I will focus on a couple of the more common issues when interpreting the results of studies: significance, correlations, and confidence intervals. Additionally, I am only skimming the surface to address some fallacies and provide a starting point. For those more statistically inclined, I encourage seeking more technical and advanced sources.

This post is meant to be more surface level and there are disagreements even within the statistics profession on both the types of statistical tests to apply and their interpretation. In case I haven’t already piqued your interest in the remainder of this post, to further draw you in, consider the following. The marker of something being significant, a p-value of 0.05 or less, was essentially decided by one man and has remained in effect largely due to tradition. This same man, the aforementioned R.A. Fisher, argued that the early data on smoking supported the notion that the presence of lung cancer caused people to smoke, rather than the other way around. More on that later.

What Is A P Value

What does ‘significant’ mean?

I am starting with p-values because it is typically the first, and often time most heavily weighed information sought when reviewing research. I can pretty confidently claim that most individuals at some point have reviewed a paper, found the p-values, and immediate made a determination of the value of the tested intervention solely on whether the magical threshold of 0.05 was achieved or not. But what exactly does p<0.05 mean? Why is this the marker for significance?

For those of you still with me and reading this exciting post, I am glad you asked. To obtain the p-value, or probability value, you first need to determine the null hypothesis. The null hypothesis is the position that there is no relationship between two measured phenomena or no association among groups. I will use one of my first publications as an example.^[2]

The objective of my study was to examine the potential relationship between physical therapy treatment outcomes and chronicity of low back pain (LBP) in the outpatient setting. It was a retrospective observational study, meaning all the treatments were complete at the time of designing the study. This type of study has several limitations limiting the ability to determine causation – which I will cover in more detail later – but it can still provide value.

The study included just shy of 12,000 patients treated in outpatient physical therapy clinics across 11 states. I measured the functional outcomes using a tool called Focus on Therapeutic Outcome (FOTO) Low Back Functional Status (FS) Patient-Reported Outcome Measure (PROM). It assesses the patients’ perceived physical abilities for patients experiencing low back pain impairments. It determined a functional score on a linear metric ranging from 0 (low functioning) to 100 (high functioning). The difference in score between the intake FS and final FS score produced the FS change which represented the overall improvement of the episode of care. Here are the results:

The mean FS change was 16.997 (n=11945).
Patients with chronic symptoms (> 90 days duration) had an FS change of 15.920 (n=7264) across 14.63 visits.
Patients with subacute symptoms (15-90 days) had an FS change of 21.66 (n=3631) across 14.05.
Patients with acute symptoms (0-14 days) had an FS change of 29.32 (n=1050) across 13.66 visits.
Stepwise regression analysis revealed a significant beta for chronicity (-4.155) with all models.

I concluded patients will likely achieve superior patient-reported functional outcomes through physical therapy if they seek care in the acute stages of symptoms rather than waiting. Furthermore, the number of treatment session and duration of care were similar between groups, indicating potential ineffective or insufficient care was provided for patients with chronic pain. Let’s look at the statistics in more detail.

The null hypothesis would be that chronicity does not have a relationship or association with the clinical outcomes. If the null hypothesis were true, we would see similar FOTO scores between the chronicity groups. So, let’s suppose the null hypothesis were true, the p-value (<0.001 in my paper) is the probability (under this hypothesis) of obtaining results as extreme as the ones I actually observed – ranging from a 14.50 mean for patients with chronic symptoms to a 25.89 mean for patients with acute symptoms. Essentially, it is very unlikely that we would observe these results if there was no relationship between chronicity and outcomes. Note I said relationship and not cause, we will get to that shortly.

Another note, the nature of an observational study yields a lot of confounding factors.^[3] The symptoms chronicity is not the only differing variable between the chronicity groups. It is difficult to replicate the exact nature of the study from a patient population standpoint. The collected demographics indicated the patient groups were similar save for chronicity, but psychosocial variables were not assessed. The interventions were not controlled either as I simply pulled outcome data from all low back pain patients treated in 122 clinics over a 2-year period. The intention was to gather a pragmatic sample of outcomes with respect to chronicity, but the nature of the study limits our ability to reproduce it. Thus, the statistical significance needs to be approached with confidence. When assessing randomized control trials, you can often instill more confidence in the observed effect – not always as there are some poorly run trials – but there are still limitations with p-values.

As stated, the marker of 0.05 does not carry any special meaning. This was simply the threshold Fisher decided was acceptable to deem a test significant. We are starting to see some resistance to this line of thinking, so much so that some researchers are specifically calling for the research community to no longer use the term ‘statistical significance’ and to simply report the p-values. It makes sense as, unfortunately, studies are wholly dismissed – or in some cases not published – if the all-important threshold is not achieved. If one study achieves an outcome with a 0.49 p-value, it is not more meaningful that one that obtains a p-value of 0.51. However, research, clinicians, and educators alike will make career decisions based solely on that threshold. This can be the difference between providing a treatment that may truly be beneficial for you patient or teaching something that is not supported. Before we move on from p-values, it is important to address how they can be manipulated as well.

Game Over Try Again

The pros and cons of big data

Big data is becoming more commonplace in research. For the most part, this is a good thing. It provides a greater array of research opportunities and allows for more detailed analysis through subgrouping to occur. One of the primary issues with assessing the outcomes of a study is the statistical power, which is the probability that an experiment will reject a false null hypothesis. This is a primary issue in the fields of rehabilitation and exercise physiology.

In order to achieve satisfactory levels of statistical power, you need enough subjects/patients. If you fail to achieve a large enough subject pool, the variability will lead to difficult statistical assessment. While ‘statistical significance’ should be interpreted with caution, p-values are still valuable. Even if you may utilize a successful intervention, a small group of patients may lead to a large p-value due to the variety of interceding variables.

Remember all the markers of rigor mentioned in the previous post? Those are all assessed to minimize the number of biases and variables that impact a study. For example, if an experimenter is not blinded to the intervention they are providing, they can impact the outcomes. The subject can interpret body language, confidence, and attention to detail portrayed by an experimenter to glean whether they believe the experimental intervention or the control is being provided. There are other variables that cannot be controlled for, such as the patient’s mood that day, how much they slept, what they ate for breakfast, or how attractive they find the experimenter and whether it is distracting them. All these factors can impact the outcome of an experiment.

How do you mitigate these influences? By increasing your subject pool. The higher the number of participants, the more the variables are watered down and regression to the mean starts to take effect. However, with big data, the opposite can occur. With large data sets – while I am pleased with my 11,945 patients in the my study, ‘big data’ often refers to sets in the hundreds of thousands or millions – almost everything is statistically significant. It makes it difficult to tease out what truly matters and what is simply deemed “different” due to the large number of people.

For example, in my paper, the pain ratings of the groups fluctuated between 5.93 and 6.45, but the p-value was <0.001. Yes, those numbers are likely truly different, but the patients are unlikely having a different pain experience. When the data sets hit the millions, a difference of 5.93 and 5.98 may reach the 0.05 threshold. Aside from being aware of this for purposes of drawing conclusions, it is important to be aware of something referred to as ‘p-hacking’ or ‘date dredging’.

P-hacking is manipulating data sets to “find” statistically significant differences. In his aforementioned book, Jordan Ellenburg eloquently writes, “P-hacking is torturing the data until it confesses.” P-hacking is not the act of replacing data with fake numbers; that is a whole new level of unethical and fraudulent “research”. I am referring to the organization of actual data to achieve the desired 0.05 significance level. This typically happens when retrospectively seeking a relationship rather than prospectively determining what variables and parameters to assess. When conducting research, we must develop a hypothesis and then test it, ideally repeatedly with different samples. Grabbing a large data set and searching for relationships will always yield something. I can safely venture that most patients who achieve a positive outcome in physical therapy wear shoes, however, the act of wearing shoes does not guarantee a positive outcome.

Just because there is a relationship does not mean it is actionable. Another method of p-hacking is manipulating a data set to tell the story you want. For example, let’s say you want to “prove” patients can achieve desired outcomes in only 5 visits of treatment to achieve a predicted outcome. You then rationalize all the parameters to achieve that goal. Perhaps you start by looking at the entire data set and five visits falls short of the desired effect. You then apply an age restriction of 25-45 years old, limit to the number of clinics in a 25-mile area (selectivity bias), only allow for clinicians with at least five years of clinical experience, ensure the visits are spaced over at least one month, and limit the chronicity of symptoms to acute. Once all these parameters are applied, you achieve your desired effect and submit a manuscript with all the listed parameters as your inclusion/exclusion criteria. That is p-hacking to tell a desired story. That is very different than pulling that data set with those parameters from the start as a hypothesis, observing multiple outcomes, and then reproducing the test on a separate data set with the same parameters; or better yet, conducting a prospective cohort and following a set sample.

A key point in proper statistical analysis is to test a hypothesis with evidence (data) that was not used in constructing the hypothesis. This is critical because every data set contains a few patterns due entirely to chance. If the hypothesis is not tested on a different data set from the same statistical population, it is impossible to assess the likelihood that chance alone would produce such patterns. It is important to realize that the statistical significance under the incorrect procedure is completely spurious – significance tests do not protect against p-hacking. My study provides some value, but the hypothesis needs to be repeatedly tested with different data sets. The lack of a control limits the ability to fully understand the impact of chronicity on the outcome. While the study shows a relationship with chronicity and outcomes, it is important we don’t take that to mean ‘the more visits a patient sees the better’ in all cases. All we can do is observe the relationship, not draw conclusions of an effect. This brings us to correlations.

Correlation Is Not Causation

Cancer causes us to smoke

Many people commonly state “correlation does not equal causation.” First and foremost, this is true. Unfortunately, this is often treated like the “no offense but…” approach of acknowledging the issue but plowing through anyways. Similar to how someone will proceed to say something really offensive under the assumption that it is now okay following the disclaimer, correlations are still treated as causation and lead to flawed decision making.

A correlation is simply an association between two variables. This relationship can be positive or negative. For example, studying is positively correlated with better test scores. The more I study, the higher I score. While this correlation makes sense, many correlations can be completed unrelated, but the association remains. For example, the number of golfers using wooden clubs is inversely correlated with the number of licensed physical therapists over the past century. As the therapist workforce increases, fewer golfers use wooden clubs. I can safely assume therapists are not stealing all the wooden golf clubs for themselves. These have nothing to do with one another, but strictly looking at the data, a relationship exists. You can find many other entertaining and clearly unrelated examples. Looking at two data sets in a correlation – A and B – the relationship could be A causes B, B causes A, or an unknown C causes A and B.

R.A. Fisher, a successful statistician who has made lasting impacts on assessments of effect sizes, does a great job of outlining the limitations of correlation. To illustrate how misleading correlations can be, let’s look at how he combated the growing public concerns of the negative health implications associated with cigarette smoking. He was a proponent of smoking and took issue with the rise of correlation studies in the 1940s and 1950s linking smoking to cancer. Here is what he had to say regarding the matter:

“Is it possible then, that lung cancer – that is to say, the pre-cancerous condition which must exist and is known to exist for years in those who are going to show overt lung cancer – is one of the causes of smoking cigarettes? I don’t think it can be excluded. I don’t think we know enough to say that it is such a cause. But the pre-cancerous condition is one involving a certain amount of slight chronic inflammation. The causes of smoking cigarettes may be studied among your friends, to some extent, and I think you will agree that a slight cause of irritation – a slight disappointment, an unexpected delay, some sort of mild rebuff, a frustration – are commonly accompanied by pulling out a cigarette and getting a little compensation for life’s minor ills in that way. And so, anyone suffering from a chronic inflammation in part of the body (something that does not give rise to conscious pain) is not unlikely to be associated with smoking more frequently or smoking rather than not smoking. It is the kind of comfort that might be a real solace to anyone in the fifteen years of approaching lung cancer. And to take the poor chap’s cigarettes away from him would be rather like taking away his white stick from a blind man. It would make an already unhappy person a little more unhappy than he needs be.”

As Ellenberg puts it in his book How Not to Be Wrong: The Power of Mathematical Thinking, “one sees here both a brilliant and rigorous statistician’s demand that all possibilities receive fair consideration.” Fisher was correct in his assessment of correlation statistics. The epidemiologist Jan Vanderbroucke stated the arguments “might have become textbook classics for their impeccable logic and clear exposition of data and argument if only the authors had been on the right side.” As we all know, decades of more rigorous studies of varying types have allowed us to conclude that smoking does in fact contribute to the development of cancer. Now the type of assessment Fisher displayed is the opposite of what we typically see when individuals interpret correlation results. It can be very tempting to see a correlation and chonclude that causation must be present. And it might! But we cannot draw a cause and effect conclusion without more evidence.

Some research questions will never pass a review board and thus cannot be tested with a randomized control trial. Regarding smoking, we cannot randomly allocate a few hundred people to smoking one pack of cigarettes a day and allocate another few hundred to a control group then see which group has a higher death rate in 20 years. Additionally, the longer a study progresses, the more potential for external biases and influencers – such as diet and exercise habits – to alter the study. But when we look at all the data, the picture is clear.

While still observational in nature, we have many studies with large subject pools that consistently show increased smoking increases the risk for lung cancer. If you stop smoking, the risk is reversed. If you smoke unfiltered compared to filtered, the risk increases. If you smoke two packs a day compared to one, the risk increases. Any way you look at the problem, smoking increases cancer risk. As compiling a large volume of studies requires substantial time and resources, improving the quality of individual studies can expedite our ability to draw conclusions.

The more bias and variables we eliminate (well controlled randomized control trials), the larger the trial size, and the more frequently it is reproduced with similar results, the more confident we can be in drawing a conclusion. If we see a large volume of correlation studies pointing to the same conclusions – as has occurred with smoking – we can start feeling more confident with a potential cause and effect relationship. Correlations have a place in research as they point to relationships that may be relevant, but they are an incomplete analysis.

This does not mean all information short of randomized control trials should be ignored. In treatment, for example, the three legs of evidence-based practice are research evidence, clinical expertise, and patient values and perspective. The legs are not equally weighted, but all should be considered. The foundation is the research, starting with randomized control trials, and the application to patients is then modified by the clinician’s expertise and the patient’s value and perspective. A lumbar manipulation may be indicated based on patient presentation, but if I as a clinician have no experience with or confidence in delivering the technique, or if the patient hates the sound of knuckles cracking and is scared of being manipulated, applying the technique will be a disaster. This issue lies in inappropriate use of the data. We can, however, compile similar findings across multiple patient populations and with slightly different research questions, to increase our confidence in our assessment.

Clinical versus statistical significance

The end goal of research is to apply the knowledge gained into real-world application. We have an abundance of data demonstrating levels of significance and associations, but how do we know if the data is clinically meaningful? The word ‘significant’ has been watered down. We are often either numb to its use and make every decision on achieving the p<0.5 threshold. This is where measuring the effect size comes into play.

Effect sizes are quantitative measures of the magnitude of a phenomenon or outcomes in a study. They include mean difference between groups, correlations, and regression analysis. Another statistical feature worth understanding to better assess effect sizes is confidence intervals. I am not going to spend much time here, but I believe these are important for anyone reading research to understand.

A confidence interval is a range of values that we believe the true value lies within. It is used to indicate the probability that the assessed value is representative of the entire population. Let’s go back to my study for a practical application.

The average improvement in FS LBP for all low back pain patients treated with physical therapy was 17.00 with a 95% confidence interval of 16.70-17.28. That means that if I took 100 new samples of patients that met the same inclusion and exclusion criteria and received the same interventions as the population I assessed, the mean improvement should fall within the range of 16.70-17.28 in 95 of the 100 samples. It does not mean 95% of the patients would achieve a score improvement within the range of 16.70-17.28. This is a common misconception. The difference between the mean of entire samples and the percent of patients needs to be noted. Why do we care about CIs? If my CIs were 10.00-20.00, that would indicate that in 95% of future samples, the mean could be as low as 10 and as high as 20. This is where understanding the magnitude of a value is important as well.

My Head Hurts

The minimally clinically important difference (MCID) for FOTO FS Low Back is 5. Thus, a difference of 16.70 and 17.28 doesn’t mean much clinically to a patient and would be hardly noticeable in the clinic. However, a difference in improvement of 10 compared to 20 can be the difference between being able to walk you dog safely or having to leaving your furry friend at home. Large confidence intervals make it more difficult to apply the results to the general population as the results are less compatible with the test and background assumptions (i.e. biases).^{[4, 5]} Understanding the magnitude of change and the confidence in outcomes is vital if we are going to translate research into clinical practice. That is the goal, is it not?

Currently, there is a large push for translational science, which is taking the basic science research and bringing it to clinical practice. We often see a divide in these two worlds. Researchers complain that clinicians rely too heavily on anecdotal experiences and fail to stay up to date with evidence, while clinicians complain that researchers don’t know what it is like to be in “the real world” and the clinic cannot be controlled like a study. There is a balance that needs to be struck but it cannot happen until researchers understand the demands and challenges of clinical practice and clinicians understand how to read, interpret, and apply evidence. As I stated earlier, this is not a comprehensive list of all the statistical analysis you will find in studies, they are however among the most common and most frequently misunderstood. It can be equally, if not more, detrimental to misrepresent data than avoiding it altogether. This is not a license to bury our heads in the sand and to go back to reading abstracts and relying on anecdotal evidence. Reading and understanding literature takes time and effort, however it is well worth it.

References

van Tulder, M., et al., Chapter 3. European guidelines for the management of acute nonspecific low back pain in primary care. Eur Spine J, 2006. 15 Suppl 2: p. S169-91.
Walston, Z. and J. McLester, Impact of low back pain chronicity on patient outcomes treated in outpatient physical therapy: a retrospective observational study. Arch Phys Med Rehabil, 2019.
Schuemie, M.J., et al., Interpreting observational studies: why empirical calibration is needed to correct p-values. Stat Med, 2014. 33(2): p. 209-18.
Chow Z.R.; Greenland, S., Semantic and cognitive tools to aide statistical inference. Applied Statistics, 2019. 21(Sept.).
Greenland, S.C., Z.R., To aide statistical inference, emphasize unconditional descriptions of statistics. Applied Statistics, 2019. 21(Sept.).

ABOUT THE AUTHOR

Zach Walston (PT, DPT, OCS) grew up in Northern Virginia and earned his Bachelor of Science in Human Nutrition, Foods, and Exercise at Virginia Polytechnic Institute and State University. He then received his Doctorate of Physical Therapy from Emory University before graduating from the PT Solutions’ Orthopaedic Residency Program in 2015.

Zach has numerous research publications in peer-reviewed rehabilitation and medical journals. He has developed and taught weekend continuing education courses in the areas of plan of care development, exercise prescription, pain science, and nutrition. He has presented full education sessions at APTA NEXT conference and ACRM, PTAG, and FOTO annual conferences multiple platforms sessions and posters at CSM.

Zach is an active member of the Orthopedic and Research sections of the American Physical Therapy Association and the Physical Therapy Association of Georgia. He currently serves on the APTA Science and Practice Affairs Committee and the PTAG Barney Poole Leadership Academy.