Hey Reader, what a pleasure to have you here again!
Today, I want to convince you of two things. A) Experimental design is a sustainable practice and B) there is lot for us to learn.
I would argue that even in the best labs around the world, scientists have a good understanding of how to set up controls, but very few can go beyond that.
So, let us explore how to properly use experimental design to make our research more effective and save resources!
Today's Lesson: Experimental Desing
How to leverage planning for sustainability & science
Number Of The Day
There is not a single publication that discusses the extent to which resources can be saved through better experimental design. Prospective designing is a topic that is often overlooked, even in most statistics courses. In my search I only came by this website that provides anecdotal but impressive quantification of the potential savings.
0
Designing Experiments Sustainably
Scientist often focus on controls during planning.
However, not all biases and confounders can be eliminated with controls. Additional, opportunities that allow for higher impact (in terms of statistical significance and implication), cost savings, and time efficiency are never discovered.
A Concrete Example
Let’s assume it is our job to test the side effects of a new drug in mice. We need to find out if this drug affects body weight, anxiety levels, or pain sensitivity.
Normally, a scientist chooses a protocol established in their lab or replicates the methodology of another paper.
The Problem
This would mean setting up 5-7 mice per experimental group as that is a common number. Three groups (anxiety vs. pain-testing vs. diarrhea) would be arranged since that yields three figures to publish.
Results would be judged significant or not and based on this analysis it would be decided whether the drug has side effects or not.
Indeed, we would get results that might be publishable. Apart from being unsustainable due to wasting resources, we would not even know if these data are “right” or “true”.
It might sound surprising, but based on our study design we could not say whether we would have been able to detect an effect at all!
Here’s how we can enhance statistical validity while reducing resource and mice use.
What Is Meaningful?
Instead of simply conducting an experiment with “standard sample size” and hoping to see an effect, the first step is to determine the smallest difference we need to detect to reliably speak of a “difference”.
So what would constitute a meaningful outcome?
First, you need to decide what constitutes a meaningful difference. Once you have your data, you calculate the observed difference. You use statistics because you need to account for the variation in the data. For example, the mouse with the lowest data point in Drug #2 might have enzymes that degrade the drug faster, resulting in fewer side effects. Still, you would infer that Drug #2 leads to side effects overall. Finally, you make a decision regarding the side effects. The calculation of significance does not tell you how large the difference is, only how certain you are that any difference exists.
If mice lose 1% of their body weight, you might assume biological variation independent of the drug. If you observe 4% you might want to investigate further, but it’s unlikely you would suggest immediately discontinuing the drug. If the mice start eating each other alive due to starvation, however, your reaction might differ, and you might be on the way to a Nobel Prize…
We could set a threshold of 5% body weight loss as a significant side effect because we know there is a 3% body weight variation in the strain of mice we use.
How Many Mice Do We Really Need?
We also know that due to biological variation, each measurement must be repeated. But how often should you repeat it, or how many mice are needed to reliably detect a 5% difference?
Finding out is unbelievably simple, you will assess the variation (e.g., precision of your scale and the weight-variation of your mice), and then enter this information into a simple online calculator. Done.
This is a simple online power/sample size calculator from the University of Vienna. It helps you to calculate proper sample sizes or tell you the power given a certain sample size. It asks you very basic questions like the difference between the groups you want to detect and the variation of your data. Normally, you can guess the latter from previous experiments. What is alpha and beta?Colloquially, beta, also called power, is the opposite of alpha or “significance”. More precisely, power refers to the probability of missing an effect that’s actually there (with significance being the opposite: the probability of seeing an effect when none exists). Note that keeping the sample size constant, you would reduce the significance level (α) if you increased power further (β).As you can see on the right, there are some mores statistical aspects to consider. However, don’t be intimidated, if they confuse you, a 2 minute google search will answer all questions since those are very basic! Here are more and more and even more such calculators, and here some advanced ideas to increase power without changing sample sizes.
Suppose your result shows that you need at least 10 mice per group to reliably detect an effect with 80% power (the standard value). This means in 20% of cases, you might miss an effect even if it’s present. For habit’s sake we also keep the significance level at 0.05 (the likelihood that you will observe this result or an even bigger one if in reality there was no side-effect of the drug).
This process is important because otherwise we could never say whether we could have detected an effect with relative certainty.
No Sustainability without Scientific Sustainability
You might now say, “Patrick, we’re now using 10 mice – even more than before! How is that sustainable?”
It’s sustainable because now we are certain we can detect an effect. With less than 10 mice the inherent variation is too big to draw robust conclusions. Thus, we would judge “we are uncertain”.
Imagine you would have used only 4 instead of 7 replicates. By deleting the light-green data points, could you still tell the difference properly?
This is why most scientists will simply repeat experiments until they get to a point of “certainty”. In our case that would mean to run a second trial ending up using 12 mice instead of 10.
Also, consider the wasted time, and the frustration while trying to analyze the data.
Even worse, imagine scientists would have judged “no significant difference detectable” – green lights to continue with the drug!
Reducing The Number Of Mice (Finally)
Originally, each effect would be tested on a different group of mice. What a waste! Given the focus on controls, scientists are primed to separate as much as possible. Fear dominates: carryovers! Contamination!
However, multiple side-effects can occur simultaneously in patients. Thus, I would argue we can test a mouse for all side-effects.
I would suggest a crossover design. Assuming anxiety and pain-sensitivity would occur after 5 days but weight loss could start any time within 3 weeks, we could measure anxiety at the end of week one and pain sensitivity at the end of week two with weight measurement conducted throughout.
This interval minimizes carry-over effects from one test to the next (e.g., through stress) since we keep one week in between.
Improving Once More
However, does it make mice more anxious when they have been losing weight for 2 rather than just 1 week? Good question, let’s iterate once more:
For instance, we could test anxiety, then aggression in 5 mice, and reverse the order for the other 5, given that symptoms occur after 5 days and remain stable thereafter.
This design has multiple advantages: instead of three separate groups, we have only one, sparing a significant number of mice (and avoiding the additional work of setting up breeding and caring for them).
Additionally, we reduce the often large inter-individual differences among mice since we test all side effects in the same mouse. Despite this, we still maintain the optimal number of mice to detect effects with statistical robustness.
What If Crossovers Are Not an Option?
Experimental design, much like sustainable practice, is resilient. If we cannot use the design mentioned above, we will find alternatives, e.g., a group sequential design.
In this approach, groups are tested in sequence rather than simultaneously, allowing us to avoid generating unnecessary results.
The fundamental idea: Istead of starting all group at once, collecting only data that is meaningful. In this way we can stop the experimental series at some point instead of blindly generating useless data just because it seemed sensible in the beginning.
For instance, if we observe weight loss in the first group, we might decide not to test for anxiety or pain sensitivity, as the drug would not make it to market anyway. However, if all mice are fine after one week, we can test those for anxiety and proceed with the next group of animals.
Applying The Knowledge
The example from above can be translated to any experiment whether it involves the electrophysiology of neurons or protein levels in cancerous cell lines.
If you have a question, you start from an unknown and construct a method to immediately measure the outcome you were interested in.
Your brain did all of this automatically in just a few seconds, because the uncertainty that goes along with asking a question immediately signals relevance.
However, what your brain normally leaves out is:
A) confounders and biases that could lead you to incorrect
B) how to find the answer most efficiently.
Your mind considers the effort required and the resources needed as equally important. As a result, thinking longer doesn’t seem like a feasible option for optimization to your brain initially.
Normally, we connect what we are curious about with what we need to know. We figure out which measure might be appropriate and then determine how to measure it. This often happens within a few seconds to minutes. However, once we take a step further, we begin to wonder about the interpretability of our measurement, thereby starting an iterative process of reviewing our assumptions at each step. Through this process, we can prompt ourselves to avoid biases and confounders, as well as optimize resource use. From time to time, we may even realize that the question we initially asked was not the right one to answer our actual inquiry.
I am pretty sure this also happened in the apple vs banana example. However, in our world there is enough food, time, and opportunity to allow for thoughtful planning.
Thus, taking a deep breath and reconsider instead of following what we are habituated to do (rushing ahead to get some data).
Upcoming Lesson:
More Sustainability For Plastic Items
How We Feel Today
If you have a wish or a question, feel free to reply to this Email. Otherwise, wish you a beatiful week! See you again the 28th : )
Personal Note From Patrick, The Editor Hi Reader, what do you think about the current publishing system? Almost every decision about an open position considers publication history, and scientists measure their status based on Impact Factor However, more impactful papers are actually more sustainable. Thus, let me show you how to motivate sustainable experimental design by pointing out how it can increase citations: Today's Lesson: SustainabilityBy Design How to run high impact studies...
Personal Note From Patrick, The Editor Hey Reader, before we get to the lesson, let me personally invite you to our re-opened Slack community! There, we scientists and sustainability experts share our experiences and the latest developments! And it is entirely free. Back to the topic: Last year, Wouter de Broeck organized a fantastic event at a research center in Belgium and invited me to give a talk. I met some very inspiring people, one of them being Ruben Vanhome. Today, I am sharing his...
Personal Note From Patrick, The Editor Hi Reader, have you been searching for a more sustainable piece of equipment lately? When it comes to evaluating which claims are valid, I’ve personally seen very few advertised enhancements that weren’t. However, the main question remains: does a particular benefit make the entire instrument sustainable? This can be hard to judge, so let’s explore it using a concrete example: Today's Lesson: How to Assess Instruments An example on how to evaluate...