Mixed models: why or why not?

(But probably why yes!)

Phillip M. Alday

Beacon Biosignals

The General Linear Model

Classical tests are all regression in disguise

See more examples of this at: https://lindeloev.github.io/tests-as-linear/
Two independent samples of 10 elements
Both true variance of 1
\(a\) has mean 0
\(b\) has mean 1

CairoMakie.Screen{IMAGE}

t-test

Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          -1.12202
    95% confidence interval: (-1.723, -0.5208)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.0010

Details:
    number of observations:   [10,10]
    t-statistic:              -3.9206839147444157
    degrees of freedom:       18
    empirical standard error: 0.2861791307330337

ANOVA

One-way analysis of variance (ANOVA) test
-----------------------------------------
Population details:
    parameter of interest:   Means
    value under h_0:         "all equal"
    point estimate:          NaN

Test summary:
    outcome with 95% confidence: reject h_0
    p-value:                     0.0010

Details:
    number of observations: [10, 10]
    F statistic:            15.3718
    degrees of freedom:     (1, 18)

Linear regression

treatment, i.e. dummy coding

─────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept)  -0.229102    0.202359  -1.13    0.2724  -0.654243   0.196039
x: b          1.12202     0.286179   3.92    0.0010   0.520778   1.72326
─────────────────────────────────────────────────────────────────────────

F-test against the null model:
F-statistic: 15.37 on 20 observations and 1 degrees of freedom, p-value: 0.0010

Linear regression

effects, i.e. sum coding

───────────────────────────────────────────────────────────────────────
                Coef.  Std. Error     t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  0.331907     0.14309  2.32    0.0323  0.0312873   0.632527
x: b         0.561009     0.14309  3.92    0.0010  0.260389    0.861629
───────────────────────────────────────────────────────────────────────

F-test against the null model:
F-statistic: 15.37 on 20 observations and 1 degrees of freedom, p-value: 0.0010

Linear regression

full dummy coding, i.e. one-hot, i.e. dummy coding without an intercept

──────────────────────────────────────────────────────────────────
          Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────
x: a  -0.229102    0.202359  -1.13    0.2724  -0.654243   0.196039
x: b   0.892916    0.202359   4.41    0.0003   0.467775   1.31806
──────────────────────────────────────────────────────────────────

What happens with more than 3+ groups?

group \(c\) with true mean -1, true variance 1

CairoMakie.Screen{IMAGE}

Linear regression

dummy coding

─────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept)  -0.229102    0.214824  -1.07    0.2957  -0.669883   0.21168
x: b          1.12202     0.303806   3.69    0.0010   0.498659   1.74538
x: c         -0.184637    0.303806  -0.61    0.5484  -0.807997   0.438722
─────────────────────────────────────────────────────────────────────────

F-test against the null model:
F-statistic: 10.84 on 30 observations and 2 degrees of freedom, p-value: 0.0004

ANOVA

One-way analysis of variance (ANOVA) test
-----------------------------------------
Population details:
    parameter of interest:   Means
    value under h_0:         "all equal"
    point estimate:          NaN

Test summary:
    outcome with 95% confidence: reject h_0
    p-value:                     0.0004

Details:
    number of observations: [10, 10, 10]
    F statistic:            10.8357
    degrees of freedom:     (2, 27)

Explicit regression gives you more control than classical tests

but also more responsibility!
you can test distinct but related hypotheses
you get explicit estimates of effect sizes
you can customize different parts of the model to get variations
- mixture of continuous and categorical predictors (ANOVA + ANCOVA)
- control which interactions are present
- interactions are resolved as part of a single step: no post-hoc t-test necessary
- control over the ‘family’ / response distribution to model e.g. yes/no responses (binomial), counts (Poisson), etc.
relationship of ANOVA tests and t-tests more explicit
- ANOVA is an omnibus test
- t-tests are individual contrasts
- more complicated tests are variations on model comparisons
contrasts can be hard but…
- they are no harder than your research question
- explicit choice of contrasts and model comparison more informative than the types of sums of squares
lack of balance not a problem

But what about repeated measures?

Two dependent* samples of 10 elements
Both true variance of 1
\(a\) has mean 0
\(b\) has mean 1

CairoMakie.Screen{IMAGE}

Paired samples t-test

One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0
    point estimate:          -1.12202
    95% confidence interval: (-1.674, -0.5705)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.0013

Details:
    number of observations:   10
    t-statistic:              -4.602356111250003
    degrees of freedom:       9
    empirical standard error: 0.24379206812308213

One-sample t-test on the difference

One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0
    point estimate:          -1.12202
    95% confidence interval: (-1.674, -0.5705)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.0013

Details:
    number of observations:   10
    t-statistic:              -4.602356111250003
    degrees of freedom:       9
    empirical standard error: 0.24379206812308213

Linear regression on the difference

	Coef.	Std. Error	t	Pr(>	t	)
(Intercept)	-1.12202	0.243792	-4.60	0.0013	-1.67351	-0.570522

Pairwise differences are not easily generalizable

what happens if we have 3+ groups? (rmANOVA ✔)
what happens if our covariates change from one measurement to the next within groups? (rmANOVA ✔ between vs. within variables)
what happens if we have more than 2 measurements per group? (rmANOVA ✔)
what if some groups are missing one or more measurements? (rmANOVA ❓)
what happens if there are multiple grouping variables? (rmANOVA ❌)
what happens if the conditional distribution is not normal? (rmANOVA ❌)

Regression and repeated measures

Strategies with classical regression

within-groups regression

aggregating within-group results may not propagate error correctly
all groups treated equal
unable to handle more complex grouping structures
separate by-item and by-subject analyses as a potential stopgap
no pooling of information between groups

ignore grouping structure or include group as covariate (complete pooling)

violates independence assumption
standard errors incorrect (too small)
if group included as categorical variable, explosion in number of parameters
- more complicated to interpret
- lower power
all observations treated equal
complete pooling of information between groups

Mixed-effects Models

can handle more complicated grouping structures
can handle imbalance at all levels
better group-level predictions
can handle both between and within variables seemlessly
partitioning of group vs. observation variance based on the evidence
partial pooling of information between groups

Why mixed models?

Classic example dataset: sleepstudy

reaction time study following \(x\) days of sleep restriction
on average, we expect a worsening of reaction time over several days
individuals may differ in baseline reaction time or worsening

Classic example dataset: sleepstudy

CairoMakie.Screen{IMAGE}

Classic example dataset: sleepstudy

CairoMakie.Screen{IMAGE}

Paritioning between-within variance

	Est.	SE	z	p	σ_subj
(Intercept)	251.4051	6.6323	37.91	<1e-99	23.7805
days	10.4673	1.5022	6.97	<1e-11	5.7168
Residual	25.5918

Shrinkage and borrowing strength

Shrinkage and borrowing strength: most changed

Shrinkage and borrowing strength: moderately changed

Shrinkage and borrowing strength: least changed

Shrinkage and borrowing strength

Subject-level predictions

And many more reasons!

Multiple levels: (partial) crossing and nesting
Parsimony
One unified framework
- Normal and non normal responses
- Mixture of categorical and continuous predictors
- Balance isnt’ an issue
Explicit model:
- effect estimates
- easier to see impact of potential violations of assumptions
- much clearer distinction between signifance and explanatory power

Why not mixed models?

(what you need to watch out for when moving to mixed models)

Contrast coding

Hinted at earlier, but contrast coding requires thinking rather explicitly about your actual hypotheses beyond “there is a difference somewhere”
Results in the literature are not interpretable without knowing the contrast scheme used (Brehm and Alday 2022)
Same problem existed historically for ANOVA – results are no interpretable without knowing whether Type I, II, or III sums of squares were used
Good tutorial in R: Schad et al. (2020)

Random-effects selection

This is a huge topic and the source of a long debate.
There are problems with many of the proposed rules of thumb because rules of thumb often ignore tradeoffs
See also Baayen, Davidson, and Bates (2008), Matuschek et al. (2017), D. Bates et al. (2018), B. Bates Douglas M. (2019)

Convergence, compute time and the computational vs. statistical problems

Unlike classical tests and OLS regression, which are based on direct computations, mixed models require a more complicated fitting process
- This can break down in various ways
- This can take substantially longer than ANOVA
Breakdowns of the fitting process can often be solved by better understanding the warnings and the deeper meaning of statistics in question
Overly cautious warnings in some software (e.g. lme4) have often been intepreted as a failure in software instead of a statement about the statistical problem (see also https://rpubs.com/palday/lme4-singular-convergence)
Folk Theorem of Statistical Computing (Gelman): When you have computational problems, often there’s a problem with your model

Breakdown of some overly simple definitions from introductory statistics

“degrees of freedom” no longer a trivial concept
p-values often depend on degrees of freedom, so they are now more difficult
largely averted by using confidence intervals (see also Cumming 2014)
\(R^2\) and standardized effect sizes are also more challenging, see e.g. these links.

References

Baayen, R. H., D. J. Davidson, and D. M. Bates. 2008. “Mixed-Effects Modeling with Crossed Random Effects for Subjects and Items.” Journal of Memory and Language, Special Issue: Emerging Data Analysis, 59 (4): 390–412. https://doi.org/10.1016/j.jml.2007.12.005.

Bates, Bates, Douglas M. 2019. “Complexity in Fitting Linear Mixed Models.” Nextjournal, August. https://doi.org/10.33016/nextjournal.100002.

Bates, Douglas, Reinhold Kliegl, Shravan Vasishth, and Harald Baayen. 2018. “Parsimonious Mixed Models.” arXiv:1506.04967 [Stat], May. http://arxiv.org/abs/1506.04967.

Brehm, Laurel, and Phillip M. Alday. 2022. “Contrast Coding Choices in a Decade of Mixed Models.” Journal of Memory and Language 125 (August): 104334. https://doi.org/10.1016/j.jml.2022.104334.

Cumming, Geoff. 2014. “The New Statistics: Why and How.” Psychological Science 25 (1): 7–29. https://doi.org/10.1177/0956797613504966.

Matuschek, Hannes, Reinhold Kliegl, Shravan Vasishth, Harald Baayen, and Douglas Bates. 2017. “Balancing Type I Error and Power in Linear Mixed Models.” Journal of Memory and Language 94 (June): 305–15. https://doi.org/10.1016/j.jml.2017.01.001.

Schad, Daniel J., Shravan Vasishth, Sven Hohenstein, and Reinhold Kliegl. 2020. “How to Capitalize on a Priori Contrasts in Linear (Mixed) Models: A Tutorial.” Journal of Memory and Language 110 (February): 104038. https://doi.org/10.1016/j.jml.2019.104038.