using BoxCox
using CairoMakie
using Random
activate!(; type="svg")
CairoMakie.
= abs2.(randn(MersenneTwister(42), 1000))
x let f = Figure()
= Axis(f[1,1]; xlabel="x", ylabel="density")
ax density!(ax, x)
= Axis(f[1,2]; xlabel="theoretical quantiles", ylabel="observed values")
ax qqnorm!(ax, x)
colsize!(f.layout, 1, Aspect(1, 1.0))
colsize!(f.layout, 2, Aspect(1, 1.0))
resize_to_layout!(f)
fend
BoxCox.jl
A lightweight package with nice extensions
The Box-Cox Transformation
Box & Cox (1964)
\[ \begin{cases} \frac{x^{\lambda} - 1}{\lambda} &\quad \lambda \neq 0 \\ \log x &\quad \lambda = 0 \end{cases} \]
The denominator serves to normalize the transformation and preserve the original direction of the effect (the sign is flipped when \(\lambda < 0\)). In application, we may only care about the numerator (e.g. when it suggests using “speed” instead of “time”.)
Example: Square of a Normal Distribution
Fitting the transformation
= fit(BoxCoxTransformation, x) bc
Box-Cox transformation
estimated λ: 0.2073
resultant transformation:
y^0.2 - 1
-----------
0.2
Examining the fitted transformation
If an appropriate Makie backend is loaded, then you can also do diagnostic plots.
let f = Figure(; size=(500, 300))
= Axis(f[1,1])
ax boxcoxplot!(ax, bc;
=0.95)
conf_level
fend
Applying the fitted transformation
bc.(x)'
1×1000 adjoint(::Vector{Float64}) with eltype Float64:
-1.04197 -1.37758 -3.74249 -1.89776 … -3.3449 -0.295085 -1.06768
let f = Figure(; size=(800, 300)), bcx = bc.(x)
= Axis(f[1,1]; xlabel="x", ylabel="density")
ax density!(ax, bcx)
= Axis(f[1,2]; xlabel="theoretical quantiles", ylabel="observed values")
ax qqnorm!(ax, bcx; qqline=:fitrobust)
colsize!(f.layout, 1, Aspect(1, 1.0))
colsize!(f.layout, 2, Aspect(1, 1.0))
resize_to_layout!(f)
fend
Conditional Distributions: transforming the response of a regression model
Example: Tree Growth
using DataFrames
using RDatasets: dataset as rdataset
= rdataset("datasets", "trees") trees
describe(trees)
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Float64 | Real | Float64 | Real | Int64 | DataType | |
1 | Girth | 13.2484 | 8.3 | 12.9 | 20.6 | 0 | Float64 |
2 | Height | 76.0 | 63 | 76.0 | 87 | 0 | Int64 |
3 | Volume | 30.171 | 10.2 | 24.2 | 77.0 | 0 | Float64 |
Linear Regression
= trees[!, :Volume]
y = hcat(ones(length(y)),
X log.(trees[!, :Height]),
log.(trees[!, :Girth]))
= fit(BoxCoxTransformation, X, y) bc_tree
Box-Cox transformation
estimated λ: -0.0673
resultant transformation:
y^-0.1 - 1
------------
-0.1
let f = Figure(; size=(400, 300))
boxcoxplot!(Axis(f[1,1]), bc_tree; conf_level=0.95)
fend
\(\lambda=0\) is well within the 95% CI and log fits in well with the rest of the model.
Linear Regression
using StatsModels
fit(BoxCoxTransformation,
@formula(Volume ~ 1 +
log(Height) +
log(Girth)),
trees)
Box-Cox transformation
estimated λ: -0.0673
resultant transformation:
y^-0.1 - 1
------------
-0.1
If StatsModels.jl is loaded (even indirectly via e.g. GLM.jl), then you can also use the @formula
macro to specify the regression model.
This also works for mixed models!
Reaction time in the sleep study
using MixedModels
= fit(MixedModel,
model @formula(reaction ~ 1 + days + (1 + days | subj)),
dataset(:sleepstudy))
Est. | SE | z | p | σ_subj | |
(Intercept) | 251.4051 | 6.6323 | 37.91 | <1e-99 | 23.7805 |
days | 10.4673 | 1.5022 | 6.97 | <1e-11 | 5.7168 |
Residual | 25.5918 |
Fitting the transformation
= fit(BoxCoxTransformation,
bc_mixed model)
Box-Cox transformation
estimated λ: -1.0747
resultant transformation:
y^-1.1 - 1
------------
-1.1
let f = Figure(; size=(400, 300))
boxcoxplot!(Axis(f[1,1]), bc_mixed; conf_level=0.95)
fend
\(\text{time}^{-1}\) has a natural interpretation as speed and -1 is nearly as good as the “optimal” transformation. We thus use our domain expertise and use speed (as responses per second) instead of the fitted result.
Speed in the sleep study
= fit(MixedModel,
model_bc @formula(1000 / reaction ~ 1 + days + (1 + days | subj)),
dataset(:sleepstudy))
Est. | SE | z | p | σ_subj | |
(Intercept) | 3.9658 | 0.1056 | 37.55 | <1e-99 | 0.4190 |
days | -0.1110 | 0.0151 | -7.37 | <1e-12 | 0.0566 |
Residual | 0.2698 |
We use 1000 in the numerator to scale things back to seconds from milliseconds.
References
Version Info
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
JULIA_LOAD_PATH = @:@stdlib
This page was rendered from git revision 8c549ca .