# libraries needed for these examples
library(tidyverse)
library(dplyr)
set.seed(8675309) # makes sure random numbers are reproducible
The General Linear Model (GLM) a general mathematical framework for expressing relationships among variables that can express or test linear relationships between a numerical dependent variable and any combination of categorical or continuous independent variables.
There are some mathematical conventions that you need to learn to understand the equations representing linear models. Once you understand those, learning about the GLM will get much easier.
GLM成份 | 數學表示式 |
---|---|
依變項 (DV) | \(Y\) |
總平均 | \(\mu\) (希臘字母,音讀 “mu”) |
主要效果 | \(A, B, C, \ldots\) |
交互作用 | \(AB, AC, BC, ABC, \ldots\) |
隨機誤差 | \(S(Group)\) |
\(Y\) ~ \(\mu+A+B+AB+S(Group)\)
The linear equation predicts the dependent variable (\(Y\)) as the sum of the grand average value of \(Y\) (\(\mu\), also called the intercept), the main effects of all the predictor variables (\(A+B+C+ \ldots\)), the interactions among all the predictor variables (\(AB, AC, BC, ABC, \ldots\)), and some random error (\(S(Group)\)). The equation for a model with two predictor variables (\(A\) and \(B\)) and their interaction (\(AB\)) is written like this:
Don’t worry if this doesn’t make sense until we walk through a concrete example.
A good way to learn about linear models is to simulate data where you know exactly how the variables are related, and then analyse this simulated data to see where the parameters show up in the analysis.
We’ll start with a very simple linear model that just has a single categorical factor with two levels. Let’s say we’re predicting reaction times for congruent and incongruent trials in a Stroop task for a single participant. Average reaction time (mu
) is 800ms, and is 50ms faster for congruent than incongruent trials (effect
).
A factor (因子) is a categorical variable that is used to divide subjects into groups, usually to draw some comparison. Factors are composed of different levels(層次, 資料型態必須是numeric). Do not confuse factors with levels!
You need to represent categorical factors with numbers. The numbers, or coding you choose will affect the numbers you get out of the analysis and how you need to interpret them. Here, we will effect code the trial types so that congruent trials are coded as +0.5, and incongruent trials are coded as -0.5.
A person won’t always respond exactly the same way. They might be a little faster on some trials than others, due to random fluctuations in attention, learning about the task, or fatigue. So we can add an error term to each trial. We can’t know how much any specific trial will differ, but we can characterise the distribution of how much trials differ from average and then sample from this distribution.
Here, we’ll assume the error term is sampled from a normal distribution with a standard deviation of 100 ms (the mean of the error term distribution is always 0). We’ll also sample 100 trials of each type, so we can see a range of variation.
So first create variables for all of the parameters that describe your data.
n_per_grp <- 100
mu <- 800 # average RT
effect <- 50 # average difference between congruent and incongruent trials
error_sd <- 100 # standard deviation of the error term
trial_types <- c("一致" = 0.5, "不一致" = -0.5) # effect code 獨變項/線性模型因子,向量數值是層次
Then simulate the data by creating a data table with a row for each trial and columns for the trial type and the error term (random numbers samples from a normal distribution with the SD specified by error_sd
). For categorical variables, include both a column with the text labels (trial_type
) and another column with the coded version (trial_type.e
) to make it easier to check what the codings mean and to use for graphing. Calculate the dependent variable (RT
) as the sum of the grand mean (mu
), the coefficient (effect
) multiplied by the effect-coded predictor variable (trial_type.e
), and the error term.
dat <- data.frame(
trial_type = rep(names(trial_types), each = n_per_grp) ## 注意重覆的數值是characters
) %>%
mutate(
trial_type.e = recode(trial_type, !!!trial_types), ## 見以下說明
error = rnorm(nrow(.), 0, error_sd),
RT = mu + effect*trial_type.e + error
)
glimpse(dat)
## Rows: 200
## Columns: 4
## $ trial_type <chr> "一致", "一致", "一致", "一致", "一致", "一致", "一致", "~
## $ trial_type.e <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.~
## $ error <dbl> -99.658235, 72.182415, -61.720882, 202.939157, 106.541605~
## $ RT <dbl> 725.3418, 897.1824, 763.2791, 1027.9392, 931.5416, 923.72~
The !!!
(triple bang) in the code recode(trial_type, !!!trial_types)
is a way to expand the vector trial_types <- c(“congruent” = 0.5, “incongruent” = -0.5)
. It’s equivalent to recode(trial_type, “congruent” = 0.5, “incongruent” = -0.5)
. This pattern avoids making mistakes with recoding because there is only one place where you set up the category to code mapping (in the trial_types
vector).
Last but not least, always plot simulated data to make sure it looks like you expect.
ggplot(dat, aes(trial_type, RT)) + ## 繪圖成分未包括'error'
geom_violin() +
geom_boxplot(aes(fill = trial_type),
width = 0.25, show.legend = FALSE)
Now we can analyse the data we simulated using the function lm()
. It takes the formula as the first argument. This is the same as the data-generating equation, but you can omit the error term (this is implied), and takes the data table as the second argument. Use the summary()
function to see the statistical summary.
my_lm <- lm(RT ~ trial_type.e, data = dat)
summary(my_lm)
##
## Call:
## lm(formula = RT ~ trial_type.e, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -264.917 -66.789 3.589 65.049 197.709
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 800.075 6.792 117.80 < 2e-16 ***
## trial_type.e 60.311 13.584 4.44 1.49e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 96.05 on 198 degrees of freedom
## Multiple R-squared: 0.09054, Adjusted R-squared: 0.08595
## F-statistic: 19.71 on 1 and 198 DF, p-value: 1.495e-05
Notice how the estimate for the (Intercept)
is close to the value we set for mu
and the estimate for trial_type.e
is close to the value we set for effect
.
Change the values of mu
and effect
, resimulate the data, and re-run the linear model. What happens to the estimates?
You can use the residuals()
function to extract the error term for each each data point. This is the DV values, minus the estimates for the intercept and trial type. We’ll make a density plot of the residuals below and compare it to the normal distribution we used for the error term.
res <- residuals(my_lm)
ggplot(dat) +
stat_function(aes(0), color = "grey60",
fun = dnorm, n = 101,
args = list(mean = 0, sd = error_sd)) +
geom_density(aes(res, color = trial_type))
You can also compare the model residuals to the simulated error values. If the model is accurate, they should be almost identical. If the intercept estimate is slightly off, the points will be slightly above or below the black line. If the estimate for the effect of trial type is slightly off, there will be a small, systematic difference between residuals for congruent and incongruent trials.
ggplot(dat) +
geom_abline(slope = 1) +
geom_point(aes(error, res,color = trial_type)) +
ylab("Model Residuals") +
xlab("Simulated Error")
Q-Q plot 可檢查資料的殘差是否符合常態分佈。
What happens to the residuals if you fit a model that ignores trial type (e.g., lm(Y ~ 1, data = dat)
)?
You can use the estimates from your model to predict new data points, given values for the model parameters. For this simple example, we just need to now the trial type to make a prediction.
For congruent trials, you would predict that a new data point would be equal to the intercept estimate plus the trial type estimate multiplied by 0.5 (the effect code for congruent trials).
int_est <- my_lm$coefficients[["(Intercept)"]]
tt_est <- my_lm$coefficients[["trial_type.e"]]
tt_code <- trial_types[["一致"]]
int_est
## [1] 800.0747
tt_est
## [1] 60.31121
tt_code
## [1] 0.5
new_congruent_RT <- int_est + tt_est * tt_code
new_congruent_RT
## [1] 830.2303
You can also use the predict()
function to do this more easily. The second argument is a data table with columns for the factors in the model and rows with the values that you want to use for the prediction.
predict(my_lm, newdata = tibble(trial_type.e = 0.5)) ## 模擬資料的總平均
## 1
## 830.2303
If you look up this function using ?predict
, you will see that “The function invokes particular methods which depend on the class of the first argument.”
What this means is that predict()
works differently depending on whether you’re predicting from the output of lm()
or other analysis functions. You can search for help on the lm version with ?predict.lm
.
In the example above, we used effect coding for trial type. You can also use sum coding, which assigns +1 and -1 to the levels instead of +0.5 and -0.5. More commonly, you might want to use treatment coding, which assigns 0 to one level (usually a baseline or control condition) and 1 to the other level (usually a treatment or experimental condition).
Here we will add sum-coded and treatment-coded versions of trial_type
to the dataset using the recode()
function.
## 設定新的效果編碼 方法1
dat <- dat %>% mutate(
trial_type.sum = recode(trial_type, "一致" = +1, "不一致" = -1),
trial_type.tr = recode(trial_type, "一致" = 1, "不一致" = 0)
)
If you define named vectors with your levels and coding, you can use them with the recode()
function if you expand them using !!!
.
## 設定新的效果編碼 方法2
tt_sum <- c("一致" = +1,
"不一致" = -1)
tt_tr <- c("一致" = 1,
"不一致" = 0)
dat <- dat %>% mutate(
trial_type.sum = recode(trial_type, !!!tt_sum),
trial_type.tr = recode(trial_type, !!!tt_tr)
)
Here are the coefficients for the effect-coded version. They should be the same as those from the last analysis.
lm(RT ~ trial_type.e, data = dat)$coefficients ## 運用原始效果編碼預測模擬資料的總平均與實驗效果
## (Intercept) trial_type.e
## 800.07467 60.31121
Here are the coefficients for the sum-coded version. This give the same results as effect coding, except the estimate for the categorical factor will be exactly half as large, as it represents the difference between each trial type and the hypothetical condition of 0 (the overall mean RT), rather than the difference between the two trial types.
lm(RT ~ trial_type.sum, data = dat)$coefficients ## 運用"tt_sum"編碼預測模擬資料的總平均與實驗效果
## (Intercept) trial_type.sum
## 800.0747 30.1556
Here are the coefficients for the treatment-coded version. The estimate for the categorical factor will be the same as in the effect-coded version, but the intercept will decrease. It will be equal to the intercept minus the estimate for trial type from the sum-coded version.
lm(RT ~ trial_type.tr, data = dat)$coefficients ## 運用"tt_tr"編碼預測模擬資料的總平均與實驗效果
## (Intercept) trial_type.tr
## 769.91907 60.31121
思考:你要相信那一種效果編碼呈現的預測?先決定你要“確認”
推薦相關教材:
Common statistical tests are linear models (or: how to teach stats) by Jonas Kristoffer Lindeløv
翻译:常见统计检验的本质都是线性模型 by 黄俊文
比較t test, ANOVA與線性迴歸的分析結果
The t-test is just a special, limited example of a general linear model.
t.test(RT ~ trial_type.e, data = dat, var.equal = TRUE)
##
## Two Sample t-test
##
## data: RT by trial_type.e
## t = -4.4398, df = 198, p-value = 1.495e-05
## alternative hypothesis: true difference in means between group -0.5 and group 0.5 is not equal to 0
## 95 percent confidence interval:
## -87.09946 -33.52296
## sample estimates:
## mean in group -0.5 mean in group 0.5
## 769.9191 830.2303
What happens when you use other codings for trial type in the t-test above? Which coding maps onto the results of the t-test best?
ANOVA is also a special, limited version of the linear model.
my_aov <- aov(RT ~ trial_type.e, data = dat)
summary(my_aov, intercept = TRUE)
## Df Sum Sq Mean Sq F value Pr(>F)
## (Intercept) 1 128023897 128023897 13875.67 < 2e-16 ***
## trial_type.e 1 181872 181872 19.71 1.49e-05 ***
## Residuals 198 1826848 9227
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The easiest way to get parameters out of an analysis is to use the broom::tidy()
function. This returns a tidy table that you can extract numbers of interest from. Here, we just want to get the F-value for the effect of trial_type. Compare the square root of this value to the t-value from the t-tests above.
注意t檢定與f檢定的數值關係!
f <- broom::tidy(my_aov)$statistic[1]
sqrt(f)
## [1] 4.43981
剖析一因子獨立樣本變異數分析
We’ll walk through an example of a one-way ANOVA with the following equation:
\(Y_{ij} = \mu + A_i + S(A)_{ij}\)
This means that each data point (\(Y_{ij}\)) is predicted to be the sum of the grand mean (\(\mu\)), plus the effect of factor A (\(A_i\)), plus some residual error (\(S(A)_{ij}\)).
Let’s create a simple simulation function so you can quickly create a two-sample dataset with specified Ns, means, and SDs.
製造示範用模擬資料。兩組各有五筆觀察值。
two_sample <- function(n = 10, m1 = 0, m2 = 0, sd1 = 1, sd2 = 1) {
s1 <- rnorm(n, m1, sd1)
s2 <- rnorm(n, m2, sd2)
data.frame(
Y = c(s1, s2),
grp = rep(c("A", "B"), each = n)
)
}
Now we will use two_sample()
to create a dataset dat
with N=5 per group, means of -2 and +2, and SDs of 1 and 1 (yes, this is an effect size of d = 4).
dat <- two_sample(5, -2, +2, 1, 1) ## 兩組平均值異預期是0
dat
## Y grp
## 1 -2.0883648 A
## 2 -1.9188596 A
## 3 -0.6118548 A
## 4 -2.0234106 A
## 5 -1.0124073 A
## 6 1.8553036 B
## 7 1.3965484 B
## 8 1.8102043 B
## 9 2.8726288 B
## 10 0.7976260 B
You can calculate how each data point (Y
) deviates from the overall sample mean (\(\hat{\mu}\)), which is represented by the horizontal grey line below and the deviations are the vertical grey lines. You can also calculate how different each point is from its group-specific mean (\(\hat{A_i}\)), which are represented by the horizontal coloured lines below and the deviations are the coloured vertical lines.
You can use these deviations to calculate variability between groups and within groups. ANOVA tests whether the variability between groups is larger than that within groups, accounting for the number of groups and observations.
先了解ANOVA表的資訊
aov(Y ~ grp, data = dat) %>% summary(intercept = TRUE) ## 傳統統計書的ANOVA表不會顯示截距項目
## Df Sum Sq Mean Sq F value Pr(>F)
## (Intercept) 1 0.116 0.116 0.225 0.648
## grp 1 26.854 26.854 52.091 9.09e-05 ***
## Residuals 8 4.124 0.516
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can use the estimation equations for a one-factor ANOVA to calculate the model components.
mu
is the overall meana
is how different each group mean is from the overall meanerr
is residual error, calculated by subtracting mu
and a
from Y
This produces a decomposition matrix, a table with columns for Y
, mu
, a
, and err
.
decomp <- dat %>%
select(Y, grp) %>%
mutate(mu = mean(Y)) %>% # calculate mu_hat
group_by(grp) %>%
mutate(a = mean(Y) - mu) %>% # calculate a_hat for each grp
ungroup() %>%
mutate(err = Y - mu - a) # calculate residual error
Y | grp | mu | a | err |
---|---|---|---|---|
-2.0883648 | A | 0.1077414 | -1.638721 | -0.5573853 |
-1.9188596 | A | 0.1077414 | -1.638721 | -0.3878802 |
-0.6118548 | A | 0.1077414 | -1.638721 | 0.9191247 |
-2.0234106 | A | 0.1077414 | -1.638721 | -0.4924312 |
-1.0124073 | A | 0.1077414 | -1.638721 | 0.5185721 |
1.8553036 | B | 0.1077414 | 1.638721 | 0.1088414 |
1.3965484 | B | 0.1077414 | 1.638721 | -0.3499138 |
1.8102043 | B | 0.1077414 | 1.638721 | 0.0637421 |
2.8726288 | B | 0.1077414 | 1.638721 | 1.1261665 |
0.7976260 | B | 0.1077414 | 1.638721 | -0.9488362 |
Calculate sums of squares for mu
, a
, and err
.
SS <- decomp %>%
summarise(SS_Y = sum(Y^2), ## 觀察值平方和
mu = sum(mu*mu), ## 總平均(截距)平方和
a = sum(a*a), ## 分組離均差平方和
err = sum(err*err)) ## 殘差平方和
SS_Y | mu | a | err |
---|---|---|---|
31.09436 | 0.1160821 | 26.85406 | 4.124216 |
If you’ve done everything right, SS$mu + SS$a + SS$err
should equal the sum of squares for Y.
SS_Y <- sum(decomp$Y^2)
all.equal(SS_Y, SS$mu + SS$a + SS$err) ## 觀察值平方和 vs. 三種平方和的總和
## [1] TRUE
Divide each sum of squares by its corresponding degrees of freedom (df) to calculate mean squares. The df for mu
is 1, the df for factor a
is K-1
(K is the number of groups), and the df for err
is N - K
(N is the number of observations).
K <- n_distinct(dat$grp) # 資料組數
N <- nrow(dat) # 觀察值總樣本數
df <- c(mu = 1, a = K - 1, err = N - K) # 自由度
MS <- SS / df # 離均差平方和平均;變異數期望值
SS_Y | mu | a | err |
---|---|---|---|
31.09436 | 0.1160821 | 3.356757 | 4.124216 |
Then calculate an F-ratio for mu
and a
by dividing their mean squares by the error term mean square. Get the p-values that correspond to these F-values using the pf()
function.
F_mu <- MS$mu / MS$err # 截距F值
F_a <- MS$a / MS$err # 主要效果F值
p_mu <- pf(F_mu, df1 = df['mu'], df2 = df['err'], lower.tail = FALSE) # 截距p值
p_a <- pf(F_a, df1 = df['a'], df2 = df['err'], lower.tail = FALSE) # 主要效果p值
Put everything into a data frame to display it in the same way as the ANOVA summary function.
## 自行排版的ANOVA表
my_calcs <- data.frame(
term = c("Intercept", "grp", "Residuals"),
Df = df,
SS = c(SS$mu, SS$a, SS$err),
MS = c(MS$mu, MS$a, MS$err),
F = c(F_mu, F_a, NA),
p = c(p_mu, p_a, NA)
)
term | Df | SS | MS | F | p | |
---|---|---|---|---|---|---|
mu | Intercept | 1 | 0.116 | 0.116 | 0.028 | 0.871 |
a | grp | 1 | 26.854 | 3.357 | 0.814 | 0.393 |
err | Residuals | 8 | 4.124 | 4.124 | NA | NA |
Now run a one-way ANOVA on your results and compare it to what you obtained in your calculations.
再看一次ANOVA表的資訊
aov(Y ~ grp, data = dat) %>% summary(intercept = TRUE)
## Df Sum Sq Mean Sq F value Pr(>F)
## (Intercept) 1 0.116 0.116 0.225 0.648
## grp 1 26.854 26.854 52.091 9.09e-05 ***
## Residuals 8 4.124 0.516
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using the code above, write your own function that takes a table of data and returns the ANOVA results table like above.