6 Data wrangling 2
由於本校並未購買R studio校園伺服器版本,
6.1 Walkthrough video
的示範與講解並不適用本單元實際上課內容。除了尚無個人電腦設備的同學,本單元起同學應能熟練在本機運用R及Rstudio學習與操作。
6.1 Walkthrough video
There is a walkthrough video of this chapter available via Zoom.
- Video notes: this video was recorded in 2020 when we recommended using the server above installing R on your computer. With more experience of the server, we now strongly encourage you to install R on your computer if you can. There are no other differences between the video and this book chapter.
6.2 Activity 1: Set-up
- Open R Studio and ensure the environment is clear.
- Open the
stub-wrangling-2.Rmd
file and ensure that the working directory is set to your Data Skills folder and that the two .csv data files (participant-info.csv
andahi-cesd.csv
) are in your working directory (you should see them in the file pane).
- If you’re on the server, avoid a number of issues by restarting the session - click
Session
-Restart R
- Type and run the below code to load the
tidyverse
package and to load in the data files.
library(tidyverse)
<- read_csv('ahi-cesd.csv')
dat <- read_csv('participant-info.csv')
pinfo <- inner_join(dat, pinfo, by= c("id", "intervention")) all_dat
Now let’s start working with our tidyverse
verb functions…
6.3 Activity 2: Select
Select the columns all_dat, ahiTotal, cesdTotal, sex, age, educ, income, occasion, elapsed.days from the data and create a variable called summarydata
.
<- select(all_dat, ahiTotal, cesdTotal, sex, age, educ, income, occasion, elapsed.days) summarydata
If you get an error message when using select that says unused argument
it means that it is trying to use the wrong version of the select function. There are two solutions to this, first, save you work and then restart the R session (click session -restart R) and then run all your code above again from the start, or replace select
with dplyr::select
which tells R exactly which version of the select function to use. We’d recommend restarting the session because this will get you in the habit and it’s a useful thing to try for a range of problems
Pause here and interpret the above code and output
- How you would translate this code into English
- What columns have been removed from the data?
6.4 Activity 3: Arrange
Arrange the data in the variable created above (summarydata
) by ahiTotal with lowest score first.
<- arrange(summarydata, by = ahiTotal) ahi_asc
- How could you arrange this data in descending order (highest score first)?
arrange(summarydata, by = desc(ahiTotal))
What is the smallest ahiTotal score?
What is the largest ahiTotal score?
6.5 Activity 4: Filter
Filter the data ahi_asc
by taking out those who are over 65 years of age.
<- filter(ahi_asc, age < 65) age_65max
- What does
filter()
do?
- How many observations are left in
age_65max
after runningfilter()
?
6.6 Activity 5: Summarise
Then, use summarise to create a new variable data_median
, which calculates the median ahiTotal score in this grouped data and assign it a table head called median_score
.
<- summarise(age_65max, median_score = median(ahiTotal)) data_median
What is the median score?
Change the above code to give you the mean score. What is the mean score to 2 decimal places?
summarise(age_65max, mean_score = mean(ahiTotal))
6.7 Activity 6: Group_by
Use mutate to create a new column called Happiness_Category
in age_65max
which categorises participants based on whether they score above the median ahiTotal
score for all participants.
Then, group the data stored in age_65max
by Happiness_Category
, and store it in an object named happy_dat
.
Finally, use summarise to calculate the median cesdTotal
score for participants who scored above and below the median ahiTotal
score and save it in a new object named data_median_group
.
<- mutate(age_65max, Happiness_Category = (ahiTotal > 74))
age_65max <- group_by(age_65max, Happiness_Category)
happy_dat
<- summarise(happy_dat, median_score = median(cesdTotal)) data_median_group
If you get what looks like an error that says summarise() ungrouping output (override with .groups argument)
don’t worry, this isn’t an error it’s just R telling you what it’s done. This message was included in a very recent update to the tidyverse
which is why it doesn’t appear on some of the walkthrough vidoes.
Pause here and interpret the above code and output
- What does
group_by()
do?
- How would you change the code to group by education rather than
Happiness_Category
?
group_by(age_65max, educ)
6.8 Activity 7: Data visualisation
Copy, paste and run the below code into a new code chunk to create a plot of depression scores grouped by income level using the age_65max
data.
ggplot(age_65max, aes(x = as.factor(income), y = cesdTotal, fill = as.factor(income))) +
geom_violin(trim = FALSE, show.legend = FALSE, alpha = .6) +
geom_boxplot(width = .2, show.legend = FALSE, alpha = .5) +
scale_fill_viridis_d(option = "D") +
scale_x_discrete(name = "Income Level", labels = c("Below Average", "Average", "Above Average")) +
scale_y_continuous(name = "Depression Score")
Which income group has the highest median depression scores?
Which group has the highest density of scores at any one point?
Density is represented by the curvy line around the boxplot that looks a little bit like a (drunk) violin. The fatter the violin, the more data points there are at any one point. This means that in the above plot, the Above Average group has the highest density because this has the widest violin, i.e., there are lots of people in the Above Average income group with a score of about 5.
- Is income group a between-subject or within-subject variable?
Between-subjects designs are where different participants are in different groups. Within-subject designs are when the same participants are in all groups. Income is an example of a between-subject variable because participants can only be in one grouping level of the independent variable
6.9 Activity 8: Make R your own
Finally, you can customise how R Studio looks to make it feel more like your own personal version. Click Tools
- Global Options
- Apperance
. You can change the default font, font size, and general appearance of R Studio, including using dark mode. Play around with the settings and see which one you prefer - you’re going to spend a lot of time with R, it might as well look nice!
6.10 Finished!
Well done! As a final step, try knitting the file to HTML. Remember to save your Markdown in your Data Skills folder and make a note of any mistakes you made and how you fixed them.