10 Reshaping data
本單元轉換為課前觀看示範錄影,上課時間進行作業練習的模式。上課練習用的資料不同於電子書。
Let’s continue what we started in AQ Data and Recap by hand but now using R to calculate a score for each participant.
10.0.1 Activity 1: Load in the data
- Open a new R markdown document, name it “Reshaping Data” and save it in your Data Skills folder.
- Set the working directory to your Data Skills folder.
- Type and run the code that loads the
tidyverse
package. - Use
read_csv()
to load in the data. you should create three objectsresponses
,scoring
andqformats
that contain the respective data.
10.0.2 Activity 2: pivot_longer()
The first step is to transform the data from wide format to long format. To do this, we will use the function pivot_longer()
. pivot_longer()
takes multiple columns and collapses them so that each unique variable has it’s own column and has four main arguments:
data
is the name of the object you want to transformnames_to
is the name of the new column that you will create that will contain the names of the original wide format columnsvalues_to
is the name of the column that will contain the existing values.cols
are the original columns you want to collapse.
These functions can seem a bit abstract and it is better to show than tell. Run the below code in a new code chunk and then compare how rlong
looks compared to responses
and see if you can figure out what effect each argument had.
<- pivot_longer(data = responses,
rlong names_to = "Question",
values_to = "Response",
cols = Q1:Q10)
You have now created a tibble with 660 observations and 3 variables; 10 observations per 66 participants and 3 variables. Let’s recreate the example from the AQ Data and Recap and only use one participant. We can do that by using filter()
(which you used last semester) to create a new object called rlong_16.
10.0.3 Activity 3: filter()
Pause here and test your knowledge
- What does
filter()
do?
- Create a new object called
rlong_16
that usesfilter()
to keep only the data from participant Id 16.
Every year, the biggest problem with these exercises is typos caused by not paying attention to the exact spelling and capitalisation. Remember, Question
is not the same as question
, Response
is not the same as response
, and Id
is not the same as ID.
10.0.4 Activity 4: inner_join()
The next step is to match each question with its format (F or R) that is stored in qformats. That is, we need to join together the two objects using inner_join()
like we did in Psych 1A.
- Create a new object called
rlong_16_join
that usesinner_join()
to join togetherrlong_16
andqformats
by their common column. - If you get the error
Error: by can't contain join column XXXX which is missing from LHS
it means that you have made a typo. Check the exact spelling and capitalisation of the variable names.
inner_join()
matches up rows in the two tables where both tables have the same value for the variable named in the third argument, “Question.” It then combines the columns from the two tables, copying rows where necessary.
To state it more simply, what it does is the following: For each row in rlong
, it checks the value of the column Question
, and looks for rows with the same value in qformats
, and then essentially combines all of the other columns in the two tables for these matching rows. If there are values that don’t match, the rows get dropped. The inner_join()
function is one of the most useful and time-saving operations in data wrangling so keep practising as it will keep reappearing time after time.
10.0.5 Activity 5: Another inner_join()
Now that we have matched up each question with its corresponding format, we can now “look up” the corresponding scores in the scoring table based on the format and the response. This means we have to use inner_join()
once again to join rlong_16_join
with scoring
**
- Create a new object named
scores_16
that joins togetherrlong_16_join
withscoring
. - Be careful to tell R all the columns the two objects have in common. Remember that when you need to specify multiple variables you will need to use the syntax
by = c("var1", "var2)
. - The reason we have to do two separate
inner_joins()
is because they can only join two tables at once and we have three, so it requires two steps.
10.0.6 Activity 6: Calculating the AQ score
Now you need to calculate the total AQ score for participant 16.
- Create a new object called
AQ_16
. Usesummarise()
andsum()
to add up the numbers in the columnScore
fromscores_16
and call the result of this calculationAQ_score
. - This is a difficult task to do from memory but try it anyway - if you get anywhere near the right solution you’re doing extremely well!
10.0.7 Activity 7: Calculating all scores
Next we’re going to do the same thing but for all participants. The first two steps are the same but we just use the full data rlong
rather than the filtered dataset.
- Run the below code in a new code chunk.
<- inner_join(rlong, qformats, "Question")
rlong_join <- inner_join(rlong_join, scoring, c("QFormat", "Response")) scores
The final part of calculating the scores requires an extra step because now we don’t just want to calculate one score, we want to calculate a score for each participant which means that we need to use group_by()
to group by Id
. We’re not going to use it in this chapter but we also want our object to show us the participant’s gender so we will also add gender
to the grouping. If you want to refresh your memory about how group_by()
works, revise Data Wrangling 1.
<- group_by(scores, Id, gender)
scores_grouped <- summarise(scores_grouped, total_score = sum(Score)) AQ_all
10.0.8 Activity 8: Visualisation
Finally, use ggplot()
and geom_histogram()
to make a histogram of all the total AQ scores. Try and make it look pretty by changing the axis labels and the theme. You can check the solution code to see how the below example was made, but you can make yours look different.
- Hint 1:
ggplot(data, aes(x)) + geom_histogram()
- Hint 2: Add
binwidth = 1
togeom_histogram()
to change the width of the bars.
10.0.9 Activity solutions - Reshping data
10.0.9.1 Activity 1
library(tidyverse)
<- read_csv ("scoring.csv")
scoring <- read_csv("responses.csv")
responses <- read_csv("qformats.csv") qformats
10.0.9.2 Activity 3
<- filter(rlong, Id == 16) rlong_16
10.0.9.3 Activity 4
<- inner_join(rlong_16, qformats, "Question") rlong_16_join
10.0.9.4 Activity 5
<- inner_join(rlong_16_join, scoring, c("QFormat", "Response")) scores_16
10.0.9.5 Activity 6
<- summarise(scores_16, AQ_score = sum(Score)) AQ_16
10.0.9.6 Activity 8
ggplot(AQ_all, aes(x = total_score)) +
geom_histogram(binwidth = 1, colour = "black", fill = "grey") +
theme_minimal()+
scale_x_continuous(name = "Total AQ Score", breaks = c(0,1,2,3,4,5,6,7,8,9,10)) +
scale_y_continuous(name = "Count")