7 Manipulating Data in the tidyverse
7.1 What is the Tidyverse
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. These packages include:
dplyr
for data manipulationtibble
for data organizingggplot2
for data visualizationtidyr
for tidy-ing your datareadr
for reading data into R
When we installed the tidyverse
package, we installed all of the above packages. This means that all of the specialized functions of these packages are available to use.
7.2 Tibbles
Tibbles are a modern take on data frames. They keep the features that have stood the test of time and drop the features that used to be convenient but are now frustrating (ie. converting character vectors to factors).
Create a tibble
You can create a tibble using the tibble()
function:
# Create
friends_data <- tibble(
name = c("Nicolas", "Thierry", "Bernard", "Jerome"),
age = c(27, 25, 29, 26),
height = c(180, 170, 185, 169),
married = c(TRUE, FALSE, TRUE, TRUE)
)
# Print
friends_data
## # A tibble: 4 × 4
## name age height married
## <chr> <dbl> <dbl> <lgl>
## 1 Nicolas 27 180 TRUE
## 2 Thierry 25 170 FALSE
## 3 Bernard 29 185 TRUE
## 4 Jerome 26 169 TRUE
Reading in tibbles
Tibbles can be read into R using the read_csv()
function. Note that the read_csv()
function is different than the read.csv()
function. The major difference is that read_csv()
will store the data as a tibble and read.csv()
will store the data as a data frame.
Data frames versus tibbles
For some applications, you will need to use data frames and for others tibbles. You will find that some older functions don’t work on tibbles. They can be easily converted using the functions as.data.frame()
or as_tibble()
:
## sample generation clade strain cit run genome_size
## 1 REL606 0 N/A REL606 unknown 4.62
## 2 REL1166A 2000 unknown REL606 unknown SRR098028 4.63
## 3 ZDB409 5000 unknown REL606 unknown SRR098281 4.60
## 4 ZDB429 10000 UC REL606 unknown SRR098282 4.59
## 5 ZDB446 15000 UC REL606 unknown SRR098283 4.66
## 6 ZDB458 20000 (C1,C2) REL606 unknown SRR098284 4.63
## # A tibble: 30 × 7
## sample generation clade strain cit run genome_size
## <chr> <int> <chr> <chr> <chr> <chr> <dbl>
## 1 REL606 0 N/A REL606 unknown "" 4.62
## 2 REL1166A 2000 unknown REL606 unknown "SRR098028" 4.63
## 3 ZDB409 5000 unknown REL606 unknown "SRR098281" 4.6
## 4 ZDB429 10000 UC REL606 unknown "SRR098282" 4.59
## 5 ZDB446 15000 UC REL606 unknown "SRR098283" 4.66
## 6 ZDB458 20000 (C1,C2) REL606 unknown "SRR098284" 4.63
## 7 ZDB464* 20000 (C1,C2) REL606 unknown "SRR098285" 4.62
## 8 ZDB467 20000 (C1,C2) REL606 unknown "SRR098286" 4.61
## 9 ZDB477 25000 C1 REL606 unknown "SRR098287" 4.65
## 10 ZDB483 25000 C3 REL606 unknown "SRR098288" 4.59
## # ℹ 20 more rows
## sample generation clade strain cit run genome_size
## 1 REL606 0 N/A REL606 unknown 4.62
## 2 REL1166A 2000 unknown REL606 unknown SRR098028 4.63
## 3 ZDB409 5000 unknown REL606 unknown SRR098281 4.60
## 4 ZDB429 10000 UC REL606 unknown SRR098282 4.59
## 5 ZDB446 15000 UC REL606 unknown SRR098283 4.66
## 6 ZDB458 20000 (C1,C2) REL606 unknown SRR098284 4.63
As you will see in the following sections, many of the useful data frame functions are the same for tibbles.
7.3 Exploring tibbles
We can explore the contents of a tibble in several ways. By typing the name of the tibble in the console, we can view the first ten rows of a tibble as above, which tells us lots of information about the column types and the number of rows. We can also use the glimpse()
function:
## Rows: 30
## Columns: 7
## $ sample <chr> "REL606", "REL1166A", "ZDB409", "ZDB429", "ZDB446", "ZDB45…
## $ generation <int> 0, 2000, 5000, 10000, 15000, 20000, 20000, 20000, 25000, 2…
## $ clade <chr> "N/A", "unknown", "unknown", "UC", "UC", "(C1,C2)", "(C1,C…
## $ strain <chr> "REL606", "REL606", "REL606", "REL606", "REL606", "REL606"…
## $ cit <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "un…
## $ run <chr> "", "SRR098028", "SRR098281", "SRR098282", "SRR098283", "S…
## $ genome_size <dbl> 4.62, 4.63, 4.60, 4.59, 4.66, 4.63, 4.62, 4.61, 4.65, 4.59…
As with data frames, we can return a vector containing the values of a variable (column) using the $ sign:
## [1] 0 2000 5000 10000 15000 20000 20000 20000 25000 25000 30000 30000
## [13] 31500 31500 31500 32000 32000 32500 32500 33000 33000 33000 34000 34000
## [25] 36000 36000 38000 38000 40000 40000
Again, similar to data frames, we can use the subsetting operator [] directly on tibbles. A tibble is two-dimensional, so we must pass two arguments to the [] operator; the first indicates the row(s) we require and the second indicates the column(s). To return the value in row 10, column 1:
## # A tibble: 1 × 1
## sample
## <chr>
## 1 ZDB483
Similarly, to return the values in rows 25 to 30, and columns 1 to 3:
If we leave an index blank, this acts as a wildcard and matches all of the rows or columns:
## # A tibble: 1 × 7
## sample generation clade strain cit run genome_size
## <chr> <int> <chr> <chr> <chr> <chr> <dbl>
## 1 CZB154 33000 Cit+ REL606 plus SRR098026 4.76
## # A tibble: 30 × 3
## sample generation clade
## <chr> <int> <chr>
## 1 REL606 0 N/A
## 2 REL1166A 2000 unknown
## 3 ZDB409 5000 unknown
## 4 ZDB429 10000 UC
## 5 ZDB446 15000 UC
## 6 ZDB458 20000 (C1,C2)
## 7 ZDB464* 20000 (C1,C2)
## 8 ZDB467 20000 (C1,C2)
## 9 ZDB477 25000 C1
## 10 ZDB483 25000 C3
## # ℹ 20 more rows
You can also refer to columns by name with quotation marks:
## # A tibble: 30 × 1
## sample
## <chr>
## 1 REL606
## 2 REL1166A
## 3 ZDB409
## 4 ZDB429
## 5 ZDB446
## 6 ZDB458
## 7 ZDB464*
## 8 ZDB467
## 9 ZDB477
## 10 ZDB483
## # ℹ 20 more rows
Note that subsetting a tibble using the [] method returns another tibble. In contrast, using the $ sign to extract a variable returns a vector:
## [1] "REL606" "REL1166A" "ZDB409" "ZDB429" "ZDB446" "ZDB458"
## [7] "ZDB464*" "ZDB467" "ZDB477" "ZDB483" "ZDB16" "ZDB357"
## [13] "ZDB199*" "ZDB200" "ZDB564" "ZDB30*" "ZDB172" "ZDB158"
## [19] "ZDB143" "CZB199" "CZB152" "CZB154" "ZDB83" "ZDB87"
## [25] "ZDB96" "ZDB99" "ZDB107" "ZDB111" "REL10979" "REL10988"
For more information on tibbles see the tibbles vignette.
7.4 Tidy data
Tidy data is data that’s easy to work with: it’s easy to manage (with dplyr), visualize (with ggplot2) and model (with R’s hundreds of modeling packages). Most importantly, tidy data is data where each column is a variable and each row is an observation.
Arranging your data in this way makes it easier to work with because you have a consistent way of referring to variables (as column names) and observations (as row indices). When use tidy data and tidy tools, you spend less time worrying about how to feed the output from one function into the input of another, and more time answering your questions about the data.
To tidy messy data, you first identify the variables in your dataset, then use the tools provided by tidyr
to move them into columns. tidyr
provides three main functions for tidying your messy data:
gather()
separate()
spread()
gather()
The gather()
function takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer. Here’s an example. In this experiment we’ve given three people two different drugs and recorded their heart rate:
messy <- data.frame(
name = c("Wilbur", "Petunia", "Gregory"),
drugA = c(67, 80, 64),
drugB = c(56, 90, 50)
)
messy
## name drugA drugB
## 1 Wilbur 67 56
## 2 Petunia 80 90
## 3 Gregory 64 50
We have three variables (name, drug and heartrate), but only name is currently in a column. We use gather()
to gather the a and b columns into key-value pairs of drug and heartrate:
## name drug heartrate
## 1 Wilbur drugA 67
## 2 Petunia drugA 80
## 3 Gregory drugA 64
## 4 Wilbur drugB 56
## 5 Petunia drugB 90
## 6 Gregory drugB 50
Now each column is a variable and each row is an observation - tidy! Note: more about the “pipe” operator later.
separate()
Sometimes two variables are clumped together in one column. separate() allows you to tease them apart. We have some measurements of how much time people spend on their phones, measured at two locations (work and home), at two times. Each person has been randomly assigned to either treatment or control.
set.seed(10)
messy <- data.frame(
id = 1:4,
trt = sample(rep(c('control', 'treatment'), each = 2)),
work.T1 = runif(4),
home.T1 = runif(4),
work.T2 = runif(4),
home.T2 = runif(4)
)
## id trt work.T1 home.T1 work.T2 home.T2
## 1 1 treatment 0.08513597 0.6158293 0.1135090 0.05190332
## 2 2 control 0.22543662 0.4296715 0.5959253 0.26417767
## 3 3 control 0.27453052 0.6516557 0.3580500 0.39879073
## 4 4 treatment 0.27230507 0.5677378 0.4288094 0.83613414
To tidy this data, we first use gather() to turn columns work.T1, home.T1, work.T2 and home.T2 into a key-value pair of key and time:
## id trt key time
## 1 1 treatment work.T1 0.08513597
## 2 2 control work.T1 0.22543662
## 3 3 control work.T1 0.27453052
## 4 4 treatment work.T1 0.27230507
## 5 1 treatment home.T1 0.61582931
## 6 2 control home.T1 0.42967153
## 7 3 control home.T1 0.65165567
## 8 4 treatment home.T1 0.56773775
## 9 1 treatment work.T2 0.11350898
## 10 2 control work.T2 0.59592531
## 11 3 control work.T2 0.35804998
## 12 4 treatment work.T2 0.42880942
## 13 1 treatment home.T2 0.05190332
## 14 2 control home.T2 0.26417767
## 15 3 control home.T2 0.39879073
## 16 4 treatment home.T2 0.83613414
Next we use separate() to split the key into location and time, using a regular expression to describe the character that separates them:
## id trt location timepoint time
## 1 1 treatment work T1 0.08513597
## 2 2 control work T1 0.22543662
## 3 3 control work T1 0.27453052
## 4 4 treatment work T1 0.27230507
## 5 1 treatment home T1 0.61582931
## 6 2 control home T1 0.42967153
## 7 3 control home T1 0.65165567
## 8 4 treatment home T1 0.56773775
## 9 1 treatment work T2 0.11350898
## 10 2 control work T2 0.59592531
## 11 3 control work T2 0.35804998
## 12 4 treatment work T2 0.42880942
## 13 1 treatment home T2 0.05190332
## 14 2 control home T2 0.26417767
## 15 3 control home T2 0.39879073
## 16 4 treatment home T2 0.83613414
You can also use integer position to split at.
7.5 What is dplyr?
The package dplyr
is a package that tries to provide easy tools for the most common data manipulation tasks. It is built to work directly with tibbles. The thinking behind it was largely inspired by the package plyr
which has been in use for some time but suffered from being slow in some cases. dplyr
addresses this by porting much of the computation to C++. An additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned.
This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database of many 100s GB, conduct queries on it directly and pull back just what you need for analysis in R.
7.6 Selecting columns and filtering rows
We’re going to learn some of the most common dplyr
functions: select()
, filter()
, mutate()
, group_by()
, and summarize()
. To select columns of a data frame, use select()
. The first argument to this function is the data frame (tibble), and the subsequent arguments are the columns to keep.
## sample clade cit genome_size
## 1 REL606 N/A unknown 4.62
## 2 REL1166A unknown unknown 4.63
## 3 ZDB409 unknown unknown 4.60
## 4 ZDB429 UC unknown 4.59
## 5 ZDB446 UC unknown 4.66
## 6 ZDB458 (C1,C2) unknown 4.63
## 7 ZDB464* (C1,C2) unknown 4.62
## 8 ZDB467 (C1,C2) unknown 4.61
## 9 ZDB477 C1 unknown 4.65
## 10 ZDB483 C3 unknown 4.59
## 11 ZDB16 C1 unknown 4.61
## 12 ZDB357 C2 unknown 4.62
## 13 ZDB199* C1 minus 4.62
## 14 ZDB200 C2 minus 4.63
## 15 ZDB564 Cit+ plus 4.74
## 16 ZDB30* C3 minus 4.61
## 17 ZDB172 Cit+ plus 4.77
## 18 ZDB158 C2 minus 4.63
## 19 ZDB143 Cit+ plus 4.79
## 20 CZB199 C1 minus 4.59
## 21 CZB152 Cit+ plus 4.80
## 22 CZB154 Cit+ plus 4.76
## 23 ZDB83 Cit+ minus 4.60
## 24 ZDB87 C2 plus 4.75
## 25 ZDB96 Cit+ plus 4.74
## 26 ZDB99 C1 minus 4.61
## 27 ZDB107 Cit+ plus 4.79
## 28 ZDB111 C2 minus 4.62
## 29 REL10979 Cit+ plus 4.78
## 30 REL10988 C2 minus 4.62
EXERCISE 7.6.1
Would the two calls to the select function below return the same tibble or not?
select(metadata, sample, clade, cit, genome_size)
select(metadata, -generation,-strain,-run)
To choose rows, use filter():
## sample generation clade strain cit run genome_size
## 1 ZDB564 31500 Cit+ REL606 plus SRR098289 4.74
## 2 ZDB172 32000 Cit+ REL606 plus SRR098042 4.77
## 3 ZDB143 32500 Cit+ REL606 plus SRR098040 4.79
## 4 CZB152 33000 Cit+ REL606 plus SRR097977 4.80
## 5 CZB154 33000 Cit+ REL606 plus SRR098026 4.76
## 6 ZDB87 34000 C2 REL606 plus SRR098035 4.75
## 7 ZDB96 36000 Cit+ REL606 plus SRR098036 4.74
## 8 ZDB107 38000 Cit+ REL606 plus SRR098038 4.79
## 9 REL10979 40000 Cit+ REL606 plus SRR098029 4.78
7.7 Pipes
But what if you wanted to select
and filter
? There are three ways to do this: use intermediate steps, nested functions, or pipes. With the intermediate steps, you essentially create a temporary data frame and use that as input to the next function. This can clutter up your workspace with lots of objects. You can also nest functions (i.e. one function inside of another). This is handy, but can be difficult to read if too many functions are nested as the process from inside out. The last option, pipes, are a fairly recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to many things to the same data set. Pipes in R look like %>%
and are made available via the magrittr
package installed as part of tidyverse. If you’re familiar with the Unix shell, you may already have used pipes to pass the output from one command to the next. The concept is the same, except the shell uses the |
character rather than R’s pipe operator %>%
.
The pipe operator can be tedious to type. In Rstudio pressing Ctrl + Shift + M
under Windows / Linux will insert the pipe operator. On the mac, use ⌘ + Shift + M
.
## sample generation clade cit
## 1 ZDB564 31500 Cit+ plus
## 2 ZDB172 32000 Cit+ plus
## 3 ZDB143 32500 Cit+ plus
## 4 CZB152 33000 Cit+ plus
## 5 CZB154 33000 Cit+ plus
## 6 ZDB87 34000 C2 plus
## 7 ZDB96 36000 Cit+ plus
## 8 ZDB107 38000 Cit+ plus
## 9 REL10979 40000 Cit+ plus
In the above we use the pipe to send the data set first through filter
, to keep rows where cit was equal to ‘plus’, and then through select
to keep the sample and generation and clade columns. When the data frame is being passed to the filter()
and select()
functions through a pipe, we don’t need to include it as an argument to these functions anymore.
EXERCISE 7.7.1
Does the order of filter()
and select()
above matter? Why or why not?
Answer to yourself or to a neighbour and then confirm your answer by trying it both ways.
Note that the above is the same as the nested version:
## sample generation clade
## 1 ZDB564 31500 Cit+
## 2 ZDB172 32000 Cit+
## 3 ZDB143 32500 Cit+
## 4 CZB152 33000 Cit+
## 5 CZB154 33000 Cit+
## 6 ZDB87 34000 C2
## 7 ZDB96 36000 Cit+
## 8 ZDB107 38000 Cit+
## 9 REL10979 40000 Cit+
If we wanted to create a new object with this smaller version of the data we could do so by assigning it a new name:
Ecoli_citplus <- metadata %>%
filter(cit == "plus") %>%
select(sample, generation, clade)
Ecoli_citplus
## sample generation clade
## 1 ZDB564 31500 Cit+
## 2 ZDB172 32000 Cit+
## 3 ZDB143 32500 Cit+
## 4 CZB152 33000 Cit+
## 5 CZB154 33000 Cit+
## 6 ZDB87 34000 C2
## 7 ZDB96 36000 Cit+
## 8 ZDB107 38000 Cit+
## 9 REL10979 40000 Cit+
We can think of the filter()
and select()
functions as verbs in the sentence; they do things to the data flowing through the pipeline.
7.8 Mutate
Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions or find the ratio of values in two columns. For this we’ll use mutate()
.
To create a new column of genome size in bp:
## sample generation clade strain cit run genome_size genome_bp
## 1 REL606 0 N/A REL606 unknown 4.62 4620000
## 2 REL1166A 2000 unknown REL606 unknown SRR098028 4.63 4630000
## 3 ZDB409 5000 unknown REL606 unknown SRR098281 4.60 4600000
## 4 ZDB429 10000 UC REL606 unknown SRR098282 4.59 4590000
## 5 ZDB446 15000 UC REL606 unknown SRR098283 4.66 4660000
## 6 ZDB458 20000 (C1,C2) REL606 unknown SRR098284 4.63 4630000
## 7 ZDB464* 20000 (C1,C2) REL606 unknown SRR098285 4.62 4620000
## 8 ZDB467 20000 (C1,C2) REL606 unknown SRR098286 4.61 4610000
## 9 ZDB477 25000 C1 REL606 unknown SRR098287 4.65 4650000
## 10 ZDB483 25000 C3 REL606 unknown SRR098288 4.59 4590000
## 11 ZDB16 30000 C1 REL606 unknown SRR098031 4.61 4610000
## 12 ZDB357 30000 C2 REL606 unknown SRR098280 4.62 4620000
## 13 ZDB199* 31500 C1 REL606 minus SRR098044 4.62 4620000
## 14 ZDB200 31500 C2 REL606 minus SRR098279 4.63 4630000
## 15 ZDB564 31500 Cit+ REL606 plus SRR098289 4.74 4740000
## 16 ZDB30* 32000 C3 REL606 minus SRR098032 4.61 4610000
## 17 ZDB172 32000 Cit+ REL606 plus SRR098042 4.77 4770000
## 18 ZDB158 32500 C2 REL606 minus SRR098041 4.63 4630000
## 19 ZDB143 32500 Cit+ REL606 plus SRR098040 4.79 4790000
## 20 CZB199 33000 C1 REL606 minus SRR098027 4.59 4590000
## 21 CZB152 33000 Cit+ REL606 plus SRR097977 4.80 4800000
## 22 CZB154 33000 Cit+ REL606 plus SRR098026 4.76 4760000
## 23 ZDB83 34000 Cit+ REL606 minus SRR098034 4.60 4600000
## 24 ZDB87 34000 C2 REL606 plus SRR098035 4.75 4750000
## 25 ZDB96 36000 Cit+ REL606 plus SRR098036 4.74 4740000
## 26 ZDB99 36000 C1 REL606 minus SRR098037 4.61 4610000
## 27 ZDB107 38000 Cit+ REL606 plus SRR098038 4.79 4790000
## 28 ZDB111 38000 C2 REL606 minus SRR098039 4.62 4620000
## 29 REL10979 40000 Cit+ REL606 plus SRR098029 4.78 4780000
## 30 REL10988 40000 C2 REL606 minus SRR098030 4.62 4620000
EXERCISE 7.8.1
Use the
mutate()
function to mutate generation into a fraction from 0 to 1.What if you want to change a column instead of create a new column. Can you do that with
mutate
?
7.9 Split-apply-combine data analysis and the summarize() function
Many data analysis tasks can be approached using the “split-apply-combine” paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr
makes this very easy through the use of the group_by()
function, which splits the data into groups. When the data is grouped in this way summarize()
can be used to collapse each group into a single-row summary. summarize()
does this by applying an aggregation or summary function to each group. For example, if we wanted to group by citrate-using mutant status and find the number of rows of data for each status, we would do:
## # A tibble: 3 × 2
## cit `n()`
## <chr> <int>
## 1 minus 9
## 2 plus 9
## 3 unknown 12
Here the summary function used was n()
to find the count for each group. We can also apply many other functions to individual columns to get other summary statistics. For example, in the R base package we can use built-in functions like mean
, median
, min
, and max
. By default, all R functions operating on vectors that contains missing data will return NA
. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean
, the easiest way to ignore NA
(the missing data) is to use na.rm=TRUE
(rm stands for remove).
EXERCISE 7.9.1
Use the group_by()
and summarize()
to view mean genome_size
by mutant status (ie. cit).
You can group by multiple columns too:
## `summarise()` has grouped output by 'cit'. You can override using the `.groups`
## argument.
## # A tibble: 13 × 3
## # Groups: cit [3]
## cit clade mean_size
## <chr> <chr> <dbl>
## 1 minus C1 4.61
## 2 minus C2 4.62
## 3 minus C3 4.61
## 4 minus Cit+ 4.6
## 5 plus C2 4.75
## 6 plus Cit+ 4.77
## 7 unknown (C1,C2) 4.62
## 8 unknown C1 4.63
## 9 unknown C2 4.62
## 10 unknown C3 4.59
## 11 unknown N/A 4.62
## 12 unknown UC 4.62
## 13 unknown unknown 4.62
Looks like for one of these clones, the clade is missing. We could then discard those rows using filter():
metadata %>%
group_by(cit, clade) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE)) %>%
filter(!is.na(clade))
## `summarise()` has grouped output by 'cit'. You can override using the `.groups`
## argument.
## # A tibble: 13 × 3
## # Groups: cit [3]
## cit clade mean_size
## <chr> <chr> <dbl>
## 1 minus C1 4.61
## 2 minus C2 4.62
## 3 minus C3 4.61
## 4 minus Cit+ 4.6
## 5 plus C2 4.75
## 6 plus Cit+ 4.77
## 7 unknown (C1,C2) 4.62
## 8 unknown C1 4.63
## 9 unknown C2 4.62
## 10 unknown C3 4.59
## 11 unknown N/A 4.62
## 12 unknown UC 4.62
## 13 unknown unknown 4.62
You can also summarize multiple variables at the same time:
metadata %>%
group_by(cit, clade) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE),
min_generation = min(generation))
## `summarise()` has grouped output by 'cit'. You can override using the `.groups`
## argument.
## # A tibble: 13 × 4
## # Groups: cit [3]
## cit clade mean_size min_generation
## <chr> <chr> <dbl> <int>
## 1 minus C1 4.61 31500
## 2 minus C2 4.62 31500
## 3 minus C3 4.61 32000
## 4 minus Cit+ 4.6 34000
## 5 plus C2 4.75 34000
## 6 plus Cit+ 4.77 31500
## 7 unknown (C1,C2) 4.62 20000
## 8 unknown C1 4.63 25000
## 9 unknown C2 4.62 30000
## 10 unknown C3 4.59 25000
## 11 unknown N/A 4.62 0
## 12 unknown UC 4.62 10000
## 13 unknown unknown 4.62 2000
For a summary of the tidyr and dplyr functions, see this Handy dplyr cheatsheet.
EXERCISE 7.9.2
Create a tibble containing each unique clade (removing the samples with unknown clades) and the rank of it’s mean genome size. (note that ranking genome size will not sort the table; the row order will be unchanged. You can use the arrange()
function to sort the table).
There are several functions for ranking observations, which handle tied values differently. For this exercise it doesn’t matter which function you choose. Use the help options to find a ranking function.
7.10 Other great resources
- Data Wrangling tutorial - an excellent four part tutorial covering selecting data, filtering data, summarising and transforming your data.
- R for Data Science
- Data wrangling with R and RStudio - 55 minute webinar from RStudio