R as a whole computer architecture (including programming languages)
is based on the binary information, the current either
flows (TRUE
) or not (FALSE
), (capital letters
on purpose) depending on the conditions. These logical values form a
distinct type of data and can be, similarly to other types, combined
into vectors using the same functions as for numbers and characters.
Create a logical vector with 3
TRUE
and 2 FALSE
values and call it
my_logical
. Note that they are internal R symbols, so you
should not use quotation marks (""
).
Expected result:
## [1] TRUE TRUE TRUE FALSE FALSE
Advice: To make your code shorter, you can use T
instead of TRUE
and F
instead of
FALSE
.
Formally, logical values correspond to 0 (for FALSE
) and
1 (for TRUE
) and behave like them in every mathematical
operation.
Calculate the sum of the
my_logical
vector.
Expected result:
## [1] 3
Curiosity
A sum of logical vectors can be used to check a data frame for the presence ofNA
cells. To do this, combine theis.na()
function used on the data frame with thesum()
function.
However, logical vectors have one important distinctive
characteristic: they can be used for subsetting. To
achieve this, you need to provide a logical vector with
TRUE
for the elements you want to keep and
FALSE
for the elements you want to
discard. The length of the logical vector used for subsetting
has to be equal to the number of elements (e.g., columns) in the object
we want to subset.
Save the first 6 rows of the
built-in CO2
dataset to a chosen variable. Then, using
logical vectors, return the 1st and 5th column.
Expected result:
## Plant uptake
## 1 Qn1 16.0
## 2 Qn1 30.4
## 3 Qn1 34.8
## 4 Qn1 37.2
## 5 Qn1 35.3
## 6 Qn1 39.2
Normally, no one creates logical vectors on their own. They are
created automatically as a effect of different comparisons. The most
common is testing for equality. It is done with double equal sign
(==
).
Check whether in R: 5 equals 5.00
and π
equals 3.14.
Expected result:
## [1] TRUE
## [1] FALSE
The same stands for all other comparisons, but the symbols are different:
• ==
- equal
• !=
- not equal
• >
- greater than
• >=
- greater than or equal
• <
- less than
• <=
- less than or equal
Curiosity
Double equal sign==
is used for equality because the single sign=
is already in use. It serves as an alternative for the assigning arrow; however, for code purity, the arrow is the one recommended.
Note that while comparing two vectors with the symbols shown above, R does not consider the action as one comparison. It rather compares them element by element, recycling shorter vectors and resulting in a logical vector of a longer vector’s length. It is the same rule as for mathematical operations on vectors.
Manually create two vectors. One of the prime numbers and a second of even numbers. Both should belong to the range <0,11>. Check if they are equal.
Expected result:
## [1] TRUE FALSE FALSE FALSE FALSE
To check whether vectors (as a whole) are identical, use the
identical()
function.
Create two integer vectors from 1
to 10. Call them differently and compare them with the
identical()
function. Then, change one value within the
first vector and repeat the comparison.
Expected result:
## [1] TRUE
## [1] FALSE
The other useful tool is the %in%
operator. It provides
information on whether elements of the first vector are present
in the second one. Note that it is focused on the first vector
only, so there is no recycling. The result of the
operation with %in%
is the logical vector.
Create two character vectors, each consisting of a set of individual characters. The first should contain your name and the second one the name of another person from the group. Check how many letters your names have in common. Then, change the order of names and repeat the comparison.
Exclamation mark !
works in R as the symbol of negation
(=reversing the statement). Any logical vector preceded by
!
will result in its reversal - FALSE
changed
into TRUE
and vice versa.
Having a vector of 3
TRUE
and 3 FALSE
values, return its
negation.
Expected result:
## [1] FALSE FALSE FALSE TRUE TRUE TRUE
More typically, an exclamation mark !
is used to negate
comparisons. Note that the idea is the same: you negate a logical vector
by negating the action that produces it. Remember that negated
comparison should be enclosed in parentheses, e.g.,
!(2 == 2)
.
Create a sequence of integers from
1 to 100 in which each subsequent element is larger by 3 than the
previous one. Then, create logical vector indicating which elements are
larger than 50. Do not use the >
sign.
Expected result:
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The real power of logic in programming is provided by combining comparisons (use parentheses for clarity). There are two basic operators:
• &
- and - condition is
TRUE
if both comparisons are TRUE
• |
- or - condition is TRUE
if at least one comparison is TRUE
For an integer vector from 1 to 10, create a logical vector indicating which element is smaller or larger than 5. Use a logical operator.
Expected result:
## [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
For an integer vector from 1 to 10, create a logical vector indicating which element is divisible by both 2 and 3. Use a logical operator.
Expected result:
## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
Curiosity
R also uses a double version of the&
and|
operators. Their outcome is the same, and their role is just related to optimization. While using double operators in R, the first condition is evaluated, and the second condition is only checked if necessary (e.g. in the “AND” statement if the first condition isFALSE
, there is no need to check the second one as the result will always beFALSE
). Note that they are often used when conditions are severely time-consuming.
Frequently, the question is not about logical vectors themselves but
rather about which element of a vector fulfills a given
condition. The answer is provided by which()
function. It takes a comparison as an argument and returns a
vector of indexes that can be used for subsetting.
Construct a vector with the first 20 integers divisible by 3. Which elements of it are larger or equal to 21?
Expected result:
## [1] 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Having indexes of values larger or equal 21 from the previous exercise, return the values from the previously constructed vector that correspond to these indexes.
Expected result:
## [1] 21 24 27 30 33 36 39 42 45 48 51 54 57 60
As stated before, subsetting can be done directly with logical
vectors (TRUE
for each kept element). In practice, it is
even simpler. All you need to provide is a condition instead of
a coordinate e.g., vector[condition]
will return
only the elements fulfilling the given condition.
For example: There is a vector a <- c(1, 2, 3, 4)
. We
want to subset elements of this vector that are greater than 2.
The formula will look like this: a[a > 2]
.
Then the result will be: [1] 3 4
.
The logic behind is as follows:
[a > 2]
) generates a logical vector
- positions are denoted as a series of TRUE
and
FALSE
. In this example, the logical vector is:
[1] FALSE FALSE TRUE TRUE
.TRUE
, so here last two elements of the vector will
be returned: [1] 3 4
Note that you cannot see the TRUE
/FALSE
vector itself, but it is in fact generated and used during the
subsetting operation.
For an integer vector from 1 to 100, return elements higher than the vector’s mean.
Expected result:
## [1] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
## [20] 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
## [39] 89 90 91 92 93 94 95 96 97 98 99 100
The same pattern applies to subsetting data frames.
Using the built-in dataset
CO2
, return the observation for the Qn2
plant.
Epected result:
## Plant Type Treatment conc uptake
## 8 Qn2 Quebec nonchilled 95 13.6
## 9 Qn2 Quebec nonchilled 175 27.3
## 10 Qn2 Quebec nonchilled 250 37.1
## 11 Qn2 Quebec nonchilled 350 41.8
## 12 Qn2 Quebec nonchilled 500 40.6
## 13 Qn2 Quebec nonchilled 675 41.4
## 14 Qn2 Quebec nonchilled 1000 44.3
Advice: To obtain the logical vector with one element for each
row, you need to make a comparison based on a column e.g.,
my_data[my_data$my_column != 5, ]
would result in the
observations (including all columns) from my_data
in which
the value for my_column
does not equal 5.
Using the built-in dataset
CO2
, return observation from the Mississippi
chilled
plant with an uptake
higher than 20
ummol/m2 x s.
Expected result:
## Plant Type Treatment conc uptake
## 69 Mc1 Mississippi chilled 675 22.2
## 70 Mc1 Mississippi chilled 1000 21.9
dplyr
is a widely used R package that simplifies the
manipulation and management of data frames.
Install and load the
dplyr
package.
Upload the rats.csv
file into an object called my_data
. View the first 10 lines
of the file.
Expected result (first 10 rows):
## Glycogen Treatment Rat Liver
## 1 131 1 1 1
## 2 130 1 1 1
## 3 131 1 1 2
## 4 125 1 1 2
## 5 136 1 1 3
## 6 142 1 1 3
## 7 150 1 2 1
## 8 148 1 2 1
## 9 140 1 2 2
## 10 143 1 2 2
Modify my_data
by
adding the column with ID at the beginning.
Expected result (first 10 rows):
## ID Glycogen Treatment Rat Liver
## 1 1 131 1 1 1
## 2 2 130 1 1 1
## 3 3 131 1 1 2
## 4 4 125 1 1 2
## 5 5 136 1 1 3
## 6 6 142 1 1 3
## 7 7 150 1 2 1
## 8 8 148 1 2 1
## 9 9 140 1 2 2
## 10 10 143 1 2 2
All subsequent functions come from the loaded dplyr
package. Importantly, names of columns provided to dplyr
functions do not need quotation marks.
To sort the data, use arrange()
function in the
following manner: arrange(dataset, ordering_column)
Obtain the observations from
my_data
sorted with increasing levels of glycogen. Use the
arrange()
function.
Expected result (first 10 rows):
## ID Glycogen Treatment Rat Liver
## 1 4 125 1 1 2
## 2 26 125 3 1 1
## 3 36 127 3 2 3
## 4 2 130 1 1 1
## 5 1 131 1 1 1
## 6 3 131 1 1 2
## 7 25 134 3 1 1
## 8 35 134 3 2 3
## 9 29 135 3 1 3
## 10 5 136 1 1 3
To obtain the descending order, put the name of the column inside the
desc()
function.
Using the arrange()
function sort the observations from my_data
with the
decreasing levels of glycogen.
Expected result (first 10 rows):
## ID Glycogen Treatment Rat Liver
## 1 23 162 2 2 3
## 2 11 160 1 2 3
## 3 13 157 2 1 1
## 4 20 155 2 2 1
## 5 15 154 2 1 2
## 6 18 153 2 1 3
## 7 24 152 2 2 3
## 8 19 151 2 2 1
## 9 7 150 1 2 1
## 10 12 150 1 2 3
You can also sort the data by multiple columns. To
do this, add column in the following manner:
arrange(dataset, ordering_column1, ordering_column2)
. Note
that the priority of sorting is denoted by the order of function
arguments.
Using the arrange()
function, obtain the observations from my_data
sorted
firstly by the Treatment
column and then by the
Rat
column (both in ascending order).
Expected result (first 10 rows):
## ID Glycogen Treatment Rat Liver
## 1 1 131 1 1 1
## 2 2 130 1 1 1
## 3 3 131 1 1 2
## 4 4 125 1 1 2
## 5 5 136 1 1 3
## 6 6 142 1 1 3
## 7 7 150 1 2 1
## 8 8 148 1 2 1
## 9 9 140 1 2 2
## 10 10 143 1 2 2
To select particular columns, use the select()
function
in the following way:
select(dataset, column_name1, column_name2)
. Note that all
mentioned columns will be preserved, and the rest will be discarded.
Obtain the Glycogen
and Liver
columns from my_data.
Use the
select()
function.
Expected result (first 10 rows):
## Glycogen Liver
## 1 131 1
## 2 130 1
## 3 131 2
## 4 125 2
## 5 136 3
## 6 142 3
## 7 150 1
## 8 148 1
## 9 140 2
## 10 143 2
You can also use minus -
preceding the column
name, which means “all except this
column”.
Obtain the ID
,
Glycogen
, Treatment
, and Liver
columns from my_data
. Use the select()
function and the minus (-
) sign.
Expected result (first 10 rows):
## ID Glycogen Treatment Liver
## 1 1 131 1 1
## 2 2 130 1 1
## 3 3 131 1 2
## 4 4 125 1 2
## 5 5 136 1 3
## 6 6 142 1 3
## 7 7 150 1 1
## 8 8 148 1 1
## 9 9 140 1 2
## 10 10 143 1 2
To subset the observations, use the filter()
function in
the following manner:
filter(dataset, your_logical_condition)
. Note that logical
conditions are always related to the values inside a given column.
Using the filter()
function, obtain the observations for Treatment
equals
1.
Expected result (first 10 rows):
## ID Glycogen Treatment Rat Liver
## 1 1 131 1 1 1
## 2 2 130 1 1 1
## 3 3 131 1 1 2
## 4 4 125 1 1 2
## 5 5 136 1 1 3
## 6 6 142 1 1 3
## 7 7 150 1 2 1
## 8 8 148 1 2 1
## 9 9 140 1 2 2
## 10 10 143 1 2 2
You can also combine several logical conditions by using logical
operators (see above). Note, however, that for each observation under
consideration, it needs to result in a single TRUE
or
FALSE
.
Using the filter()
function, obtain the observations for Treatment
equals 3
with glycogen level higher than 135.
Expected result:
## ID Glycogen Treatment Rat Liver
## 1 27 138 3 1 2
## 2 28 138 3 1 2
## 3 30 136 3 1 3
## 4 31 138 3 2 1
## 5 32 140 3 2 1
## 6 33 139 3 2 2
## 7 34 138 3 2 2
To create a new column based on the others, use the
mutate()
function in the following manner:
mutate(dataset, new_column_name = recipe_for_values)
.
“Recipe for values” is often a mathematical formula or simple
mathematical function based on the values of other columns.
Note that it is just a modification of the given value for each
observation separately.
Using the mutate()
function, create a new column called log_Gly
that will be a
natural logarithm transformation of the Glycogen
column.
Overwrite my_data
Expected result (first 10 rows):
## ID Glycogen Treatment Rat Liver log_Gly
## 1 1 131 1 1 1 4.875197
## 2 2 130 1 1 1 4.867534
## 3 3 131 1 1 2 4.875197
## 4 4 125 1 1 2 4.828314
## 5 5 136 1 1 3 4.912655
## 6 6 142 1 1 3 4.955827
## 7 7 150 1 2 1 5.010635
## 8 8 148 1 2 1 4.997212
## 9 9 140 1 2 2 4.941642
## 10 10 143 1 2 2 4.962845
Make a summary of your dataset with the summarise()
function. The syntax is as follows:
summarise(dataset, name_of_summary1 = recipe_for_value1, name_of_summary2 = recipe_for_value2,…)
.
Recipe in that case is any aggregating function (e.g.,
mean()
) that will accept a whole column (vector) and
result in a single statistic. Note that this function will result in the
creation of a new table with 1 row of summary statistics and as many
columns as the number of statistics mentioned.
Use the summarise()
function to create a summary of my_data
containing mean,
median, maximum value, minimum value, and standard deviation of the
Glycogen
column.
Expected result:
## mean median max min st_dev
## 1 142.2222 141 162 125 9.754445
You can also count the number of observations corresponding to the
groups within a given column e.g., check how many observations were
collected for each treatment. To obtain it, use the count()
function by typing count(dataset, given_column)
.
Use the count()
function to calculate how many treatments were used in a
study.
Expected result:
## Treatment n
## 1 1 12
## 2 2 12
## 3 3 12
As you probably noticed, all dplyr
functions have data
as the first argument. Based on these characteristics, you can create a
pipeline where the next function uses the output generated by
the previous one. In that case, you should provide the dataset
argument to the first function only.
A pipeline is created by connecting subsequent functions with the
pipe operator %>%
. Remember to skip the
data
argument in all functions except for the first one
e.g.,
select(dataset, column1, column2) %>% filter(column1 > 50)
.
The shortcut for the %>%
operator is
Ctrl + Shift + M
(Windows) or Cmd + Shift + M
(Mac).
Curiosity In R version 4.1.0, a new pipe operator is available
|>
. This operator is a part of base R, so you don’t have to load additional packages (like loadingdplyr
to use%>%
) to use it.
Using pipeline and
my_data
:
1. Select ID
, Glycogen
, and Liver
column.
2. Obtain observations with glycogen levels lower than 140.
3. Sort it based on the Glycogen
column in descending
order.
Expected result (first 10 rows):
## ID Glycogen Liver
## 1 33 139 2
## 2 27 138 2
## 3 28 138 2
## 4 31 138 1
## 5 34 138 2
## 6 5 136 3
## 7 30 136 3
## 8 29 135 3
## 9 25 134 1
## 10 35 134 3
Performing any function over a complete dataset is often not what you
really want. Imagine 3 species with a trait of interest. It can be the
case that the overall mean does not reflect the variability among
species. To check the value for each species separately, you need to
group your dataset. Do it by using the
group_by()
function in the following manner:
group_by(dataset, column_with_groups)
. This produces a
grouped dataframe and will cause all subsequent functions to
operate on each group separately.
Using pipe, the
group_by()
function, and my_data
data frame,
create a summary table with mean and standard deviation of the
Glycogen
column for each treatment
separately.
Expected result:
## # A tibble: 3 × 3
## Treatment mean st_dev
## <int> <dbl> <dbl>
## 1 1 140. 10.3
## 2 2 151 5.66
## 3 3 135. 4.71
Advice: To perform any action on the whole dataset again, use the
ungroup()
function.
Using pipe, thee
group_by()
function, and my_data
data frame,
create a new column with the deviations of Glycogen
values
from the arithmetic mean in a given
Treatment
.
Expected result (first 10 rows):
## # A tibble: 10 × 7
## # Groups: Treatment [1]
## ID Glycogen Treatment Rat Liver log_Gly std_dev
## <int> <int> <int> <int> <int> <dbl> <dbl>
## 1 1 131 1 1 1 4.88 10.3
## 2 2 130 1 1 1 4.87 10.3
## 3 3 131 1 1 2 4.88 10.3
## 4 4 125 1 1 2 4.83 10.3
## 5 5 136 1 1 3 4.91 10.3
## 6 6 142 1 1 3 4.96 10.3
## 7 7 150 1 2 1 5.01 10.3
## 8 8 148 1 2 1 5.00 10.3
## 9 9 140 1 2 2 4.94 10.3
## 10 10 143 1 2 2 4.96 10.3
**************** ADVANCED ******************
Imagine having two data frames corresponding to the same study
system. In both of them, there is a column with individual IDs. How to
bind them together? Using cbind()
is rather a bad idea, as
the order of observations can differ.
The solution is to use one of the _join()
functions.
Data frame 1:
## ID V1
## 1 ind_1 red
## 2 ind_2 blue
## 3 ind_3 green
Data frame 2:
## ID V2
## 1 ind_2 black
## 2 ind_3 blue
## 3 ind_4 blue
left_join()
- join the values from the second
table (right) that correspond to observations in the first one (left).
If there is no suitable value in the second table, NA
is
returned.left_join(df1, df2, by = "ID")
## ID V1 V2
## 1 ind_1 red <NA>
## 2 ind_2 blue black
## 3 ind_3 green blue
right_join()
- join the values from the first
table (left) that correspond to observations in the second one (right).
If there is no suitable value in the first table, NA
is
returned.right_join(df1, df2, by = "ID")
## ID V1 V2
## 1 ind_2 blue black
## 2 ind_3 green blue
## 3 ind_4 <NA> blue
inner_join()
- return only those observations
that have corresponding values in both tables.inner_join(df1, df2, by = "ID")
## ID V1 V2
## 1 ind_2 blue black
## 2 ind_3 green blue
full_join()
- join what can be joined, but
keeps all observations. In case of the lack of a
suitable value, returns NA
.full_join(df1, df2, by = "ID")
## ID V1 V2
## 1 ind_1 red <NA>
## 2 ind_2 blue black
## 3 ind_3 green blue
## 4 ind_4 <NA> blue
Each of the abovementioned functions can be used by typing:
_join(first_table, second_table, by = “shared_column_name”)
.
Notice that the name of the shared column should be the same in both
tables (eg. ID
).
Execute the code below to create a
new data frame with an ID
and a weight
column.
Join the observations from my_data
and
new_data
data frames using the proper _join()
function. Keep all observations from my_data
, but only
those from new_data
that have their counterparts in
my_data
.
new_data <- data.frame("ID" = c(2:100), "weight" = rnorm(99, mean = 150, sd = 20))
Expected result (first 10 rows):
## ID Glycogen Treatment Rat Liver log_Gly weight
## 1 1 131 1 1 1 4.875197 NA
## 2 2 130 1 1 1 4.867534 155.1685
## 3 3 131 1 1 2 4.875197 126.8394
## 4 4 125 1 1 2 4.828314 135.9751
## 5 5 136 1 1 3 4.912655 151.2915
## 6 6 142 1 1 3 4.955827 129.2615
## 7 7 150 1 2 1 5.010635 148.5248
## 8 8 148 1 2 1 4.997212 145.3898
## 9 9 140 1 2 2 4.941642 144.0690
## 10 10 143 1 2 2 4.962845 169.9980
Curiosity
1. Using joins lets us avoid repeating the same information across many data frames.
2. The concept of joins is common in many computer languages, but it is most often used in database management (e.g., SQL).
1. Create a vector from 1 to 50.
Then, using R only, calculate how many elements of this vector are
higher than or equal to the positive square root of 100. *A positive
square root can be calculated with the sqrt()
function.
2. Upload the built-in CO2
data set as
hw_data
. Using logical expressions, display the rows of the
dataset that have a concentration
below 350 AND
uptake
of over 30.
3. Load the dplyr
package. In the hw_data
data frame, sort the observations
first by concentration
, then by uptake
.
Overwrite the hw_data
variable so that the columns are
permanently sorted in this manner.
4. Using one of the the
dplyr
function, add a new column called
uptake_percent
to the hw_data
data frame that
displays the uptake
value as a percent of the maximum value
of that column (=so that the highest value, 45.5, corresponds to 100%).
Then, remove the original uptake
column using a
dplyr
function.
5. Use the pipe operator to perform
multiple functions on the hw_data
data frame. Create a
summary data frame that displays the mean and standard deviation of the
new uptake_percent
column for the Treatment
groups. Save this summary as a .csv file called
“CO2_uptake_summary_Your_Name.csv”.
Upload both your R script and .csv file to the Pegaz platform.