Introduction to R - class 4 Data Wrangling 2

Logical expressions

R as a whole computer architecture (including programming languages) is based on the binary information, the current either flows (TRUE) or not (FALSE), (capital letters on purpose) depending on the conditions. These logical values form a distinct type of data and can be, similarly to other types, combined into vectors using the same functions as for numbers and characters.

Exercise 1

Create a logical vector with 3 TRUE and 2 FALSE values and call it my_logical. Note that they are internal R symbols, so you should not use quotation marks ("").

Expected result:

## [1]  TRUE  TRUE  TRUE FALSE FALSE

Advice: To make your code shorter, you can use T instead of TRUE and F instead of FALSE.

Formally, logical values correspond to 0 (for FALSE) and 1 (for TRUE) and behave like them in every mathematical operation.

Exercise 2

Calculate the sum of the my_logicalvector.

Expected result:

## [1] 3

Curiosity
A sum of logical vectors can be used to check a data frame for the presence of NA cells. To do this, combine the is.na() function used on the data frame with the sum() function.

However, logical vectors have one important distinctive characteristic: they can be used for subsetting. To achieve this, you need to provide a logical vector with TRUE for the elements you want to keep and FALSE for the elements you want to discard. The length of the logical vector used for subsetting has to be equal to the number of elements (e.g., columns) in the object we want to subset.

Exercise 3

Save the first 6 rows of the built-in CO2 dataset to a chosen variable. Then, using logical vectors, return the 1st and 5th column.

Expected result:

##   Plant uptake
## 1   Qn1   16.0
## 2   Qn1   30.4
## 3   Qn1   34.8
## 4   Qn1   37.2
## 5   Qn1   35.3
## 6   Qn1   39.2

Comparisons

Normally, no one creates logical vectors on their own. They are created automatically as a effect of different comparisons. The most common is testing for equality. It is done with double equal sign (==).

Exercise 4

Check whether in R: 5 equals 5.00 and π equals 3.14.

Expected result:

## [1] TRUE

## [1] FALSE

The same stands for all other comparisons, but the symbols are different:

• == - equal
• != - not equal
• > - greater than
• >= - greater than or equal
• < - less than
• <= - less than or equal

Curiosity
Double equal sign == is used for equality because the single sign = is already in use. It serves as an alternative for the assigning arrow; however, for code purity, the arrow is the one recommended.

Note that while comparing two vectors with the symbols shown above, R does not consider the action as one comparison. It rather compares them element by element, recycling shorter vectors and resulting in a logical vector of a longer vector’s length. It is the same rule as for mathematical operations on vectors.

Exercise 5

Manually create two vectors. One of the prime numbers and a second of even numbers. Both should belong to the range <0,11>. Check if they are equal.

Expected result:

## [1]  TRUE FALSE FALSE FALSE FALSE

To check whether vectors (as a whole) are identical, use the identical() function.

Exercise 6

Create two integer vectors from 1 to 10. Call them differently and compare them with the identical() function. Then, change one value within the first vector and repeat the comparison.

Expected result:

## [1] TRUE

## [1] FALSE

The other useful tool is the %in% operator. It provides information on whether elements of the first vector are present in the second one. Note that it is focused on the first vector only, so there is no recycling. The result of the operation with %in% is the logical vector.

Exercise 7

Create two character vectors, each consisting of a set of individual characters. The first should contain your name and the second one the name of another person from the group. Check how many letters your names have in common. Then, change the order of names and repeat the comparison.

Exclamation mark

Exclamation mark ! works in R as the symbol of negation (=reversing the statement). Any logical vector preceded by ! will result in its reversal - FALSE changed into TRUE and vice versa.

Exercise 8

Having a vector of 3 TRUE and 3 FALSE values, return its negation.

Expected result:

## [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE

More typically, an exclamation mark ! is used to negate comparisons. Note that the idea is the same: you negate a logical vector by negating the action that produces it. Remember that negated comparison should be enclosed in parentheses, e.g., !(2 == 2).

Exercise 9

Create a sequence of integers from 1 to 100 in which each subsequent element is larger by 3 than the previous one. Then, create logical vector indicating which elements are larger than 50. Do not use the > sign.

Expected result:

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Logical operators

The real power of logic in programming is provided by combining comparisons (use parentheses for clarity). There are two basic operators:

• & - and - condition is TRUE if both comparisons are TRUE
• | - or - condition is TRUE if at least one comparison is TRUE

Exercise 10

For an integer vector from 1 to 10, create a logical vector indicating which element is smaller or larger than 5. Use a logical operator.

Expected result:

##  [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

Exercise 11

For an integer vector from 1 to 10, create a logical vector indicating which element is divisible by both 2 and 3. Use a logical operator.

Expected result:

##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

Curiosity
R also uses a double version of the & and | operators. Their outcome is the same, and their role is just related to optimization. While using double operators in R, the first condition is evaluated, and the second condition is only checked if necessary (e.g. in the “AND” statement if the first condition is FALSE, there is no need to check the second one as the result will always be FALSE). Note that they are often used when conditions are severely time-consuming.

which() function

Frequently, the question is not about logical vectors themselves but rather about which element of a vector fulfills a given condition. The answer is provided by which() function. It takes a comparison as an argument and returns a vector of indexes that can be used for subsetting.

Exercise 12

Construct a vector with the first 20 integers divisible by 3. Which elements of it are larger or equal to 21?

Expected result:

##  [1]  7  8  9 10 11 12 13 14 15 16 17 18 19 20

Exercise 13

Having indexes of values larger or equal 21 from the previous exercise, return the values from the previously constructed vector that correspond to these indexes.

Expected result:

##  [1] 21 24 27 30 33 36 39 42 45 48 51 54 57 60

Subsetting with the logical expressions

As stated before, subsetting can be done directly with logical vectors (TRUE for each kept element). In practice, it is even simpler. All you need to provide is a condition instead of a coordinate e.g., vector[condition] will return only the elements fulfilling the given condition.

For example: There is a vector a <- c(1, 2, 3, 4). We want to subset elements of this vector that are greater than 2.
The formula will look like this: a[a > 2].
Then the result will be: [1] 3 4.

The logic behind is as follows:

Condition (here: [a > 2]) generates a logical vector - positions are denoted as a series of TRUE and FALSE. In this example, the logical vector is: [1] FALSE FALSE TRUE TRUE.
Elements in this logical vector are used to return the elements of the vector you want to subset, for which coordinate equals TRUE, so here last two elements of the vector will be returned: [1] 3 4

Note that you cannot see the TRUE/FALSE vector itself, but it is in fact generated and used during the subsetting operation.

Exercise 14

For an integer vector from 1 to 100, return elements higher than the vector’s mean.

Expected result:

##  [1]  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69
## [20]  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88
## [39]  89  90  91  92  93  94  95  96  97  98  99 100

The same pattern applies to subsetting data frames.

Exercise 15

Using the built-in dataset CO2, return the observation for the Qn2 plant.

Epected result:

##    Plant   Type  Treatment conc uptake
## 8    Qn2 Quebec nonchilled   95   13.6
## 9    Qn2 Quebec nonchilled  175   27.3
## 10   Qn2 Quebec nonchilled  250   37.1
## 11   Qn2 Quebec nonchilled  350   41.8
## 12   Qn2 Quebec nonchilled  500   40.6
## 13   Qn2 Quebec nonchilled  675   41.4
## 14   Qn2 Quebec nonchilled 1000   44.3

Advice: To obtain the logical vector with one element for each row, you need to make a comparison based on a column e.g., my_data[my_data$my_column != 5, ] would result in the observations (including all columns) from my_data in which the value for my_column does not equal 5.

Exercise 16

Using the built-in dataset CO2, return observation from the Mississippi chilled plant with an uptake higher than 20 ummol/m2 x s.

Expected result:

##    Plant        Type Treatment conc uptake
## 69   Mc1 Mississippi   chilled  675   22.2
## 70   Mc1 Mississippi   chilled 1000   21.9

dplyr

dplyr is a widely used R package that simplifies the manipulation and management of data frames.

Exercise 17

Install and load the dplyr package.

Exercise 18

Upload the rats.csv file into an object called my_data. View the first 10 lines of the file.

Expected result (first 10 rows):

##    Glycogen Treatment Rat Liver
## 1       131         1   1     1
## 2       130         1   1     1
## 3       131         1   1     2
## 4       125         1   1     2
## 5       136         1   1     3
## 6       142         1   1     3
## 7       150         1   2     1
## 8       148         1   2     1
## 9       140         1   2     2
## 10      143         1   2     2

Exercise 19

Modify my_data by adding the column with ID at the beginning.

Expected result (first 10 rows):

##    ID Glycogen Treatment Rat Liver
## 1   1      131         1   1     1
## 2   2      130         1   1     1
## 3   3      131         1   1     2
## 4   4      125         1   1     2
## 5   5      136         1   1     3
## 6   6      142         1   1     3
## 7   7      150         1   2     1
## 8   8      148         1   2     1
## 9   9      140         1   2     2
## 10 10      143         1   2     2

All subsequent functions come from the loaded dplyr package. Importantly, names of columns provided to dplyr functions do not need quotation marks.

Sorting observations

To sort the data, use arrange() function in the following manner: arrange(dataset, ordering_column)

Exercise 20

Obtain the observations from my_data sorted with increasing levels of glycogen. Use the arrange() function.

Expected result (first 10 rows):

##    ID Glycogen Treatment Rat Liver
## 1   4      125         1   1     2
## 2  26      125         3   1     1
## 3  36      127         3   2     3
## 4   2      130         1   1     1
## 5   1      131         1   1     1
## 6   3      131         1   1     2
## 7  25      134         3   1     1
## 8  35      134         3   2     3
## 9  29      135         3   1     3
## 10  5      136         1   1     3

To obtain the descending order, put the name of the column inside the desc() function.

Exercise 21

Using the arrange() function sort the observations from my_data with the decreasing levels of glycogen.

Expected result (first 10 rows):

##    ID Glycogen Treatment Rat Liver
## 1  23      162         2   2     3
## 2  11      160         1   2     3
## 3  13      157         2   1     1
## 4  20      155         2   2     1
## 5  15      154         2   1     2
## 6  18      153         2   1     3
## 7  24      152         2   2     3
## 8  19      151         2   2     1
## 9   7      150         1   2     1
## 10 12      150         1   2     3

You can also sort the data by multiple columns. To do this, add column in the following manner: arrange(dataset, ordering_column1, ordering_column2). Note that the priority of sorting is denoted by the order of function arguments.

Exercise 22

Using the arrange() function, obtain the observations from my_data sorted firstly by the Treatment column and then by the Rat column (both in ascending order).

Expected result (first 10 rows):

##    ID Glycogen Treatment Rat Liver
## 1   1      131         1   1     1
## 2   2      130         1   1     1
## 3   3      131         1   1     2
## 4   4      125         1   1     2
## 5   5      136         1   1     3
## 6   6      142         1   1     3
## 7   7      150         1   2     1
## 8   8      148         1   2     1
## 9   9      140         1   2     2
## 10 10      143         1   2     2

Subsetting columns

To select particular columns, use the select() function in the following way: select(dataset, column_name1, column_name2). Note that all mentioned columns will be preserved, and the rest will be discarded.

Exercise 23

Obtain the Glycogen and Liver columns from my_data. Use the select() function.

Expected result (first 10 rows):

##    Glycogen Liver
## 1       131     1
## 2       130     1
## 3       131     2
## 4       125     2
## 5       136     3
## 6       142     3
## 7       150     1
## 8       148     1
## 9       140     2
## 10      143     2

You can also use minus - preceding the column name, which means “all except this column”.

Exercise 24

Obtain the ID, Glycogen, Treatment, and Liver columns from my_data. Use the select() function and the minus (-) sign.

Expected result (first 10 rows):

##    ID Glycogen Treatment Liver
## 1   1      131         1     1
## 2   2      130         1     1
## 3   3      131         1     2
## 4   4      125         1     2
## 5   5      136         1     3
## 6   6      142         1     3
## 7   7      150         1     1
## 8   8      148         1     1
## 9   9      140         1     2
## 10 10      143         1     2

Subsetting observations

To subset the observations, use the filter() function in the following manner: filter(dataset, your_logical_condition). Note that logical conditions are always related to the values inside a given column.

Exercise 25

Using the filter() function, obtain the observations for Treatment equals 1.

Expected result (first 10 rows):

##    ID Glycogen Treatment Rat Liver
## 1   1      131         1   1     1
## 2   2      130         1   1     1
## 3   3      131         1   1     2
## 4   4      125         1   1     2
## 5   5      136         1   1     3
## 6   6      142         1   1     3
## 7   7      150         1   2     1
## 8   8      148         1   2     1
## 9   9      140         1   2     2
## 10 10      143         1   2     2

You can also combine several logical conditions by using logical operators (see above). Note, however, that for each observation under consideration, it needs to result in a single TRUE or FALSE.

Exercise 26

Using the filter() function, obtain the observations for Treatment equals 3 with glycogen level higher than 135.

Expected result:

##   ID Glycogen Treatment Rat Liver
## 1 27      138         3   1     2
## 2 28      138         3   1     2
## 3 30      136         3   1     3
## 4 31      138         3   2     1
## 5 32      140         3   2     1
## 6 33      139         3   2     2
## 7 34      138         3   2     2

Modifying columns

To create a new column based on the others, use the mutate() function in the following manner: mutate(dataset, new_column_name = recipe_for_values). “Recipe for values” is often a mathematical formula or simple mathematical function based on the values of other columns. Note that it is just a modification of the given value for each observation separately.

Exercise 27

Using the mutate() function, create a new column called log_Gly that will be a natural logarithm transformation of the Glycogen column. Overwrite my_data

Expected result (first 10 rows):

##    ID Glycogen Treatment Rat Liver  log_Gly
## 1   1      131         1   1     1 4.875197
## 2   2      130         1   1     1 4.867534
## 3   3      131         1   1     2 4.875197
## 4   4      125         1   1     2 4.828314
## 5   5      136         1   1     3 4.912655
## 6   6      142         1   1     3 4.955827
## 7   7      150         1   2     1 5.010635
## 8   8      148         1   2     1 4.997212
## 9   9      140         1   2     2 4.941642
## 10 10      143         1   2     2 4.962845

Summary of the dataset

Make a summary of your dataset with the summarise() function. The syntax is as follows: summarise(dataset, name_of_summary1 = recipe_for_value1, name_of_summary2 = recipe_for_value2,…). Recipe in that case is any aggregating function (e.g., mean() ) that will accept a whole column (vector) and result in a single statistic. Note that this function will result in the creation of a new table with 1 row of summary statistics and as many columns as the number of statistics mentioned.

Exercise 28

Use the summarise() function to create a summary of my_data containing mean, median, maximum value, minimum value, and standard deviation of the Glycogen column.

Expected result:

##       mean median max min   st_dev
## 1 142.2222    141 162 125 9.754445

You can also count the number of observations corresponding to the groups within a given column e.g., check how many observations were collected for each treatment. To obtain it, use the count() function by typing count(dataset, given_column).

Exercise 29

Use the count() function to calculate how many treatments were used in a study.

Expected result:

##   Treatment  n
## 1         1 12
## 2         2 12
## 3         3 12

Pipeline operator and pipeline

As you probably noticed, all dplyr functions have data as the first argument. Based on these characteristics, you can create a pipeline where the next function uses the output generated by the previous one. In that case, you should provide the dataset argument to the first function only.

A pipeline is created by connecting subsequent functions with the pipe operator %>%. Remember to skip the data argument in all functions except for the first one e.g., select(dataset, column1, column2) %>% filter(column1 > 50). The shortcut for the %>% operator is Ctrl + Shift + M (Windows) or Cmd + Shift + M (Mac).

Curiosity In R version 4.1.0, a new pipe operator is available |>. This operator is a part of base R, so you don’t have to load additional packages (like loading dplyr to use %>%) to use it.

Exercise 30

Using pipeline and my_data:
1. Select ID, Glycogen, and Liver column.
2. Obtain observations with glycogen levels lower than 140.
3. Sort it based on the Glycogen column in descending order.

Expected result (first 10 rows):

##    ID Glycogen Liver
## 1  33      139     2
## 2  27      138     2
## 3  28      138     2
## 4  31      138     1
## 5  34      138     2
## 6   5      136     3
## 7  30      136     3
## 8  29      135     3
## 9  25      134     1
## 10 35      134     3

Grouping

Performing any function over a complete dataset is often not what you really want. Imagine 3 species with a trait of interest. It can be the case that the overall mean does not reflect the variability among species. To check the value for each species separately, you need to group your dataset. Do it by using the group_by() function in the following manner: group_by(dataset, column_with_groups). This produces a grouped dataframe and will cause all subsequent functions to operate on each group separately.

Exercise 31

Using pipe, the group_by() function, and my_data data frame, create a summary table with mean and standard deviation of the Glycogen column for each treatment separately.

Expected result:

## # A tibble: 3 × 3
##   Treatment  mean st_dev
##       <int> <dbl>  <dbl>
## 1         1  140.  10.3 
## 2         2  151    5.66
## 3         3  135.   4.71

Advice: To perform any action on the whole dataset again, use the ungroup() function.

Exercise 32

Using pipe, thee group_by() function, and my_data data frame, create a new column with the deviations of Glycogen values from the arithmetic mean in a given Treatment.

Expected result (first 10 rows):

## # A tibble: 10 × 7
## # Groups:   Treatment [1]
##       ID Glycogen Treatment   Rat Liver log_Gly std_dev
##    <int>    <int>     <int> <int> <int>   <dbl>   <dbl>
##  1     1      131         1     1     1    4.88    10.3
##  2     2      130         1     1     1    4.87    10.3
##  3     3      131         1     1     2    4.88    10.3
##  4     4      125         1     1     2    4.83    10.3
##  5     5      136         1     1     3    4.91    10.3
##  6     6      142         1     1     3    4.96    10.3
##  7     7      150         1     2     1    5.01    10.3
##  8     8      148         1     2     1    5.00    10.3
##  9     9      140         1     2     2    4.94    10.3
## 10    10      143         1     2     2    4.96    10.3

**************** ADVANCED ******************

Joining

Imagine having two data frames corresponding to the same study system. In both of them, there is a column with individual IDs. How to bind them together? Using cbind() is rather a bad idea, as the order of observations can differ.

The solution is to use one of the _join() functions.

Data frame 1:

##      ID    V1
## 1 ind_1   red
## 2 ind_2  blue
## 3 ind_3 green

Data frame 2:

##      ID    V2
## 1 ind_2 black
## 2 ind_3  blue
## 3 ind_4  blue

left_join() - join the values from the second table (right) that correspond to observations in the first one (left). If there is no suitable value in the second table, NA is returned.

left_join(df1, df2, by = "ID")

##      ID    V1    V2
## 1 ind_1   red  <NA>
## 2 ind_2  blue black
## 3 ind_3 green  blue

right_join() - join the values from the first table (left) that correspond to observations in the second one (right). If there is no suitable value in the first table, NA is returned.

right_join(df1, df2, by = "ID")

##      ID    V1    V2
## 1 ind_2  blue black
## 2 ind_3 green  blue
## 3 ind_4  <NA>  blue

inner_join() - return only those observations that have corresponding values in both tables.

inner_join(df1, df2, by = "ID")

##      ID    V1    V2
## 1 ind_2  blue black
## 2 ind_3 green  blue

full_join() - join what can be joined, but keeps all observations. In case of the lack of a suitable value, returns NA.

full_join(df1, df2, by = "ID")

##      ID    V1    V2
## 1 ind_1   red  <NA>
## 2 ind_2  blue black
## 3 ind_3 green  blue
## 4 ind_4  <NA>  blue

Each of the abovementioned functions can be used by typing: _join(first_table, second_table, by = “shared_column_name”).

Notice that the name of the shared column should be the same in both tables (eg. ID).

Exercise 33

Execute the code below to create a new data frame with an ID and a weight column. Join the observations from my_data and new_data data frames using the proper _join() function. Keep all observations from my_data, but only those from new_data that have their counterparts in my_data.

new_data <- data.frame("ID" = c(2:100), "weight" = rnorm(99, mean = 150, sd = 20))

Expected result (first 10 rows):

##    ID Glycogen Treatment Rat Liver  log_Gly   weight
## 1   1      131         1   1     1 4.875197       NA
## 2   2      130         1   1     1 4.867534 155.1685
## 3   3      131         1   1     2 4.875197 126.8394
## 4   4      125         1   1     2 4.828314 135.9751
## 5   5      136         1   1     3 4.912655 151.2915
## 6   6      142         1   1     3 4.955827 129.2615
## 7   7      150         1   2     1 5.010635 148.5248
## 8   8      148         1   2     1 4.997212 145.3898
## 9   9      140         1   2     2 4.941642 144.0690
## 10 10      143         1   2     2 4.962845 169.9980

Curiosity
1. Using joins lets us avoid repeating the same information across many data frames.
2. The concept of joins is common in many computer languages, but it is most often used in database management (e.g., SQL).

Homework

1. Create a vector from 1 to 50. Then, using R only, calculate how many elements of this vector are higher than or equal to the positive square root of 100. *A positive square root can be calculated with the sqrt() function.
2. Upload the built-in CO2 data set as hw_data. Using logical expressions, display the rows of the dataset that have a concentration below 350 AND uptake of over 30.
3. Load the dplyr package. In the hw_data data frame, sort the observations first by concentration, then by uptake. Overwrite the hw_data variable so that the columns are permanently sorted in this manner.
4. Using one of the the dplyr function, add a new column called uptake_percent to the hw_data data frame that displays the uptake value as a percent of the maximum value of that column (=so that the highest value, 45.5, corresponds to 100%). Then, remove the original uptake column using a dplyr function.
5. Use the pipe operator to perform multiple functions on the hw_data data frame. Create a summary data frame that displays the mean and standard deviation of the new uptake_percent column for the Treatment groups. Save this summary as a .csv file called “CO2_uptake_summary_Your_Name.csv”.

Upload both your R script and .csv file to the Pegaz platform.

Introduction to R - class 4 Data Wrangling 2

Tomasz Gaczorek, Wiesław Babik & Mateusz ChechetkinMarzena Marszałek marzena.marszalek@doctoral.uj.edu.pl

2025-09-29

Logical expressions

Exercise 1

Exercise 2

Exercise 3

Comparisons

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exclamation mark

Exercise 8

Exercise 9

Logical operators

Exercise 10

Exercise 11

which() function

Exercise 12

Exercise 13

Subsetting with the logical expressions

Exercise 14

Exercise 15

Exercise 16

dplyr

Exercise 17

Exercise 18

Exercise 19

Sorting observations

Exercise 20

Exercise 21

Exercise 22

Subsetting columns

Exercise 23

Exercise 24

Subsetting observations

Exercise 25

Exercise 26

Modifying columns

Exercise 27

Summary of the dataset

Exercise 28

Exercise 29

Pipeline operator and pipeline

Exercise 30

Grouping

Exercise 31

Exercise 32

Joining

Exercise 33

Homework

Introduction to R - class 4
Data Wrangling 2

Tomasz Gaczorek, Wiesław Babik & Mateusz Chechetkin
Marzena Marszałek
marzena.marszalek@doctoral.uj.edu.pl