Data wrangling refers to the process of transforming raw and basic data into a more useful, neat, and tidy form. It includes everything from modifying the appearance of tables and organizing their contents to performing mathematical operations on the data and preparing it for various types of statistical analysis. In the first part, we will cover the data wrangling capabilities of base R.

Data frame

A data frame is something you probably know as a “table”. However, it has some unique characteristics:

  1. It always has a header. It means that each column must have its unique name. If names are not provided, R creates them automatically as “V” (as “variable”) followed by the consecutive number.
  2. The table is always complete. It means that each column has the same number of rows. If some data are missing, R puts NA (“not available”) in a given cell.
  3. An individual column of a data frame, similarly to vectors, can contain only one type of data, but different columns can be of different types.

During most statistical analyses, you will be operating on data frames.

Curiosity
In R, there is also an object called a matrix that will not be covered during this course. It looks and behaves similarly to the data frame, but all cells must contain the same type of data (often numbers). It is usually used in mathematics (matrix operations) and graphics (image encoding).
Recently, the object called tibble is rising in popularity and possibly will replace data.frame in the near future. The differences between them are rather subtle and do not affect our exercises directly. If you want to read more, follow the link.

In the last class, we learned how to upload an existing file into R. Now we will learn how to modify data frames. Remember the optimal coding workflow.

Another good habit to add to the ones we learned last time is inspecting your data once it is uploaded into RStudio. You can do this in the following ways:

visually inspect the file by either clicking on it in the environment (top right square) or by using the View(dataframe) function – reminder that R is case sensitive and View starts with a capital V
• ask R to display the top six rows of the data frame using the head() function, or the bottom six rows using the tail() function
• get a brief description of the data frame using the summary() function – this is particularly important, as it displays the type of data in each column of the data frame

Exercise 1

Upload class_3_data.xlsx into R and save it to a variable called my_data.

Exercise 2

View the uploaded data frame. Access the head() function manual. View the top 10 rows of my_data data frame and generate a short description for it.

Types of data

All values inside one column must be of the same data type, but the columns can be of different types. The type of data within a column is defined by its class. The most popular classes are:

character - strings (remember that numbers surrounded by quotation marks are also treated as strings)
factor - usually strings, represent categorical variables - variables that can take on a limited number of distinct values. When creating or saving data (data.frame() or read.csv()), you can control whether character vectors are automatically converted into factors using the stringAsFactors argument (for R version >= 4.0.0)
numeric - real numbers
integer - only integers
logical - logical values (TRUE or FALSE)

You can check the class of a value or a vector using the class() function. To use it on a data frame, you need to specify which columns you want to check. To check every column in a data frame, use the str() function.

Exercise 3

Check the class of every column of the my_data data frame.

You can also convert different data types with the use of as.[class_name] functions family e.g. as.character(), as.factor() ect.. Note, however, that not all conversions are permitted e.g., a letter cannot be converted into an integer. To change the class of a column, you need to replace the whole column with the result of an as.[class_name] function. We will learn how to do it once we discuss subsetting.

Subsetting

A data frame can be subsetted (=accessed or displayed in a specific way) with the use of coordinates (indexes) enclosed within the square brackets ([]). In case of data frames, there are always 2 coordinates: table[row_number, column_number].

Note that vectors have only one coordinate, the position, so they can be called one-dimensional objects. Data frames have two dimensions: rows and columns.

Exercise 4

Return the value from the 5th row and 3rd column of my_data.

Expected result:

## # A tibble: 1 × 1
##    mass
##   <dbl>
## 1    24

You can call for multiple values in the same time. To define a range of coordinates, use a colon (:) separating range borders - e.g., table[row_number_1:row_number_2, column_number].

Exercise 5

Return first 5 rows for the last 2 columns of my_data.

Expected result:

## # A tibble: 5 × 2
##    mass score
##   <dbl> <dbl>
## 1    22    94
## 2    29    79
## 3    22    88
## 4    29    86
## 5    24    28

Exercise 6

Now, save the 4th column of my_data as a variable called my_column_4. Call the variable, so its content is displayed in the console.

Tip: If you want to call for all rows or columns, it is enough to leave blank space instead of the respective coordinate - e.g., table[,column_number].

Expected result (first 10 rows):

## # A tibble: 30 × 1
##    score
##    <dbl>
##  1    94
##  2    79
##  3    88
##  4    86
##  5    28
##  6    22
##  7    32
##  8    36
##  9    76
## 10    74
## # ℹ 20 more rows

Exercise 7

Return the 3rd, 4th, and the 5th value from the variable my_column_4.

Expected result:

## # A tibble: 3 × 1
##   score
##   <dbl>
## 1    88
## 2    86
## 3    28

The table can also be subsetted with the use of columns’ (or rows’) names.

Exercise 8

Return the 4th column of my_data, but use the column name instead of its number. Don’t forget about quotation marks ("").

Expected results (first 10 rows):

## # A tibble: 30 × 1
##    score
##    <dbl>
##  1    94
##  2    79
##  3    88
##  4    86
##  5    28
##  6    22
##  7    32
##  8    36
##  9    76
## 10    74
## # ℹ 20 more rows

Another way to obtain the whole column is to use the dollar sign $ between the table’s and column’s names. Such an expression is automatically treated as a vector, so it can be directly subsetted to get a particular row. - e.g. table$column_name[row_number].

Exercise 9

Return values from the 5th to the 15th row of the mass column from my_data. Use the dollar sign $.

Expected result:

##  [1] 24 27 22 20 25 24 28 25 20 23 29

If desired rows (or columns) do not follow each other and the range option cannot be used, a vector of coordinates should be provided.

Note that the outcome is no longer a table. As we asked for just one column, a series of numbers (vector) was returned.

Curiosity
Pulling one row from the data frame will not result in a vector. It is because a row can contain elements of different types, which is not allowed for vectors.

Exercise 10

Create a vector with values 3, 5, 7, 9, and 12 and save it to a variable. Call it.

Expected result:

## [1]  3  5  7  9 12

Exercise 11

Return rows of the score column corresponding to the values in the vector created before.

Expected result:

## [1] 88 28 32 76 49

You can also subset everything except specified columns (or rows). To do this, put a minus (-) before an index or a vector of indices.

Exercise 12

Return all columns of my_data except the 2nd one.

Expected result (first 10 rows):

## # A tibble: 30 × 3
##    individual  mass score
##    <chr>      <dbl> <dbl>
##  1 ind001        22    94
##  2 ind002        29    79
##  3 ind003        22    88
##  4 ind004        29    86
##  5 ind005        24    28
##  6 ind006        27    22
##  7 ind007        22    32
##  8 ind008        20    36
##  9 ind009        25    76
## 10 ind010        24    74
## # ℹ 20 more rows

Simple operations on data frames/ tibbles

1. Replacement

To replace a given value in your table assign a new value to this place in your table with the use of an arrow (<-). It works just as with variables assignment - e.g.table[row_number,column_number] <- new_value

Exercise 13

Replace the 5th value in the column sex with the label ‘Unknown’. Call column sex and check whether it was indeed replaced.

Expected result:

##  [1] "M"       "M"       "M"       "M"       "Unknown" "M"       "M"      
##  [8] "M"       "M"       "M"       "M"       "M"       "M"       "M"      
## [15] "M"       "F"       "F"       "F"       "F"       "F"       "F"      
## [22] "F"       "F"       "F"       "F"       "F"       "F"       "F"      
## [29] "F"       "F"

2. Mathematical operations

You can use classical mathematical operators: +, -, *, and /. Remember, however, that mathematical operations make sense only for integer or numeric data types.

Exercise 14

Recalculate and modify the mass column to convert it from kilograms to pounds (1 kilogram equals around 2.20 pounds). Display modified column.

Expected result:

##  [1] 48.4 63.8 48.4 63.8 52.8 59.4 48.4 44.0 55.0 52.8 61.6 55.0 44.0 50.6 63.8
## [16] 61.6 57.2 48.4 46.2 57.2 50.6 68.2 59.4 59.4 74.8 72.6 55.0 46.2 39.6 74.8

You can also use simple summary functions from the previous class (see Class 2).

Exercise 15

Calculate the mean value of the score column.

Expected result:

## [1] 57.53333

3. Simple summaries

nrow() - number of rows
ncol() - number of columns
dim() - dimensions of an object

Exercise 16

Return the total number of cells within my_data.

Expected result:

## [1] 120

4. Deleting rows with missing data

Missing data, as stated before, are represented as NA (not available) in R. Most of the functions will raise an error every time the NA is provided as the argument.

Exercise 17

Choose one cell and replace its value with NA. Do not add quotation marks as NA is an internal R symbol (just as π). Print the whole row.

Rows with missing data can be removed with the na.omit() function. To save changes, the result of the na.omit() function must be saved as a variable. Note that, in practice, deleting missing data must be well justified. Make sure they do not provide any important information.

Exercise 18

Check the number of rows of my_data. Remove the row with missing data. Replace variable my_data with the modified table. Remember that it cannot be undone. Check if the number of rows has changed.

Adding a new column or row

  1. Adding a new column is relatively simple. All you need is a vector. However, remember three things:

    • The length of the vector must equal the number of rows of a data frame
    • The order of values within a vector corresponds to the row numbers
    • The name of the vector will become the name of the added column

We are going to add a column with the IDs of observations. Note that a column with IDs is often necessary in statistical analysis. It is also inherent to the data in long format, which is strongly advised.

Exercise 19

Create a vector with ID numbers starting from 100. Use one of the functions introduced above to obtain the desired length of the vector. Call the vector ID.

Expected result:

##  [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
## [20] 119 120 121 122 123 124 125 126 127 128

You can combine data frames with the use of the cbind() function in the following manner: cbind(data_frame1,data_frame2). Note that a vector can be seen as a data frame with only one column.

Exercise 20

Add generated IDs as the first column in my_data dataframe. Overwrite the my_data variable.

  1. Adding a new row is more complicated as you cannot create a vector with different types of data. Firstly, you need to create a new data frame (similar to the old one) consisting of only new row (or rows). To do this, use the data.frame() function in the following manner: data.frame(columnName1 = value1, columnName2 = value2,…).

Exercise 21

Create a data frame added_rows, with one row and the following values: 130, ind031, F, 55.7, 77. Columns’ names should correspond to those of my_data data frame. Call it.

Expected result:

##    ID individual sex mass score
## 1 130     ind031   F 55.7    77

Curiosity To combine a data frame with more rows at the same time, replace the values for each column with the vectors.

To stick data frames by rows use the rbind() function in the following manner: rbind(data_frame1, data_frame2)

Exercise 22

Add generated row at the end of my_data dataframe. Overwrite the my_data variable. Print the last 6 rows of my_data to check the operation result.

Expected result:

##     ID individual sex mass score
## 25 124     ind026   F 72.6    83
## 26 125     ind027   F 55.0    42
## 27 126     ind028   F 46.2    48
## 28 127     ind029   F 39.6    37
## 29 128     ind030   F 74.8    48
## 30 130     ind031   F 55.7    77

Sometimes we need to change the class of a column in a dataframe. This can be done using as.[class] functions: as.numeric(), as.character(), as.factor(), etc.

Exercise 23

Use the class() function to check the data type of the sex column in my_data. Then, convert this column into a factor and overwrite it. Finally, use class() again to confirm the conversion.

Expected result:

## [1] "character"
## [1] "factor"

Saving the data frame

To save your data frame into a file, use the write.table() function.

Exercise 24

Save modified my_data to the .csv file.

Homework

Prepare your homework in the form of a script file (.R) and call it “class_3_homework_Your_name.R”.
Perform all exercises using the built-in ToothGrowth dataset. You can view details about the dataset with ?ToothGrowth. To begin, assign it to a variable with my_data <- ToothGrowth, and then carry out all operations on my_data as we did in today’s class.
Include all subsequent steps in a script file.
1. Check the class of each column in the my_data data frame. Find the shortest tooth length.
2. Add a new row at the end of my_data. The values in a new row should be as follows: 26.2, OJ, 2.0. Write the modified my_data dataset as the my_data_modified variable. The my_data should still contain the original dataset.
3. Add column with IDs - “obs_1” to “obs_n” (“n” is the number of the last observation)- for each observation as the first column of my_data. Overwrite my_data_modified variable. Tip: to create a vector with IDs, you can use paste() to connect “obs_” with the numbers (1-n) instead of typing each ID manually.
4. Modify my_data by removing its first 4 rows. Overwrite the my_data_modified variable.
5. Save a modified dataset to a “class_3_toothgrowth_modified_Your_Name.csv” file.

Upload both your R script and .csv to the “Class 3” tab on the Pegaz platform.