Data wrangling refers to the process of transforming raw and basic data into a more useful, neat, and tidy form. It includes everything from modifying the appearance of tables and organizing their contents to performing mathematical operations on the data and preparing it for various types of statistical analysis. In the first part, we will cover the data wrangling capabilities of base R.
A data frame is something you probably know as a “table”. However, it has some unique characteristics:
NA
(“not available”) in a given cell.During most statistical analyses, you will be operating on data frames.
Curiosity
In R, there is also an object called a matrix that will not be covered during this course. It looks and behaves similarly to the data frame, but all cells must contain the same type of data (often numbers). It is usually used in mathematics (matrix operations) and graphics (image encoding).
Recently, the object called tibble is rising in popularity and possibly will replace data.frame in the near future. The differences between them are rather subtle and do not affect our exercises directly. If you want to read more, follow the link.
In the last class, we learned how to upload an existing file into R. Now we will learn how to modify data frames. Remember the optimal coding workflow.
Another good habit to add to the ones we learned last time is inspecting your data once it is uploaded into RStudio. You can do this in the following ways:
• visually inspect the file by either clicking on it
in the environment (top right square) or by using the
View(dataframe)
function – reminder that R is case
sensitive and View starts with a capital V
• ask R to display the top six rows of the data frame
using the head()
function, or the bottom six rows using the
tail()
function
• get a brief description of the data frame using the
summary()
function – this is particularly important, as it
displays the type of data in each column of the data frame
Upload class_3_data.xlsx into R and save it to a
variable called my_data
.
View the uploaded data frame.
Access the head()
function manual. View the top 10 rows of
my_data
data frame and generate a short description for
it.
All values inside one column must be of the same data type, but the columns can be of different types. The type of data within a column is defined by its class. The most popular classes are:
• character
- strings (remember that numbers surrounded
by quotation marks are also treated as strings)
• factor
- usually strings, represent categorical variables
- variables that can take on a limited number of distinct values. When
creating or saving data (data.frame()
or
read.csv()
), you can control whether character vectors are
automatically converted into factors using the
stringAsFactors
argument (for R version >= 4.0.0)
• numeric
- real numbers
• integer
- only integers
• logical
- logical values (TRUE
or
FALSE
)
You can check the class of a value or a vector using
the class()
function. To use it on a data frame, you need
to specify which columns you want to check. To check every column in a
data frame, use the str()
function.
Check the class of every column of
the my_data
data frame.
You can also convert different data types with the
use of as.[class_name] functions family
e.g. as.character()
, as.factor()
ect.. Note,
however, that not all conversions are permitted e.g., a
letter cannot be converted into an integer. To change the class of a
column, you need to replace the whole column with the result of an
as.[class_name]
function. We will learn how to do it once
we discuss subsetting.
A data frame can be subsetted (=accessed or
displayed in a specific way) with the use of
coordinates (indexes) enclosed within the
square brackets ([]
). In case of data
frames, there are always 2 coordinates:
table[row_number, column_number]
.
Note that vectors have only one coordinate, the position, so they can be called one-dimensional objects. Data frames have two dimensions: rows and columns.
Return the value from the 5th row
and 3rd column of my_data
.
Expected result:
## # A tibble: 1 × 1
## mass
## <dbl>
## 1 24
You can call for multiple values in the same time.
To define a range of coordinates, use a colon
(:
) separating range borders - e.g.,
table[row_number_1:row_number_2, column_number]
.
Return first 5 rows for the last 2
columns of my_data
.
Expected result:
## # A tibble: 5 × 2
## mass score
## <dbl> <dbl>
## 1 22 94
## 2 29 79
## 3 22 88
## 4 29 86
## 5 24 28
Now, save the 4th column of
my_data
as a variable called my_column_4
. Call
the variable, so its content is displayed in the
console.
Tip: If you want to call for all rows or
columns, it is enough to leave blank space instead of the
respective coordinate - e.g.,
table[,column_number]
.
Expected result (first 10 rows):
## # A tibble: 30 × 1
## score
## <dbl>
## 1 94
## 2 79
## 3 88
## 4 86
## 5 28
## 6 22
## 7 32
## 8 36
## 9 76
## 10 74
## # ℹ 20 more rows
Return the 3rd, 4th, and the 5th
value from the variable my_column_4
.
Expected result:
## # A tibble: 3 × 1
## score
## <dbl>
## 1 88
## 2 86
## 3 28
The table can also be subsetted with the use of columns’ (or rows’) names.
Return the 4th column of
my_data
, but use the column name instead of its number.
Don’t forget about quotation marks
(""
).
Expected results (first 10 rows):
## # A tibble: 30 × 1
## score
## <dbl>
## 1 94
## 2 79
## 3 88
## 4 86
## 5 28
## 6 22
## 7 32
## 8 36
## 9 76
## 10 74
## # ℹ 20 more rows
Another way to obtain the whole column is to use the
dollar sign $
between the table’s and column’s
names. Such an expression is automatically treated as a vector,
so it can be directly subsetted to get a particular row. -
e.g. table$column_name[row_number]
.
Return values from the 5th to the
15th row of the mass column from my_data
. Use the dollar
sign $
.
Expected result:
## [1] 24 27 22 20 25 24 28 25 20 23 29
If desired rows (or columns) do not follow each other and the range option cannot be used, a vector of coordinates should be provided.
Note that the outcome is no longer a table. As we asked for just one column, a series of numbers (vector) was returned.
Curiosity
Pulling one row from the data frame will not result in a vector. It is because a row can contain elements of different types, which is not allowed for vectors.
Create a vector with values 3, 5, 7, 9, and 12 and save it to a variable. Call it.
Expected result:
## [1] 3 5 7 9 12
Return rows of the
score
column corresponding to the values in the vector
created before.
Expected result:
## [1] 88 28 32 76 49
You can also subset everything except specified columns (or
rows). To do this, put a minus (-
) before an index
or a vector of indices.
Return all columns of
my_data
except the 2nd one.
Expected result (first 10 rows):
## # A tibble: 30 × 3
## individual mass score
## <chr> <dbl> <dbl>
## 1 ind001 22 94
## 2 ind002 29 79
## 3 ind003 22 88
## 4 ind004 29 86
## 5 ind005 24 28
## 6 ind006 27 22
## 7 ind007 22 32
## 8 ind008 20 36
## 9 ind009 25 76
## 10 ind010 24 74
## # ℹ 20 more rows
To replace a given value in your table assign a new value to this
place in your table with the use of an arrow (<-
). It
works just as with variables assignment -
e.g.table[row_number,column_number] <- new_value
Replace the 5th value in the
column sex
with the label ‘Unknown’. Call column
sex
and check whether it was indeed
replaced.
Expected result:
## [1] "M" "M" "M" "M" "Unknown" "M" "M"
## [8] "M" "M" "M" "M" "M" "M" "M"
## [15] "M" "F" "F" "F" "F" "F" "F"
## [22] "F" "F" "F" "F" "F" "F" "F"
## [29] "F" "F"
You can use classical mathematical operators: +
,
-
, *
, and /
. Remember, however,
that mathematical operations make sense only for integer or
numeric data types.
Recalculate and modify the
mass
column to convert it from kilograms to pounds (1
kilogram equals around 2.20 pounds). Display modified
column.
Expected result:
## [1] 48.4 63.8 48.4 63.8 52.8 59.4 48.4 44.0 55.0 52.8 61.6 55.0 44.0 50.6 63.8
## [16] 61.6 57.2 48.4 46.2 57.2 50.6 68.2 59.4 59.4 74.8 72.6 55.0 46.2 39.6 74.8
You can also use simple summary functions from the previous class (see Class 2).
Calculate the mean value of the
score
column.
Expected result:
## [1] 57.53333
• nrow()
- number of rows
• ncol()
- number of columns
• dim()
- dimensions of an object
Return the total number of cells
within my_data
.
Expected result:
## [1] 120
Missing data, as stated before, are represented as
NA
(not available) in R. Most of the functions will raise
an error every time the NA
is provided as the argument.
Choose one cell and replace its
value with NA
. Do not add quotation marks as
NA
is an internal R symbol (just as π
). Print
the whole row.
Rows with missing data can be removed with the na.omit()
function. To save changes, the result of the na.omit()
function must be saved as a variable. Note that, in practice,
deleting missing data must be well justified. Make sure they do not
provide any important information.
Check the number of rows of
my_data
. Remove the row with missing data. Replace variable
my_data
with the modified table. Remember that it cannot be
undone. Check if the number of rows has changed.
Adding a new column is relatively simple. All you need is a vector. However, remember three things:
We are going to add a column with the IDs of observations. Note that a column with IDs is often necessary in statistical analysis. It is also inherent to the data in long format, which is strongly advised.
Create a vector with ID numbers starting from 100. Use one of the functions introduced above to obtain the desired length of the vector. Call the vector ID.
Expected result:
## [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
## [20] 119 120 121 122 123 124 125 126 127 128
You can combine data frames with the use of the cbind()
function in the following manner:
cbind(data_frame1,data_frame2)
. Note that a vector
can be seen as a data frame with only one column.
Add generated IDs as the first
column in my_data
dataframe. Overwrite the
my_data
variable.
data.frame()
function in
the following manner:
data.frame(columnName1 = value1, columnName2 = value2,…)
.Create a data frame
added_rows
, with one row and the following values:
130
, ind031
, F
,
55.7
, 77
. Columns’ names should correspond to
those of my_data
data frame. Call it.
Expected result:
## ID individual sex mass score
## 1 130 ind031 F 55.7 77
Curiosity To combine a data frame with more rows at the same time, replace the values for each column with the vectors.
To stick data frames by rows use the rbind()
function in
the following manner: rbind(data_frame1, data_frame2)
Add generated row at the end of
my_data
dataframe. Overwrite the my_data
variable. Print the last 6 rows of my_data
to check the
operation result.
Expected result:
## ID individual sex mass score
## 25 124 ind026 F 72.6 83
## 26 125 ind027 F 55.0 42
## 27 126 ind028 F 46.2 48
## 28 127 ind029 F 39.6 37
## 29 128 ind030 F 74.8 48
## 30 130 ind031 F 55.7 77
Sometimes we need to change the class of a column in
a dataframe. This can be done using as.[class]
functions:
as.numeric()
, as.character()
,
as.factor()
, etc.
Use the class() function to check the data type of the sex column in my_data. Then, convert this column into a factor and overwrite it. Finally, use class() again to confirm the conversion.
Expected result:
## [1] "character"
## [1] "factor"
To save your data frame into a file, use the
write.table()
function.
Save modified my_data
to the .csv file.
Prepare your homework in the form of
a script file (.R) and call it “class_3_homework_Your_name.R”.
Perform all exercises using the built-in ToothGrowth dataset. You can
view details about the dataset with ?ToothGrowth
. To begin,
assign it to a variable with my_data <- ToothGrowth
, and
then carry out all operations on my_data
as we did in
today’s class.
Include all subsequent steps in a script file.
1. Check the class of
each column in the my_data
data frame. Find the shortest
tooth length.
2. Add a new row at the end of my_data
.
The values in a new row should be as follows: 26.2, OJ, 2.0. Write the
modified my_data
dataset as the
my_data_modified
variable. The my_data
should
still contain the original dataset.
3. Add column with IDs - “obs_1”
to “obs_n” (“n” is the number of the last observation)- for each
observation as the first column of my_data
. Overwrite
my_data_modified
variable. Tip: to create a vector with
IDs, you can use paste()
to connect “obs_” with the numbers
(1-n) instead of typing each ID manually.
4. Modify
my_data
by removing its first 4 rows. Overwrite the
my_data_modified
variable.
5. Save a modified dataset
to a “class_3_toothgrowth_modified_Your_Name.csv” file.
Upload both your R script and .csv to the “Class 3” tab on the Pegaz platform.