Data visualization is a vital tool for statistics
and research. Plots, graphs, and other visual representations of data
not only make complex datasets more accessible but also
reveal patterns, trends, and outliers we might not
notice otherwise. Base R provides all necessary functions for creating
quick plots that you can use for your own information, e.g. in accessing
the distribution of data. To make plots more
aesthetically pleasing, readable, and publication ready, we will use
ggplot2
package.
During this class, we will be working on the internal dataset
iris
, which describes various dimensions of flowers. You
can access the description of the dataset by typing a question mark
before its name. Here is a botanical visual of the flower structure.
Assign the internal data set
iris
to an object called my_data
. View the
dataset and create its short summary.
Expected result (first 10 rows):
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
The two most common types of plots are scatterplots and histograms.
Histograms are usually used to demonstrate differences between groups or their central tendencies (e.g. means).
We’ll start with a scatterplot.
The basic R function to create any kind of plot is, simply,
plot()
. To set it up, provide arguments for x axis, and y
axis: plot(x = x_axis_data, y = y_axis_data)
, where
x
and y
specify variables that define x and y
axes values respectively.
Create a scatterplot using
my_data
data frame. Provide the Sepal.length
column as x, and Sepal.width
as y arguments of the
plot()
function.
Expected result:
Alternatively, the same command can be given using the
tilde ~
sign. In coding language, this can
be read as “depends on”, and is used not only in plots, but also in
e.g. statistical functions.
Create a plot using the same
function but writing the arguments down as Sepal.width
depending on Sepal.length
. Provide the name of the data
frame as data argument. Remember that, on a standard plot, the y axis
shows the dependent variable, and x shows the independent
variable.
The only difference in the results should be the displayed titles for the axes.
The basic plot()
function allows us to somewhat
customize the plots. For example, we can change the name of the axes, as
well as the title of the plot.
Copy and modify the code from
exercise 3. Add axis titles using the arguments xlab
and
ylab
. Add the title of the plot using the main
argument.
Remember that, for any function, you can always access its manual by typing its name with a question mark at the beginning into the R console.
Curiosity You can use the
par()
function to further customize the plot. Inspecting the manual will show you the various options you have.
What if we want to add grouping to the scatterplot? We can add another argument to the function too, for example, color the individual observations according to group.
Modify the plot function to color
the observations by Species
. Do this by adding a
col
argument and setting the Species
column as
its value.
Expected result:
Curiosity The basic plot function allows further modification, such as, e.g., addition of a legend to the plot. We will not focus on this as making the plots more readable is easier in
ggplot2
.
Now let’s try to make a histogram.
The base R function for making a histogram is hist()
.
The only argument you need to provide is the column of the data you want
to work on. The group of the observations will be shown on the x axis,
and the number of observations in that group will be shown on the y
axis.
Create a histogram plot of the
sepal length using my_data
.
Expected outcome:
You can customize this plot using the same arguments as in the
plot()
function to add axes labels and a title.
Additionally, we can adjust some other things about this plot, for
example changing the grouping on the x axis using the
breaks
argument. The breaks
argument can take
a single number (which will correspond to the number of bins) or a
numeric vector (which will correspond to the boundaries of the
bins).
Update the histogram code by adding axes names, title, and setting it to display 4 bins.
From now on, we are going to use the ggplot2
package.
Install and load the package:
install.packages("ggplot2")
library(ggplot2)
The syntax to generate a plot includes three groups of functions:
ggplot()
- starts a new
plotgeom_point()
- displays
the datatheme()
- changes
the appearance of a plotYou can use many functions that will contribute to a given plot but
each of them should be connected by the plus symbol (+
)
e.g. ggplot(…) + geom_point() + geom_line() + theme(…)
.
Both the ggplot()
and the body functions share two
crucial arguments:
data
- data frame based on which plotting should be
performedmapping
- the value needs to be in a form of
aes()
(aesthetic) function with following arguments:
x
- name of the column with values for the x
(horizontal) axis (without quotation marks)y
- name of the column with values for the y (vertical)
axis (without quotation marks)Note that the arguments passed to the ggplot()
function
are “parental”. It means they would be used if no arguments are provided
for the body functions. It is called inheritance in the programming
(because every subsequent function in the line “inherits” the arguments
of the parent function).
To create the plotting area we supply the name of the dataset and map two variables to x and y aesthetics.
ggplot(data = iris, mapping = aes(x = Petal.Length, y = Petal.Width))
We see the plot, with both axes scaled according to the ranges of the
two variables to be plotted, but there’s nothing else on the plot. It’s
because we have to add the visual layer(s) representing the data, which
are called geoms. Not every geom is appropriate for
displaying every type of data, you can find the complete list of geoms
here.
For scatterplot we’ll use geom_point()
. Also, we can omit
data =
and mapping =
when we keep the order of
the arguments.
You can also move mapping
from ggplot()
to
the body - here, to geom_point()
. Both options will give
the same result.
ggplot(data = iris) + geom_point(mapping = aes(x = Sepal.Length, y = Sepal.Width))
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
Now let’s mark each species with a distinct color. It’s very simple
and natural in ggplot2()
- we just add color
aesthetics and map the variable Species
to it. The results
of both formulas below will be the same.
ggplot(data = iris, mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + geom_point()
ggplot(data = iris) + geom_point(mapping = aes(x = Sepal.Length, y = Sepal.Width, color = Species))
Create a scatterplot presenting
the relationship between sepals’ length and width. Use
geom_point()
function.
Expected result:
Now, let’s make it pretty!
theme…()
function. For example, it is often useful to apply
theme_bw()
as it removes unnecessary
background. To check all built-in possibilities follow the link. Be
aware that the choice of any specific theme…()
function
reduces adjustment flexibility. The other option is to use plain
theme()
function and adjust all characteristics on your
own.labs()
function with
arguments: title
for the title of plot, x
for
x-axis label and y
for y-axis label.Modify previous scatterplot by
setting theme_bw()
function. Also, set a main title and
axis labels according to the plot below.
Tip: By default plot title is left-aligned. To center the title
you can use the theme() function:
theme(plot.title = element_text(hjust = 0.5))
. The
hjust
argument controls horizontal justification, with 0.5
centering the title.
Expected result:
As you have seen in the dataset description, the observations were collected for the three species of irises. To display them in different colors add color argument to aesthetics and set it as the name of column with species description for each observation. Notice that the legend will be automatically added.
Modify previous scatterplot by displaying the observations from different species with distinct colors.
Expected result:
Curiosity
There is an entire family of the scale…() functions that personalize the way observations are displayed e.g.:
1.scale_x_discrete(
) (for groups),scale_x_continuous()
(for gradient) - modify x axis functionalities (e.g. its range)
2.scale_y_discrete()
(for groups),scale_y_continuous()
(for gradient) - modify y axis functionalities (e.g. its range)
3.scale_color_manual()
- to set personalized colors and legend labels
Points alone are often not enough to see a trend. To get a better
idea we need a regression line. You can draw it by
adding geom_smooth()
function to the plot’s body. Remember
about required arguments (either assigned or inherited). Also, to obtain
the linear regression, set geom_smooth()
method
to "lm"
.
Draw a linear regression line for the scatterplot from exercise 10.
Expected result:
Regression line in blue, gray area represents confidence intervals
generated by default.
It seems that there is no relationship when you
consider all points together but what about
relationships within species? Do it by adding
color
argument to the geom_smooth()
aesthetics
and setting it as the name of column with species description for each
observation.
Modify a scatterplot from exercise 11 to obtain separate regression lines for each species.
Expected result:
Curiosity Using
group
argument instead of color will also generate separate lines but all will be of the same color.
Until now, we were working on plots with continuous values on
x and y axes. What if we want to plot sepal width alone over
all species? As species are on discrete scale we should
consider a boxplot. Generate it in the similar way with
the use of geom_boxplot()
function. Note, that
discrete variables can be of character, logical or factor
type.
If you have named groups it is highly recommended to encode them as
factors, because factor levels can be ordered manually,
which allows you to control the order in which groups appear on the
plot.
Generate a boxplot representing the sepal’s length across different iris species.
Expected result:
An advice: If you do not know how to interpret the boxplot, follow the link.
To set a common color for all boxes, use fill
and
color
argument of geom_boxplot()
function and
provide it with a color name. It would set the color of filling and
contours respectively. To color the boxes or its contours according to
column’ values e.g., Species
you have to place
fill
or color
argument in
aes()
.
Modify plot from exercise 13. by
setting the color of boxes’ contours according to
Species
.
If you have tens, or perhaps a hundred data points, it may be useful
to show them all together with boxplot. To avoid points overlapping each
other, use geom_jitter()
.
Modify plot from exercise 14. by
adding data points. Use geom_jitter()
function
Another commonly used type of a plot is histogram.
It depicts the number of observations within given ranges of
values. Generate it with the geom_histogram()
function. Specify the aesthetics for x-axis only as y-values are
computed by R.
To set a common color for all contours and bars, use fill
and color
argument of geom_histogram()
function and provide it with a color name.
Generate a histogram of sepal length. Set bars and contours color.
Expected result:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It looks like there are too many bins (the default is 30), let’s reduce their number to 10. It’d also be nice to to show each species with different colour.
Generate a histogram of sepal
length. Set bars’ color according to Species
. Add
bins = 10
argument to geom_histogram()
function to decrease the number of bins.
Expected result:
Still not great. The bars from different species overlap each other.
To show the bars for each species next to each other we adjust their
position with position = "dodge"
in
geom_histogram()
function.
Modify histogram from exercise 17 to show the bars for each species next to each other.
Expected result:
You can also think about histograms as calculating the
frequency rather than the number of observations. To do this
set the y
argument of aesthetics to
(after_stat(count))/(sum(after_stat(count)))
. Note that
after_stat(count))
is a function that allows you to access
computed statistics while creating a plot.
Generate a histogram of sepal length with the relative frequencies on the y axis.
Expected result:
Note that the histograms looks almost identical but the y-axis values were changed.
Using histograms, you can also generate a density
distribution. Think about it as the graphical visualization of
relative probabilities that a randomly sampled
observations come from a given interval. Note that sampling
based on density function would generate the histogram similar to the
one from the real-data. It can be used to roughly assess the
normality of data. Do it by changing y argument of
geom_histogram()
function to
after_stat(density)
. Additionally, use
geom_density()
function. Remember about aesthetics’
argument x
and note you can use alpha
argument
to set distribution transparency.
Modify the histogram from exercise
16th to display density on y axis. Add also density distribution using
geom_density()
.
Expected result:
Modify the histogram above and
display density distributions for each species separately on a single
plot. Take advantage of color
argument inside
aesthetics.
Expected result:
When R generates plots based on our commands, they are treated like any other object – meaning that they are not automatically stored in the app memory unless we save the code that generates them or export the images out of R studio. There are 2 main ways to do this:
If you only need to save one image, both options are equally convenient. If, however, you are generating multiple plots per script, it can be faster to include code that saves them after every line that generates the image.
In base R, to save a plot you need first initialize an appropriate graphic device with a file name, than use plotting commands, and finally close the device. It looks like this:
png("name.png")
plot(arguments)
dev.off()
Other than .png, you can save it in .jpg, .bmp, and .tiff formats, using graphic devices. Remember that by default, R will save the image to your working directory.
In ggplot
, we can use the function
ggsave()
. By default, it saves the last plot
generated in the code. It takes the name of a file to be
generated (with a format extension) as the main argument, and it can be
adjusted using many other arguments. For example:
ggsave("scatterplot_1.pdf", width = 20, height = 20, units = "cm")
Save one of the plots we generated today using your preferred method.
Use the built in
mtcars
data set for all exercises. As a reminder, you can
pull up a description of the inbuilt data set by typing a question mark
before its name into the R console.
1. Using base R, generate a
scatterplot to demonstrate a relationship between any 2 continuous
variables. Add a grouping by color
to the plot,
descriptions of axes, and a title.
2. Using base R, generate a
histogram for any appropriate variable. Set the number of bins on it to
5, add axes descriptions, and plot title. 3. Load the
ggplot2
library and use it for all subsequent tasks. Draw a
scatterplot of cars’ horsepower against weight. Include a regression
line to check the relationship.
4. Group values in the above
scatterplot based on the number of cylinders. Does it change the
relationship
5. Plot a relationship between the number of forward
gears and rate of combustion (miles/per gallon).
Save all the graphs with your preferred method. Upload your R script (including commands for saving graphs) on the Pegaz platform. Do not upload graph files itself.