Update: January 16, 2018. Updated the post to include the data from FSA and FSAdata packages.

In our work, presenting the status of fish stocks are very important. It can help the local fishers as well as the Local Government Units in crafting an ordinance or measures to manage the fish stocks in their respective jurisdiction. The data cannot tell the real status unless it has a form - a graph or chart.

One of the graphs produced by my colleagues are based on the length frequency distribution data. They used Microsoft Excel in making the graph, and manually draw a rectangles inside the plot to differentiate the lengths of immature, mature, and mega-spawner of a single species.

Histogram produced in Excel

Above is an example of the said plot, but it is stacked according to the fishing gears that caught that particular species (not shown).

Honestly, I find it tiring especially in the context of reproducibility. In addition, if there are some editions in the raw data, they have to do a series of pivoting and manually producing the graph.

So I try to recreate the said graph, with a little modifications, using R and the ggplot2 package.

In this tutorial, I wanted to produce a histogram of length frequency by using the ggplot2 package in R. If you are new to ggplot2, there are many free online resources you can read: ggplot2 (the official website of the package), and this one from STHDA.

Preliminary Steps

To follow this tutorial, first install the tidyverse package - a suite of R packages developed by Hadley Wickham. You can view the official documentation here and here.

You can install it by running the code inside the R terminal/console:

install.packages("ggplot2")
install.packages("dplyr")
install.packages("magrittr")

Lastly, you may also install ggthemes needed to tweak the appearance of your graph(s).

install.packages("ggthemes")

Once installed, you can load it by typing:

library(ggplot2)
library(dplyr)
library(magrittr)

The Data

I used the CiscoTL data from the FSAdata and its meta-documentation can be found here. I save the data into a CSV file and you can download it from here.

head(cisco_data)
##   X lakeid year4 sampledate gearid spname length weight sex
## 1 1     TR  1981  8/11/1981 VGN032  CISCO    140   21.4   F
## 2 2     TR  1981  8/10/1981 VGN032  CISCO    146   22.3   F
## 3 3     TR  1981  8/11/1981 VGN032  CISCO    147   23.3   F
## 4 4     TR  1981  8/19/1981 VGN032  CISCO    153   23.5   F
## 5 5     TR  1981  8/19/1981 VGN032  CISCO    150   24.0   F
## 6 6     TR  1981  8/19/1981 VGN032  CISCO    152   24.0   F

The Process

Histogram is a type of graphical method that is used to display the distribution of your data. This graph is a close relative of bar chart, but this is primarily used if your data is continuous, such as length measurements. This graph relies on bins, a range of measurement values consisting of upper and lower limits. To find the appropriate bins for your data, you must first find the class size and class interval.

You can compute for the class interval by using the formula:

Class Interval = (Range / Class Size)

First find the range of your data by getting the maximum value and subtracting it with the minimum value. In our data, the range can be computed as:

cisco_data_range <- max(cisco_data$length) - min(cisco_data$length)

The range is 322. Then we must specify the class size. According to several articles, there is no hard and fast rule in selecting the number of class size. This is up to the researcher, but it must be enough to show the distribution of your data.

class_size <- 20

Say, for example, you settled in a class size of 20, then finding the class interval is simply dividing the range with the class size.

class_interval <- cisco_data_range / class_size

The next step is to find the lower and the upper limits of the bins.

lower_limit <- seq(min(cisco_data$length), max(cisco_data$length), by = class_interval)

The code above simply made a sequence of numbers beginning from the minimum value up to the maximum value with an interval of 16.1, which is the class interval. To find the upper limit of the bin, we simply add the lower limit to the class interval, and subtract 0.1.

upper_limit <- (lower_limit + class_interval) - 0.1

To apply function to the values, we must first convert the vectors to a data frame:

df_bin <- as.data.frame(cbind(lower_limit, upper_limit))

Now that it is converted into a data frame, we can now compute for the midlengths of the class bins. We can make a new column containing the midlength by using the mutate function in dplyr package.

df_bin <- df_bin %>% 
  mutate(midlength = (lower_limit + upper_limit) / 2)

Placing the limits of the class intervals midway between two numbers (e.g., 89.1) ensures that every score will fall in an interval rather than on the boundary between intervals.

Take note that we used a class size of 20 in our computation, but, if you didn’t noticed, the number of class size generated was actually 21. Take note of this. You can check it by:

nrow(df_bin)
## [1] 21

Now that we calculated the needed values, we now have to find out the needed values specific for this species. Generally, when presenting the length frequency distribution in the form of histogram, my colleagues added a vertical line representing the length-at-first maturity (Lm) of the species. The value for Lm can be accessed at FishBase.

Navigate to the said website and search for “Coregonus artedii”, and we can find the value of length-at-first maturity as 17.1 cm.

Length-at-first maturity data at FishBase for Cisco

cisco_lm_cm <- 17.1

Reviewing the documentation of the data, the lengths are measured in millimeters, so we need to convert the said value.

cisco_lm_mm <- cisco_lm_cm * 10

In addition to the Lm line, another vertical line is added to the graph, representing the starting length of the so-called mega-spawner. I asked my colleagues on how to compute this, and this can be done by multiplying the maximum recorded length for that species by 0.7. The reference I am yet to find out.

cisco_mega <- (max(cisco_data$length)) * 0.7

Okay, the values are now calculated and ready. It is now the time to make the graph.

First, make a variable containing the title of our graph:

my_title <- expression(paste("Length frequency distribution of Cisco (", italic("Coregonus artedi"), ")")) 

Custom theme

I found a custom ggplot2 theme online, located here. I made a little modification on his custom theme to suit my needs.

theme_pub <- function(base_size = 14, base_family = "Liberation Serif") {
  library(grid)
  library(ggthemes)
  (theme_foundation(base_size = base_size, base_family = base_family)
    + theme(plot.title = element_text(face = "bold",
                                      size = rel(1.2),
                                      margin = margin(b = 0)),
            text = element_text(),
            panel.background = element_rect(colour = NA),
            plot.background = element_rect(colour = NA),
            plot.subtitle = element_text(margin = margin(t = 5, b = 10)),
            panel.border = element_rect(colour = NA),
            axis.title = element_text(face = "bold", size = rel(1)),
            axis.title.y = element_text(angle = 90, vjust = 2),
            axis.title.x = element_text(vjust = -0.2),
            axis.text.y = element_text(),
            axis.text.x = element_text(size = 10, angle = 25), 
            axis.line = element_line(colour = "black"),
            axis.ticks = element_line(),
            panel.grid.major = element_line(colour = "#f0f0f0"),
            panel.grid.minor = element_blank(),
            legend.key = element_rect(colour = NA),
            legend.position = "bottom",
            legend.direction = "horizontal",
            legend.key.size = unit(0.2, "cm"),
            legend.title = element_text(face = "italic"),
            legend.text = element_text(size = rel(1)),
            plot.margin = unit(c(10,5,5,5), "mm"),
            strip.background = element_rect(colour = "#f0f0f0", fill = "#f0f0f0"),
            strip.text = element_text(face = "bold")
    ))
  
}

You can change the value for the base_family to Times New Roman or any font you like.

You can save it as a separate R scripts, example, custom-theme.R, and in your document, you can source it by:

source("scripts/custom-theme.R")

Plotting

We will do the graph piece by piece. First, let’s see what the basic histogram look like in ggplot2.

plot1 <- ggplot(data = cisco_data, aes(x = length)) +
  geom_histogram()

print(plot1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Basic histogram in ggplot2

The basic histogram is using the default bins, which is set to 30, as you can see in the message after you run print(plot1).

To use our computed value, we must assigned that value to the binwidth option in geom_histogram.

plot2 <- ggplot(data = cisco_data, aes(x = length)) +
  geom_histogram(binwidth = class_interval)

print(plot2)

Histogram with a binwidth = 16.1

Another way to make the histogram is to use the bins option instead of binwidth, but take note that the value in the said option must be the same as the value of your actual class size, which in our case, is 21.

plot3 <- ggplot(data = cisco_data, aes(x = length)) +
  geom_histogram(bins = class_size + 1)

print(plot3)

Histogram with bins = 21

I added 1 to the class_size variable to make it 21. As you can see, the generated plots are the same. If you plot it using just the class_size variable in the bins option, the generated plot is different. Try it to see.

Also, take note that the numbers in the x-axis ranges from 100 to 400, with an interval of 100s. If you want to use the midlengths as the numbers in the x-axis, we can use the breaks option. This must be supplied to the argument scale_x_continuous.

plot4 <- ggplot(data = cisco_data, aes(x = length)) +
  geom_histogram(binwidth = class_interval) +
  scale_x_continuous(breaks = df_bin$midlength)

print(plot4)

Modifying the x-axis in histogram

Now that we have the code for our base histogram, we can now tweak it to suit our needs. First, we will change the color of our graph. We will just add fill argument inside the geom_histogram. We can supply this with a color name or its HEX value. Likewise, if we want to change the color of the boundary of each bar, we can add color argument.

plot5 <- ggplot(data = cisco_data, aes(x = length)) +
  geom_histogram(binwidth = class_interval, fill = "gray", color = "black") +
  scale_x_continuous(breaks = df_bin$midlength)

print(plot5)

Tweaking the color of the graph

Now we will add the title we made and modify the axis labels. In this part, we will reuse the codes of plot5 so that we will not re-type it again and again.

plot6 <- plot5 +
  labs(title = my_title,
        subtitle = "The samples were taken from Trout Lake, WI, 1981-2006.",
        x = "Midlength (mm)", 
        y = "Frequency",
        caption = "(Based on the data from fishR website by Derek Ogle)")

print(plot6)

Adding titles and labels in the graph

The labs function is self-explanatory. It holds the title and the axis labels. In addition, you can add a caption below the graph using the caption argument.

Starting this part, we will reuse the codes of the previous plots to generate the final histogram. Now, we will add a vertical line indicating the location of the length-at-first maturity of the species.

plot7 <- plot6 +
  geom_vline(aes(xintercept = cisco_lm_mm), color = "red",
            linetype = "dashed", size = 0.5)

print(plot7)

Adding a vertical line

Add an arrow indicating that the said line is where the lenght-at-first maturity at. Since the red line is at 171 mm, the pointed part of the arrow must be at 172 mm (xend). The tail part of the arrow (x) must extend from xend and to its right (you can specify anywhere the tail ends). y and yend must be the same if you want an straight arrow.

plot8 <- plot7 +
  geom_segment(aes(x = 220, y = 2100, xend = 172, yend = 2100),
              arrow = arrow(length = unit(0.2, "cm")))

print(plot8)

Adding an arrow

Add a label next to the arrow.

plot9 <- plot8 +
  annotate("text", x = 300, y = 2100, label = "Length-at-first maturity (171 mm)")

print(plot9)

Adding text inside the graph

Add a rectangle to enclosed the lengths that are below length-at-first maturity. xmin must be set at -Inf to cover the whole area to the left of the red vertical line. xmax should contain a value just below the length-at-first maturity. ymax must be set to Inf to cover the height of the chart since we do not know the actual value of the maximum value in the y-axis since it is automatically computed by geom_histogram.

plot10 <- plot9 +
  annotate("rect", xmin = -Inf, xmax = 170, ymin = 0, ymax = Inf,
            alpha = 0.2)

print(plot10)

Adding rectangle inside the graph

Add text inside the rectangle area indicating that the lengths are immature or juveniles. You will need to re-adjust the values in the x and y options.

plot11 <- plot10 +
  annotate("text", x = 97, y = 1700, label = "Juveniles/\nImmature") 

print(plot11)

Adding text inside the rectangle area

The \n is a code to make a break in your text. Add text indicating that the lengths after the red line are mature.

plot12 <- plot11 +
  annotate("text", x = 230, y = 1700, label = "Mature")

print(plot12)

Adding another text inside the graph

Add another rectangle to indicate that the lengths beginning at 276.5 mm are mega spawners. The value of xmax and ymax must be set to Inf.

plot13 <- plot12 +
  annotate("rect", xmin = 276.5, xmax = Inf, ymin = 0, ymax = Inf,
            alpha = 0.2)

print(plot13)

Adding another rectangle inside the graph

Add a text inside the rectangle indicating that the said lengths are mega spawners.

plot14 <- plot13 +
  annotate("text", x = 350, y = 1700, label = "Mega Spawner")

print(plot14)

Adding another text inside the rectangle area

Lastly, apply the custom theme to the graph. The plot below is the final histogram.

plot15 <- plot14 +
  theme_pub()

print(plot15)

Applying the custom theme

Now, we can save the final graph as a .tif picture.

tiff("histogram-cisco.tif", 
      res = 600, compression = "lzw", height = 5, width = 9, units = "in")
print(plot15)
invisible(dev.off())

Complete code

histogram_cisco <- ggplot(data = cisco_data, aes(x = length)) +
  geom_histogram(binwidth = class_interval, fill = "gray", color = "black") +
  scale_x_continuous(breaks = df_bin$midlength) +
  labs(title = my_title,
        subtitle = "The samples were taken from Trout Lake, WI, 1981-2006.",
        x = "Midlength (mm)", 
        y = "Frequency",
        caption = "(Based on the data from fishR website by Derek Ogle)") +
  geom_vline(aes(xintercept = cisco_lm_mm), color = "red",
              linetype = "dashed", size = 0.5) +
  geom_segment(aes(x = 220, y = 2100, xend = 172, yend = 2100),
              arrow = arrow(length = unit(0.2, "cm"))) +
  annotate("text", x = 300, y = 2100, label = "Length-at-first maturity (171 mm)") +
  annotate("rect", xmin = -Inf, xmax = 170, ymin = 0, ymax = Inf,
            alpha = 0.2) +
  annotate("text", x = 97, y = 1700, label = "Juveniles/\nImmature") +
  annotate("text", x = 230, y = 1700, label = "Mature") +
  annotate("rect", xmin = 276.5, xmax = Inf, ymin = 0, ymax = Inf,
            alpha = 0.2) +
  annotate("text", x = 350, y = 1700, label = "Mega Spawner") +
  theme_pub()

Hope that you enjoy following the tutorial.

Note: Take note that you have to re-adjust and re-run the codes several times to produced your desired graph. This is also true to the custom theme.