**Update:** January 16, 2018. Updated the post to include the data from `FSA`

and `FSAdata`

packages.

In our work, presenting the status of fish stocks are very important. It can help the local fishers as well as the Local Government Units in crafting an ordinance or measures to manage the fish stocks in their respective jurisdiction. The data cannot tell the real status unless it has a form - a graph or chart.

One of the graphs produced by my colleagues are based on the length frequency distribution data. They used Microsoft Excel in making the graph, and manually draw a rectangles inside the plot to differentiate the lengths of immature, mature, and mega-spawner of a single species.

*Histogram produced in Excel*

Above is an example of the said plot, but it is stacked according to the fishing gears that caught that particular species (not shown).

Honestly, I find it tiring especially in the context of reproducibility. In addition, if there are some editions in the raw data, they have to do a series of pivoting and manually producing the graph.

So I try to recreate the said graph, with a little modifications, using `R`

and the `ggplot2`

package.

In this tutorial, I wanted to produce a histogram of length frequency by using the `ggplot2`

package in R. If you are new to `ggplot2`

, there are many free online resources you can read: ggplot2 (the official website of the package), and this one from STHDA.

## Preliminary Steps

To follow this tutorial, first install the `tidyverse`

package - a suite of R packages developed by Hadley Wickham. You can view the official documentation here and here.

You can install it by running the code inside the R terminal/console:

```
install.packages("ggplot2")
install.packages("dplyr")
install.packages("magrittr")
```

Lastly, you may also install `ggthemes`

needed to tweak the appearance of your graph(s).

`install.packages("ggthemes")`

Once installed, you can load it by typing:

```
library(ggplot2)
library(dplyr)
library(magrittr)
```

## The Data

I used the *CiscoTL* data from the `FSAdata`

and its meta-documentation can be found here. I save the data into a CSV file and you can download it from here.

`head(cisco_data)`

```
## X lakeid year4 sampledate gearid spname length weight sex
## 1 1 TR 1981 8/11/1981 VGN032 CISCO 140 21.4 F
## 2 2 TR 1981 8/10/1981 VGN032 CISCO 146 22.3 F
## 3 3 TR 1981 8/11/1981 VGN032 CISCO 147 23.3 F
## 4 4 TR 1981 8/19/1981 VGN032 CISCO 153 23.5 F
## 5 5 TR 1981 8/19/1981 VGN032 CISCO 150 24.0 F
## 6 6 TR 1981 8/19/1981 VGN032 CISCO 152 24.0 F
```

## The Process

Histogram is a type of graphical method that is used to display the distribution of your data. This graph is a close relative of bar chart, but this is primarily used if your data is continuous, such as length measurements. This graph relies on bins, a range of measurement values consisting of upper and lower limits. To find the appropriate bins for your data, you must first find the class size and class interval.

You can compute for the class interval by using the formula:

Class Interval = (Range / Class Size)

First find the range of your data by getting the maximum value and subtracting it with the minimum value. In our data, the range can be computed as:

`cisco_data_range <- max(cisco_data$length) - min(cisco_data$length)`

The range is 322. Then we must specify the class size. According to several articles, there is no hard and fast rule in selecting the number of class size. This is up to the researcher, but it must be enough to show the distribution of your data.

`class_size <- 20`

Say, for example, you settled in a class size of 20, then finding the class interval is simply dividing the range with the class size.

`class_interval <- cisco_data_range / class_size`

The next step is to find the lower and the upper limits of the bins.

`lower_limit <- seq(min(cisco_data$length), max(cisco_data$length), by = class_interval)`

The code above simply made a sequence of numbers beginning from the minimum value up to the maximum value with an interval of 16.1, which is the class interval. To find the upper limit of the bin, we simply add the lower limit to the class interval, and subtract 0.1.

`upper_limit <- (lower_limit + class_interval) - 0.1`

To apply function to the values, we must first convert the vectors to a data frame:

`df_bin <- as.data.frame(cbind(lower_limit, upper_limit))`

Now that it is converted into a data frame, we can now compute for the midlengths of the class bins. We can make a new column containing the midlength by using the `mutate`

function in `dplyr`

package.

```
df_bin <- df_bin %>%
mutate(midlength = (lower_limit + upper_limit) / 2)
```

Placing the limits of the class intervals midway between two numbers (e.g., 89.1) ensures that every score will fall in an interval rather than on the boundary between intervals.

Take note that we used a class size of 20 in our computation, but, if you didn’t noticed, the number of class size generated was actually 21. Take note of this. You can check it by:

`nrow(df_bin)`

`## [1] 21`

Now that we calculated the needed values, we now have to find out the needed values specific for this species. Generally, when presenting the length frequency distribution in the form of histogram, my colleagues added a vertical line representing the length-at-first maturity (L_{m}) of the species. The value for L_{m} can be accessed at FishBase.

Navigate to the said website and search for “*Coregonus artedii*”, and we can find the value of length-at-first maturity as 17.1 cm.

*Length-at-first maturity data at FishBase for Cisco*

`cisco_lm_cm <- 17.1`

Reviewing the documentation of the data, the lengths are measured in millimeters, so we need to convert the said value.

`cisco_lm_mm <- cisco_lm_cm * 10`

In addition to the L_{m} line, another vertical line is added to the graph, representing the starting length of the so-called mega-spawner. I asked my colleagues on how to compute this, and this can be done by multiplying the maximum recorded length for that species by 0.7. The reference I am yet to find out.

`cisco_mega <- (max(cisco_data$length)) * 0.7`

Okay, the values are now calculated and ready. It is now the time to make the graph.

First, make a variable containing the title of our graph:

`my_title <- expression(paste("Length frequency distribution of Cisco (", italic("Coregonus artedi"), ")")) `

## Custom theme

I found a custom ggplot2 theme online, located here. I made a little modification on his custom theme to suit my needs.

```
theme_pub <- function(base_size = 14, base_family = "Liberation Serif") {
library(grid)
library(ggthemes)
(theme_foundation(base_size = base_size, base_family = base_family)
+ theme(plot.title = element_text(face = "bold",
size = rel(1.2),
margin = margin(b = 0)),
text = element_text(),
panel.background = element_rect(colour = NA),
plot.background = element_rect(colour = NA),
plot.subtitle = element_text(margin = margin(t = 5, b = 10)),
panel.border = element_rect(colour = NA),
axis.title = element_text(face = "bold", size = rel(1)),
axis.title.y = element_text(angle = 90, vjust = 2),
axis.title.x = element_text(vjust = -0.2),
axis.text.y = element_text(),
axis.text.x = element_text(size = 10, angle = 25),
axis.line = element_line(colour = "black"),
axis.ticks = element_line(),
panel.grid.major = element_line(colour = "#f0f0f0"),
panel.grid.minor = element_blank(),
legend.key = element_rect(colour = NA),
legend.position = "bottom",
legend.direction = "horizontal",
legend.key.size = unit(0.2, "cm"),
legend.title = element_text(face = "italic"),
legend.text = element_text(size = rel(1)),
plot.margin = unit(c(10,5,5,5), "mm"),
strip.background = element_rect(colour = "#f0f0f0", fill = "#f0f0f0"),
strip.text = element_text(face = "bold")
))
}
```

You can change the value for the `base_family`

to Times New Roman or any font you like.

You can save it as a separate R scripts, example, `custom-theme.R`

, and in your document, you can source it by:

`source("scripts/custom-theme.R")`

## Plotting

We will do the graph piece by piece. First, let’s see what the basic histogram look like in `ggplot2`

.

```
plot1 <- ggplot(data = cisco_data, aes(x = length)) +
geom_histogram()
print(plot1)
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

*Basic histogram in ggplot2*

The basic histogram is using the default bins, which is set to 30, as you can see in the message after you run `print(plot1)`

.

To use our computed value, we must assigned that value to the `binwidth`

option in `geom_histogram`

.

```
plot2 <- ggplot(data = cisco_data, aes(x = length)) +
geom_histogram(binwidth = class_interval)
print(plot2)
```

*Histogram with a binwidth = 16.1*

Another way to make the histogram is to use the `bins`

option instead of `binwidth`

, but take note that the value in the said option must be the same as the value of your actual class size, which in our case, is 21.

```
plot3 <- ggplot(data = cisco_data, aes(x = length)) +
geom_histogram(bins = class_size + 1)
print(plot3)
```

*Histogram with bins = 21*

I added 1 to the `class_size`

variable to make it 21. As you can see, the generated plots are the same. If you plot it using just the `class_size`

variable in the `bins`

option, the generated plot is different. Try it to see.

Also, take note that the numbers in the x-axis ranges from 100 to 400, with an interval of 100s. If you want to use the midlengths as the numbers in the x-axis, we can use the `breaks`

option. This must be supplied to the argument `scale_x_continuous`

.

```
plot4 <- ggplot(data = cisco_data, aes(x = length)) +
geom_histogram(binwidth = class_interval) +
scale_x_continuous(breaks = df_bin$midlength)
print(plot4)
```

*Modifying the x-axis in histogram*

Now that we have the code for our base histogram, we can now tweak it to suit our needs. First, we will change the color of our graph. We will just add `fill`

argument inside the `geom_histogram`

. We can supply this with a color name or its HEX value. Likewise, if we want to change the color of the boundary of each bar, we can add `color`

argument.

```
plot5 <- ggplot(data = cisco_data, aes(x = length)) +
geom_histogram(binwidth = class_interval, fill = "gray", color = "black") +
scale_x_continuous(breaks = df_bin$midlength)
print(plot5)
```

*Tweaking the color of the graph*

Now we will add the title we made and modify the axis labels. In this part, we will reuse the codes of `plot5`

so that we will not re-type it again and again.

```
plot6 <- plot5 +
labs(title = my_title,
subtitle = "The samples were taken from Trout Lake, WI, 1981-2006.",
x = "Midlength (mm)",
y = "Frequency",
caption = "(Based on the data from fishR website by Derek Ogle)")
print(plot6)
```

*Adding titles and labels in the graph*

The `labs`

function is self-explanatory. It holds the title and the axis labels. In addition, you can add a caption below the graph using the `caption`

argument.

Starting this part, we will reuse the codes of the previous plots to generate the final histogram. Now, we will add a vertical line indicating the location of the length-at-first maturity of the species.

```
plot7 <- plot6 +
geom_vline(aes(xintercept = cisco_lm_mm), color = "red",
linetype = "dashed", size = 0.5)
print(plot7)
```

*Adding a vertical line*

Add an arrow indicating that the said line is where the lenght-at-first maturity at. Since the red line is at 171 mm, the pointed part of the arrow must be at 172 mm (`xend`

). The tail part of the arrow (`x`

) must extend from `xend`

and to its right (you can specify anywhere the tail ends). `y`

and `yend`

must be the same if you want an straight arrow.

```
plot8 <- plot7 +
geom_segment(aes(x = 220, y = 2100, xend = 172, yend = 2100),
arrow = arrow(length = unit(0.2, "cm")))
print(plot8)
```

*Adding an arrow*

Add a label next to the arrow.

```
plot9 <- plot8 +
annotate("text", x = 300, y = 2100, label = "Length-at-first maturity (171 mm)")
print(plot9)
```

*Adding text inside the graph*

Add a rectangle to enclosed the lengths that are below length-at-first maturity. `xmin`

must be set at `-Inf`

to cover the whole area to the left of the red vertical line. `xmax`

should contain a value just below the length-at-first maturity. `ymax`

must be set to `Inf`

to cover the height of the chart since we do not know the actual value of the maximum value in the y-axis since it is automatically computed by `geom_histogram`

.

```
plot10 <- plot9 +
annotate("rect", xmin = -Inf, xmax = 170, ymin = 0, ymax = Inf,
alpha = 0.2)
print(plot10)
```

*Adding rectangle inside the graph*

Add text inside the rectangle area indicating that the lengths are immature or juveniles. You will need to re-adjust the values in the `x`

and `y`

options.

```
plot11 <- plot10 +
annotate("text", x = 97, y = 1700, label = "Juveniles/\nImmature")
print(plot11)
```

*Adding text inside the rectangle area*

The `\n`

is a code to make a break in your text. Add text indicating that the lengths after the red line are mature.

```
plot12 <- plot11 +
annotate("text", x = 230, y = 1700, label = "Mature")
print(plot12)
```

*Adding another text inside the graph*

Add another rectangle to indicate that the lengths beginning at 276.5 mm are mega spawners. The value of `xmax`

and `ymax`

must be set to `Inf`

.

```
plot13 <- plot12 +
annotate("rect", xmin = 276.5, xmax = Inf, ymin = 0, ymax = Inf,
alpha = 0.2)
print(plot13)
```

*Adding another rectangle inside the graph*

Add a text inside the rectangle indicating that the said lengths are mega spawners.

```
plot14 <- plot13 +
annotate("text", x = 350, y = 1700, label = "Mega Spawner")
print(plot14)
```

*Adding another text inside the rectangle area*

Lastly, apply the custom theme to the graph. The plot below is the final histogram.

```
plot15 <- plot14 +
theme_pub()
print(plot15)
```

*Applying the custom theme*

Now, we can save the final graph as a .tif picture.

```
tiff("histogram-cisco.tif",
res = 600, compression = "lzw", height = 5, width = 9, units = "in")
print(plot15)
invisible(dev.off())
```

## Complete code

```
histogram_cisco <- ggplot(data = cisco_data, aes(x = length)) +
geom_histogram(binwidth = class_interval, fill = "gray", color = "black") +
scale_x_continuous(breaks = df_bin$midlength) +
labs(title = my_title,
subtitle = "The samples were taken from Trout Lake, WI, 1981-2006.",
x = "Midlength (mm)",
y = "Frequency",
caption = "(Based on the data from fishR website by Derek Ogle)") +
geom_vline(aes(xintercept = cisco_lm_mm), color = "red",
linetype = "dashed", size = 0.5) +
geom_segment(aes(x = 220, y = 2100, xend = 172, yend = 2100),
arrow = arrow(length = unit(0.2, "cm"))) +
annotate("text", x = 300, y = 2100, label = "Length-at-first maturity (171 mm)") +
annotate("rect", xmin = -Inf, xmax = 170, ymin = 0, ymax = Inf,
alpha = 0.2) +
annotate("text", x = 97, y = 1700, label = "Juveniles/\nImmature") +
annotate("text", x = 230, y = 1700, label = "Mature") +
annotate("rect", xmin = 276.5, xmax = Inf, ymin = 0, ymax = Inf,
alpha = 0.2) +
annotate("text", x = 350, y = 1700, label = "Mega Spawner") +
theme_pub()
```

Hope that you enjoy following the tutorial.

*Note:* Take note that you have to re-adjust and re-run the codes several times to produced your desired graph. This is also true to the custom theme.