Project Motivation

In pursuit of identifying a data set for the “Data Analysis & Visualisation” module, I quickly found a lot of statistical databases are quite difficult to digest. In my time of need, I turned to the most trusted friend of the milenial, the avocado - a large botanical berry, also referred to as the avocado pear. Not only is it a highly nutritious fruit and delicious on toast, avocados also produce an abundance of rich data, perfect for creating visualisations.

The dataset I selected was among the few I really understood and I wanted to add a degree of light-hearted novelty to my work in these times of doom and gloom. So here I present, data visualisations of avocado prices, sales & distribution across the US between 2015 - 2018.

#path to image of avocado
imgpath1 <- '/Users/katie/Desktop/PGrad/Semester 2/Data analysis/avo.jpg'
#include graphics
knitr::include_graphics(imgpath1)

The Data

The data is sourced from a page on Kaggle, who originally gathered the data from the Hass Avocado Board website. The data contains the average price, total sales and regional data of Hass avocados in the USA between 2015-2018. In total, the data has 14 variables and 18,242 data entries.

#Path the data
avocado.csv <- '/Users/katie/Desktop/PGrad/Semester2/Dataanalysis/avocado.csv'
#Load the data
ogavo <- read.csv("avocado.csv", header = TRUE, sep = ",") 
#Display the data
head(ogavo)

##   X       Date AveragePrice Total.Volume   X4046     X4225  X4770 Total.Bags
## 1 0 2015-12-27         1.33     64236.62 1036.74  54454.85  48.16    8696.87
## 2 1 2015-12-20         1.35     54876.98  674.28  44638.81  58.33    9505.56
## 3 2 2015-12-13         0.93    118220.22  794.70 109149.67 130.50    8145.35
## 4 3 2015-12-06         1.08     78992.15 1132.00  71976.41  72.58    5811.16
## 5 4 2015-11-29         1.28     51039.60  941.48  43838.39  75.78    6183.95
## 6 5 2015-11-22         1.26     55979.78 1184.27  48067.99  43.61    6683.91
##   Small.Bags Large.Bags XLarge.Bags         type year region
## 1    8603.62      93.25           0 conventional 2015 Albany
## 2    9408.07      97.49           0 conventional 2015 Albany
## 3    8042.21     103.14           0 conventional 2015 Albany
## 4    5677.40     133.76           0 conventional 2015 Albany
## 5    5986.26     197.69           0 conventional 2015 Albany
## 6    6556.47     127.44           0 conventional 2015 Albany

Due to the sheer volume of data, the analysis focused specifically on 6 variables. The variables selected represent the price, sales & (brief) regional data of avocado across 3 years - which are central to the research questions asked during this analysis.

The table below provides a description of the variables used - including variables present in the original data set and those which were created during the analysis, such as ‘monthabb.’

Variable	Explanation
Date	Date of sale
Average Price	Average price of a single avocado
Total Volume	Total number of avocados sold
Type	Either conventional (non-organic) or organic
Year	Year of sale (2015 - 2018)
Region	Either city, region or state of sale (USA)
Month	Month of sale in a numerical format January = 1
Monthabb	Month of sale in a abbreviated format January = Jan

Note: I excluded analysis of the year 2018, as the data set was limited to entries between January-March.

Research Questions

Aims of the data analysis & visualisation production:

In which month of the year are conventional/organic avocados the cheapest/most expensive? Does this vary across years?
Is there a relationship between avocado price and popularity? Does this vary between conventional and organic avocado?
Which are most popular - organic or conventional avocados?
Which regions of the US purchases the most avocados?

Data Preparation

Creating variables which represent months in numerical and abbreviated formats, respectively, to prepare the variables ‘Average Price’ & ‘Total Volume’ to be plotted on a time series graph .

#Changing the date column from factor to a date variable
ogavo$Date <- as.Date(ogavo$Date, "%Y-%m-%d")

#Ordering the columns from the earliest to the latest
ogavodate <- ogavo[order(as.Date(ogavo$Date, format = "%Y-%m-%d")),]

#Adding the column "month", which represents the numeric value of the month, e.g: January = 1
ogavodate$month <- format(as.Date(ogavodate$Date), "%m")

#Adding the column "month.abb" - which contains monthly abbreviations from the "month" column created, e.g: January/1 = Jan
ogavodate$monthabb <- sapply(ogavodate$month, function(x) month.abb[as.numeric(x)])

Avocado Prices, 2015 - 2017: Visualisation 1

Initially, I began my analysis by comparing the value variation of organic and non-organic (conventional) avocados across the year, for each year available in the dataset (excluding 2018). In order to deduce when exactly, the best time is to buy avocados and whether prices are improving or spiking.

#Create a variable which calculates the average price of a single conventional avocados, for each month, across each year in the dataset. The step is then repeated for organic avocados
convavo <- ogavodate %>% 
  select(type, year, monthabb, AveragePrice) %>% 
  filter(type == "conventional", year == c("2015", "2016", "2017")) %>%
  group_by(year, monthabb) %>% 
  summarise(avg=mean(AveragePrice))

#Data preparation - using the variable created of average conventional avocado prices to prepare the time series parameters, containing: the year the data begins, separated by 12 (representing the number of months per year)
conavo.price <- ts(convavo$avg, start=2015, frequency = 12)

#Labelling paramaeters fro the conventional avocado graph
ggstitle.c <- "Average Price of Conventional Avocados Per Year"
ylab <- "Average Price ($)"
"2015" <- 'seagreen'
"2016" <- 'yellowgreen'
"2017" <- 'chocolate4'

#Conventional(price) graph, 'ggseasonplot' used to display seasonal time series for separate years
line.conv.year <- ggseasonplot(conavo.price, geom_label_repel(year.labels=TRUE)) 
 line.conv.year <- line.conv.year + 
   labs(title = ggstitle.c) + 
   theme_minimal() +
   ylab(ylab) + 
   scale_color_manual(values=c(`2015`, `2016`, `2017`))
 
#Saving the conventional avocado price graph
 ggsave("convavoprice.png")

#Format the graphs so they're arranged together for easy visualisation
grid.arrange(line.conv.year, line.org.year, nrow = 2)

#Saving the formatted graphs
ggsave("Price.avo.png")

Aim 1 - In which month of the year are conventional/organic avocados the cheapest/most expensive? Has this changed over the years?

Conventional: The most affordable months for conventional avocados purhases are between March & May. This is relatively consistent in 2017 & 2016, with the latter showing an additional drop in price in September. The year 2015 followed a less uniform pattern, with prices falling a month earlier than later years, but remaining relatively consistent across the year, in constrast to 2016 & 2017, where prices peaked in between October & December. Therefore, purchases of conventional avocados during winter months should be avoided at all costs, as the graph illustrates this is when prices tend to rise.

Organic: As expected, organic avocados cost significantly more than conventional, throughout the year. From the graph, it is clear 2017 displays the most variation in price. It yeilds both the cheapest and most expensive months of all years recorded - the former being April with a second, but less pronounced dip in August and the latter in February & December ($2.25 for a single avocado !?!). A similar pattern to conventional avocado prices is followed, with price troughs occurring around April/May, again in August/September before a peak during the winter months.

Avocado Sales, 2015 - 2017: Visualisation 2

The analysis was repeated, but instead plotting the average volume of avocados sold, to directly compare with the trends in price.

#Similar preparation format as before, but listing total volume as opposed to average price
sold.conavo <- ogavodate %>% 
  select(type, year, monthabb, Total.Volume) %>% 
  filter(type == "conventional", year == c("2015", "2016", "2017")) %>% 
  group_by(year, monthabb) %>% 
  summarise(avg=mean(Total.Volume))

#Time series prep
soldc <- ts(sold.conavo$avg, start=2015, frequency = 12)

#Graph label prep
c.sold.title <- "Mean Monthly Sales of Conventional Avocados, USA (2015 - 2017)"
s.ylab <- "No. of Avocados"

#Conventional(sold) graph
 line.sold.conavo <- ggseasonplot(soldc, geom_label_repel(year.labels=TRUE))
 line.sold.conavo <- line.sold.conavo +
   labs(title = c.sold.title) +
   ylab(s.ylab) +
   theme_minimal() +
   scale_y_continuous(labels = scales::comma) + #arranges the values on the yaxis to be listed as continuous
   scale_color_manual(values=c(`2015`, `2016`, `2017`))
 
#Save the output
 ggsave("soldconavo.png")

## Saving 7 x 5 in image

Aim 2 - Is there a relationship between avocado price and popularity?

Interestingly, one of the highest sales peaks occurs when the price is at it’s lowest in the calendar year - April in both conventional & organic avocados and August for organic avocados . More so, the opposite relationship may be observed in June and winter months, in which the price of avocados increases as sales fall.

Avocado popularity by type: Visualisaion 3

Previous analyses have indicated conventional are much more popular than organic avocados - but I wanted to plot the data on a piechart in order to easily visualise the proportion of sales occupied by each type.

#Preparing a variabel which contains the average number of sales for each type of avocado
organic.conventional <- ogavodate %>% 
  select(type, Total.Volume) %>% 
  group_by(type) %>% 
  summarise(avg=mean(Total.Volume))

#Plot labels
colls.co <- c("forestgreen", "yellowgreen") #Colour for each segment of the pie chart 
pielabels <- c("Conventional", "Organic") #Label for each segement
mainpie <- c("Avocado Popularity by Type") #Title
#Create percentages to attach to the labels
co <- c(1653212.90, 47811.21)
pct <- round(co/sum(co)*100)
newpielabels <- paste(pielabels, pct)
newpielabels <- paste(newpielabels, "%", sep = "")

#Plot the pie chart
pie(organic.conventional$avg,
    col = colls.co, 
    main = mainpie,
    labels = newpielabels, 
    border = "white", #Colour of segment border
    radius = 1, #Size of the pie
    cex = 0.9) #Size of labels

## Saving 7 x 5 in image

Aim 3 - Which are most popular - organic or conventional avocados?

Looking at the piechart, it is clear conventional are the most popular avocados of choice - with organic avocados making up less than 5% of total sales.

Avocado popularity by region: Visualisaion 4

Next, I wanted to look at how avocado sales are distributed across the US, specifically in the top 5 sales regions. The table below identifies the regions with the largest avocado sales; West, California, South Central, North East & South East, USA.

#total data type, region & volume data ranked from from highest to lowest
region.sold.total <- ogavodate %>% 
  select(region, Total.Volume) %>% 
  group_by(region) %>% 
  summarise(avg=mean(Total.Volume))

#ordering the the data
order.region.total <- region.sold.total[order(region.sold.total$avg, decreasing = TRUE),]

#Showing the top 5 avocado sales regions (not including total US sales)
head(order.region.total)

## # A tibble: 6 x 2
##   region             avg
##   <fct>            <dbl>
## 1 TotalUS      17351302.
## 2 West          3215323.
## 3 California    3044324.
## 4 SouthCentral  2991952.
## 5 Northeast     2110299.
## 6 Southeast     1820232.

I opted to plot the figures in a stacked barchart, in order to visualise the distribution of sales for each type among each region.

#Creating a variable containing the avocado type (conventional/organic) & average sales data from the top 5 sales regions
top.five <- ogavodate %>% 
  select(type, region, Total.Volume) %>% 
  filter(region == c("West", "California", "SouthCentral", "Northeast", "Southeast")) %>%
  group_by(region, type) %>% 
  summarise(avg=mean(Total.Volume))

#Plot labels
xlab.b <- "No. of Avocados Sold"
ylab.b <- "US Region"
ggtitle.b <- "Average Avocado Sales in the Top 5 US Sales Regions (2015-2018)"

#Plotting a bar chart to display avocado sales in the top 5 sales regions, using the "stack" feature to stack conventional & organic avocados
region.plot <- ggplot(top.five) 
region.plot + 
  geom_bar(aes(fill = type, x = region, y = avg), width = 0.5, position = "stack", stat = "identity", colour = "black") +
  ggtitle(ggtitle.b) +
  ylab(xlab.b) +
  xlab(ylab.b) +
  theme_minimal() +
  theme(axis.text.y = element_text(angle = 40, vjust = 0.6)) +
  scale_y_continuous(labels = scales::comma) +
  scale_fill_manual(values = c("forestgreen", "yellowgreen")) + 
  coord_flip() #Changes the axis from x to y

## Saving 7 x 5 in image

Aim 4 - Which regions of the US purchases the most avocados? The data shows the western region of the US is the purchases the most avocados, closely followed by California (also positioned in the west) which is perhaps unsurprising given the avocado is the official fruit of California and it produces the largest amount of avocados in the USA. The data illustrates avocados are significantly more popular in the west in contrast to the east.

Summary

I believe the most important message to be taken from this analysis is to buy avocados between March & May - where they’re on average, the cheapest price of the entire year.

Caveats

1 - Regional Data

The regional data holds a mix of regions, states, counties and cities - each inconsistently represented and slightly ambiguous in definition. More so, it is unclear whether regions include states and cities which are also present in the data or not. For example; it is unclear whether the “West” region also includes “California” (which is positioned on the western border). Thus, from this data alone it is difficult to understand the sales distribution and price variation of avocado per region/state/city specifically.

2 - Seasonality

The data doesn’t specify where exactly the avocado are sourced. This is problematic as avocados sourced from different countries will vary in season. For example; Californian avocado season spans from spring through summer, whereas, Mexican avocado season spans from November to April - this is a variable that will undoubedtly have an impact on the quality of the avocado and therefore, the sales.

3 - Avocado type

The analysis is limited to the sale of Hass avocados . Thus, it is unclear whether the price, sale and distribution is reflective of other types of avocado , such as Pinkerton or Bacon (yes, Bacon is a variety of avocado.) )

Future Pursuits

There were several variables left untouched due to time constraints of the module. Given the time or the resources, future endeavours which use this data may focus more specifically on the type of hass avocado (small hass:4046, large hass: 4225, or extra large hass: 4770) to note the sales & price correlations for each. More so, further analysis into regional data, looking specifically at cities would also be an interesting future pursuit.

This analysis was performed using R. The rMarkdown file, images & data plots are availble on github.

PSY6422 Project

Katie Moran

23/05/2020