In pursuit of identifying a data set for the “Data Analysis & Visualisation” module, I quickly found a lot of statistical databases are quite difficult to digest. In my time of need, I turned to the most trusted friend of the milenial, the avocado - a large botanical berry, also referred to as the avocado pear. Not only is it a highly nutritious fruit and delicious on toast, avocados also produce an abundance of rich data, perfect for creating visualisations.
The dataset I selected was among the few I really understood and I wanted to add a degree of light-hearted novelty to my work in these times of doom and gloom. So here I present, data visualisations of avocado prices, sales & distribution across the US between 2015 - 2018.
#path to image of avocado
imgpath1 <- '/Users/katie/Desktop/PGrad/Semester 2/Data analysis/avo.jpg'
#include graphics
knitr::include_graphics(imgpath1)
The data is sourced from a page on Kaggle, who originally gathered the data from the Hass Avocado Board website. The data contains the average price, total sales and regional data of Hass avocados in the USA between 2015-2018. In total, the data has 14 variables and 18,242 data entries.
#Path the data
avocado.csv <- '/Users/katie/Desktop/PGrad/Semester2/Dataanalysis/avocado.csv'
#Load the data
ogavo <- read.csv("avocado.csv", header = TRUE, sep = ",")
#Display the data
head(ogavo)
## X Date AveragePrice Total.Volume X4046 X4225 X4770 Total.Bags
## 1 0 2015-12-27 1.33 64236.62 1036.74 54454.85 48.16 8696.87
## 2 1 2015-12-20 1.35 54876.98 674.28 44638.81 58.33 9505.56
## 3 2 2015-12-13 0.93 118220.22 794.70 109149.67 130.50 8145.35
## 4 3 2015-12-06 1.08 78992.15 1132.00 71976.41 72.58 5811.16
## 5 4 2015-11-29 1.28 51039.60 941.48 43838.39 75.78 6183.95
## 6 5 2015-11-22 1.26 55979.78 1184.27 48067.99 43.61 6683.91
## Small.Bags Large.Bags XLarge.Bags type year region
## 1 8603.62 93.25 0 conventional 2015 Albany
## 2 9408.07 97.49 0 conventional 2015 Albany
## 3 8042.21 103.14 0 conventional 2015 Albany
## 4 5677.40 133.76 0 conventional 2015 Albany
## 5 5986.26 197.69 0 conventional 2015 Albany
## 6 6556.47 127.44 0 conventional 2015 Albany
Due to the sheer volume of data, the analysis focused specifically on 6 variables. The variables selected represent the price, sales & (brief) regional data of avocado across 3 years - which are central to the research questions asked during this analysis.
The table below provides a description of the variables used - including variables present in the original data set and those which were created during the analysis, such as ‘monthabb.’
Variable | Explanation |
---|---|
Date | Date of sale |
Average Price | Average price of a single avocado |
Total Volume | Total number of avocados sold |
Type | Either conventional (non-organic) or organic |
Year | Year of sale (2015 - 2018) |
Region | Either city, region or state of sale (USA) |
Month | Month of sale in a numerical format January = 1 |
Monthabb | Month of sale in a abbreviated format January = Jan |
Note: I excluded analysis of the year 2018, as the data set was limited to entries between January-March.
Aims of the data analysis & visualisation production:
In which month of the year are conventional/organic avocados the cheapest/most expensive? Does this vary across years?
Is there a relationship between avocado price and popularity? Does this vary between conventional and organic avocado?
Which are most popular - organic or conventional avocados?
Which regions of the US purchases the most avocados?
Creating variables which represent months in numerical and abbreviated formats, respectively, to prepare the variables ‘Average Price’ & ‘Total Volume’ to be plotted on a time series graph .
#Changing the date column from factor to a date variable
ogavo$Date <- as.Date(ogavo$Date, "%Y-%m-%d")
#Ordering the columns from the earliest to the latest
ogavodate <- ogavo[order(as.Date(ogavo$Date, format = "%Y-%m-%d")),]
#Adding the column "month", which represents the numeric value of the month, e.g: January = 1
ogavodate$month <- format(as.Date(ogavodate$Date), "%m")
#Adding the column "month.abb" - which contains monthly abbreviations from the "month" column created, e.g: January/1 = Jan
ogavodate$monthabb <- sapply(ogavodate$month, function(x) month.abb[as.numeric(x)])
Initially, I began my analysis by comparing the value variation of organic and non-organic (conventional) avocados across the year, for each year available in the dataset (excluding 2018). In order to deduce when exactly, the best time is to buy avocados and whether prices are improving or spiking.
#Create a variable which calculates the average price of a single conventional avocados, for each month, across each year in the dataset. The step is then repeated for organic avocados
convavo <- ogavodate %>%
select(type, year, monthabb, AveragePrice) %>%
filter(type == "conventional", year == c("2015", "2016", "2017")) %>%
group_by(year, monthabb) %>%
summarise(avg=mean(AveragePrice))
#Data preparation - using the variable created of average conventional avocado prices to prepare the time series parameters, containing: the year the data begins, separated by 12 (representing the number of months per year)
conavo.price <- ts(convavo$avg, start=2015, frequency = 12)
#Labelling paramaeters fro the conventional avocado graph
ggstitle.c <- "Average Price of Conventional Avocados Per Year"
ylab <- "Average Price ($)"
"2015" <- 'seagreen'
"2016" <- 'yellowgreen'
"2017" <- 'chocolate4'
#Conventional(price) graph, 'ggseasonplot' used to display seasonal time series for separate years
line.conv.year <- ggseasonplot(conavo.price, geom_label_repel(year.labels=TRUE))
line.conv.year <- line.conv.year +
labs(title = ggstitle.c) +
theme_minimal() +
ylab(ylab) +
scale_color_manual(values=c(`2015`, `2016`, `2017`))
#Saving the conventional avocado price graph
ggsave("convavoprice.png")
#Format the graphs so they're arranged together for easy visualisation
grid.arrange(line.conv.year, line.org.year, nrow = 2)
#Saving the formatted graphs
ggsave("Price.avo.png")
Aim 1 - In which month of the year are conventional/organic avocados the cheapest/most expensive? Has this changed over the years?
Conventional: The most affordable months for conventional avocados purhases are between March & May. This is relatively consistent in 2017 & 2016, with the latter showing an additional drop in price in September. The year 2015 followed a less uniform pattern, with prices falling a month earlier than later years, but remaining relatively consistent across the year, in constrast to 2016 & 2017, where prices peaked in between October & December. Therefore, purchases of conventional avocados during winter months should be avoided at all costs, as the graph illustrates this is when prices tend to rise.
Organic: As expected, organic avocados cost significantly more than conventional, throughout the year. From the graph, it is clear 2017 displays the most variation in price. It yeilds both the cheapest and most expensive months of all years recorded - the former being April with a second, but less pronounced dip in August and the latter in February & December ($2.25 for a single avocado !?!). A similar pattern to conventional avocado prices is followed, with price troughs occurring around April/May, again in August/September before a peak during the winter months.
The analysis was repeated, but instead plotting the average volume of avocados sold, to directly compare with the trends in price.
#Similar preparation format as before, but listing total volume as opposed to average price
sold.conavo <- ogavodate %>%
select(type, year, monthabb, Total.Volume) %>%
filter(type == "conventional", year == c("2015", "2016", "2017")) %>%
group_by(year, monthabb) %>%
summarise(avg=mean(Total.Volume))
#Time series prep
soldc <- ts(sold.conavo$avg, start=2015, frequency = 12)
#Graph label prep
c.sold.title <- "Mean Monthly Sales of Conventional Avocados, USA (2015 - 2017)"
s.ylab <- "No. of Avocados"
#Conventional(sold) graph
line.sold.conavo <- ggseasonplot(soldc, geom_label_repel(year.labels=TRUE))
line.sold.conavo <- line.sold.conavo +
labs(title = c.sold.title) +
ylab(s.ylab) +
theme_minimal() +
scale_y_continuous(labels = scales::comma) + #arranges the values on the yaxis to be listed as continuous
scale_color_manual(values=c(`2015`, `2016`, `2017`))
#Save the output
ggsave("soldconavo.png")
## Saving 7 x 5 in image
Aim 2 - Is there a relationship between avocado price and popularity?
Interestingly, one of the highest sales peaks occurs when the price is at it’s lowest in the calendar year - April in both conventional & organic avocados and August for organic avocados . More so, the opposite relationship may be observed in June and winter months, in which the price of avocados increases as sales fall.
Previous analyses have indicated conventional are much more popular than organic avocados - but I wanted to plot the data on a piechart in order to easily visualise the proportion of sales occupied by each type.
#Preparing a variabel which contains the average number of sales for each type of avocado
organic.conventional <- ogavodate %>%
select(type, Total.Volume) %>%
group_by(type) %>%
summarise(avg=mean(Total.Volume))
#Plot labels
colls.co <- c("forestgreen", "yellowgreen") #Colour for each segment of the pie chart
pielabels <- c("Conventional", "Organic") #Label for each segement
mainpie <- c("Avocado Popularity by Type") #Title
#Create percentages to attach to the labels
co <- c(1653212.90, 47811.21)
pct <- round(co/sum(co)*100)
newpielabels <- paste(pielabels, pct)
newpielabels <- paste(newpielabels, "%", sep = "")
#Plot the pie chart
pie(organic.conventional$avg,
col = colls.co,
main = mainpie,
labels = newpielabels,
border = "white", #Colour of segment border
radius = 1, #Size of the pie
cex = 0.9) #Size of labels
## Saving 7 x 5 in image
Aim 3 - Which are most popular - organic or conventional avocados?
Looking at the piechart, it is clear conventional are the most popular avocados of choice - with organic avocados making up less than 5% of total sales.
Next, I wanted to look at how avocado sales are distributed across the US, specifically in the top 5 sales regions. The table below identifies the regions with the largest avocado sales; West, California, South Central, North East & South East, USA.
#total data type, region & volume data ranked from from highest to lowest
region.sold.total <- ogavodate %>%
select(region, Total.Volume) %>%
group_by(region) %>%
summarise(avg=mean(Total.Volume))
#ordering the the data
order.region.total <- region.sold.total[order(region.sold.total$avg, decreasing = TRUE),]
#Showing the top 5 avocado sales regions (not including total US sales)
head(order.region.total)
## # A tibble: 6 x 2
## region avg
## <fct> <dbl>
## 1 TotalUS 17351302.
## 2 West 3215323.
## 3 California 3044324.
## 4 SouthCentral 2991952.
## 5 Northeast 2110299.
## 6 Southeast 1820232.
I opted to plot the figures in a stacked barchart, in order to visualise the distribution of sales for each type among each region.
#Creating a variable containing the avocado type (conventional/organic) & average sales data from the top 5 sales regions
top.five <- ogavodate %>%
select(type, region, Total.Volume) %>%
filter(region == c("West", "California", "SouthCentral", "Northeast", "Southeast")) %>%
group_by(region, type) %>%
summarise(avg=mean(Total.Volume))
#Plot labels
xlab.b <- "No. of Avocados Sold"
ylab.b <- "US Region"
ggtitle.b <- "Average Avocado Sales in the Top 5 US Sales Regions (2015-2018)"
#Plotting a bar chart to display avocado sales in the top 5 sales regions, using the "stack" feature to stack conventional & organic avocados
region.plot <- ggplot(top.five)
region.plot +
geom_bar(aes(fill = type, x = region, y = avg), width = 0.5, position = "stack", stat = "identity", colour = "black") +
ggtitle(ggtitle.b) +
ylab(xlab.b) +
xlab(ylab.b) +
theme_minimal() +
theme(axis.text.y = element_text(angle = 40, vjust = 0.6)) +
scale_y_continuous(labels = scales::comma) +
scale_fill_manual(values = c("forestgreen", "yellowgreen")) +
coord_flip() #Changes the axis from x to y
## Saving 7 x 5 in image
Aim 4 - Which regions of the US purchases the most avocados? The data shows the western region of the US is the purchases the most avocados, closely followed by California (also positioned in the west) which is perhaps unsurprising given the avocado is the official fruit of California and it produces the largest amount of avocados in the USA. The data illustrates avocados are significantly more popular in the west in contrast to the east.
I believe the most important message to be taken from this analysis is to buy avocados between March & May - where they’re on average, the cheapest price of the entire year.
The regional data holds a mix of regions, states, counties and cities - each inconsistently represented and slightly ambiguous in definition. More so, it is unclear whether regions include states and cities which are also present in the data or not. For example; it is unclear whether the “West” region also includes “California” (which is positioned on the western border). Thus, from this data alone it is difficult to understand the sales distribution and price variation of avocado per region/state/city specifically.
The data doesn’t specify where exactly the avocado are sourced. This is problematic as avocados sourced from different countries will vary in season. For example; Californian avocado season spans from spring through summer, whereas, Mexican avocado season spans from November to April - this is a variable that will undoubedtly have an impact on the quality of the avocado and therefore, the sales.
The analysis is limited to the sale of Hass avocados . Thus, it is unclear whether the price, sale and distribution is reflective of other types of avocado , such as Pinkerton or Bacon (yes, Bacon is a variety of avocado.) )
There were several variables left untouched due to time constraints of the module. Given the time or the resources, future endeavours which use this data may focus more specifically on the type of hass avocado (small hass:4046, large hass: 4225, or extra large hass: 4770) to note the sales & price correlations for each. More so, further analysis into regional data, looking specifically at cities would also be an interesting future pursuit.
This analysis was performed using R. The rMarkdown file, images & data plots are availble on github.