Monday, March 8, 2021

Mod 9: Visualization in R

Mod-9.utf8


Set environment

env <- c("readr","tidyverse","lubridate","lattice","ggplot2")
lapply(env, library, character.only = 1)


Load, clean, tidy, & inspect dataset - “TechStocks.csv”

# read in data
stocks.tmp <- read_csv("TechStocks.csv", col_names = TRUE) %>%
    mutate(Day = t) %>%
    mutate(Date_chr = Date) %>%
    mutate(Date_lub = mdy(Date)) %>%
    select(Day,Date_chr,Date_lub,AAPL,GOOG,MSFT)
## Warning: Missing column names filled in: 'X1' [1]
## 
## -- Column specification --------------------------------------------------------
## cols(
##   X1 = col_double(),
##   Date = col_character(),
##   AAPL = col_double(),
##   GOOG = col_double(),
##   MSFT = col_double(),
##   t = col_double()
## )
# pivot to tidy the data
stocks <- stocks.tmp %>%
    pivot_longer(cols=c("AAPL","GOOG","MSFT"),
                 names_to = "Symbol", values_to = "Price" )

# inspect the data
glimpse(stocks)
## Rows: 1,512
## Columns: 5
## $ Day      <dbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7...
## $ Date_chr <chr> "12/1/2015", "12/1/2015", "12/1/2015", "12/2/2015", "12/2/...
## $ Date_lub <date> 2015-12-01, 2015-12-01, 2015-12-01, 2015-12-02, 2015-12-0...
## $ Symbol   <chr> "AAPL", "GOOG", "MSFT", "AAPL", "GOOG", "MSFT", "AAPL", "G...
## $ Price    <dbl> 117.34, 767.04, 55.22, 116.28, 762.38, 55.21, 115.20, 752....
summary(stocks)
##       Day          Date_chr            Date_lub             Symbol         
##  Min.   :  1.0   Length:1512        Min.   :2015-12-01   Length:1512       
##  1st Qu.:126.8   Class :character   1st Qu.:2016-06-01   Class :character  
##  Median :252.5   Mode  :character   Median :2016-11-29   Mode  :character  
##  Mean   :252.5                      Mean   :2016-11-30                     
##  3rd Qu.:378.2                      3rd Qu.:2017-06-01                     
##  Max.   :504.0                      Max.   :2017-12-01                     
##      Price        
##  Min.   :  48.43  
##  1st Qu.:  69.41  
##  Median : 116.22  
##  Mean   : 335.95  
##  3rd Qu.: 742.21  
##  Max.   :1054.21


This is a dataset of AAPL, GOOG, and MSFT stock prices from 1 December 2015 to 1 December 2017. A time-series plot will be appropriate here.

Date_chr is the date as character type (imported).
Date_lub is the date as lubridate type.


Base graphics

Base graphics provides the basics of graphics in R.
It is probably best used when exploring datasets as the call is very simple - a call to plot.

# simple plot
plot(Price ~ Day, data=stocks)


This is quite messy because the data is plotted on the same chart and uses the same price scale. Let’s look at just AAPL.

# filter then plot
stocks %>% filter(Symbol=="AAPL") %>%
    plot(Price ~ Day, data=.)


Base graphics can be enhanced but it takes more effort and parameters can appear a bit cryptic e.g. usage of cex, yaxt, or las. The workflow can also be awkward when comparing/creating related plots as in this case with 3 tech stocks. Each plot must be rendered individually which can create charting inconsistencies. Also, a legend will have to be manually created.

Still, some simple enhancements can be made as shown below.

# pipe stocks and build the plot
stocks %>% filter(Symbol=="AAPL") %>%
    plot(Price ~ Day, cex = 0.25, yaxt="none", data=., 
        main="AAPL", col.main="blue",
        xlab="Days (starts on 12-1-2015)", ylab="Price"
        )
axis(2, seq(0,200,25),las=2)
abline( lm(AAPL ~ Day, data=stocks.tmp) )


Lattice

Lattice provides several high-level lattice functions. The xyplot is one type and is great for time-series plots. A simply call to xyplot looks very much like a call to plot but its default formatting is much better.

# pipe and filter to plot
stocks %>% filter(Symbol=="AAPL") %>%
    xyplot(Price ~ Day, data=.)


Lattice makes it easy to do a comparison. Key to note here is that scales = "free" has been set. Without it we end with the same scaling problem as above but, it could introduce a misleading interpretation.

# straight plot
xyplot(Price ~ Day | Symbol, cex = 0.25, data=stocks,
       group = Symbol,
       type = c("p", "smooth"),
       scales = "free",
       main="Stocks (on free scales)", col.main="blue",
        xlab="Days (starts on 12-1-2015)", ylab="Price"
       )


GGPLOT2

GGPLOT uses the Grammar of Graphics approach which identifies 7 layers: data, aesthetics, geometries, facets, statistics, coordinates, and themes. This provides a systematic way to create a visualization.

We’ll plot the same 3 tech stocks but instead we’ll plot NOT price but instead the percentage increase since 1 December 2015. The result should be similar to the side-by-side comparison above but now we’ll compare percentages and could provide for much fairer analysis.

# create a cumulative percentage column
stocksPerc <- stocks %>%
    group_by(Symbol) %>%
    mutate(Perc = ((Price - Price[1]) / Price[1]) * 100
           ) %>%
    ungroup()

# plotting
stocksPerc %>% ggplot( aes(x=Date_lub,y=Perc) ) + 
    ggtitle("Stocks (percent chart)") + xlab("") + ylab("Percent") +
    geom_point(size=0.5, col=if_else(stocksPerc$Perc<0,"red","green3") ) + 
    facet_wrap(~Symbol) +
    theme_light() +
    theme(plot.title = element_text(hjust = 0.5)) +
    theme(axis.text.x = element_text(angle = 45, vjust = 0.5, hjust=1)) +
    theme(axis.title.y = element_text(angle = 0, vjust = 0.5, hjust=1))




GitHub

Related file(s) can be found at Git Me

No comments:

Post a Comment