Tuesday, January 26, 2021

Mod 3: data.frame

Mod-3.utf8


The provided data

names <- c("Jeb", "Donald", "Ted", "Marco", "Carly", "Hillary", "Bernie")
abcPoll <- c(4, 62, 51, 21, 2, 14, 15)
cbsPoll <- c(12, 75, 43, 19, 1, 21, 19)

names <- c("Jeb", "Donald", "Ted", "Marco", "Carly", "Hillary", "Bernie" )
abcPoll <- c(4, 62, 51, 21, 2, 14, 15)
cbsPoll <- c(12, 75, 43, 19, 1, 21, 19)


The data.frame

pollResult <- data.frame(names,abcPoll,cbsPoll)

pollResult <- data.frame(names,abcPoll,cbsPoll)
str(pollResult)
## 'data.frame':    7 obs. of  3 variables:
##  $ names  : chr  "Jeb" "Donald" "Ted" "Marco" ...
##  $ abcPoll: num  4 62 51 21 2 14 15
##  $ cbsPoll: num  12 75 43 19 1 21 19
pollResult
##     names abcPoll cbsPoll
## 1     Jeb       4      12
## 2  Donald      62      75
## 3     Ted      51      43
## 4   Marco      21      19
## 5   Carly       2       1
## 6 Hillary      14      21
## 7  Bernie      15      19


Let’s determine the rank order for names. There are two possibilities.


(1) Ranking: simple average

Caveat: potential issue.

avgPolls <- (abcPoll+cbsPoll)*.5
pollResult <- cbind(pollResult,avgPolls)
pollResult[ order(-avgPolls), ]
##     names abcPoll cbsPoll avgPolls
## 2  Donald      62      75     68.5
## 3     Ted      51      43     47.0
## 4   Marco      21      19     20.0
## 6 Hillary      14      21     17.5
## 7  Bernie      15      19     17.0
## 1     Jeb       4      12      8.0
## 5   Carly       2       1      1.5
colSums( pollResult[2:3] )
## abcPoll cbsPoll 
##     169     190


Here we see the ranking sorted by avgPolls.
What is the issue? The abcPoll and the cbsPoll have different raw totals - 169 vs 190. Probably due to different polling methodologies. Now, looking at this small dataset…probably not a big deal but, it could pose problems with larger datasets.
Let’s consider a proportional perspective.

(2) Ranking: proportional

As (1) above but using prop.table first. Kind of normalizing the data?

prop_abc <- prop.table(abcPoll)
prop_cbs <- prop.table(cbsPoll)
propFrame <- data.frame(names,prop_abc,prop_cbs)

avgProps <- (prop_abc+prop_cbs)*.5
propResult <- cbind(propFrame,avgProps)
propResult[ order(-avgProps), ]
##     names   prop_abc    prop_cbs    avgProps
## 2  Donald 0.36686391 0.394736842 0.380800374
## 3     Ted 0.30177515 0.226315789 0.264045469
## 4   Marco 0.12426036 0.100000000 0.112130178
## 6 Hillary 0.08284024 0.110526316 0.096683276
## 7  Bernie 0.08875740 0.100000000 0.094378698
## 1     Jeb 0.02366864 0.063157895 0.043413267
## 5   Carly 0.01183432 0.005263158 0.008548739
colSums( propFrame[2:3] )
## prop_abc prop_cbs 
##        1        1


The prop_abc and prop_cbs should add to 1…good.

In this case, there is no difference in ranking when using a simple average versus a proportional ranking. Maybe something to be watchful for the next time.



GitHub

Related file(s) can be found at Git Me

No comments:

Post a Comment