Chapter 3 Univariat data

3.1 Introduction

  • Statistics
    • 데이터 분석을 통한 예측 - 데이터를 수집, 정리하여 이로부터 미지의 사실에 대한 신빙성 있는 추론을 수행하는 과정

  • Data - 사실을 나타내는 수치
  • 맥도너 정보경제학 (1963)
    • 지혜 (wisdom) : 패턴화된 지식
    • 지식 (knowledge) : 가치있는 정보
    • 정보 (information) : 의미있는 데이터
    • 데이터 (data) : 단순한 사실의 나열
  • Univariate: Single variable
  • Data collection process
    • Case: One of several different possible items of interest
    • Variable: Some measurement of a case
    • Univariate data set: A set of measurements for a variable

\[ x_1, x_2, ..., x_n \]

library(UsingR)
exec.pay
?exec.pay
  • Levels of measurement
    • Nominal (명목형) – 사람 이름
    • Ordinal (순서형) – 달리기 도착 순서
    • Interval (구간형) – 선수1, 선수2 종점통과 시간
    • Ratio (비율형) – 출발시간 기준 종점 통과 시간

  • Data type in R
    • Numeric data types
      • Discrete (이산형) data - 카운트, 횟수
      • Continuous (연속형) data - 키, 몸무게, Cannot be shared
    • Factors data - Categories to group the data
    • Character data - Identifiers
    • Date and time
    • Hierarchical data - 네트워크 구조

3.2 Data vectors

  • Using combine function
#The number of whale beachings in Texas during the 1990s
whale <- c(74, 122, 235, 111, 292, 111, 211, 133, 156, 79)
#Object `whale` is a data vector == (univariate) data set

# The size 
length(whale)
sum(whale)
sum(whale)/length(whale)
mean(whale)
  • Vectorization
whale - mean(whale)
whale^2 - mean(whale)
sqrt(whale)
  • Adding values to a vector variable
x <- 1
x <- c(x, 2)
x
x <- c(x, 3, 3, 3, 4)
x
  • Missing/NULL values
    • NA: Not available, The value is missing
    • NULL: a reserved value
    • NaN: Not a number (0/0)
    • Inf: (1/0)
hip_cost <- c(10500, 45000, 74100, NA, 83500)
sum(hip_cost)
sum(hip_cost, na.rm=TRUE)
?sum
  • Attributes: names in data vectors
head(precip)
class(precip)
length(precip)
names(precip)
order(names(precip))

test_scores <- c(100, 90, 80)
names(test_scores) <- c("Alice", "Bob", "Shirley")
  • Indexing
head(precip)
precip[1]
precip[2:10]
precip[c(1,3,5)]
precip[-1]
precip["Seattle Tacoma"]
precip[c("Seattle Tacoma", "Portland")]
precip[2] <- 10
  • Functions for generating structured data
1:5
seq(1,5, by=1)
seq(0, 100, by=10)
seq(0, 100, length.out=11)
?seq
rep(5, times10)
rep(1:3, times=4)

3.3 Data type

  • Numeric data
class(1)
class(pi)
class(seq(1,5,by=1))
  • Character data
ch <- c("Lincoln", "said", "and")
class(ch)
  • Combining strings - paste function
paste("X", 1:10)
paste("X", 1:10, sep="")
paste("The", "quick", "brown", "fox")
paste(c("The", "quick", "brown", "fox"))
paste(c("The", "quick", "brown", "fox"), collapse=" ")
x <- 1:10
paste(x)
paste(x, collapse=":")
  • Factors
x <- c("Red", "Blue", "Yellow", "Green", "Blue", "Green")
y <- factor(x)
y
  • Adding a level
levels(y)
y[1] <- "Gold"
y
levels(y) <- c(levels(y), "Gold")
levels(y)
y
y[1] <- "Gold"
y
  • Odered factors (ex. 위치 바꾸기)
#library(UsingR)
str(Cars93)
x <- Cars93$Origin
plot(x)
levels(x) <- c("non-USA", "USA")
levels(x)
plot(x)
  • Logical data
    • TRUE and FALSE
    • “is” functions
    • Comparison by <, <=, ==, !=, >=, >
    • Combination by !, &, |
is.na(1)
is.numeric(1)
is.logical(TRUE)

pi < 3
precip < 30
which(precip < 30)
any(precip < 30)
all(precip < 30)
any(39 == precip)
which(39 == precip)
sum(precip < 30)
sum(c(TRUE, TRUE))

x <- Cars93$Origin
x == "USA" 
which(x == "USA")
i <- which(x == "USA")
x[i]

x <- 1:100
x < 10
x > 90
x < 10 | x >90
which(x < 10 | x >90)
i <- which(x < 10 | x >90)
x[i]
x[x < 10 | x >90]
  • Date and time
    • Unixtime, POSIX time
    • 1970년 1월 1일 00:00:00 협정 세계시(UTC) 부터의 경과 시간을 초로 환산
    • 32비트로 표현된 유닉스 시간은 1970년 1월 1일 00:00 (UTC)에서 2,147,483,647 (231 - 1) 지난 후인 2038년 1월 19일 03:14:08 UTC에 2038년 문제를 발생시킨다. 이는 산술 오버플로와 관련 있는 문제이다. –wiki-
library(lubridate)
current_time <- now() # record since 1970
as.numeric(current_time)
as.numeric(now())
month(current_time)

3.3.1 Example - Recoding values

다음은 신생아들의 키를 나타내는 data set 이다. 오류 값을 찾아내고 이들 값을 NA로 바꾼 후 평균 값을 구하라.

x <- babies$dwt
x

3.3.2 Example - Average distance from center

\[\begin{equation} (| x_1 - \bar{x} | + |x_2 - \bar{x}| + ... + |x_n - \bar{x}| )/n \end{equation}\]

x <- rivers

3.4 Functions

  • Define a function
my_mean <- function(x){
    total <- sum(x)
    n <- length(x)
    return(total/n)
}
  • Write a function named get_dist for the example 3.3.2, and use it for the rivers data
get_dist <- function(x){
    return()
}

3.5 Numeric summaries



  • 대푯값
  • Center – commonly known as “average” or “mean” but not the only one.
    • median, mode, etc
  • Spread – Variability of a data set.
    • No variability – mean is everything
    • Large variability – mean informs much less
    • confidence of interpretation from knowing center
    • Distance from center
  • Shape – Degree of interpretation from knowing center and spread.
    • eg. bell shape – two sides are equally likely, large values are rather unlikely and values tend to cluster near the center.

3.6 Center

3.6.1 Sample mean

\[ \bar{x} = \frac{1}{n} (x_1 + x_2 + ... + x_n) = \frac{1}{n}\sum_i{x_i} \]

head(kid.weights)
str(kid.weights)
wts <- kid.weights$weight
length(wts)
plot(wts)
mean(wts)
devs <- wts – mean(wts) # deviation, centering
plot(wts)
mean(wts)
  • Trimmed mean
mean(wts)
wts[wts<120]
mean(wts[wts<=120])
mean(wts, trim=0.8)

3.6.2 Measure of Position

  • _p_th Quantile - 특정 값으로 이 값보다 작은 데이터의 비율이 100∙p 퍼센트, 큰 데이터의 비율은 100∙(1- p) 퍼센트
  • Median - Splits the data in half p=0.5
  • Percentiles - The same as quantile but its scale is 0 to 100
x <- 0:5
length(x)
quantile(x, 0.25)
median(x)
quantile(x, seq(0, 1, by=0.2))
quantile(x)
  • Robustness
mean(wts)
median(wts)
plot(wts)
abline(h=mean(wts), col="red")
abline(h=median(wts), col="blue")
wts2 <- wts[wts<120]
abline(h=mean(wts2), col="red", lty=2)
  • Boxplot
x <- 0:5
quantile(x)
boxplot(x)
text(x=1.3, y=quantile(x, 0.25), labels = "1사분위수")
text(x=1.3, y=quantile(x, 0.5), labels = "2사분위수")
text(x=1.3, y=quantile(x, 0.75), labels = "3사분위수")

3.7 Spread

  • Range - the distance between the smallest and largest values
  • Sample variance
    • Distance - \[ d_i = x_i - \bar{x} \]

\[\begin{equation} s^2 = \frac{1}{n-1}\sum_i(x_i - \bar{x})^2 \end{equation}\]

  • Sample standard deviation
    • 측정값들이 평균에서 떨어진 정도 \[\begin{equation} \sqrt{s^2} = sqrt{ \frac{1}{n-1}\sum_i(x_i - \bar{x})^2 } \end{equation}\]
wts <- kid.weights$weight
var(wts)
sd(wts)

plot(wts)
boxplot(wts)
hist(wts)
hist(wts, breaks = 50)
hist(wts, 50)
abline(v=mean(wts), col="red")
  • z-score
    • How big (small) is the value relative to the others
    • \(z=3\) 이 값은 평균에 비해 3 표준편차만큼 크다

\[\begin{equation} z_i = \frac{x_i - \bar{x}}{s} \end{equation}\]

  • Example - z score wts의 z 값을 구하는 함수를 만들고 histogram을 그리시오
wts <- kid.weights$weight
  • Interquartile range (IQR)
    • Middle 50% of the data
    • Difference between Q3 and Q1
  • Example - IQR
    wts 변수 값들의 IQR 을 구하시오
wts <- kid.weights$weight

3.8 Shape

  • Symmetry and skew

\[\begin{equation} sample skewness = \sqrt{n} \frac{\sum{(x_i - \bar{x})^2}}{(\sum{(x_i - \bar{x})^2)^{3/2}}} = \frac{1}{n}\sum{z_i^3} \end{equation}\]

myskew <- function(x){
  n <- length(x)
  z <- (x-mean(x))/sd(x)
  return(sum(z^3)/n)
}

wts <- kid.weights$weight
hist(wts, 50)
myskew(wts)

z <- rnorm(length(wts))
hist(z, br=50)
myskew(z)
  • Sample excess kurtosis
    • Measure of tails

\[\begin{equation} sample excess kurtosis = n \frac{\sum{(x_i - \bar{x})^4}}{(\sum{(x_i - \bar{x})^2)^2}} -3 = \frac{1}{n}\sum{z_i^4} - 3 \end{equation}\]

mykurtosis <- function(x){
  n <- length(x)
  z <- (x-mean(x))/sd(x)
  return(sum(z^4)/n - 3)
}

wts <- kid.weights$weight
hist(wts, 50)
mykurtosis(wts)

z <- rnorm(length(wts))
hist(z, br=50)
mykurtosis(z)

3.9 Viewing the shape

  • Dot plots – Trouble with repeated values, only used for small data sets
  • Stem and leaf plot – Shows range, median, shape. But only for small data sets. trouble with clustered data. Rounding
  • Histogram – Break up an interval, for each subinterval the number of data points are counted
  • Density plots
wts <- kid.weights$weight
xrange <- range(wts)
den <- density(wts)
plot(den, xlim=xrange, xlab="densities", main="")

  • Boxplots
    • It shows center, spread, shape
    • Five-number summary of a univariate data set: min, max, Q1, Q3, and median
    • These are good summary of even very large data sets.
    • Outliers – 1.5 x IQR

3.10 Categorical data

  • Tabulating data
x <- babies$smoke
x <- factor(x, labels=c("never", "now", "until current", "once, quit", "unknown"))
table(x)
out <- table(x)
prop <- 100*out/sum(out)
round(prop, digits=2)
barplot(out)
barplot(prop)
dotplot(out)
dotplot(prop)
pie(out)