[R3.3] 단순 선형회귀 분석(+등분산/평균비교/이상치)

2017. 1. 8. 21:49

/*******************************************************************************************************************
-- Title : [R3.3] 단순 선형회귀 분석(+등분산/평균비교/이상치)
-- Reference : http://blog.naver.com/gksshdk8003/220820705546
-- Key word : lm linear model 선형회귀 선형모델 galton tapply var.test var test 등분산 검정 t.test t test
평균비교 p-value p value outlier 이상치 outliertest outlierTest shapiro.test shapiro test
resid Shapiro Wilk normality test r-squared p value
*******************************************************************************************************************/

-- Chart

-- R

# -- http://blog.naver.com/gksshdk8003/220820705546

# ------------------------------

# -- Galton 샘플 다운로드

# ------------------------------

fmlyDF <- read.csv("http://www.math.uah.edu/stat/data/Galton.csv", header = TRUE, stringsAsFactors = FALSE)

head(fmlyDF)

write.csv(fmlyDF,"e:\\galton.csv", row.names=TRUE)

# ------------------------------

# -- DF 사전 탐색

# ------------------------------

View(fmlyDF)

summary(fmlyDF)

str(fmlyDF)

# ------------------------------

# -- 가설 정의

# ------------------------------

# 아버지의 키에 따라 자녀의 키가 영향을 받는다?

# ------------------------------

# -- 그래프를 이용한 데이터 탐색

# ------------------------------

par(mfrow=c(1,1))

plot(fmlyDF$Height~fmlyDF$Father, data=fmlyDF, col="blue")

tempmean = tapply(fmlyDF$Height, fmlyDF$Gender, mean)

tempmean

barplot(tempmean, col=rainbow(10))

# ------------------------------

# -- Two Sample T-test

# ------------------------------

# -- 등분산 검정

var.test(fmlyDF$Height~fmlyDF$Gender)

# F test to compare two variances

# data: fmlyDF$Height by fmlyDF$Gender

# F = 0.81129, num df = 432, denom df = 464, p-value = 0.02747

# alternative hypothesis: true ratio of variances is not equal to 1

# 95 percent confidence interval:

# 0.6741029 0.9770039

# sample estimates:

# ratio of variances

# 0.8112897

# 해석 : 유의확률 0.02747 < 0.05 이므로 귀무가설 기각 = 두 그룹의 분산이 동일하지 않음을 의미.

# 귀무가설 : 두 집단의 분산이 같다?

# -- 집단간 평균비교(두 집단의 평균이 같은지 검정)

t.test(fmlyDF$Height~fmlyDF$Gender, var.equal=FALSE)

# Welch Two Sample t-test

# data: fmlyDF$Height by fmlyDF$Gender

# t = -30.662, df = 895.02, p-value < 2.2e-16

# alternative hypothesis: true difference in means is not equal to 0

# 95 percent confidence interval:

# -5.446293 -4.791018

# sample estimates:

# mean in group F mean in group M

# 64.11016 69.22882

# 해석 : p-value가 매우 작아 유의미함 = 두 집단 간의 평균키의 차이가 통계학적으로 유의미한 차이.

# ------------------------------

# -- 남자 자녀 데이터만 추출

# ------------------------------

sr_data = subset(fmlyDF, Gender == "M")

head(sr_data)

model = lm(Height~Father, data=sr_data)

model

# Call:

# lm(formula = Height ~ Father, data = sr_data)

# Coefficients:

# (Intercept) Father

# 38.2589 0.4477

# ------------------------------

# -- 이상치(outlier) 탐색

# ------------------------------

install.packages("car")

library(car)

# -- outlierTest 함수를 이용한 이상치 탐색

outlierTest(model)

# rstudent unadjusted p-value Bonferonni p

# 289 3.941997 9.3321e-05 0.043394

# 479 -3.932263 9.7054e-05 0.045130

# 289번, 479번 관측치의 경우 Bonferonni P-value가 0.05보다 작게 나와

# 이상치인 것을 알 수 있다.

# ------------------------------

# -- 제거한 결과를 토대로 모형 재검증

# ------------------------------

sr_data2 = subset(sr_data, rownames(sr_data) != "289" &

rownames(sr_data) != "479")

sr_data2

# -- lm() 적용

model = lm(Height~Father, data=sr_data2)

# -- outlierTest 함수를 이용한 이상치 탐색

outlierTest(model) # 60번 관측치가 또 이상하지만 그냥 Pass.

# -- 적합도 검정

summary(model)

# Call:

# lm(formula = Height ~ Father, data = sr_data2)

# Residuals:

# Min 1Q Median 3Q Max

# -8.437 -1.491 0.015 1.626 7.955

# Coefficients:

# Estimate Std. Error t value Pr(>|t|)

# (Intercept) 38.39154 3.28263 11.695 <2e-16 ***

# Father 0.44583 0.04743 9.399 <2e-16 ***

# ---

# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Residual standard error: 2.35 on 461 degrees of freedom

# Multiple R-squared: 0.1608, Adjusted R-squared: 0.159

# F-statistic: 88.34 on 1 and 461 DF, p-value: < 2.2e-16

# 해석)

# - Coefficients에서 y-절편(Intercept)에 대한 P-value가

# 둘다 2e-16 < 0.05으로 유의미함.

# - Father에 대한 회귀계순(Slope)의 P-value(2e-16) < 0.05라 유의미함.

# - R-squared값이 0.1608로 전체 변동량의 약 16%정도를 이 모형을 통해 설명 가능.

# R-squared값이 비교적 낮게 나타난 이유는 자녀키가 아버지키 외에 영향을 받는 요인이 있을 수 있음.

# - F-통계량 P-value(2.2e-16) < 0.05로 모형전반이 유의미성을 나타남.

# - 단순 선형회귀모형은 회귀계수가 곧 모형의 유의미성을 나타냄.

# - Father의 T-value(9.399)를 제곱하면 F-통계량 값인 88.34와 동일.

# 이 역시 단순 선형회귀모형의 성질 중 하나임.

# -- 차트에서 확인

par(mfrow=c(2,2))

plot(model)

# -- shapiro 테스트

shapiro.test(resid(model)) # resid()함수는 인수로 적합한 모형식을 입력받아 해당 모형의 잔차를 return

# Shapiro-Wilk normality test

# data: resid(model)

# W = 0.9975, p-value = 0.7193

# 해석) P-value가 0.7193로 상당히 크게 나와 귀무가설이 채택되어 위 모형의 자차들이 정규

# 분포를 따르고 있음을 설명.

# ------------------------------

# -- 결과 해석

# ------------------------------

# 위 과정으로 아래 회귀식 얻어 매우 잘 적합되었음을 확인

# 회귀식 : Height = 0.446 x Father + 38.392

# 그러므로 아버지키에 0.446인치를 곱하고 38.392인치를 더하면 대략 아들키 예측 가능.

저작자표시 비영리 변경금지

디비랑[dɪ'bɪraŋ]

[R3.3] 단순 선형회귀 분석(+등분산/평균비교/이상치)

+ Recent posts

티스토리툴바