Little World

Example Page 1

Sun, 05 May 2019 00:00:00 +0100

In this tutorial, I'll share my top 10 tips for getting started with Academic:

Tip 1

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.

Nullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.

Cras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.

Suspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.

Aliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.

Tip 2

Example Page 2

Sun, 05 May 2019 00:00:00 +0100

Here are some more tips for getting started with Academic:

Tip 3

Tip 4

Example Talk

Sat, 01 Jun 2030 13:00:00 +0000

Slides can be added in a few ways:

Create slides using Academic's Slides feature and link using slides parameter in the front matter of the talk file
Upload an existing slide deck to static/ and link using url_slides parameter in the front matter of the talk file
Embed your slides (e.g. Google Slides) or presentation video on this page using shortcodes.

Further talk details can easily be added to this page using Markdown and $\rm \LaTeX$ math code.

用R取代Stata与SAS

Mon, 20 Jan 2020 00:00:00 +0000

安装Stata

首先安装ncurses5-compat-libs和libpng12这两个包，其次

% sudo -s

cd /tmp/

mkdir statafiles

cd statafiles

tar -zxf /home/you/Downloads/Stata14Linux64.tar.gz

cd /usr/local

mkdir stata14

cd stata14

/tmp/statafiles/install

安完之后把安装目录加到环境变量中去。我选择编辑/etc/profile加入：

export PATH="$PATH:/usr/local/stata14"

若想不重启就生效可以source /etc/profile

Lic文件可以直接COPY到安装目录，或者在目录中放stata.lic.tar.gz。

在R中调用Stata

通过RStata实现

#run Stata in R----
library("RStata")
options("RStata.StataPath" = "D:\\Stata15\\StataSE-64") #office
options("RStata.StataPath" = "/usr/local/stata14/stata") #linux #cannot use stata-se?
options("RStata.StataVersion" = 14)

三种环境下数据互通

R下通过两个包

library(haven) #nead read_dta to read dta
library(rio) # rio::import to read sas data
#haven::read_sas can also import sas7bdat
f1 <- str_c(data_loc,"after2007.sas7bdat",sep = "/") 
o1 <- str_c(data_loc,"after2007.dta",sep = "/") 
after2007_raw <-  import(f1)
after2007 %>% 
  mutate_if(is.numeric, as.integer) %>% 
  write_dta(.,o1, version = 12)
# Because sas only supports Stata 12 files (or earlier) while haven supports stata versions 8-15.

如以上方法都无法顺利读入sas7bdat，用SAS中转

#import stata data file, only supports 12 or earlier
PROC IMPORT OUT= WORK.S1 
            DATAFILE= "E:\after2007.dta" 
            DBMS=STATA REPLACE;
RUN;

proc export data=raw1 outfile= "D:\sample.dta" replace;
run;

The Catcher in Rye

Sun, 17 Nov 2019 00:00:00 +0000

原来The Catcher in Rye并不是讲稻草人和乌鸦的故事，是中二少年失败的离家出走尝试。如果我中二期看的这书，应该会很喜欢吧。虽然现在也挺喜欢的。更奇妙的是这么多年了，竟然一点没被剧透。To Kill A Mockingbird也是如此，并不是一个讲猎人的故事，我到底是有多文盲啊！

Manjaro折腾记

Wed, 06 Nov 2019 00:00:00 +0000

缘起

一切的开始应该是从折腾家庭影院开始。最早的解决方案是Windows做服务器，不太理想，于是入手了黑群晖。在黑群晖一路走来，点亮了无数新的技能点。再加上非常幸运的有公网IP，可以折腾的余地大大增加了。在群晖系统里玩了一阵子docker之后，就想着要搞一个Linux来玩玩，不想用自己的台式机折腾，查了查说最好的linux笔记本是Chromebook，说是丝滑般的Chrome体验以及是续航最久的Linux本子，加上便宜，果断入手。到后之后发现真不错，Chrome OS再加上Android再加上Linux简直了，基本上出门的需求可以满足，虽然说我也不爱出门。但玩着玩着看到人家说最好的Linux发行版是WSL，WSL需要更新windows 10，但我家里的台式机一直停留在15年的windows版本，一升级就蓝屏循环中，这回干脆咬牙升级了下系统，玩上了WSL，想用来开docker布服务吧，我看着Windows的防火墙就头疼，还是算了。但在用Chromebook的过程中发现这个触摸板手势真的很爽啊！想在Windows下也有这么爽，入了一个联想触摸板，是旧型号，只支持Windows 8的手势，突然又幻想Linux下对触摸板的驱动是不是好些呢（做梦，最后实践表明Manjaro根本只把它认成鼠标而非触摸板），于是搞起了双系统…

现在的结果是爽死了，感觉自己省了好多买服务器的钱！我真是太机智了！

折腾备忘录

以防将来又需要重装，写下安装的注意事项供未来的我参考。安装iso用的是Manjaro KDE，不要用xfce版。

安好之后换中国源

# 中国区镜像排序，一般选择前两个镜像
sudo pacman-mirrors -i -c China -m rank
##更新数据源
sudo pacman -Syy 
## 添加archlinuxcn源
sudo nano /etc/pacman.conf

在文件最后添加

[archlinuxcn]
SigLevel = Optional TrustedOnly
Server = https://mirrors.tuna.tsinghua.edu.cn/archlinuxcn/$arch

sudo pacman -Syyu //更新数据源
sudo pacman -S archlinuxcn-keyring //安装导入GPG key

N卡驱动

sudo mhwd -a pci nonfree 0300
sudo reboot
nvidia-settings

中文输入法

#中文字体
sudo pacman -S adobe-source-han-sans-cn-fonts adobe-source-han-serif-cn-fonts
sudo pacman -S fcitx fcitx-googlepinyin fcitx-im fcitx-configtool

# 编辑 ~/.xinitrc sudo nano ~/.xprofile

export GTK_IM_MODULE=fcitx
export QT_IM_MODULE=fcitx
export XMODIFIERS="@im=fcitx"

Deepin桌面

安装dde

sudo pacman -S deepin deepin-extra

修改 /etc/lightdm/lightdm.conf

sudo cp /etc/lightdm/lightdm.conf /etc/lightdm/lightdm.conf.bak

sudo sed -i 's/greeter-session=lightdm-.*/greeter-session=lightdm-deepin-greeter/g' /etc/lightdm/lightdm.conf

sudo sed -i 's/user-session=xfce/user-session=deepin/g' /etc/lightdm/lightdm.conf

选择桌面:注销账户，在登录界面右下角选择 deepin 桌面图标

安装miniconda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh 

# 编辑 ~/.bash_profile,在最后添加如下环境变量（注意PATH要在前面）
export PATH="$PATH:$HOME/miniconda3/bin"

# 编辑完成后
source .bash_profile

# 进入base环境或新建的python环境
source activate

pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

之后便可conda和pip安装包了。

搭建服务并打开端口

用的是ufw。

Rstudio Server开机自动运行

sudo rstudio-server verify-installation

# 查看狀態
systemctl status rstudio-server
# 啟動
systemctl start rstudio-server
# 關閉
systemctl stop rstudio-server

#auto start
sudo systemctl enable rstudio-server

太爽了！这篇post就是在Rstudio Server写就。

Jupyter lab

在/etc/systemd/system下添加jupyter.service文件

#sudo nano /etc/systemd/system/jupyter.service
[Unit]
Description=Jupyter Lab

[Service]
Type=simple
PIDFile=/run/jupyter.pid
ExecStart=/home/wyih/anaconda3/bin/jupyter lab --ip 192.168.6.100 --config=/home/wyih/.jupyter/jupyter_notebook_config.py
User=wyih
Group=wyih
WorkingDirectory=/home/wyih/Jupyter Notebook
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

开启服务

systemctl enable jupyter.service
systemctl daemon-reload
systemctl restart jupyter.service

jupyter_notebook_config.py配置:

c.NotebookApp.ip = '*'  # 允许访问此服务器的 IP，星号表示任意 IP
c.NotebookApp.password = u'sha1:xxx:xxx' # 之前生成的密码 hash 字串
c.NotebookApp.open_browser = False # 运行时不打开本机浏览器
c.NotebookApp.port = 8889 # 使用的端口
c.NotebookApp.allow_remote_access = True
## 是否允许notebook在root用户下运行.
c.NotebookApp.allow_root = True

Chrome Remote Desktop

installed “chrome-remote-desktop” from AUR and Chrome extension.
Executed crd --setup in the terminal as normal user - was requested sudo password
edited “.chrome-remote-desktop-session” file deleting the # in front of “exec /usr/bin/startkde” line
accepted screen resolution
executed crd --restart

好像还是不能开始自动运行CRD。

系统备份和恢复

还没研究明白。

Data Vis Chapter 8

Wed, 09 Oct 2019 00:00:00 +0000

head(asasec)

##                                Section         Sname Beginning Revenues
## 1      Aging and the Life Course (018)         Aging     12752    12104
## 2     Alcohol, Drugs and Tobacco (030) Alcohol/Drugs     11933     1144
## 3 Altruism and Social Solidarity (047)      Altruism      1139     1862
## 4            Animals and Society (042)       Animals       473      820
## 5             Asia/Asian America (024)          Asia      9056     2116
## 6            Body and Embodiment (048)          Body      3408     1618
##   Expenses Ending Journal Year Members
## 1    12007  12849      No 2005     598
## 2      400  12677      No 2005     301
## 3     1875   1126      No 2005      NA
## 4     1116    177      No 2005     209
## 5     1710   9462      No 2005     365
## 6     1920   3106      No 2005      NA

p <-
  ggplot(
    data = subset(asasec, Year == 2014),
    mapping = aes(x = Members,
                  y = Revenues, label = Sname)
  )

p + geom_point() + geom_smooth()

p <-
  ggplot(
    data = subset(asasec, Year == 2014),
    mapping = aes(x = Members,
                  y = Revenues, label = Sname)
  )

p + geom_point(mapping = aes(color = Journal)) + geom_smooth(method = "lm")

p0 <-
  ggplot(
    data = subset(asasec, Year == 2014),
    mapping = aes(x = Members,
                  y = Revenues, label = Sname)
  )

p1 <-
  p0 + geom_smooth(method = "lm", se = FALSE, color = "gray80") +
  geom_point(mapping = aes(color = Journal))
library(ggrepel)
p2 <- p1 + geom_text_repel(data = subset(asasec, Year == 2014 &
                                           Revenues > 7000),
                           size = 2)

p3 <- p2 + labs(
  x = "Membership",
  y = "Revenues",
  color = "Section has own Journal",
  title = "ASA Sections",
  subtitle = "2014 Calendar year.",
  caption = "Source: ASA annual report."
)
p4 <- p3 + scale_y_continuous(labels = scales::dollar) +
  theme(legend.position = "bottom")
p4

Use Color Palette

Use the RColorBrewer package. Access the colors by specifying the scale_color_brewer() or scale_ﬁll_brewer() functions, depending on the aesthetic you are mapping.

p <- ggplot(data = organdata,
            mapping = aes(x = roads, y = donors,
                          color = world))
p + geom_point(size = 2) + scale_color_brewer(palette = "Set2") +
  theme(legend.position = "top")

p + geom_point(size = 2) + scale_color_brewer(palette = "Pastel2") +
  theme(legend.position = "top")

p + geom_point(size = 2) + scale_color_brewer(palette = "Dark2") +
  theme(legend.position = "top")

Specify colors manually, via scale_color_manual() or scale_fill_manual(). Try demo('color') to see the color names in R.

cb_palette <-
  c(
    "#999999",
    "#E69F00",
    "#56B4E9",
    "#009E73",
    "#F0E442",
    "#0072B2",
    "#D55E00",
    "#CC79A7"
  )

p4 + scale_color_manual(values = cb_palette)

library(dichromat)
library(RColorBrewer)

Default <- brewer.pal(5, "Set2")

types <- c("deutan", "protan", "tritan")
names(types) <- c("Deuteronopia", "Protanopia", "Tritanopia")

color_table <- types %>% purrr::map(~ dichromat(Default, .x)) %>%
  as_tibble() %>% add_column(Default, .before = TRUE)

color_table

## # A tibble: 5 x 4
##   Default Deuteronopia Protanopia Tritanopia
##   <chr>   <chr>        <chr>      <chr>     
## 1 #66C2A5 #AEAEA7      #BABAA5    #82BDBD   
## 2 #FC8D62 #B6B661      #9E9E63    #F29494   
## 3 #8DA0CB #9C9CCB      #9E9ECB    #92ABAB   
## 4 #E78AC3 #ACACC1      #9898C3    #DA9C9C   
## 5 #A6D854 #CACA5E      #D3D355    #B6C8C8

Layer Color and Text Together

# Democrat Blue and Republican Red party_colors ← c("#2E74C0", "#CB454A")
p0 <- ggplot(
  data = subset(county_data, flipped == "No"),
  mapping = aes(x = pop, y = black / 100)
)
p1 <-
  p0 + geom_point(alpha = 0.15, color = "gray50") + scale_x_log10(labels =
                                                                    scales::comma)
p1

party_colors <- c("#2E74C0", "#CB454A")
p2 <- p1 + geom_point(
  data = subset(county_data, flipped == "Yes"),
  mapping = aes(x = pop, y = black / 100, color = partywinner16)
) +
  scale_color_manual(values = party_colors) 
p2

p3 <-
  p2 + scale_y_continuous(labels = scales::percent) + labs(
    color = "County flipped to ... ",
    x = "County Population (log scale)",
    y = "Percent Black Population",
    title = "Flipped counties, 2016",
    caption = "Counties in gray did not flip."
  )
p3

p4 <-
  p3 + geom_text_repel(
    data = subset(county_data, flipped == "Yes" & black > 25),
    mapping = aes(x = pop, y = black / 100, label = state),
    size = 2
  )
p4 + theme_minimal() + theme(legend.position = "top")

Themes

theme_set(theme_bw()) 
p4 + theme(legend.position = "top")

theme_set(theme_dark()) 
p4 + theme(legend.position = "top")

p4 + theme_gray()

library(ggthemes)
theme_set(theme_economist())
p4 + theme(legend.position = "top")

theme_set(theme_wsj())
p4 + theme(
  plot.title = element_text(size = rel(0.6)),
  legend.title = element_text(size = rel(0.35)),
  plot.caption = element_text(size = rel(0.35)),
  legend.position = "top"
)

Claus O. Wilke’s cowplot package, contains a well-developed theme suitable for figures whose final destination is a journal article. BobRudis’s hrbrthemes package, has a distinctive and compact look and feel that takes advantage of some freely available typefaces.

library(hrbrthemes)
theme_set(theme_ipsum())
p4 + theme(legend.position = "top")

p4 + theme(
  legend.position = "top",
  plot.title = element_text(
    size = rel(2),
    lineheight = .5,
    family = "Times",
    face = "bold.italic",
    colour = "orange"
  ),
  axis.text.x = element_text(
    size = rel(1.1),
    family = "Courier",
    face = "bold",
    color = "purple"
  )
)

Use Theme Elements

yrs <- c(seq(1972, 1988, 4), 1993, seq(1996, 2016, 4))
mean_age <-
  gss_lon %>% filter(age %nin% NA &&
                       year %in% yrs) %>% group_by(year) %>% summarize(xbar = round(mean(age, na.rm = TRUE), 0))
mean_age$y <- 0.3
yr_labs <- data.frame(x = 85, y = 0.8, year = yrs)

p <-
  ggplot(data = subset(gss_lon, year %in% yrs),
         mapping = aes(x = age))
p1 <-
  p + geom_density(
    fill = "gray20",
    color = FALSE,
    alpha = 0.9,
    mapping = aes(y = ..scaled..)
  ) +
  geom_vline(
    data = subset(mean_age, year %in% yrs),
    aes(xintercept = xbar),
    color = "white",
    size = 0.5
  ) +
  geom_text(
    data = subset(mean_age, year %in% yrs),
    aes(x = xbar, y = y, label = xbar),
    nudge_x = 7.5,
    color = "white",
    size = 3.5,
    hjust = 1
  ) +
  geom_text(data = subset(yr_labs, year %in% yrs), aes(x = x, y = y, label = year)) +
  facet_grid(year ~ ., switch = "y")

p1 + 
  theme(
    plot.title = element_text(size = 16),
    axis.text.x = element_text(size = 12),
    axis.title.y = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    strip.background = element_blank(),
    strip.text.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  ) +
  labs(x = "Age", y = NULL, title = "Age Distribution of\nGSS Respondents")

library(ggridges)
p <-
  ggplot(data = gss_lon, mapping = aes(x = age, y = factor(
    year, levels = rev(unique(year)), ordered = TRUE
  )))
p + geom_density_ridges(alpha = 0.6,
                        fill = "lightblue",
                        scale = 1.5) + scale_x_continuous(breaks = c(25, 50, 75)) + scale_y_discrete(expand = c(0.01, 0)) + labs(x = "Age", y = NULL, title = "Age Distribution of\nGSS Respondents") +
  theme_ridges() + theme(title = element_text(size = 16, face = "bold"))

Two y-axes

head(fredts)

##         date  sp500 monbase  sp500_i monbase_i
## 1 2009-03-11 696.68 1542228 100.0000  100.0000
## 2 2009-03-18 766.73 1693133 110.0548  109.7849
## 3 2009-03-25 799.10 1693133 114.7012  109.7849
## 4 2009-04-01 809.06 1733017 116.1308  112.3710
## 5 2009-04-08 830.61 1733017 119.2240  112.3710
## 6 2009-04-15 852.21 1789878 122.3245  116.0579

fredts_m <-
  fredts %>% select(date, sp500_i, monbase_i) %>% gather(key = series, value = score, sp500_i:monbase_i)
head(fredts_m)

##         date  series    score
## 1 2009-03-11 sp500_i 100.0000
## 2 2009-03-18 sp500_i 110.0548
## 3 2009-03-25 sp500_i 114.7012
## 4 2009-04-01 sp500_i 116.1308
## 5 2009-04-08 sp500_i 119.2240
## 6 2009-04-15 sp500_i 122.3245

p <-
  ggplot(data = fredts_m,
         mapping = aes(
           x = date,
           y = score,
           group = series,
           color = series
         ))
p1 <-
  p + geom_line() + theme(legend.position = "top") + labs(x = "Date", y = "Index", color = "Series")
p <-
  ggplot(data = fredts,
         mapping = aes(x = date, y = sp500_i - monbase_i))
p2 <- p + geom_line() + labs(x = "Date", y = "Difference")

cowplot::plot_grid(p1, p2, nrow = 2, rel_heights = c(0.75, 0.25), align = "v")

Using two y-axes gives you an extra degree of freedom to mess about with the data that, in most cases, you really should not take advantage of.

p <- ggplot(data = yahoo, mapping = aes(x = Employees, y = Revenue))
p + geom_path(color = "gray80") + geom_text(aes(color = Mayer, label = Year),
                                            size = 3,
                                            fontface = "bold") +
  theme(legend.position = "bottom") + labs(
    color = "Mayer is CEO",
    x = "Employees",
    y = "Revenue (Millions)",
    title = "Yahoo Employees vs Revenues, 2004-2014"
  ) + scale_y_continuous(labels = scales::dollar) + scale_x_continuous(labels = scales::comma)

p <-
  ggplot(data = yahoo,
         mapping = aes(x = Year, y = Revenue / Employees))
p + geom_vline(xintercept = 2012) + geom_line(color = "gray60", size = 2) + annotate(
  "text",
  x = 2013,
  y = 0.44,
  label = " Mayer becomes CEO",
  size = 2.5
) +
  labs(x = "Year\n", y = "Revenue/Employees", title = "Yahoo Revenue to Employee Ratio, 2004-2014")

Saying no to pie

p_xlab <-
  "Amount Owed, in thousands of Dollars" 
p_title <- "Outstanding Student Loans" 
p_subtitle <- "44 million borrowers owe a total of $1.3 trillion" 
p_caption <- "Source: FRB NY"
f_labs <-
  c(`Borrowers` = "Percent of\nall Borrowers", `Balances` = "Percent of\nall Balances")
p <-
  ggplot(data = studebt,
         mapping = aes(x = Debt, y = pct / 100, fill = type))
p + geom_bar(stat = "identity") + scale_fill_brewer(type = "qual", palette = "Dark2") + scale_y_continuous(labels = scales::percent) + guides(fill = FALSE) + theme(strip.text.x = element_text(face = "bold")) + labs(
  y = NULL,
  x = p_xlab,
  caption = p_caption,
  title = p_title,
  subtitle = p_subtitle
) + facet_grid( ~ type, labeller = as_labeller(f_labs)) + coord_flip()

library(viridis)
p <-
  ggplot(studebt, aes(y = pct / 100, x = type, fill = Debtrc)) 
p + geom_bar(stat = "identity", color = "gray80") + scale_x_discrete(labels = as_labeller(f_labs)) + scale_y_continuous(labels = scales::percent) + scale_fill_viridis(discrete = TRUE) + guides(
    fill = guide_legend(
      reverse = TRUE,
      title.position = "top",
      label.position = "bottom",
      keywidth = 3,
      nrow = 1
    )
  ) +
  labs(
    x = NULL,
    y = NULL,
    fill = "Amount Owed, in thousands of dollars",
    caption = p_caption,
    title = p_title,
    subtitle = p_subtitle
  ) +
  theme(
    legend.position = "top",
    axis.text.y = element_text(face = "bold", hjust = 1, size = 12),
    axis.ticks.length = unit(0, "cm"),
    panel.grid.major.y = element_blank()
  ) +
  coord_flip()

http://r-graph-gallery.com/ for more examples

Data Vis Chapter 6

Thu, 26 Sep 2019 00:00:00 +0000

p <-  ggplot(data = gapminder,
             mapping = aes(x = log(gdpPercap), y = lifeExp))

p + geom_point(alpha = 0.1) +
  geom_smooth(color = "tomato",
              fill = "tomato",
              method = MASS::rlm) + #robust regression line
  geom_smooth(color = "steelblue",
              fill = "steelblue",
              method = "lm")

p + geom_point(alpha = 0.1) +
  geom_smooth(
    color = "tomato",
    method = "lm",
    size = 1.2,
    formula = y ~ splines::bs(x, 3),
    se = FALSE
  )

p + geom_point(alpha = 0.1) +
  geom_quantile( # specialized version of geom)smooth that can fit quantile regression
    color = "tomato",
    size = 1.2,
    method = "rqss",
    lambda = 1,
    quantiles = c(0.20, 0.5, 0.85)
  )

## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)

Show Several Fits at Once, with a Legend

model_colors <- RColorBrewer::brewer.pal(3, "Set1")
model_colors

## [1] "#E41A1C" "#377EB8" "#4DAF4A"

p0 <- ggplot(data = gapminder,
             mapping = aes(x = log(gdpPercap), y = lifeExp))

p1 <- p0 + geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS")) +
  geom_smooth(
    method = "lm",
    formula = y ~ splines::bs(x, df = 3),
    aes(color = "Cubic Spline", fill = "Cubic Spline")
  ) +
  geom_smooth(method = "loess",
              aes(color = "LOESS", fill = "LOESS"))

p1 + scale_color_manual(name = "Models", values = model_colors) +
  scale_fill_manual(name = "Models", values = model_colors) +
  theme(legend.position = "top")

Model-based Graphics

min_gdp <- min(gapminder$gdpPercap)
max_gdp <- max(gapminder$gdpPercap)
med_pop <- median(gapminder$pop)

pred_df <- expand.grid(gdpPercap = (seq(from = min_gdp, to = max_gdp,
length.out = 100)), pop = med_pop, continent = c("Africa",
"Americas", "Asia", "Europe", "Oceania"))

dim(pred_df)

## [1] 500   3

out <- lm(formula = lifeExp ~ gdpPercap + pop + continent, data = gapminder)

pred_out <- predict(object = out, newdata = pred_df, interval = "predict")
pred_df <- cbind(pred_df, pred_out)

p <-
  ggplot(
    data = subset(pred_df, continent %in% c("Europe", "Africa")),
    aes(
      x = gdpPercap,
      y = fit,
      ymin = lwr,
      ymax = upr,
      color = continent,
      fill = continent,
      group = continent
    )
  )

p + geom_point(
  data = subset(gapminder,
                continent %in% c("Europe", "Africa")),
  aes(x = gdpPercap, y = lifeExp,
      color = continent),
  alpha = 0.5,
  inherit.aes = FALSE
) +
  geom_line() +
  geom_ribbon(alpha = 0.2, color = FALSE) +
  scale_x_log10(labels = scales::dollar)

Tidy Model Objects with Broom

get component-level statistics with tidy()

library(broom)
out_comp <- tidy(out)
out_comp %>% round_df()

## # A tibble: 7 x 5
##   term              estimate std.error statistic p.value
##   <chr>                <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)          47.8      0.34     141.         0
## 2 gdpPercap             0        0         19.2        0
## 3 pop                   0        0          3.33       0
## 4 continentAmericas    13.5      0.6       22.5        0
## 5 continentAsia         8.19     0.570     14.3        0
## 6 continentEurope      17.5      0.62      28.0        0
## 7 continentOceania     18.1      1.78      10.2        0

“not in” %nin% is availabe via socviz. prefix_strip from socviz drops prefixes

#confidence interval
out_conf <- tidy(out, conf.int = TRUE)
out_conf <- subset(out_conf, term %nin% "(Intercept)")
out_conf$nicelabs <- prefix_strip(out_conf$term, "continent")

p <- ggplot(out_conf,
            mapping = aes(
              x = reorder(nicelabs, estimate),
              y = estimate,
              ymin = conf.low,
              ymax = conf.high
            ))
p + geom_pointrange() + coord_flip() + labs(x = "", y = "OLS Estimate")

Get observation-level statistics with augment()

out_aug <- augment(out)
p <- ggplot(data = out_aug, mapping = aes(x = .fitted, y = .resid))
p + geom_point()

### Get model-level statistics with glance() Broom is able to tidy (and augment, and glance at) a wide range of model types.

library(survival)

out_cph <- coxph(Surv(time, status) ~ age + sex, data = lung)
out_surv <- survfit(out_cph)
out_tidy <- tidy(out_surv)
p <- ggplot(data = out_tidy, mapping = aes(time, estimate))
p + geom_line() + geom_ribbon(mapping = aes(ymin = conf.low,
                                            ymax = conf.high),
                              alpha = 0.2)

Grouped Analysis

nest and unnest

out_le <- gapminder %>%
  group_by(continent, year) %>%
  nest()

fit_ols <- function(df) {
  lm(lifeExp ~ log(gdpPercap), data = df)
}

out_le <- gapminder %>%
  group_by(continent, year) %>%
  nest() %>%
  mutate(model = map(data, fit_ols))



out_tidy <- gapminder %>%
  group_by(continent, year) %>%
  nest() %>%
  mutate(model = map(data, fit_ols),
         tidied = map(model, tidy)) %>%
  unnest(tidied, .drop = TRUE) %>%
  filter(term %nin% "(Intercept)" &
           continent %nin% "Oceania")

## Warning: The `.drop` argument of `unnest()` is deprecated as of tidyr 1.0.0.
## All list-columns are now preserved.
## This warning is displayed once per session.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

p <- ggplot(
  data = out_tidy,
  mapping = aes(
    x = year,
    y = estimate,
    ymin = estimate - 2 * std.error,
    ymax = estimate + 2 * std.error,
    group = continent,
    color = continent
  )
)

p + geom_pointrange(position = position_dodge(width = 1)) +
  scale_x_continuous(breaks = unique(gapminder$year)) +
  theme(legend.position = "top") +
  labs(x = "Year", y = "Estimate", color = "Continent")

## Plot Marginal Effects

library(margins)
gss_sm$polviews_m <- relevel(gss_sm$polviews, ref = "Moderate")
out_bo <- glm(obama ~ polviews_m + sex * race,
              family = "binomial",
              data = gss_sm)
summary(out_bo)

## 
## Call:
## glm(formula = obama ~ polviews_m + sex * race, family = "binomial", 
##     data = gss_sm)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9045  -0.5541   0.1772   0.5418   2.2437  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       0.296493   0.134091   2.211  0.02703 *  
## polviews_mExtremely Liberal       2.372950   0.525045   4.520 6.20e-06 ***
## polviews_mLiberal                 2.600031   0.356666   7.290 3.10e-13 ***
## polviews_mSlightly Liberal        1.293172   0.248435   5.205 1.94e-07 ***
## polviews_mSlightly Conservative  -1.355277   0.181291  -7.476 7.68e-14 ***
## polviews_mConservative           -2.347463   0.200384 -11.715  < 2e-16 ***
## polviews_mExtremely Conservative -2.727384   0.387210  -7.044 1.87e-12 ***
## sexFemale                         0.254866   0.145370   1.753  0.07956 .  
## raceBlack                         3.849526   0.501319   7.679 1.61e-14 ***
## raceOther                        -0.002143   0.435763  -0.005  0.99608    
## sexFemale:raceBlack              -0.197506   0.660066  -0.299  0.76477    
## sexFemale:raceOther               1.574829   0.587657   2.680  0.00737 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2247.9  on 1697  degrees of freedom
## Residual deviance: 1345.9  on 1686  degrees of freedom
##   (1169 observations deleted due to missingness)
## AIC: 1369.9
## 
## Number of Fisher Scoring iterations: 6

bo_m <- margins(out_bo)
summary(bo_m)

##                            factor     AME     SE        z      p   lower
##            polviews_mConservative -0.4119 0.0283 -14.5394 0.0000 -0.4674
##  polviews_mExtremely Conservative -0.4538 0.0420 -10.7971 0.0000 -0.5361
##       polviews_mExtremely Liberal  0.2681 0.0295   9.0996 0.0000  0.2103
##                 polviews_mLiberal  0.2768 0.0229  12.0736 0.0000  0.2319
##   polviews_mSlightly Conservative -0.2658 0.0330  -8.0596 0.0000 -0.3304
##        polviews_mSlightly Liberal  0.1933 0.0303   6.3896 0.0000  0.1340
##                         raceBlack  0.4032 0.0173  23.3568 0.0000  0.3694
##                         raceOther  0.1247 0.0386   3.2297 0.0012  0.0490
##                         sexFemale  0.0443 0.0177   2.5073 0.0122  0.0097
##    upper
##  -0.3564
##  -0.3714
##   0.3258
##   0.3218
##  -0.2011
##   0.2526
##   0.4371
##   0.2005
##   0.0789

The margins library comes with several plot methods of its own. If you wish, at this point you can just try plot(bo_m) to see a plot of the average marginal effects, produced with the general look of a Stata graphic. Other plot methods in the margins library include cplot(), which visualizes marginal effects conditional on a second variable, and image(), which shows predictions or marginal effects as a filled heatmap or contour plot.

bo_gg <- as_tibble(summary(bo_m))
prefixes <- c("polviews_m", "sex")
bo_gg$factor <- prefix_strip(bo_gg$factor, prefixes)
bo_gg$factor <- prefix_replace(bo_gg$factor, "race", "Race: ")

bo_gg %>% select(factor, AME, lower, upper)

## # A tibble: 9 x 4
##   factor                     AME    lower   upper
##   <chr>                    <dbl>    <dbl>   <dbl>
## 1 Conservative           -0.412  -0.467   -0.356 
## 2 Extremely Conservative -0.454  -0.536   -0.371 
## 3 Extremely Liberal       0.268   0.210    0.326 
## 4 Liberal                 0.277   0.232    0.322 
## 5 Slightly Conservative  -0.266  -0.330   -0.201 
## 6 Slightly Liberal        0.193   0.134    0.253 
## 7 Race: Black             0.403   0.369    0.437 
## 8 Race: Other             0.125   0.0490   0.200 
## 9 Female                  0.0443  0.00967  0.0789

p <- ggplot(data = bo_gg, aes(
  x = reorder(factor, AME),
  y = AME,
  ymin = lower,
  ymax = upper
))

p + geom_hline(yintercept = 0, color = "gray80") +
  geom_pointrange() + coord_flip() +
  labs(x = NULL, y = "Average Marginal Effect")

pv_cp <- cplot(out_bo, x = "sex", draw = FALSE)

##    xvals     yvals     upper     lower
## 1   Male 0.5735849 0.6378653 0.5093045
## 2 Female 0.6344507 0.6887845 0.5801169

p <- ggplot(data = pv_cp, aes(
  x = reorder(xvals, yvals),
  y = yvals,
  ymin = lower,
  ymax = upper
))

p + geom_hline(yintercept = 0, color = "gray80") +
  geom_pointrange() + coord_flip() +
  labs(x = NULL, y = "Conditional Effect")

Plots for Surveys

library(survey)

## Loading required package: grid

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

library(srvyr)

## 
## Attaching package: 'srvyr'

## The following object is masked from 'package:stats':
## 
##     filter

options(survey.lonely.psu = "adjust")
options(na.action = "na.pass")

gss_wt <- subset(gss_lon, year > 1974) %>%
  mutate(stratvar = interaction(year, vstrat)) %>%
  as_survey_design(
    ids = vpsu,
    strata = stratvar,
    weights = wtssall,
    nest = TRUE
  )

out_grp <- gss_wt %>%
  filter(year %in% seq(1976, 2016, by = 4)) %>%
  group_by(year, race, degree) %>%
  summarize(prop = survey_mean(na.rm = TRUE)) # calculate  properly weighted survey means

## Warning: Factor `degree` contains implicit NA, consider using
## `forcats::fct_explicit_na`

out_mrg <- gss_wt %>%
  filter(year %in% seq(1976, 2016, by = 4)) %>%
  mutate(racedeg = interaction(race, degree)) %>%
  group_by(year, racedeg) %>%
  summarize(prop = survey_mean(na.rm = TRUE))

## Warning: Factor `racedeg` contains implicit NA, consider using
## `forcats::fct_explicit_na`

out_mrg <- gss_wt %>%  filter(year %in% seq(1976, 2016, by = 4)) %>%
  mutate(racedeg = interaction(race, degree)) %>% group_by(year,
                                                           racedeg) %>% 
  summarize(prop = survey_mean(na.rm = TRUE)) %>%
  separate(racedeg, sep = "\\.", into = c("race", "degree"))

## Warning: Factor `racedeg` contains implicit NA, consider using
## `forcats::fct_explicit_na`

p <- ggplot(
  data = subset(out_grp, race %nin% "Other"),
  mapping = aes(
    x = degree,
    y = prop,
    ymin = prop - 2 * prop_se,
    ymax = prop + 2 * prop_se,
    fill = race,
    color = race,
    group = race
  )
)

dodge <- position_dodge(width = 0.9)

p + geom_col(position = dodge, alpha = 0.2) +
  geom_errorbar(position = dodge, width = 0.2) +
  scale_x_discrete(labels = scales::wrap_format(10)) +
  scale_y_continuous(labels = scales::percent) +
  scale_color_brewer(type = "qual", palette = "Dark2") +
  scale_fill_brewer(type = "qual", palette = "Dark2") +
  labs(
    title = "Educational Attainment by Race",
    subtitle = "GSS 1976-2016",
    fill = "Race",
    color = "Race",
    x = NULL,
    y = "Percent"
  ) +
  facet_wrap( ~ year, ncol = 2) +
  theme(legend.position = "top")

## Warning: Removed 13 rows containing missing values (geom_col).

## Warning: Removed 13 rows containing missing values (geom_errorbar).

p <- ggplot(
  data = subset(out_grp, race %nin% "Other"),
  mapping = aes(
    x = year,
    y = prop,
    ymin = prop - 2 * prop_se,
    ymax = prop + 2 * prop_se,
    fill = race,
    color = race,
    group = race
  )
)

p + geom_ribbon(alpha = 0.3, aes(color = NULL)) + #Use ribbon to show the error range
  geom_line() + #Use line to show a time trend
  facet_wrap( ~ degree, ncol = 1) +
  scale_y_continuous(labels = scales::percent) +
  scale_color_brewer(type = "qual", palette = "Dark2") +
  scale_fill_brewer(type = "qual", palette = "Dark2") +
  labs(
    title = "Educational Attainment by Race",
    subtitle = "GSS 1976-2016",
    fill = "Race",
    color = "Race",
    x = NULL,
    y = "Percent"
  ) +
  theme(legend.position = "top")

## Warning: Removed 13 rows containing missing values (geom_path).

Other useful packages: infer, ggally

Data Visualization Chapter 2-4

Thu, 26 Sep 2019 00:00:00 +0000

Chapter 2

geom_point

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point()

Chapter 3

geom_smooth

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() + geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

scale_x_log10

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() + geom_smooth(method = "gam") + scale_x_log10()

scales::dollar

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() +
geom_smooth(method = "gam") +
scale_x_log10(labels = scales::dollar)

Wrong way to set color

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp,
color = "purple"))
p + geom_point() + geom_smooth(method = "loess") + scale_x_log10()

The aes() function is for mappings only. Do not use it to change properties to a particular value. If we want to set a property, we do it in the geom_ we are using, and outside the mapping =aes(...)step.

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(color = "purple") + geom_smooth(method = "loess") + scale_x_log10()

The various geom_ functions can take many other arguments that will affect how the plot looks but do not involve mapping variables to aesthetic elements. “alpha” is an aesthetic property that points (and some other plot elements) have, and to which variables can be mapped. It controls how transparent the object will appear when drawn. It’s measured on a scale of zero to one.

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(alpha = 0.3) + geom_smooth(color = "orange", se = FALSE,
                                          size = 8, method = "lm") + scale_x_log10()

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y=lifeExp))
p + geom_point(alpha = 0.3) +
  geom_smooth(method = "gam") +
  scale_x_log10(labels = scales::dollar) +
  labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
       title = "Economic Growth and Life Expectancy",
       subtitle = "Data points are country-years",
       caption = "Source: Gapminder.")

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp,
                                            color = continent))
p + geom_point() + geom_smooth(method = "loess") + scale_x_log10()

The color of the standard error ribbon is controlled by the fill aesthetic.

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp,
                                            color = continent, fill = continent))
p + geom_point() + geom_smooth(method = "loess") + scale_x_log10()

Aesthetics Can Be Mapped per Geom

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(mapping = aes(color = factor(year))) + 
  geom_smooth(method = "loess") +
  scale_x_log10()

Order doesn’t matter!!! Besides scale_x_log10(), you can try scale_x_sqrt() and scale_x_reverse()

p <- ggplot(data = gapminder, mapping = aes(x = pop, y = lifeExp))
p + geom_smooth(method = "loess") + 
  geom_point(mapping = aes(color = continent)) + 
  scale_x_reverse(labels = scales::number)

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(mapping = aes(color = log(pop))) + scale_x_log10()

Save plots

p_out <-  p + geom_point() + geom_smooth(method = "loess") + scale_x_log10()
ggsave("my_figure.pdf", plot = p_out)

Chapter 4

Group data and the “Group” Aesthetic

p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line()

use the group aesthetic to tell ggplot explicitly about this country-level structure

p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(aes(group = country))

Facet to make small multiples

use facet_wrap() to split our plot by continent.

p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(aes(group = country)) + facet_wrap(~continent)

Add another enhancements

p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(color="gray70", aes(group = country)) + 
  geom_smooth(size= 1.1, method = "loess", se = FALSE) +
  scale_y_log10(labels=scales::dollar) +
  facet_wrap(~continent , ncol = 5) +
  labs(x = "Year",
       y = "GDP per capita on Five Continents")

Use facet_grid

p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs))
p + geom_point(alpha = 0.2) +
  geom_smooth() + 
  facet_grid(sex ~ race)

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 18 rows containing non-finite values (stat_smooth).

## Warning: Removed 18 rows containing missing values (geom_point).

p <- ggplot(data = gss_sm, mapping = aes(x = age, y = childs))
p + geom_point(alpha = 0.2) +
  geom_smooth() + 
  facet_grid(sex ~ race + degree)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 18 rows containing non-finite values (stat_smooth).

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 62.87

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.13

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 582.26

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small.
## fewer data values than degrees of freedom.

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 62.87

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.13

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0

## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 582.26

## Warning: Removed 18 rows containing missing values (geom_point).

Geoms can transform data

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion))
p + geom_bar()

geom_bar called the default stat_ function associated with it,stat_count().

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion))
p + geom_bar(mapping = aes(y = ..prop..))

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion))
p + geom_bar(mapping = aes(y = ..prop.., group = 1))

table(gss_sm$religion)

## 
## Protestant   Catholic     Jewish       None      Other 
##       1371        649         51        619        159

p <- ggplot(data = gss_sm, mapping = aes(x = religion, color = religion))
p + geom_bar()

p <- ggplot(data = gss_sm, mapping = aes(x = religion, fill = religion))
p + geom_bar() + guides(fill = FALSE)

p + geom_bar()

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar()

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "fill")

if you want separate bars

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "dodge", mapping = aes(y = ..prop..,
                                               group = religion))

However, they don’t sum to one within each region. They sum to one across regions.

p <- ggplot(data = gss_sm, mapping = aes(x = religion))
p + geom_bar(position = "dodge", mapping = aes(y = ..prop..,
                                               group = bigregion)) +
  facet_wrap(~bigregion, ncol=1)

Histgrams and Density Plots

p <- ggplot(data = midwest, mapping = aes( x = area))
p + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p <- ggplot(data = midwest, mapping = aes( x = area))
p + geom_histogram(bins = 10)

oh_wi <- c("OH", "WI")
p <- ggplot(data = subset(midwest, subset = state %in% oh_wi),
            mapping = aes(x = percollege, fill = state))
p + geom_histogram(alpha = 0.4, bins = 20)

p <- ggplot(data = midwest, mapping = aes( x = area))
p + geom_density()

p <- ggplot(data = midwest, mapping = aes( x = area, fill = state,
                                           color = state))
p + geom_density(alpha = 0.3)

Avoid Transformations When Necessary

p <- ggplot(data = titanic, mapping = aes(x = fate, y = percent,
                                          fill = sex))
p + geom_bar(position = "dodge", stat = "identity") + theme(legend.position = "top")

p <- ggplot(data = oecd_sum,
            mapping = aes(x = year, y = diff, fill = hi_lo))
p + geom_col() + guides(fill = FALSE) + 
  labs(x = NULL, y = "Difference in Years",
       title = "The US Life Expectancy Gap",
       subtitle = "Difference between US and OECD 
       average life expectancies, 1960-2015",
       caption = "Data: OECD. After a chart by Christopher Ingraham,
       Washington Post, December 27th 2017.")

## Warning: Removed 1 rows containing missing values (position_stack).

Data Visualization Chapter 5

Thu, 26 Sep 2019 00:00:00 +0000

Chapter 5

Use Pipes to Summerize Data

rel_by_region <- gss_sm %>%
  group_by(bigregion, religion) %>%
  summarize(N = n()) %>%
  mutate(freq = N / sum(N),
         pct = round((freq*100), 0))

## Warning: Factor `religion` contains implicit NA, consider using
## `forcats::fct_explicit_na`

rel_by_region

## # A tibble: 24 x 5
## # Groups:   bigregion [4]
##    bigregion religion       N    freq   pct
##    <fct>     <fct>      <int>   <dbl> <dbl>
##  1 Northeast Protestant   158 0.324      32
##  2 Northeast Catholic     162 0.332      33
##  3 Northeast Jewish        27 0.0553      6
##  4 Northeast None         112 0.230      23
##  5 Northeast Other         28 0.0574      6
##  6 Northeast <NA>           1 0.00205     0
##  7 Midwest   Protestant   325 0.468      47
##  8 Midwest   Catholic     172 0.247      25
##  9 Midwest   Jewish         3 0.00432     0
## 10 Midwest   None         157 0.226      23
## # … with 14 more rows

rel_by_region %>% group_by(bigregion) %>% summarize(total = sum(pct))

## # A tibble: 4 x 2
##   bigregion total
##   <fct>     <dbl>
## 1 Northeast   100
## 2 Midwest     101
## 3 South       100
## 4 West        101

p <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
  labs(x = "Region",y = "Percent", fill = "Religion") +
  theme(legend.position = "top")

Use coord_flip()

p <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
  labs(x = "Region",y = "Percent", fill = "Religion") +
  guides(fill = FALSE) + 
  coord_flip() + 
  facet_grid(~ bigregion)

p <- ggplot(rel_by_region, aes(x = religion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
  labs(x = NULL,y = "Percent", fill = "Religion") +
  guides(fill = FALSE) + 
  coord_flip() + 
  facet_grid(~ bigregion)

Continuous Variables by Group or Category

p <- ggplot(data = organdata, mapping = aes(x = year, y = donors))
p + geom_line(aes(group = country)) + facet_wrap(~country)

## Warning: Removed 34 rows containing missing values (geom_path).

p <- ggplot(data = organdata, mapping = aes(x = country, y = donors))
p + geom_boxplot() + coord_flip()

## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

p <- ggplot(data = organdata, mapping = aes(x = reorder(country,
                                                        donors, na.rm = TRUE), y = donors))
p + geom_boxplot() + labs(x = NULL) + coord_flip()

## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

p <- ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm = TRUE), 
                                            y = donors, fill = world))
p + geom_boxplot() + labs(x = NULL) + 
  coord_flip() + theme(legend.position = "top")

## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

p <- ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm = TRUE), 
                                            y = donors, color = world))
p + geom_point() + labs(x = NULL) + 
  coord_flip() + theme(legend.position = "top")

## Warning: Removed 34 rows containing missing values (geom_point).

p <-
  ggplot(data = organdata,
         mapping = aes(
           x = reorder(country, donors, na.rm = TRUE),
           y = donors,
           color = world
         ))
p + geom_jitter() + labs(x = NULL) +
  coord_flip() + theme(legend.position = "top")

## Warning: Removed 34 rows containing missing values (geom_point).

p <-
  ggplot(data = organdata,
         mapping = aes(
           x = reorder(country, donors, na.rm = TRUE),
           y = donors,
           color = world
         ))
p + geom_jitter(position = position_jitter(width = 0.15)) + labs(x = NULL) +
  coord_flip() + theme(legend.position = "top")

## Warning: Removed 34 rows containing missing values (geom_point).

by_country <-
  organdata %>% group_by(consent_law, country) %>% summarize(
    donors_mean = mean(donors, na.rm = TRUE),
    donors_sd = sd(donors, na.rm = TRUE),
    gdp_mean = mean(gdp, na.rm = TRUE),
    health_mean = mean(health, na.rm = TRUE),
    roads_mean = mean(roads, na.rm = TRUE),
    cerebvas_mean = mean(cerebvas, na.rm = TRUE)
  )

by_country

## # A tibble: 17 x 8
## # Groups:   consent_law [2]
##    consent_law country donors_mean donors_sd gdp_mean health_mean
##    <chr>       <chr>         <dbl>     <dbl>    <dbl>       <dbl>
##  1 Informed    Austra…        10.6     1.14    22179.       1958.
##  2 Informed    Canada         14.0     0.751   23711.       2272.
##  3 Informed    Denmark        13.1     1.47    23722.       2054.
##  4 Informed    Germany        13.0     0.611   22163.       2349.
##  5 Informed    Ireland        19.8     2.48    20824.       1480.
##  6 Informed    Nether…        13.7     1.55    23013.       1993.
##  7 Informed    United…        13.5     0.775   21359.       1561.
##  8 Informed    United…        20.0     1.33    29212.       3988.
##  9 Presumed    Austria        23.5     2.42    23876.       1875.
## 10 Presumed    Belgium        21.9     1.94    22500.       1958.
## 11 Presumed    Finland        18.4     1.53    21019.       1615.
## 12 Presumed    France         16.8     1.60    22603.       2160.
## 13 Presumed    Italy          11.1     4.28    21554.       1757 
## 14 Presumed    Norway         15.4     1.11    26448.       2217.
## 15 Presumed    Spain          28.1     4.96    16933        1289.
## 16 Presumed    Sweden         13.1     1.75    22415.       1951.
## 17 Presumed    Switze…        14.2     1.71    27233        2776.
## # … with 2 more variables: roads_mean <dbl>, cerebvas_mean <dbl>

by_country <- organdata %>% group_by(consent_law, country) %>%
  summarize_if(is.numeric, lst(mean, sd), na.rm = TRUE) %>%
  ungroup()
by_country

## # A tibble: 17 x 28
##    consent_law country donors_mean pop_mean pop_dens_mean gdp_mean
##    <chr>       <chr>         <dbl>    <dbl>         <dbl>    <dbl>
##  1 Informed    Austra…        10.6   18318.         0.237   22179.
##  2 Informed    Canada         14.0   29608.         0.297   23711.
##  3 Informed    Denmark        13.1    5257.        12.2     23722.
##  4 Informed    Germany        13.0   80255.        22.5     22163.
##  5 Informed    Ireland        19.8    3674.         5.23    20824.
##  6 Informed    Nether…        13.7   15548.        37.4     23013.
##  7 Informed    United…        13.5   58187.        24.0     21359.
##  8 Informed    United…        20.0  269330.         2.80    29212.
##  9 Presumed    Austria        23.5    7927.         9.45    23876.
## 10 Presumed    Belgium        21.9   10153.        30.7     22500.
## 11 Presumed    Finland        18.4    5112.         1.51    21019.
## 12 Presumed    France         16.8   58056.        10.5     22603.
## 13 Presumed    Italy          11.1   57360.        19.0     21554.
## 14 Presumed    Norway         15.4    4386.         1.35    26448.
## 15 Presumed    Spain          28.1   39666.         7.84    16933 
## 16 Presumed    Sweden         13.1    8789.         1.95    22415.
## 17 Presumed    Switze…        14.2    7037.        17.0     27233 
## # … with 22 more variables: gdp_lag_mean <dbl>, health_mean <dbl>,
## #   health_lag_mean <dbl>, pubhealth_mean <dbl>, roads_mean <dbl>,
## #   cerebvas_mean <dbl>, assault_mean <dbl>, external_mean <dbl>,
## #   txp_pop_mean <dbl>, donors_sd <dbl>, pop_sd <dbl>, pop_dens_sd <dbl>,
## #   gdp_sd <dbl>, gdp_lag_sd <dbl>, health_sd <dbl>, health_lag_sd <dbl>,
## #   pubhealth_sd <dbl>, roads_sd <dbl>, cerebvas_sd <dbl>,
## #   assault_sd <dbl>, external_sd <dbl>, txp_pop_sd <dbl>

p <- ggplot(data = by_country,
            mapping = aes(
              x = donors_mean,
              y = reorder(country, donors_mean),
              color = consent_law
            ))
p + geom_point(size = 3) +
  labs(x = "Donor Procurement Rate",
       y = "", color = "Consent Law") +
  theme(legend.position = "top")

p <- ggplot(data = by_country,
            mapping = aes(x = donors_mean,
                          y = reorder(country, donors_mean)))

p + geom_point(size = 3) +
  facet_wrap( ~ consent_law, scales = "free_y", ncol = 1) +
  labs(x = "Donor Procurement Rate",
       y = "")

p <- ggplot(data = by_country,
            mapping = aes(x = reorder(country,
                                      donors_mean), y = donors_mean))

p + geom_pointrange(mapping = aes(ymin = donors_mean - donors_sd,
                                  ymax = donors_mean + donors_sd)) +
  labs(x = "", y = "Donor Procurement Rate") + coord_flip()

### Plot Text Directly

p <- ggplot(data = by_country,
            mapping = aes(x = roads_mean,
                          y = donors_mean))
p + geom_point() + geom_text(mapping = aes(label = country))

p <- ggplot(data = by_country,
            mapping = aes(x = roads_mean,
                          y = donors_mean))
p + geom_point() + geom_text(mapping = aes(label = country), hjust = 0)

ggrepel is better than geom_text()

library(ggrepel)

p_title <-
  "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <- "1824-2016"
p_caption <- "Data for 2016 are provisional."
x_label <- "Winner's share of Popular Vote"
y_label <- "Winner's share of Electoral College Votes"

p <- ggplot(elections_historic,
            aes(x = popular_pct, y = ec_pct,
                label = winner_label))

p + geom_hline(yintercept = 0.5,
               size = 1.4,
               color = "gray80") +
  geom_vline(xintercept = 0.5,
             size = 1.4,
             color = "gray80") +
  geom_point() +
  geom_text_repel() +
  scale_x_continuous(labels = scales::percent) +
  scale_y_continuous(labels = scales::percent) +
  labs(
    x = x_label,
    y = y_label,
    title = p_title,
    subtitle = p_subtitle,
    caption = p_caption
  )

### Label Outliers

p <- ggplot(data = by_country,
            mapping = aes(x = gdp_mean, y = health_mean))

p + geom_point() +
  geom_text_repel(data = subset(by_country, gdp_mean > 25000),
                  mapping = aes(label = country))

p <- ggplot(data = by_country,
            mapping = aes(x = gdp_mean, y = health_mean))

p + geom_point() +
  geom_text_repel(
    data = subset(
      by_country,
      gdp_mean > 25000 | health_mean < 1500 |
        country %in% "Belgium"
    ),
    mapping = aes(label = country)
  )

organdata$ind <- organdata$ccode %in% c("Ita", "Spa") &
  organdata$year > 1998

p <- ggplot(data = organdata,
            mapping = aes(x = roads,
                          y = donors, color = ind))
p + geom_point() +
  geom_text_repel(data = subset(organdata, ind),
                  mapping = aes(label = ccode)) +
  guides(label = FALSE, color = FALSE)

## Warning: Removed 34 rows containing missing values (geom_point).

Write and Draw in the Plot Area

p <- ggplot(data = organdata, mapping = aes(x = roads, y = donors))
p + geom_point() + annotate(
  geom = "text",
  x = 91,
  y = 33,
  label = "A surprisingly high \n recovery rate.",
  hjust = 0
)

## Warning: Removed 34 rows containing missing values (geom_point).

p <- ggplot(data = organdata,
            mapping = aes(x = roads, y = donors))
p + geom_point() +
  annotate(
    geom = "rect",
    xmin = 125,
    xmax = 155,
    ymin = 30,
    ymax = 35,
    fill = "red",
    alpha = 0.2
  ) +
  annotate(
    geom = "text",
    x = 157,
    y = 33,
    label = "A surprisingly high \n recovery rate.",
    hjust = 0
  )

## Warning: Removed 34 rows containing missing values (geom_point).

Scales, Guides, and Themes

p <- ggplot(data = organdata,
            mapping = aes(x = roads,
                          y = donors,
                          color = world))
p + geom_point()

## Warning: Removed 34 rows containing missing values (geom_point).

p <- ggplot(data = organdata,
            mapping = aes(x = roads,
                          y = donors,
                          color = world))
p + geom_point() + scale_x_log10() + scale_y_continuous(breaks = c(5,
                                                                   15, 25),
                                                        labels = c("Five", "Fifteen", "Twenty Five"))

## Warning: Removed 34 rows containing missing values (geom_point).

p <- ggplot(data = organdata,
            mapping = aes(x = roads, y = donors,
                          color = world))
p + geom_point() + scale_color_discrete(labels = c("Corporatist",
                                                   "Liberal", "Social Democratic", "Unclassified")) + 
  labs(x = "Road Deaths",
       y = "Donor Procurement", color = "Welfare State")

## Warning: Removed 34 rows containing missing values (geom_point).

p <- ggplot(data = organdata,
            mapping = aes(x = roads, y = donors,
                          color = world))
p + geom_point() + labs(x = "Road Deaths", y = "Donor Procurement") +
  guides(color = FALSE)

## Warning: Removed 34 rows containing missing values (geom_point).

Meta-Analysis Note 1

Wed, 25 Sep 2019 00:00:00 +0000

本书第一章主要对一些术语进行了界定，把元分析同其它种文献综述的方式进行了区分。元分析同其它定性的总结以及定量的（Informal vote counting-一般采用多数原则来总结结论与formal vote counting-在前者基础之上采用了一些统计分析以期得到统计上显著的结论）一些分析的不同之处在于：元分析的关注点除了关注效果是否存在之外，主要关注效果的大小(effect size)。

元分析的工作步骤分为五个阶段：

确定问题(formulate a problem)

在确定问题开始综述工作时，要把关注的重点厘清。比如希望是一个更概括的结论或样本，还是一个适用范围有限定的结论和样本，这将决定第二三阶段对文献的取舍。

取得相关文献

要从尽可能多的样本里采样，确保文献是有代表性的(representive)和无偏的(unbiased)。后者因为学术刊物对于结果显著的论文的发表偏好而很难实现，因此别忘了未发表的工作论文或毕业论文等。

对文献进行评估精选

本阶段对上一阶段取得的文献进行相关性评估。在此阶段会对第一阶段确定下来的研究问题进行进一步的提炼。

对文献进行分析和解释

最花时间和最难的阶段，在此阶段中需要将文献的数据整理和输入。

文献综述的写作

有几点需要注意的：

对于整个文献综述的工作过程要完整描述，记录在此过程中所做的各项取舍
关键在与要回答感兴趣的问题，假如不能回答，也要解释为什么以及未来需要做什么来回答这个问题
要避免文献列表的堆砌

作者的几点建议

因为文献的收集、评估、整理直到分析很花时间和精力，所以一定要做到有规划。一些具体的建议如下：

在文献收集过程中做记录
系统的存放文献，假如同别人合作，确保你们用同一套系统来存放，处理文献

我对于工具的建议

文献收集可以用Mendeley的private group功能，这样加入同一组的成员可以直接在客户端上打开PDF和加标注。虽然mendeley对免费用户private group数有限制，但可以通过在group底下再加子目录的方式来绕过限制。

SEM and GSEM

Wed, 25 Sep 2019 00:00:00 +0000

SEM

sem bmi <- age children incomeln educ quickfood

This would give us the unstandardized solution. This command uses maximum likelihood estimation ather than the ordinary least-squares (OLS) estimation used by the regress command. Add ,standardized just like add ,beta to regress

option method(mlmv) (maximum likelihood with missing values): Estimation is less robust to the assumption of multivariate normality when using the method(mlmv) option than when using maximum likelihood estimation with listwise deletion of observations with missing values. Because some of the five variables in our model are not normally distributed, the method(mlmv) option needs to be used with caution. The estimation performed when we use the method(mlmv) option also assumes that the missing values are MAR¹ . By contrast, when listwise deletion is used we are assuming that missing values are MCAR¹, and this is a much more restrictive assumption.

sem bmi <- age children incomeln educ quickfood, method(mlmv) standardized

estat eqgof

The OLS regression solution and the SEM solution without MLMV, which uses listwise deletion, are producing the same standardized parameter estimates and $R^2$s. As noted, the z values are slightly larger than the t-values, and the p-values are slightly smaller. The z tests for the SEM solution are directly testing the standardized solution. The regress solution’s t tests are testing the significance of the unstandardized B coefficients and do not directly test the significance of the Betas. The regress command does not provide such a direct test for the significance of Betas.

Notice that the $R^2$ using sem with method(mlmv) is actually slightly smaller. Using all the available information in the SEM solution with MLMV is not cheating if the assumptions are met. The MAR assumption for the SEM solution is more realistic than the MCAR assumption required for listwise deletion to be unbiased.

There are three rules to follow when using the maximum likelihood with missing values estimation.

Generate an indicator variable for each variable in your model to reflect whether an observation has a missing value.
Correlate potential auxiliary variables to see whether they predict missing value indicator variables.
Include additional auxiliary variables that are substantially correlated with a person’s score on a variable that has missing values.

Getting auxiliary variables into your SEM command？？？没懂

GSEM

logit obese age children incomeln educ quickfood
listcoef
glm obese age children incomeln educ quickfood, family(binomial) link(logit)
glm, eform

The logit command is a special application of the generalized linear model. We can obtain the same results by using the glm command. The glm command requires us to specify the family of our model, family(binomial), and the link function, link(logit). To obtain the odds ratio, we can replay these results by using glm, eform.

后面没看懂，以后再说吧。

Missing at Random (MAR)This is where the unfortunate names come in.Missing at Random means the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data. ↩︎

Panel data in R vs in Stata

Tue, 27 Aug 2019 00:00:00 +0000

Panel data with one way fixed effect

mm1 <- invforward ~ TOBINQ + inv + top3 + size + lev + cash + loss + lnage + cfo + sd + ic + factor(year)
zzz <- plm(mm1,data=sample,model="within",index=c("stkcd"))

same as xtreg i.year fe , without robust vcetype 用这种方法算出来$R^2$和Stata报告$R^2$ within的一致

m1 <- invforward ~ TOBINQ + inv + top3 + size + lev + cash + loss + lnage + cfo + sd + ic
zz <- plm(m1,data=sample,model="within",index=c("stkcd", "year"),effect = "twoways")
summary(zz)

same sa xtreg i.year, fe , without robust vcetype，但$R^2$较Stata报告$R^2$ within小

vcetype robust

zz_r <- coeftest(zz, vcov.=function(x) vcovHC(x, type="sss")) # same as stata xtreg i.year, fe r
# OR
zzz_r <- coeftest(zzz, vcov.=function(x) vcovHC(x, type="sss"))

组间系数比较

OLS可用

sur_diff <-  MVBV ~ (Dm + Dh + EBV + DmEBV +DhEBV)*g_layer
h2t <- h2 %>%
  filter(g_layer != 2)%>%
  mutate(g_layer = ifelse(g_layer == 1, 0, 1))
mm <- lm(sur_diff,data=h2t)
ttt <-  coeftest(mm, vcov.=function(x) vcovHC(x, cluster="group", type="HC1"))

stargazer(fpm,models_growth_layer,type = "text", column.labels = table4_label)
stargazer(fpm_r,robusts_growth_layer,type = "text", column.labels = table4_label,
          add.lines=c("DhEBV(4)-(2)", str_c(round(ttt[12,1],3),"**(p=",round(ttt[12,4],3),")")))

Panel Data不行！One way, two way fixed effect都不行！建议直接加interaction

Difference in Difference

Wed, 10 Jul 2019 00:00:00 +0000

效應評估模型

“提高最低工資是否會減少就業？”

“最低工資提高是否餐廳的全職員工數會減少？”

假設 $MinWage$為「最低工資有提高」的虛擬變數， $FEmp$為餐廳全職員工數。

\[ FEmp_i=FEmp_{0,i}+\beta^*MinWage_i \]

\[ FEmp_i=\beta_0+\beta_1 MinWage_i+\epsilon_i \]

「沒有受到最低工資提高影響下的員工數」$FEmp_{0,i}$與「有無受到最低工資提高影響」无關时OLS是一致估计。

令 $s$表示餐廳所屬的州，則原本的效應模型可以寫成： $ \begin{eqnarray} FEmp_{is}=FEmp_{0,is}+\beta^*MinWage_{s} \tag{7.1} \end{eqnarray} $

	Pre	Post
Control		$MinWage=1$:PA
Treatment		$MinWage=1$:NJ

複迴歸模型

餐廳的型態（大型連鎖、咖啡店、小吃店等等）會影響員工僱用量。 $ \begin{eqnarray} FEmp_{is} =FEmp_{0,-type,is}+\beta^*MinWage_s+\gamma'type_{is} \tag{7.2} \end{eqnarray} $ 其中 $ FEmp_{0,-type,is}=FEmp_{0,is}-\mathbb{E}(FEmp_{0,is}|type_{is}) $

在思考怱略變數偏誤(omitted variable bias)時，可能的confounder都必需放在（依實驗組/控制組分的）加總層級來思考。

固定效果

組固定效果

\[ FEmp_{is}=FEmp_{0,is}+\beta^*MinWage_{s} \]

多數時候實驗組/控制組在政策還沒施行前，他們就存在組間的特質差異，也就是 $ FEmp_{0,is}=FEmp_{0,-\alpha_s,is}+\alpha_s $ 其中$\alpha_s$ 代表因組而異的confounder效果。

若沒有其他confounder，我們可以估計以下迴歸模型： $ FEmp_{ist}=\alpha_s+\beta^* MinWage_{st}+\epsilon_{ist} $

時間固定效果

\[ FEmp_{ist}=FEmp_{0,-(\alpha_s,\delta_t),ist}+\alpha_s+\delta_t+\beta^*MinWage_{st} \]

所對應的迴歸模型為： $ FEmp_{ist}=\alpha_s+\delta_t+\beta^* MinWage_{st}+\epsilon_{ist} $

資料追踪/不追踪

雖然$FEmp_{ist}$ 有到個別餐廳（即有下標 $i$），然而固定效果只到組層級（即下標 $s$)，因此在估計上我們並不需要追踪同一家餐廳——各期抽樣的餐廳可以不同。

DiD 估计法

\[ \begin{eqnarray} FEmp_{ist}=\alpha_s+\delta_t+\beta^*MinWage_{st}+\epsilon_{ist} \tag{7.3} \end{eqnarray} \]

\[ FEmp_{ist}=\beta_0+\alpha_1D1_s+\delta_1B1_t+\beta_1MinWage_{st}+\epsilon_{ist} \]

令$D1=1$代表來自第1個州（NJ）的虛擬變數。
令$B1 = 1$代表政策施行「後」的虛擬變數。
$MinWage_{st}=D1_s\times B1_t$

State	t=0	T=1
NJ	D1=1,B1=0	D1=1,B1=1
PA	D1=0,B1=0	D1=0,B1=1

cluster standard error

我們有G1-G4共四群誤差項的變異數及跨群間的共變異數需要去留意，當誤差項有聚類（clustering）可能時，必需要適當的調整估計式標準誤。

Panel Data

Wed, 10 Jul 2019 00:00:00 +0000

效應評估模型

\[ mrall=mrall_{-BeerTax}+\beta^*BeerTax \]

提高啤酒稅（BeerTax）是否有助減低車禍死亡率（mrall）？

固定效應模型

令 $W$代表「州愛喝酒程度」。

$W$與 $mrall_{-BeerTax}+$有關
$W$與 $BeerTax$有關

\[ mrall=(mrall_{-BT}-\mathbb{E}(mrall_{-BT}|W))+\mathbb{E}(mrall_{-BT}|W) + \beta^*BeerTax \]

\[ mrall_{-BT,-W}\equiv mrall_{-BT}-\mathbb{E}(mrall_{-BT}|W) \]

\[ mrall=mrall_{-BT,-W}+\mathbb{E}(mrall_{-BT}|W)+\beta^*BeerTax \]

$mrall_{-BT,-W}$為「去除」 $W$影響的「非啤酒稅造成的車禍死亡因素」：

它與 $W$無關。
若兩筆obs有相同飲酒文化，即$W$相同，他們的 $\mathbb{E}(mrall_{-BT}|W)$ 會相同。

「假設」一個地方的飲酒文化「不隨時間改變」，即同一州在不同時點的$W$相同。

令$\mathbb{E}(mrall_{-BT,it}|W_i)=\alpha_i$，故我們的效應模型可以寫成： $ mrall_{it}=mrall_{-BT,-W,it}+\alpha_i+\beta^*BeerTax_{it} $ 其中$\alpha_i$為第 $i$ 個州的固定效果：

$BearTax$與$mrall_{-BT,-W}$無關
$BearTax$與$\alpha$有關

組內差異最小平方法

差分OLS解决$\alpha_i$不可得的阻碍

\[ mrall_{i1}-mrall_{i0}=\beta^* (BeerTax_{i1}-BearTax_{i0})+(mrall_{-BT,-W,i1}-mrall_{-BT,-W,i0}) \]

如果$t$超過兩期，考慮用組內平均為差分比較的點。

即$x_1-\bar{x},x_2-\bar{x},...,x_n-\bar{x}, \bar{x}=\sum_{i=1}^n x_i/n$ $ \bar{mrall}_i=\sum_{t=1}^T mrall_{it}/T \\ \bar{BeerTax}_i=\sum_{t=1}^T BeerTax_{it}/T\\ \bar{mrall}_{-BT,-W,i}=\sum_{t=1}^T mrall_{-BT,-W,it}/T $

\[ mrall_{it}-\bar{mrall}_i=\beta^*\left( BeerTax_{it}-\bar{BeerTax}_i\right)+(mrall_{-BT,-W,it}-\bar{mrall}_{-BT,-W,i}) \]

固定效果模型下，我們可以以最小平方法估計下面的迴歸式： $ mrall_{it}-\bar{mrall}_i=\beta_0+\beta_1\left( BeerTax_{it}-\bar{BeerTax}_i\right)+\epsilon_{it} $ 其中$\hat{\beta}_1$即為$\beta^*$的一致性估計

常見的固定效果模型

Identity fixed effect:$\alpha_i$
Time fixed effect: $\delta_i$

\[ mrall_{-BT,it}=mrall_{-BT,-W_i,-Z_t}+\alpha_i+\delta_t \]

$W_i$為造成效應係數估計偏誤的變數，它在$i$面向固定不變。
$Z_t$為造成效應係數估計偏誤的變數，它在$t$面向固定不變。

如$Z_t$為全美國的景氣狀況。

對應的迴歸模型： $ mrall_{it}=\alpha_i+\delta_t+\beta_1 BeerTax_{it}+\epsilon_{it} $

廣義的固定效果模型

\[ mrall=mrall_{-BeerTax}+\beta^*BeerTax \]

但 $ \begin{equation} mrall_{-BT,it}\not\perp BeerTax_{it} \tag{5.1} \end{equation} $

複迴歸控制

先思考造成(5.1)的變數有哪些——統計上稱這些變數為混淆變數(confounder)。Confounder中有資料的（令為$Z$）可進一步用來擴充模型成為： $ mrall_{it}=mrall_{-BT,-Z,it}+\beta^*BeerTax_{it}+\gamma'Z_{it} $ 其中： $ mrall_{-BT,-Z}=mrall_{-BT}-\mathbb{E}(mrall_{-BT}|Z) $

固定效果模型

Confounder中沒有資料但在某些面向固定的，假設分成以下兩類：

$W_i$：在同個identity下固定。
$V_t$：在同個time下固定。

\[ \begin{eqnarray} mrall_{it}=mrall_{-BT,-(Z,W,V),it}+\beta^*BeerTax_{it}+\\ \alpha_i+\delta_t+\gamma'Z_{it} \tag{5.2} \end{eqnarray} \]

(5.2)是相當廣義的固定效果效應模型——有兩個面向的固定效果及控制變數。

隨機效果模型

\[ mrall_{it}=mrall_{-BT,-Z,it}+\beta^*BeerTax_{it}+\gamma'Z_{it} \]

隨機效果模型(Random Effect model)的設定：

使用迴歸模型：

\[ \begin{eqnarray} mrall_{it}=\beta_0+\beta_{1}BeerTax_{it}+\gamma'Z_{it}+\nu_{it} \tag{5.3} \end{eqnarray} \]

假設$\nu_{it}$ 具有某種結構。

其中假设：

$\nu_{it}\perp BeerTax_{it}$
$var(\alpha_i|X)=\sigma_{\alpha}^2$
$var(\epsilon_{it}|X)=\sigma^2$
$cov(\epsilon_{it},\epsilon_{is}|X)=0$

隨機效果模型帶有高度誤差項假設，故不建議使用。

Hausman檢定

固定效果模型(FE)

表示使用組內差異最小平法方去估算以下迴歸模型中的$\beta_1$: $ mrall_{it}=\beta_0+\beta_{1}BeerTax_{it}+\gamma'Z_{it}+\alpha_i+\epsilon_{it} $

隨機效果模型(RE)

表示使用GLS去估算以下迴歸模型中的$\beta_1$: $ mrall_{it}=\beta_0+\beta_{1}BeerTax_{it}+\gamma'Z_{it}+\nu_{it} $

$\nu_{it}=\alpha_i+\epsilon_{it}$

假設

RE下「關於variance、covariance的假設」都成立。
$\epsilon_{it} \perp BeerTax_{it} | \alpha_i,Z_{it}$

H0: $\alpha_i \perp BeerTax_{it} |Z_{it}$

H0为RE，拒绝则为FE

Linear Regression

Thu, 04 Jul 2019 00:00:00 +0000

OLS estimator

The method to compute (or estimate) $b_0$ and $b_1$ we illustrated above is called Ordinary Least Squares, or OLS. $b_0$ and $b_1$ are therefore also often called the OLS coefficients. By solving problem

\[ \begin{align} e_i & = y_i - \hat{y}_i = y_i - \underbrace{\left(b_0 + b_1 x_i\right)}_\text{prediction}\\ e_1^2 + \dots + e_N^2 &= \sum_{i=1}^N e_i^2 \equiv \text{SSR}(b_0,b_1) \\ (b_0,b_1) &= \arg \min_{\text{int},\text{slope}} \sum_{i=1}^N \left[y_i - \left(\text{int} + \text{slope } x_i\right)\right]^2 \end{align} \]

one can derive an explicit formula for them:

$ \begin{equation} b_1 = \frac{cov(x,y)}{var(x)} \end{equation} $ i.e. the estimate of the slope coefficient is the covariance between $x$ and $y$ divided by the variance of $x$, both computed from our sample of data. With $b_1$ in hand, we can get the estimate for the intercept as

\[\begin{equation} b_0 = \bar{y} - b_1 \bar{x} \end{equation}\]

where $\bar{z}$ denotes the sample mean of variable $z$. The interpretation of the OLS slope coefficient $b_1$ is as follows. Given a line as in $y = b_0 + b_1 x$,

$b_1 = \frac{d y}{d x}$ measures the change in $y$ resulting from a one unit change in $x$
For example, if $y$ is wage and $x$ is years of education, $b_1$ would measure the effect of an additional year of education on wages.

There is an alternative representation for the OLS slope coefficient which relates to the correlation coefficient $r$. Remember that $r = \frac{cov(x,y)}{s_x s_y}$, where $s_z$ is the standard deviation of variable $z$. With this in hand, we can derive the OLS slope coefficient as

$$ \begin{align} b_1 &= \frac{cov(x,y)}{var(x)}\

&= \frac{cov(x,y)}{s_x s_x} \\
&= r\frac{s_y}{s_x} \end{align}

In other words, the slope coefficient is equal to the correlation coefficient $r$ times the ratio of standard deviations of $y$ and $x$.

Linear Regression without Regressor

\[ \begin{equation} y = b_0 \end{equation} \]

This means that our minimization problem becomes very simple: We only have to choose $b_0$! We have

$ b_0 = \arg\min_{\text{int}} \sum_{i=1}^N \left[y_i - \text{int}\right]^2, $ which is a quadratic equation with a unique optimum such that $ b_0 = \frac{1}{N} \sum_{i=1}^N y_i = \overline{y}. $

Least Squares without regressor $x$ estimates the sample mean of the outcome variable $y$, i.e. it produces $\overline{y}$.

Regression without an Intercept

We follow the same logic here, just that we miss another bit from our initial equation and the minimisation problem now becomes: $ \begin{align} b_1 &= \arg\min_{\text{slope}} \sum_{i=1}^N \left[y_i - \text{slope } x_i \right]^2\\ \mapsto b_1 &= \frac{\frac{1}{N}\sum_{i=1}^N x_i y_i}{\frac{1}{N}\sum_{i=1}^N x_i^2} = \frac{\bar{x} \bar{y}}{\overline{x^2}} \end{align} $

Least Squares without intercept (i.e. with $b_0=0$) is a line that passes through the origin.

In this case we only get to choose the slope $b_1$ of this anchored line.¹

Centering A Regression

By centering or demeaning a regression, we mean to substract from both $y$ and $x$ their respective averages to obtain $\tilde{y}_i = y_i - \bar{y}$ and $\tilde{x}_i = x_i - \bar{x}$. We then run a regression without intercept as above. That is, we use $\tilde{x}_i,\tilde{y}_i$ instead of $x_i,y_i$ in

\[ \begin{align} b_1 &= \arg\min_{\text{slope}} \sum_{i=1}^N \left[y_i - \text{slope } x_i \right]^2\\ \mapsto b_1 &= \frac{\frac{1}{N}\sum_{i=1}^N x_i y_i}{\frac{1}{N}\sum_{i=1}^N x_i^2} = \frac{\bar{x} \bar{y}}{\overline{x^2}} \end{align} \]

to obtain our slope estimate $b_1$:

$$ \begin{align} b1 &= \frac{\frac{1}{N}\sum^N \tilde{x}_i \tilde{y}i}{\frac{1}{N}\sum^N \tilde{x}_i^2}\

&= \frac{\frac{1}{N}\sum_{i=1}^N (x_i - \bar{x}) (y_i - \bar{y})}{\frac{1}{N}\sum_{i=1}^N (x_i - \bar{x})^2} \\
&= \frac{cov(x,y)}{var(x)}

\end{align} $$

This last expression is identical to the one in OLS estimate! It's the standard OLS estimate for the slope coefficient. We note the following:

Adding a constant to a regression produces the same result as centering all variables and estimating without intercept. So, unless all variables are centered, always include an intercept in the regression.

Standardizing A Regression

Standardizing a variable $z$ means to demean as above, but in addition to divide the demeaned value by its own standard deviation. Similarly to what we did above for centering, we define transformed variables $\breve{y}_i = \frac{y_i-\bar{y}}{\sigma_y}$ and $\breve{x}_i = \frac{x_i-\bar{x}}{\sigma_x}$ where $\sigma_z$ is the standard deviation of variable $z$. From here on, you should by now be used to what comes next! As above, we use $\breve{x}_i,\breve{y}_i$ instead of $x_i,y_i$:

$$ \begin{align} b1 &= \frac{\frac{1}{N}\sum^N \breve{x}_i \breve{y}i}{\frac{1}{N}\sum^N \breve{x}_i^2}\

&= \frac{\frac{1}{N}\sum_{i=1}^N \frac{x_i - \bar{x}}{\sigma_x} \frac{y_i - \bar{y}}{\sigma_y}}{\frac{1}{N}\sum_{i=1}^N \left(\frac{x_i - \bar{x}}{\sigma_x}\right)^2} \\
&= \frac{Cov(x,y)}{\sigma_x \sigma_y} \\
&= Corr(x,y)

\end{align} $$

After we standardize both $y$ and $x$, the slope coefficient $b_1$ in the regression without intercept is equal to the correlation coefficient.

Predictions and Residuals

Now we want to ask how our residuals $e_i$ relate to the prediction $\hat{y_i}$. Let us first think about the average of all predictions $\hat{y_i}$, i.e. the number $\frac{1}{N} \sum_{i=1}^N \hat{y_i}$. Let's just take

\[ \begin{equation} \hat{y}_i = b_0 + b_1 x_i \end{equation} \]

and plug this into this average, so that we get

\[ \begin{align} \frac{1}{N} \sum_{i=1}^N \hat{y_i} &= \frac{1}{N} \sum_{i=1}^N b_0 + b_1 x_i \\ &= b_0 + b_1 \frac{1}{N} \sum_{i=1}^N x_i \\ &= b_0 + b_1 \bar{x} \\ \end{align} \]

But that last line is just equal to the formula for the OLS intercept $b_0 = \bar{y} - b_1 \bar{x}$! That means of course that

$ \frac{1}{N} \sum_{i=1}^N \hat{y_i} = b_0 + b_1 \bar{x} = \bar{y} $ in other words:

The average of our predictions $\hat{y_i}$ is identically equal to the mean of the outcome $y$. This implies that the average of the residuals is equal to zero.

Related to this result, we can show that the prediction $\hat{y}$ and the residuals are uncorrelated, something that is often called orthogonality between $\hat{y}_i$ and $e_i$. We would write this as

\[ \begin{align} Cov(\hat{y},e) &=\frac{1}{N} \sum_{i=1}^N (\hat{y}_i-\bar{y})(e_i-\bar{e}) = \frac{1}{N} \sum_{i=1}^N (\hat{y}_i-\bar{y})e_i \\ &= \frac{1}{N} \sum_{i=1}^N \hat{y}_i e_i-\bar{y} \frac{1}{N} \sum_{i=1}^N e_i = 0 \end{align} \]

Correlation, Covariance and Linearity

It is important to keep in mind that Correlation and Covariance relate to a linear relationship between x and y. Given how the regression line is estimated by OLS (see just above), you can see that the regression line inherits this property from the Covariance.

Always visually inspect your data, and don't rely exclusively on summary statistics like mean, variance, correlation and regression line. All of those assume a linear relationship between the variables in your data.

Analysing $Var(y)$

Analysis of Variance (ANOVA) refers to a method to decompose variation in one variable as a function of several others. We can use this idea on our outcome $y$. Suppose we wanted to know the variance of $y$, keeping in mind that, by definition, $y_i = \hat{y}_i + e_i$. We would write

\[ \begin{align}Var(y) &= Var(\hat{y} + e)\\ &= Var(\hat{y}) + Var(e) + 2 Cov(\hat{y},e)\\ &= Var(\hat{y}) + Var(e) \end{align} \]

We have seen that the covariance between prediction $\hat{y}$ and error $e$ is zero, that's why we have $Cov(\hat{y},e)=0$. What this tells us in words is that we can decompose the variance in the observed outcome $y$ into a part that relates to variance as explained by the model and a part that comes from unexplained variation. Finally, we know the definition of variance, and can thus write down the respective formulae for each part:

\[Var(y) = \frac{1}{N}\sum_{i=1}^N (y_i - \bar{y})^2\]
$Var(\hat{y}) = \frac{1}{N}\sum_{i=1}^N (\hat{y_i} - \bar{y})^2$, because the mean of $\hat{y}$ is $\bar{y}$ as we know.
Finally, $Var(e) = \frac{1}{N}\sum_{i=1}^N e_i^2$, because the mean of $e$ is zero. We can thus formulate how the total variation in outcome $y$ is apportioned between model and unexplained variation:

The total variation in outcome $y$ (often called SST, or total sum of squares) is equal to the sum of explained squares (SSE) plus the sum of residuals (SSR). We have thus SST = SSE + SSR.

Assessing the Goodness of Fit

In our setup, there exists a convenient measure for how good a particular statistical model fits the data. It is called $R^2$ (R squared), also called the coefficient of determination. We make use of the just introduced decomposition of variance, and write the formula as

\[ \begin{equation}R^2 = \frac{\text{variance explained}}{\text{total variance}} = \frac{SSE}{SST} = 1 - \frac{SSR}{SST}\in[0,1] \end{equation} \]

It is easy to see that a good fit is one where the sum of explained squares (SSE) is large relative to the total variation (SST). In such a case, we observe an $R^2$ close to one. In the opposite case, we will see an $R^2$ close to zero. Notice that a small $R^2$ does not imply that the model is useless, just that it explains a small fraction of the observed variation.

This slope is related to the angle between vectors $\mathbf{a} =(\overline{x},\overline{y})$, and $\mathbf{b} = (\overline{x},0)$. Hence, it's related to the scalar projection of $\mathbf{a}$ on $\mathbf{b}$] ^{^}

工具变量

Thu, 04 Jul 2019 00:00:00 +0000

效應評估模型

\[Y_{i}={Y}_{-p,i}+\beta_i P_{i}\]

\[ Y_i=Y_{-P,i}+\beta^* P_i \]

\[ \begin{equation} Y_i=\beta_0+\beta_1P_i+w_i'\gamma+\varepsilon \tag{3.2} \end{equation} \]

在$w_{i}$條件下，「香煙售價」$P_{i}$必需要與「非價格效應的香煙銷售量」$Y_{-P}$獨立，即：$P_i\perp Y_{-p,i} | w_i$ 另一個同義說法是：「香煙售價」$P_{i}$必需要與「控制$w_{i}$條件後的非價格效應香煙銷售量」獨立。

对$Y_{-P}$进行$rincome$下分解 $ \begin{equation} Y_{i}=Y_{-P,i}-\mathbb{E}(Y_{-P,i}|rincome_{i})+\beta^{*}P_{i}+\mathbb{E}(Y_{-P,i}|rincome_{i}) \tag{3.3} \end{equation} $

把資料依$w_{i}$條件變數不同, 分群觀察「香煙售價」$P_{i}$與「香煙銷售量」$Y_{i}$之間的斜率。如果$w_{i}$變數選得好，同一群資料$P_{i}$與$Y_{i}$間的關連會反映應有的效應斜率——雖然有時$Y_{i}$會因為$Y_{-P,i}$的干擾影響我們對斜率高低的觀察，但因為$Y_{-P,i}$不會與$P_{i}$有關了，這些觀察干擾在大樣本下會互相抵消掉而還原應有的效應斜率值。

如果不管我們怎麼選擇$w_{i}$還是無法控制住$Y_{-P,i}$對與關連$Y_{i}$的干擾，那我們就要進行【資料轉換】直接從原始資料中【去除這些干擾】，其中最常見的兩種去除法為：工具變數法、追蹤資料固定效果模型。

工具變數法：透過工具變數留下$P_{i}$不與$Y_{-P,i}$相關的部份。
追蹤資料：透過變數轉換去除$P_{i}$中與$Y_{-P,i}$相關的部份。

\[ Y_i=Y_{-p,i}+\beta\mathbb{E}(P_i|z_i)+\beta (P_i-\mathbb{E}(P_i|z_i)) \]

Relevance condition

$\mathbb{E}(P|z)\neq 常数$即$z$对$P$具有解释力

Exclusion condition

$Y_{-p,i}+\beta(P_i-\mathbb{E}(P_i|z_i))$与$z_{i}$无关

三个假设

\[ \begin{equation} Y_i=\beta_0+\beta_1 P_i + \gamma_1 rincome_i + \epsilon_i \tag{3.5} \end{equation} \]

Q1: 我的工具變數有滿足排除條件（或外生條件）嗎?

香煙稅是否與控制條件下的「非售價因素銷售」無關？

\[ Y =\underset{(\times k)}{X}\beta+\underset{(\times p)}{W}\gamma +\epsilon \]

其中$X$為要進行效應評估的變數群，$W$為控制變數群，故$ϵ$為「$W$控制條件下排除$X$效果的Y值」。另外，我們額外找了工具變數: $\underset{\times m)}{Z}$, 要驗證：

$H_{0}$: 工具變數$Z$與迴歸模型誤差項$ϵ$無關

進行TSLS，取得 $ \hat{\epsilon}_{_{TSLS}}=Y-\hat{Y}_{TSLS} $.
將 $ \hat{\epsilon}_{_{TSLS}} $ 迴歸在總工具變數群（即$Z$與$W$）並進行所有係數為0的聯立檢定，計算檢定量 $J=mF\sim\chi^{2}(m-k)$，其中F係數聯立檢定的F檢定值。

此檢定的自由度為$m−k$，所以$m$要大於$k$。“等於”時是無法進行檢定的。

Q2: 我的工具變數關聯性夠強嗎？

香煙稅真的與「售價」很有關連嗎？

工具變數$Z$必需要與效應解釋變數$X$有「足夠強」的關聯，否則$\hat{\beta}_{_{TSLS}}$的大樣本漸近分配不會是常態分配。

考慮TSLS中的第一階段迴歸模型：$X=Z\alpha_z+W\alpha_w+u$我們希望$\alpha_z$聯立夠顯著。

檢定原則

$H_0$:$Z$ 工具變數只有微弱關聯性。

$X$迴歸在「總」工具變數群($Z$,$W$)，進行$\alpha_z=0$的聯立F檢定。
$F>10$拒絕$H_0$。

Q3: 我對遺漏變數偏誤(OVB)的擔心是否多餘？

或許根本沒有必要用工具變數，在(3.5)迴歸模型下，PP早已和ϵϵ（即「控制條件下的非售價因素銷售」）無關——直接對(3.5)進行最小平方法估計即可。 $ \begin{equation} Y =X\beta+W\gamma +\epsilon \tag{3.6} \end{equation} $ $H_0 $: 迴歸模型(3.6)中的$\beta$係數估計「沒有」面臨OVB: 用OLS或TSLS都可以: 在大樣本下，$\\hat{\beta}_{OLS}\approx\hat{\beta}_{TSLS}$。

$H_1 $: 迴歸模型(3.6)中的$\beta$係數估計「有」面臨OVB: 只能用TSLS :在大樣本下，$\\hat{\beta}_{OLS}\neq \hat{\beta}_{TSLS}$。

Hausman檢定統計量: $ H\equiv\left(\hat{\beta}_{IV}-\hat{\beta}_{OLS}\right)^{'}\left[V(\hat{\beta}_{IV}-\hat{\beta}_{OLS})\right]^{-1}\left(\hat{\beta}_{IV}-\hat{\beta}_{OLS}\right)\sim\chi_{(df)}^{2}. $ – df： $\beta$係數個數.

當$H>\chi_{(df)}^{2}(\alpha)$才拒絕$H_0$。

Ghost Blog Workflow

Wed, 26 Jun 2019 00:00:00 +0000

Sep 25, 2019 的update: 这个WorkFlow不太完美，现在转用Blogdown和Git来管理，正在摸索中。

~~总算把Ghost配得七七八八，以后要好好记下笔记了。像以前看过的东西时间久了就全忘了，太郁闷了。~~

目前的Workflow如下

在Synology Drive下Draft目录存放草稿
Typora里写markdown并保存
存Leanote和evernote各一份，这个应该可以通过IFTTT来实现，日后研究
另外一个解决方案是直接Git init Draft目录，再往Github上push备份。
存Ghost发布

需要的代码注入

公式

在Post Header 粘贴以下脚本

<script type="text/javascript" src="https://cdn.bootcss.com/mathjax/2.7.3/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>  
<script type="text/x-mathjax-config">  
    MathJax.Hub.Config({
        tex2jax: {
            inlineMath: [['$$','$$'], ['\\\\(','\\\\)']],
            processEscapes: true
        }
    });
</script>

语法高亮

在Post Header 粘贴以下脚本

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.16.0/themes/prism-tomorrow.css">

在Post Footer粘贴以下脚本

<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.16.0/prism.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.16.0/components/prism-python.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.16.0/components/prism-r.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.16.0/components/prism-sas.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.16.0/components/prism-bash.js"></script>

prism.js不支持Stata就凑合着用用吧。需要载入的components取决于博文需要。

需要注意的地方

Ghost对于H1不能生成Toc，从H2开始
对于Markdown中的公式有些需要转义

Logistic Regression

Wed, 26 Jun 2019 00:00:00 +0000

Odds ratios

An odds ratio of 1.0 is equivalent to a beta weight of 0.0.

Group	Diseased	Healthy
Exposed	$D_E$	$H_E$
Not exposed	$D_N$	$H_N$

$OR={\frac {D_{E}/H_{E}}{D_{N}/H_{N}}}$

The distribution of the odds ratio is far from normal. Take the natural logarithm of the odds ratio to get normal.

$logit = ln(OR)$

When the mean is around 0.50, the OLS regression and logistic regression produce consistent results, but when the probability is close to 0 or 1, the logistic regression is especially important.

Logistic regression

The logit command gives the regression coefficients to estimate the logit score. The logistic command gives us the odds ratios we need to interpret the effect size of the predictors.

Both commands give the same results, except that logit gives the coefficients for estimating the logit score and logistic gives the odds ratios.

The McFadden pseudo-$R^2$ represents how much larger log likelihood is for the final solution. , meaning the log likelihood for the fitted model is 2% larger than for the log likelihood for the intercept-only model. This is not explained variance. The pseudo-$R^2$ is often a small value, and many researchers do not report it. The biggest mistake is to report it and interpret it as explained variance.

If you are interested in specific effects of individual variables, it is better to rely on odds ratios for interpreting results of logistic regression. ~~This shows that mothers who smoke have 2.02 times greater odds of having a low-birthweight child.~~

Odds ratios tell us what happens to the odds of an outcome, whereas risk ratios tell us what happens to their probability.

For binary predictor variables, you can interpret the odds ratios and percentages directly. For variables that are not binary, you need to have some other standard. One solution is to compare specific examples, such as having no dinners with the family versus having seven dinners with them each week. Another solution is to evaluate the effect of a 1-standard-deviation change for variables that are not binary.listcoef,get from package spost13. After logit/logitstic regression, run listcoef, helpor listcoef, help percent

Group	Experimental (E)	Control (C)
Events (E)	EE	CE
Non-events (N)	EN	CN

$ RR={\frac {EE/(EE+EN)}{CE/(CE+CN)}}={\frac {EE(CE+CN)}{CE(EE+EN)}}. $ 相对风险是指在暴露在某条件下，一个事件的发生风险 oddsrisk $OR={\frac {EE/CE}{EN/CN}}={\frac {EE\cdot CN}{EN\cdot CE}}$ 一个事件发生比是该事件发生和不发生的比率 Risk ratio is different from the odds ratio, although it asymptotically approaches it for small probabilities of outcomes. If EE is substantially smaller than EN, then EE/(EE + EN) $ \scriptstyle \approx $ EE/EN. Similarly, if CE is much smaller than CN, then CE/(CN + CE) $ \scriptstyle \approx $ CE/CN. $ RR={\frac {EE(CE+CN)}{CE(EE+EN)}}\approx {\frac {EE\cdot CN}{EN\cdot CE}}=OR. $

The difference is small with a rare outcome.The relative risk is appealing, but it should not be used in a study that controls the number of people in each category.

Hypothesis testing

chi-squared test that has k degrees of freedom, tells us only that the overall model has at least one significant predictor.

Testing individual coefficients

The z test in the Stata output is actually the square root of the Wald chi-squared test.

The likelihood-ratio chi-squared test for each parameter estimate is based on comparing two logistic models, one with the individual variable we want to test included and one without it. The likelihood-ratio test is the difference in the likelihood-ratio chi-squared values for these two models (this appears as LR chi2(1) near the upper right corner of the output). The difference between the two likelihood-ratio chi-squared values is 1 degree of freedom.

use nlsy97_chapter11, clear
logistic drank30 male dinner97 pdrink97
estimates store a
logistic drank30 age97 male dinner97 pdrink97
#subtracts the chi-squared values and estimates the probability of the chi-squared difference;
lrtest a

or just use lrdrop1

Testing sets of coefficients

test pdrink97 dinner97
#it is the same as:
logistic drank30 age97 male if !mi(dinner97) &!mi(pdrink97)
estimates store a
logistic drank30 age97 male pdrink97 dinner97 
lrtest a
lrdrop1

this overall test only tells us that at least one of them is significant.

Margins

logit drank30 age97 i.black pdrink97 dinner97
margins, dydx(black) atmeans
margins black, atmeans
margins, at(pdrink97=(1 2 3 4 5)) atmeans
marginsplot

We can run the logistic regression using the i. label for this categorical variable, i.black. This produces the same results for the logistic regression as if we had simply used black, but the results will work properly if we follow this command with other postestimation commands.

Nested logistic regressions

The nestreg command is extremely general, applicable across a variety of regression models, including logistic, negative binomial, Poisson, probit, ordered logistic, tobit, and others. It also works with the complex sample designs for many regression models.

Power analysis

powerlog, p1(.70) p2(.75) alpha(.05)
powerlog, p1(.70) p2(.75) alpha(.05) rsq(.30) help

Measurement, reliability, and validity

Wed, 26 Jun 2019 00:00:00 +0000

Constructing a Scale

recode empathy2 empathy4 empathy5 (1=5 "Does not describe very well") ///
  (2=4) (3=3) (4=2) (5=1 "Describes very well"), pre(rev) label(empathy)
egen empathy = rowmean(empathy1 revempathy2 empathy3 revempathy4 ///
  revempathy5 empathy6 empathy7)
egen miss = rowmiss(empathy1 revempathy2 empathy3 revempathy4 ///
   revempathy5 empathy6 empathy7) 
egen empathya = rowmean(empathy1 revempathy2 empathy3 revempathy4 ///
   revempathy5 empathy6 empathy7) if miss < 3

One drawback to using the rowmean() function is that it simply adds the score on the items a person answers and divides by the number of items answered.

Reliability

Stability means that if you measure a variable today using a particular scale and then measure it again tomorrow using the same scale, your results will be consistent.(correlation r,pwcorr, intraclass correlation $\rho_I$)
Equivalence means that you have two measures of the same variable and they produce consistent results. (correlation $r_{xx}$)* (A low correlation means either that the measure is not reliable or that the measures are not truly equivalent.)
A reliable test would be internally consistent if the score for the first half of the items was highly correlated with the score for the second half of the items.(correlation $r_{x_Ax_B}$), alpha,$\alpha$) In general, an $\alpha>0.8$ is considered good reliability, and many researchers feel an $\alpha>0.7$ is adequate reliability. ($\alpha=\sigma^2_{True}/(\sigma^2_{True}+\sigma^2_{error})$)However, for this interpretation to be used, we need to assume that the scale is valid.
alpha empathy1 revempathy2 empathy3 revempathy4 revempathy5 /// empathy6 empathy7, asis item min(5) The asis (as is) option means that we do not want Stata to change the signs of any of our variables. The bottom row of the output table, Test scale, reports the $\alpha$ for the scale (0.7462). Above this value is the $\alpha$ we would obtain if we dropped each item, one at a time. The item-test correlation column reports the correlation of each item with the total score of the seven items. item-rest correlation. This is the correlation of each item with the total of the other items. The equivalent of alpha for items that are dichotomous is the Kuder–Richardson measure of reliability.alpha
Rater consistency is important when you have observers rating a video, observed behavior, essay, or something else where two or more people are rating the same information. Here reliability means that a pair of raters gives consistent results.(kappa,$\kappa$ kap coder1 coder2)$\kappa$ only gives us credit for the extent the agreement exceeds what we would have expected to get by chance alone. kappa tends to be lower than alpha.

Validity

A valid measure is one that measures what it is supposed to be measuring.
表面效度(face validity)：把設計的問卷，拿給親朋好友填，並問他們問卷好不好。指測量工具在外顯形式上的有效程度
內容效度(content validity)：找一群有相關經驗的人來看題目，問他們設計的好不好，有沒有哪裡要修改。Content validity ratio (CVR): Judges rate each item as essential, useful, or not necessary. $CVR=(Ne - N/2)/(N/2)$ , in which the $Ne$ is the number of panelists indicating "essential" and $N$ is the total number of panelists. You can keep the items that have a relatively high CVR and drop those that do not.
效標效度(criterion validity)：把測量工具和其他可測量的工具，算他們之間的相關n以測驗分數和特定效標（criterion）之間的相關係數，表示測量工具有效性之高低。
- （1）同時效度(current validity)：把設計好的題目，和標準工具（同樣的觀念，相同的變項），去算之間的相關。如：測疼痛忍受度，有四題一分鐘可測完的題目，和另一份標準工具的題目，45題1小時可做完的題目去測，如果R＝0.92（高相關），表示原題目有同時效度。
- （2）預測效度(predictive validity)：一個調查，可以預測未來的事件、行為、態度、結果。如：手術後，病人對止痛藥的需求，看24個病人的分數，分數越高，手術忍受度越高。把24的分數算出，和拿止痛藥量求相關，R＝－0.82，表示高忍痛程度，低止痛藥量。SAT（可以預測大學第一學期的平均成績）成績，和大學第一學期的平均成績求相關，R＝0.42，表示沒有預測效度。但是R如果逐年增加，則表示有預測效度。
構念（建構）效度(construct validity)：
- We can assess the convergent and divergent validity of our measure, hope, by seeing whether it is positively correlated with variables with which we believe it converges and negatively correlated with variables with which we believe it diverges.ttest, esize, pwcorr
  
  Factor analysis
exploratory factor analysis, which Stata calls principal factor analysis: the variance is partitioned into the shared variance and unique or error variance. The shared variance is how much of the variance in any one item can be explained by the rest of the items. PF
principal-component factor analysis PCF

putdocx stata 15可以create word documents!

Terminology

Extraction(萃取)
Eigenvalues: In the case of PCF analysis, If there are 10 items, the sum of the eigenvalues will be 10.The factors will be ordered from the most important, which has the largest eigenvalue, to the least important, which has the smallest eigenvalue.In PF analysis, the sum of the eigenvalues will be less than the number of items, and the eigenvalues’ interpretation is complex.
Communality and uniqueness: PF analysis tries to explain the shared variance. PCF analysis tries to explain all the variance, which is why it is ideal for the uniqueness to approach zero.
Loadings: how clusters of items are most related to one or another of the factors. If an item has a loading over 0.4 on a factor, it is considered a good indicator of that factor.
Simple structure: This is a pattern of loadings where each item loads strongly on just one factor and a subset of items load strongly on each factor. When an item loads strongly on more than one factor, it is factorially confounded.
Scree plot: This is a graph showing the eigenvalue for each factor. When doing a PCF analysis, we usually drop factors that have eigenvalues in the neighborhood of 1.0 or smaller.
Rotation: 轉軸的方式有很多種，但基本就是兩大類：正交 (orthogonal) 與斜交 (oblique rotation)。轉軸的目的是讓因素更有意義，並同時看看因素之間的關係。更詳細一點來說，如果是正交轉軸的話，那就是假設因素之間沒有關連；相對地，斜交假設因素之間有一定的關連。
Factor score: weights each item based on how related it is to the factor. Also the factor score is scaled to have a mean of 0.0 and a variance of 1.0.

Use PCF when you have a set of items that you believe all measure one concept. In this situation, you would be interested in the first principal factor. You would want to see if it explained a substantial part of the total variance for the entire set of items, and you would want most of the items to have a loading of 0.4 or above on this factor. Because PCF analysis is trying to explain all the variance in the items, the uniqueness for each item should approach zero. Generally, we should consider any factor that has an eigenvalue of more than 1.A visual way to examine the eigenvalues is with a scree plot.

factor rnatspac rnatenvir rnatheal rnatcity rnatcrime rnatdrug ///
	rnateduc rnatrace rnatarms rnatfare rnatroad rnatsoc rnatchld rnatsci, pcf
screeplot

If, on the other hand, you want to identify two or more latent variables that represent interpretable dimensions of some concept, then PF analysis is probably best.

Rotation

Orthogonal:rotateWith a varimax rotation, we can think of the loadings as being the estimated correlation between each item and each factor.
oblique:rotate, promax

estat common to get correlation matrix of promax rotated common factors

Get one factor score

However, this distinction rarely makes a lot of practical difference. The factor score may make a difference if there are some items with very large loadings, say, 0.9, and others with very small loadings, say, 0.2. But we would probably drop the weakest items. When the loadings do not vary a great deal, computing a factor score or a mean/total score will produce comparable results.

factor rnatenvir rnatheal rnatcity rnatcrime rnatdrug rnateduc rnatrace ///
	rnatfare rnatsoc rnatchld, pcf
predict libfscore, norotate
egen libmean = rowmean(rnatenvir rnatheal rnatcity rnatcrime rnatdrug ///
	rnateduc rnatrace rnatfare rnatsoc rnatchld)

correlation higher than 0.9...

Missing values

Wed, 26 Jun 2019 00:00:00 +0000

Many advanced Stata estimation models can use multiple imputation for handling missing values.

Auxiliary variables are variables that can help to make estimates on incomplete data, while they are not part of the main analysis (Collins et al., 2001).

Include all variables in the analysis model, including the dependent variable,
Include auxiliary variables that predict patterns of missingness,
and Include additional variables that predict a person’s score on a variable that has missing values.

The imputation model is then used to generate a complete dataset.

Once you have included a reasonably large number of variables, adding additional variables may not be helpful because of multicollinearity.

Drop any participant who does not have complete information on every item used in the analysis. This approach goes by several names, including full case analysis, casewise deletion, or listwise deletion.

There will be a substantial loss of power because of the reduced sample size.
Listwise deletion can introduce substantial bias. (survival bias)

One alternative to listwise deletion involves substituting the mean on a variable for anybody who does not have a response. This has two serious limitations. People who are average on a variable are often more likely to give an answer than are people who have an extreme value.The second problem with mean substitution is that when you give several people the same score on a variable, these people have zero variance on the variable. This artificially reduced variance will seriously bias our parameter estimates.

The key to understanding multiple imputation is that the imputed missing values will not contain any unique information once the variables in the model and the auxiliary variables are allowed to explain the patterns of missing values and predict the score of the missing values. The imputed values for variables with missing values are simply consistent with the observed data. This allows us to use all available information in our analysis.

Multiple imputation

A powerful way of working with missing values involves multiple imputation. The command mi involves three straightforward steps:

Create m complete datasets by imputing the missing values. Each dataset will have no missing values, but the values imputed for missing values will vary across the datasets.
Do your analysis in each of the m complete datasets.
Pool your m solutions to get one solution.
- The parameter estimates—for example, regression coefficients—will be the mean of their corresponding values in the datasets.
- The standard errors used for testing significance will combine the standard errors from the solutions plus the variance of the parameter estimates across the solutions. If each solution is yielding a very different estimate, this uncertainty is added to the standard errors. Also the degrees of freedom is adjusted based on the number of imputations and proportion of data that have missing values.

The most widely used approach is using multivariate normal regression (MVN). mi impute mvn is designed for continuous variables. mi impute chained is another useful alternative.

A missing value will have a code of ., .a, .b, etc. Remember that a missing value is recorded in a Stata dataset as an extremely high value. Within mi, a missing-value code, . (dot), has a special meaning. It denotes the missing values eligible for imputation. If you have a set of missing values that should not be imputed, you should record them as extended missing values, that is, as .a, .b, etc.recode agem (.a = .)

misstable summarize ln_wagem gradem agem ttl_expm tenurem not_smsa south blackm
misstable patterns ln_wagem gradem agem ttl_expm tenurem not_smsa south blackm
quietly misstable summarize ln_wagem gradem agem ttl_expm tenurem not_smsa south blackm, gen(miss_)

then

logit miss_ln_wagem gradem agem ttl_expm tenurem not_smsa south blackm if ln_wagem <= .
logit miss_gradem ln_wagem agem ttl_expm tenurem not_smsa south blackm if gradem <= .
logit miss_agem ln_wagem gradem ttl_expm tenurem not_smsa south blackm if agem <= .
logit miss_ttl_expm ln_wagem gradem agem tenurem not_smsa south blackm if ttl_expm <= .
logit miss_tenurem ln_wagem gradem agem ttl_expm not_smsa south blackm if tenurem <= .
logit miss_blackm ln_wagem gradem agem ttl_expm tenurem not_smsa south if blackm <= .

Or use pwcorr , obs sig to find potential auxiliary variables.

Any variable that is statistically significant in these logistic regressions should be included in the imputation step.

mi set flong
mi register imputed ln_wagem gradem agem ttl_expm tenurem blackm
mi register regular not_smsa south

The mi set flong command tells Stata how to arrange our multiple datasets(flong (full and long), or mlong (marginal and long)). The mi register imputed command registers all the variables that have missing values and need to be imputed. The mi register regular command registers all the variables that have no missing values or for which we do not want to impute values.

mi impute mvn ln_wagem gradem agem ttl_expm tenurem blackm, add(20) rseed(2121)

生成m=20个数据集，_mi_m variable identifies datasets and ranges from 0 to 20.

mi impute mvn ln_wagem gradem agem ttl_expm tenurem blackm, add(20) rseed(2121)

To get pooled $R^2$ and standardized $\beta$s use mibeta

mibeta ln_wagem gradem agem ttl_expm tenurem not_smsa south blackm, fisherz miopts(vartable)

When impossible values are imputed(建议不调整): Binary variables, squares, and interactions（在原数据集先相乘，再impute）

Multilevel analysis

Wed, 26 Jun 2019 00:00:00 +0000

Multilevel analysis can address the lack of independence of the observations when you are analyzing grouped data. See Stata Multilevel Mixed-Effects Reference Manual.

groups of individuals
panel data

Fixed-effects regression models

\[y_it = \beta_0 +\beta x_{it}+\mu_i+\eta_{it}\]

if $\mu_i$ correlates with $x_{it}$ -> Fixed-effects if $\mu_i$ independent of $x_{it}$ -> Random-effects models give consistent estimates

xtreg see Stata Longitudinal-Data/Panel-Data Reference Manual.

Random-effects regression models

\[y_it = \beta_0 +\beta x_{it}+\gamma z_i +\mu_i+\eta_{it}\]

assume $\mu_i$ is independent of $x_{it}$

fixed component, $ \beta_0 +\beta x_{it}+\gamma z_i$ , describes the overall relationship between our dependent variable and our independent variable. The random component, $\mu_i$ i represents the effects of the unobserved time-invariant variables.

score = fixed part + random effects + error

Going back and forth between wide and long formats : reshape wide and reshape long

reshape long drink, i(id) j(wave)

Random-intercept model

linear model

mixed drink c.wave || id:
estimates store linear
margins, at(wave=(0(2)10))
marginsplot

quadratic term

mixed drink c.wave##c.wave || id:
estimates store quadratic
margins, at(wave=(0(2)10))
marginsplot
lrtest linear quadratic

A proportional reduction in error (PRE) measuring how much the residual (error) variance is reduced by adding the quadratic term may be useful. We will call the random-intercept linear model “Model 1” and the random-intercept quadratic model “Model 2”.

PRE = (var(Residual)Model1-var(Residual)Model2)/var(Residual)Model1

Treating time as a categorical variable

mixed drink i.wave || id:
estimates store means
margins, at(wave=(0(2)10))
marginsplot
lrtest linear means
lrtest quadratic means

Random-coefficients model

mixed drink c.wave || id: wave, cov(unstructured)
predict yhat_drink, fitted

Including a time-invariant covariate

* Random coefficients model with time invariant covariate
* gender coded as male = 1, female = 0
mixed drink c.wave i.male || id: wave
margins male, at(wave=(0(2)8))
marginsplot

* Random coefficients, with wave interacting with the
* time invariant covariate--gender coded
mixed drink c.wave##i.male || id: wave
margins male, at(wave=(0(2)8))
marginsplot

mixed drink c.wave##c.wave##i.male || id: wave
margins male, at(wave=(0(2)8))
marginsplot

Multiple Regressions

Wed, 26 Jun 2019 00:00:00 +0000

Note: toc is not compatible with markup: mmark

Basic

F: There is a highly significant relationship between outcomes and the set of predictors.
R2: How much of the outcome variance is explained by the regression model
Adj-R2: remove the chance effects
Coef.: unstandardized regression coefficients
t: coef/standard error
Std. Err.: represents the average distance that the observed values fall from the regression line. Conveniently, it tells you how wrong the regression model is on average using the units of the response variable.
,beta gives beta weights: based on standardizing all variables to have a mean of 0 and a standard deviation of 1. These beta weights are interpreted similarly to how you interpret correlations in that beta<0.2 is considered a weak effect, between 0.2 and 0.5 is considered a moderate effect, and is considered a strong effect.(range of -1 to +1, if out of range, ->multicollinearity problem):a 1-standard-deviation change in the independent variable produces a - beta standard-deviation change in the dependent variable.
increment in R2:part-correlation square because it measures the part that is uniquely explained by the variable. or semipartial R2 (Semipartial Corr.^2 in pcorr )estimates only the unique effect of each predictor. Another way to compare is partial correlation;
distribution of the dependent variable: histogram env_con, frequency normal kdensity (for kernel density estimation)Skewness(0:Normal; <0: negative or left skew, >0: positive or skew to the right)kurtosis(3: normal; <3: tails are too thick, flat or negative kurtosis; >3: tails are too thin, peaky or positive kurtosis)sktest
distribution of the residuals: for large sample, normality is not a critical issue. rvfplot, yline(0)residual-versus-fitted plot: To solve the non-normal distribution of residual, we can use reg y xs, vce(robust) or use bootstrapreg y xs, vce(bootstrap, rep(1000)) , it will change std err and hence t-value. However, Andrew J. Leone, Miguel Minutti-Meza, and Charles E. Wasley (2019) Influential Observations and Inference in Accounting Research. The Accounting Review In-Press. they talk about robust regression using robreg, what's the difference? ALso, check Correcting for Cross-Sectional and Time-Series Dependence in Accounting Research

regress env_con educat inc com3 hlthprob epht3, beta
predict envhat
preserve
set seed 515
sample 100, count
twoway (scatter env_con envhat) (lfit env_con envhat)
restore

Diagnostic statistics

Rstandard:

The standardized residual is the residual divided by its standard deviation.

regress env_con educat inc com3 hlthprob epht3, beta
predict yhat
predict residual, residual
predict rstandard, rstandard
list respnum env_con yhat residual rstandard if abs(rstandard) > 2.58 & rstandard < .
dfbeta
list respnum rstandard _dfbeta_1 if abs(_dfbeta_1) > 2/sqrt(3769) & _dfbeta_1 < .
estat vif

Influential observations: DFbeta: You could think of this as redoing the regression model, omitting just one observation at a time and seeing how much difference omitting each observation makes. **A value of DFbeta >2/sqrt(N) ** indicates that an observation has a large influence More specific than rstandard

. dfbeta
(739 missing values generated)
                       _dfbeta_1: dfbeta(educat)
(739 missing values generated)
                       _dfbeta_2: dfbeta(inc)
(739 missing values generated)
                       _dfbeta_3: dfbeta(com3)
(739 missing values generated)
                       _dfbeta_4: dfbeta(hlthprob)
(739 missing values generated)
                       _dfbeta_5: dfbeta(epht3)

multicollinearity: The more correlated the predictors, the more they overlap and, hence, the more difficult it is to identify their independent effects. In such situations, you can have multicollinearity in which one or more of the predictors are virtually redundant. variance inflation factor estat vif after regression, if >10, for any variable, a multicollinearity problem may exist. If the average VIF is substantially greater than 1.00, there still could be a problem.(Dropping a variable, create a scale that combines them into one variable.) 1/VIF = 1-R2(of regress X1 on other Xs) It tells how much of the variance in the independent variable is available to predict the outcome variable independently.

Weighted data

regress env_con educat inc com3 hlthprob epht3 [pweight=finalwt], beta

When you do a weighted regression this way, Stata automatically uses the robust regression—whether you ask for it or not—because weighted data require robust standard errors.

Categorical predictors and hierarchical regression

regress smday97 age97 male psmoke97 aa hispanic other if !missing(smday97, ///
	age97, male, psmoke97, aa, hispanic, other), beta
test aa hispanic other

nested regressions

nestreg: regress smday97 (age97 male) (psmoke97) (aa hispanic other), beta

If you put i. as a stub in front of a categorical variable, Stata will make the first category the reference category and then generate a dummy variable for each of the remaining categories.

regress smday97 age97 male psmoke97 i.race
#change reference category or what Stata refers to as the baselevel
regress smday97 age97 male psmoke97 ib3.race
regress smday97 age97 male psmoke97 ib(last).race

interaction

g ed_male = educ*male
reg inc educ male ed_male,beta
nestreg: regress inc (educ male) (ed_male), beta
regress inc i.male##c.educ, beta

some researchers choose to center quantitative independent variables, such as education, before computing the interaction terms. Centering is important for independent variables where a value of zero may not be meaningful.

summarize educ
generate educ_c = educ - r(mean)

margins help us to interpret the interaction term

margins male, at(educ=(8 10 12 14 16 18))
marginsplot

nonlinear

regress ln_wage c.ttl_exp##c.ttl_exp, beta
margins, at(ttl_exp = (0(2)28))
marginsplot

Power analysis

We have no idea

Wed, 26 Jun 2019 00:00:00 +0000

Bosons make up one of the two classes of particles, the other being fermions.

So far, we have some hints and some ideas about what the smallest distance in the universe might be (the Planck length). We have a pretty good catalog of twelve matter particles that so far we haven’t been able to break further apart (the Standard Model). And we have a list of three possible ways that these particles can interact (the electroweak and strong forces and gravity).

An example preprint / working paper

Sun, 07 Apr 2019 00:00:00 +0000

Supplementary notes can be added here, including code and math.

Slides

Tue, 05 Feb 2019 00:00:00 +0000

Welcome to Slides

Academic

Features

Efficiently write slides in Markdown
3-in-1: Create, Present, and Publish your slides
Supports speaker notes
Mobile friendly slides

Controls

Next: Right Arrow or Space
Previous: Left Arrow
Start: Home
Finish: End
Overview: Esc
Speaker notes: S
Fullscreen: F
Zoom: Alt + Click
PDF Export: E

Code Highlighting

Inline code: variable

Code block:

porridge = "blueberry"
if porridge == "blueberry":
    print("Eating...")

Math

In-line math: $x + y = z$

Block math:

$$ f\left( x \right) = ;\frac{{2\left( {x + 4} \right)\left( {x - 4} \right)}}{{\left( {x + 4} \right)\left( {x + 1} \right)}} $$

Fragments

Make content appear incrementally

{{% fragment %}} One {{% /fragment %}}
{{% fragment %}} **Two** {{% /fragment %}}
{{% fragment %}} Three {{% /fragment %}}

Press Space to play!

One Two Three

A fragment can accept two optional parameters:

class: use a custom style (requires definition in custom CSS)
weight: sets the order in which a fragment appears

Speaker Notes

Add speaker notes to your presentation

{{% speaker_note %}}
- Only the speaker can read these notes
- Press `S` key to view
{{% /speaker_note %}}

Press the S key to view the speaker notes!

Themes

black: Black background, white text, blue links (default)
white: White background, black text, blue links
league: Gray background, white text, blue links
beige: Beige background, dark text, brown links
sky: Blue background, thin dark text, blue links

night: Black background, thick white text, orange links
serif: Cappuccino background, gray text, brown links
simple: White background, black text, blue links
solarized: Cream-colored background, dark green text, blue links

Custom Slide

Customize the slide style and background

{{< slide background-image="/img/boards.jpg" >}}
{{< slide background-color="#0000FF" >}}
{{< slide class="my-style" >}}

Custom CSS Example

Let's make headers navy colored.

Create assets/css/reveal_custom.css with:

.reveal section h1,
.reveal section h2,
.reveal section h3 {
  color: navy;
}

Questions?

Ask

Documentation

Privacy Policy

Thu, 28 Jun 2018 00:00:00 +0100

…

Terms

Thu, 28 Jun 2018 00:00:00 +0100

…

External Project

Wed, 27 Apr 2016 00:00:00 +0000

Internal Project

Wed, 27 Apr 2016 00:00:00 +0000

An example journal article

Tue, 01 Sep 2015 00:00:00 +0000

Supplementary notes can be added here, including code and math.

An example conference paper

Mon, 01 Jul 2013 00:00:00 +0000

Supplementary notes can be added here, including code and math.

Little World

Example Page 1

Tip 1

Tip 2

Example Page 2

Tip 3

Tip 4

Example Talk

用R取代Stata与SAS

安装Stata

在R中调用Stata

三种环境下数据互通

The Catcher in Rye

Manjaro折腾记

缘起

折腾备忘录

安好之后换中国源

N卡驱动

中文输入法

Deepin桌面

安装miniconda

搭建服务并打开端口

Rstudio Server开机自动运行

Jupyter lab

Chrome Remote Desktop

系统备份和恢复

Data Vis Chapter 8

Use Color Palette

Layer Color and Text Together

Themes

Use Theme Elements

Two y-axes

Data Vis Chapter 6

Show Several Fits at Once, with a Legend

Model-based Graphics

Tidy Model Objects with Broom

get component-level statistics with tidy()

Get observation-level statistics with augment()

Grouped Analysis

Plots for Surveys

Data Visualization Chapter 2-4

Chapter 2

Chapter 3

Wrong way to set color

Aesthetics Can Be Mapped per Geom

Save plots

Chapter 4

Group data and the “Group” Aesthetic

Facet to make small multiples

Geoms can transform data

Histgrams and Density Plots

Avoid Transformations When Necessary

Data Visualization Chapter 5

Chapter 5

Use Pipes to Summerize Data

Continuous Variables by Group or Category

Write and Draw in the Plot Area

Scales, Guides, and Themes

Meta-Analysis Note 1

确定问题(formulate a problem)

取得相关文献

对文献进行评估精选

对文献进行分析和解释

文献综述的写作

作者的几点建议

我对于工具的建议

最新的建议

SEM and GSEM

SEM

GSEM

Panel data in R vs in Stata

Panel data with one way fixed effect

vcetype robust

组间系数比较

Difference in Difference

效應評估模型

複迴歸模型

固定效果

組固定效果

時間固定效果