Stata | Little World

用R取代Stata与SAS

Mon, 20 Jan 2020 00:00:00 +0000

安装Stata

首先安装ncurses5-compat-libs和libpng12这两个包，其次

% sudo -s

cd /tmp/

mkdir statafiles

cd statafiles

tar -zxf /home/you/Downloads/Stata14Linux64.tar.gz

cd /usr/local

mkdir stata14

cd stata14

/tmp/statafiles/install

安完之后把安装目录加到环境变量中去。我选择编辑/etc/profile加入：

export PATH="$PATH:/usr/local/stata14"

若想不重启就生效可以source /etc/profile

Lic文件可以直接COPY到安装目录，或者在目录中放stata.lic.tar.gz。

在R中调用Stata

通过RStata实现

#run Stata in R----
library("RStata")
options("RStata.StataPath" = "D:\\Stata15\\StataSE-64") #office
options("RStata.StataPath" = "/usr/local/stata14/stata") #linux #cannot use stata-se?
options("RStata.StataVersion" = 14)

三种环境下数据互通

R下通过两个包

library(haven) #nead read_dta to read dta
library(rio) # rio::import to read sas data
#haven::read_sas can also import sas7bdat
f1 <- str_c(data_loc,"after2007.sas7bdat",sep = "/") 
o1 <- str_c(data_loc,"after2007.dta",sep = "/") 
after2007_raw <-  import(f1)
after2007 %>% 
  mutate_if(is.numeric, as.integer) %>% 
  write_dta(.,o1, version = 12)
# Because sas only supports Stata 12 files (or earlier) while haven supports stata versions 8-15.

如以上方法都无法顺利读入sas7bdat，用SAS中转

#import stata data file, only supports 12 or earlier
PROC IMPORT OUT= WORK.S1 
            DATAFILE= "E:\after2007.dta" 
            DBMS=STATA REPLACE;
RUN;

proc export data=raw1 outfile= "D:\sample.dta" replace;
run;

SEM and GSEM

Wed, 25 Sep 2019 00:00:00 +0000

SEM

sem bmi <- age children incomeln educ quickfood

This would give us the unstandardized solution. This command uses maximum likelihood estimation ather than the ordinary least-squares (OLS) estimation used by the regress command. Add ,standardized just like add ,beta to regress

option method(mlmv) (maximum likelihood with missing values): Estimation is less robust to the assumption of multivariate normality when using the method(mlmv) option than when using maximum likelihood estimation with listwise deletion of observations with missing values. Because some of the five variables in our model are not normally distributed, the method(mlmv) option needs to be used with caution. The estimation performed when we use the method(mlmv) option also assumes that the missing values are MAR¹ . By contrast, when listwise deletion is used we are assuming that missing values are MCAR¹, and this is a much more restrictive assumption.

sem bmi <- age children incomeln educ quickfood, method(mlmv) standardized

estat eqgof

The OLS regression solution and the SEM solution without MLMV, which uses listwise deletion, are producing the same standardized parameter estimates and $R^2$s. As noted, the z values are slightly larger than the t-values, and the p-values are slightly smaller. The z tests for the SEM solution are directly testing the standardized solution. The regress solution’s t tests are testing the significance of the unstandardized B coefficients and do not directly test the significance of the Betas. The regress command does not provide such a direct test for the significance of Betas.

Notice that the $R^2$ using sem with method(mlmv) is actually slightly smaller. Using all the available information in the SEM solution with MLMV is not cheating if the assumptions are met. The MAR assumption for the SEM solution is more realistic than the MCAR assumption required for listwise deletion to be unbiased.

There are three rules to follow when using the maximum likelihood with missing values estimation.

Generate an indicator variable for each variable in your model to reflect whether an observation has a missing value.
Correlate potential auxiliary variables to see whether they predict missing value indicator variables.
Include additional auxiliary variables that are substantially correlated with a person’s score on a variable that has missing values.

Getting auxiliary variables into your SEM command？？？没懂

GSEM

logit obese age children incomeln educ quickfood
listcoef
glm obese age children incomeln educ quickfood, family(binomial) link(logit)
glm, eform

The logit command is a special application of the generalized linear model. We can obtain the same results by using the glm command. The glm command requires us to specify the family of our model, family(binomial), and the link function, link(logit). To obtain the odds ratio, we can replay these results by using glm, eform.

后面没看懂，以后再说吧。

Missing at Random (MAR)This is where the unfortunate names come in.Missing at Random means the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data. ↩︎

Panel data in R vs in Stata

Tue, 27 Aug 2019 00:00:00 +0000

Panel data with one way fixed effect

mm1 <- invforward ~ TOBINQ + inv + top3 + size + lev + cash + loss + lnage + cfo + sd + ic + factor(year)
zzz <- plm(mm1,data=sample,model="within",index=c("stkcd"))

same as xtreg i.year fe , without robust vcetype 用这种方法算出来$R^2$和Stata报告$R^2$ within的一致

m1 <- invforward ~ TOBINQ + inv + top3 + size + lev + cash + loss + lnage + cfo + sd + ic
zz <- plm(m1,data=sample,model="within",index=c("stkcd", "year"),effect = "twoways")
summary(zz)

same sa xtreg i.year, fe , without robust vcetype，但$R^2$较Stata报告$R^2$ within小

vcetype robust

zz_r <- coeftest(zz, vcov.=function(x) vcovHC(x, type="sss")) # same as stata xtreg i.year, fe r
# OR
zzz_r <- coeftest(zzz, vcov.=function(x) vcovHC(x, type="sss"))

组间系数比较

OLS可用

sur_diff <-  MVBV ~ (Dm + Dh + EBV + DmEBV +DhEBV)*g_layer
h2t <- h2 %>%
  filter(g_layer != 2)%>%
  mutate(g_layer = ifelse(g_layer == 1, 0, 1))
mm <- lm(sur_diff,data=h2t)
ttt <-  coeftest(mm, vcov.=function(x) vcovHC(x, cluster="group", type="HC1"))

stargazer(fpm,models_growth_layer,type = "text", column.labels = table4_label)
stargazer(fpm_r,robusts_growth_layer,type = "text", column.labels = table4_label,
          add.lines=c("DhEBV(4)-(2)", str_c(round(ttt[12,1],3),"**(p=",round(ttt[12,4],3),")")))

Panel Data不行！One way, two way fixed effect都不行！建议直接加interaction

Logistic Regression

Wed, 26 Jun 2019 00:00:00 +0000

Odds ratios

An odds ratio of 1.0 is equivalent to a beta weight of 0.0.

Group	Diseased	Healthy
Exposed	$D_E$	$H_E$
Not exposed	$D_N$	$H_N$

$OR={\frac {D_{E}/H_{E}}{D_{N}/H_{N}}}$

The distribution of the odds ratio is far from normal. Take the natural logarithm of the odds ratio to get normal.

$logit = ln(OR)$

When the mean is around 0.50, the OLS regression and logistic regression produce consistent results, but when the probability is close to 0 or 1, the logistic regression is especially important.

Logistic regression

The logit command gives the regression coefficients to estimate the logit score. The logistic command gives us the odds ratios we need to interpret the effect size of the predictors.

Both commands give the same results, except that logit gives the coefficients for estimating the logit score and logistic gives the odds ratios.

The McFadden pseudo-$R^2$ represents how much larger log likelihood is for the final solution. , meaning the log likelihood for the fitted model is 2% larger than for the log likelihood for the intercept-only model. This is not explained variance. The pseudo-$R^2$ is often a small value, and many researchers do not report it. The biggest mistake is to report it and interpret it as explained variance.

If you are interested in specific effects of individual variables, it is better to rely on odds ratios for interpreting results of logistic regression. ~~This shows that mothers who smoke have 2.02 times greater odds of having a low-birthweight child.~~

Odds ratios tell us what happens to the odds of an outcome, whereas risk ratios tell us what happens to their probability.

For binary predictor variables, you can interpret the odds ratios and percentages directly. For variables that are not binary, you need to have some other standard. One solution is to compare specific examples, such as having no dinners with the family versus having seven dinners with them each week. Another solution is to evaluate the effect of a 1-standard-deviation change for variables that are not binary.listcoef,get from package spost13. After logit/logitstic regression, run listcoef, helpor listcoef, help percent

Group	Experimental (E)	Control (C)
Events (E)	EE	CE
Non-events (N)	EN	CN

$ RR={\frac {EE/(EE+EN)}{CE/(CE+CN)}}={\frac {EE(CE+CN)}{CE(EE+EN)}}. $ 相对风险是指在暴露在某条件下，一个事件的发生风险 oddsrisk $OR={\frac {EE/CE}{EN/CN}}={\frac {EE\cdot CN}{EN\cdot CE}}$ 一个事件发生比是该事件发生和不发生的比率 Risk ratio is different from the odds ratio, although it asymptotically approaches it for small probabilities of outcomes. If EE is substantially smaller than EN, then EE/(EE + EN) $ \scriptstyle \approx $ EE/EN. Similarly, if CE is much smaller than CN, then CE/(CN + CE) $ \scriptstyle \approx $ CE/CN. $ RR={\frac {EE(CE+CN)}{CE(EE+EN)}}\approx {\frac {EE\cdot CN}{EN\cdot CE}}=OR. $

The difference is small with a rare outcome.The relative risk is appealing, but it should not be used in a study that controls the number of people in each category.

Hypothesis testing

chi-squared test that has k degrees of freedom, tells us only that the overall model has at least one significant predictor.

Testing individual coefficients

The z test in the Stata output is actually the square root of the Wald chi-squared test.

The likelihood-ratio chi-squared test for each parameter estimate is based on comparing two logistic models, one with the individual variable we want to test included and one without it. The likelihood-ratio test is the difference in the likelihood-ratio chi-squared values for these two models (this appears as LR chi2(1) near the upper right corner of the output). The difference between the two likelihood-ratio chi-squared values is 1 degree of freedom.

use nlsy97_chapter11, clear
logistic drank30 male dinner97 pdrink97
estimates store a
logistic drank30 age97 male dinner97 pdrink97
#subtracts the chi-squared values and estimates the probability of the chi-squared difference;
lrtest a

or just use lrdrop1

Testing sets of coefficients

test pdrink97 dinner97
#it is the same as:
logistic drank30 age97 male if !mi(dinner97) &!mi(pdrink97)
estimates store a
logistic drank30 age97 male pdrink97 dinner97 
lrtest a
lrdrop1

this overall test only tells us that at least one of them is significant.

Margins

logit drank30 age97 i.black pdrink97 dinner97
margins, dydx(black) atmeans
margins black, atmeans
margins, at(pdrink97=(1 2 3 4 5)) atmeans
marginsplot

We can run the logistic regression using the i. label for this categorical variable, i.black. This produces the same results for the logistic regression as if we had simply used black, but the results will work properly if we follow this command with other postestimation commands.

Nested logistic regressions

The nestreg command is extremely general, applicable across a variety of regression models, including logistic, negative binomial, Poisson, probit, ordered logistic, tobit, and others. It also works with the complex sample designs for many regression models.

Power analysis

powerlog, p1(.70) p2(.75) alpha(.05)
powerlog, p1(.70) p2(.75) alpha(.05) rsq(.30) help

Measurement, reliability, and validity

Wed, 26 Jun 2019 00:00:00 +0000

Constructing a Scale

recode empathy2 empathy4 empathy5 (1=5 "Does not describe very well") ///
  (2=4) (3=3) (4=2) (5=1 "Describes very well"), pre(rev) label(empathy)
egen empathy = rowmean(empathy1 revempathy2 empathy3 revempathy4 ///
  revempathy5 empathy6 empathy7)
egen miss = rowmiss(empathy1 revempathy2 empathy3 revempathy4 ///
   revempathy5 empathy6 empathy7) 
egen empathya = rowmean(empathy1 revempathy2 empathy3 revempathy4 ///
   revempathy5 empathy6 empathy7) if miss < 3

One drawback to using the rowmean() function is that it simply adds the score on the items a person answers and divides by the number of items answered.

Reliability

Stability means that if you measure a variable today using a particular scale and then measure it again tomorrow using the same scale, your results will be consistent.(correlation r,pwcorr, intraclass correlation $\rho_I$)
Equivalence means that you have two measures of the same variable and they produce consistent results. (correlation $r_{xx}$)* (A low correlation means either that the measure is not reliable or that the measures are not truly equivalent.)
A reliable test would be internally consistent if the score for the first half of the items was highly correlated with the score for the second half of the items.(correlation $r_{x_Ax_B}$), alpha,$\alpha$) In general, an $\alpha>0.8$ is considered good reliability, and many researchers feel an $\alpha>0.7$ is adequate reliability. ($\alpha=\sigma^2_{True}/(\sigma^2_{True}+\sigma^2_{error})$)However, for this interpretation to be used, we need to assume that the scale is valid.
alpha empathy1 revempathy2 empathy3 revempathy4 revempathy5 /// empathy6 empathy7, asis item min(5) The asis (as is) option means that we do not want Stata to change the signs of any of our variables. The bottom row of the output table, Test scale, reports the $\alpha$ for the scale (0.7462). Above this value is the $\alpha$ we would obtain if we dropped each item, one at a time. The item-test correlation column reports the correlation of each item with the total score of the seven items. item-rest correlation. This is the correlation of each item with the total of the other items. The equivalent of alpha for items that are dichotomous is the Kuder–Richardson measure of reliability.alpha
Rater consistency is important when you have observers rating a video, observed behavior, essay, or something else where two or more people are rating the same information. Here reliability means that a pair of raters gives consistent results.(kappa,$\kappa$ kap coder1 coder2)$\kappa$ only gives us credit for the extent the agreement exceeds what we would have expected to get by chance alone. kappa tends to be lower than alpha.

Validity

A valid measure is one that measures what it is supposed to be measuring.
表面效度(face validity)：把設計的問卷，拿給親朋好友填，並問他們問卷好不好。指測量工具在外顯形式上的有效程度
內容效度(content validity)：找一群有相關經驗的人來看題目，問他們設計的好不好，有沒有哪裡要修改。Content validity ratio (CVR): Judges rate each item as essential, useful, or not necessary. $CVR=(Ne - N/2)/(N/2)$ , in which the $Ne$ is the number of panelists indicating "essential" and $N$ is the total number of panelists. You can keep the items that have a relatively high CVR and drop those that do not.
效標效度(criterion validity)：把測量工具和其他可測量的工具，算他們之間的相關n以測驗分數和特定效標（criterion）之間的相關係數，表示測量工具有效性之高低。
- （1）同時效度(current validity)：把設計好的題目，和標準工具（同樣的觀念，相同的變項），去算之間的相關。如：測疼痛忍受度，有四題一分鐘可測完的題目，和另一份標準工具的題目，45題1小時可做完的題目去測，如果R＝0.92（高相關），表示原題目有同時效度。
- （2）預測效度(predictive validity)：一個調查，可以預測未來的事件、行為、態度、結果。如：手術後，病人對止痛藥的需求，看24個病人的分數，分數越高，手術忍受度越高。把24的分數算出，和拿止痛藥量求相關，R＝－0.82，表示高忍痛程度，低止痛藥量。SAT（可以預測大學第一學期的平均成績）成績，和大學第一學期的平均成績求相關，R＝0.42，表示沒有預測效度。但是R如果逐年增加，則表示有預測效度。
構念（建構）效度(construct validity)：
- We can assess the convergent and divergent validity of our measure, hope, by seeing whether it is positively correlated with variables with which we believe it converges and negatively correlated with variables with which we believe it diverges.ttest, esize, pwcorr
  
  Factor analysis
exploratory factor analysis, which Stata calls principal factor analysis: the variance is partitioned into the shared variance and unique or error variance. The shared variance is how much of the variance in any one item can be explained by the rest of the items. PF
principal-component factor analysis PCF

putdocx stata 15可以create word documents!

Terminology

Extraction(萃取)
Eigenvalues: In the case of PCF analysis, If there are 10 items, the sum of the eigenvalues will be 10.The factors will be ordered from the most important, which has the largest eigenvalue, to the least important, which has the smallest eigenvalue.In PF analysis, the sum of the eigenvalues will be less than the number of items, and the eigenvalues’ interpretation is complex.
Communality and uniqueness: PF analysis tries to explain the shared variance. PCF analysis tries to explain all the variance, which is why it is ideal for the uniqueness to approach zero.
Loadings: how clusters of items are most related to one or another of the factors. If an item has a loading over 0.4 on a factor, it is considered a good indicator of that factor.
Simple structure: This is a pattern of loadings where each item loads strongly on just one factor and a subset of items load strongly on each factor. When an item loads strongly on more than one factor, it is factorially confounded.
Scree plot: This is a graph showing the eigenvalue for each factor. When doing a PCF analysis, we usually drop factors that have eigenvalues in the neighborhood of 1.0 or smaller.
Rotation: 轉軸的方式有很多種，但基本就是兩大類：正交 (orthogonal) 與斜交 (oblique rotation)。轉軸的目的是讓因素更有意義，並同時看看因素之間的關係。更詳細一點來說，如果是正交轉軸的話，那就是假設因素之間沒有關連；相對地，斜交假設因素之間有一定的關連。
Factor score: weights each item based on how related it is to the factor. Also the factor score is scaled to have a mean of 0.0 and a variance of 1.0.

Use PCF when you have a set of items that you believe all measure one concept. In this situation, you would be interested in the first principal factor. You would want to see if it explained a substantial part of the total variance for the entire set of items, and you would want most of the items to have a loading of 0.4 or above on this factor. Because PCF analysis is trying to explain all the variance in the items, the uniqueness for each item should approach zero. Generally, we should consider any factor that has an eigenvalue of more than 1.A visual way to examine the eigenvalues is with a scree plot.

factor rnatspac rnatenvir rnatheal rnatcity rnatcrime rnatdrug ///
	rnateduc rnatrace rnatarms rnatfare rnatroad rnatsoc rnatchld rnatsci, pcf
screeplot

If, on the other hand, you want to identify two or more latent variables that represent interpretable dimensions of some concept, then PF analysis is probably best.

Rotation

Orthogonal:rotateWith a varimax rotation, we can think of the loadings as being the estimated correlation between each item and each factor.
oblique:rotate, promax

estat common to get correlation matrix of promax rotated common factors

Get one factor score

However, this distinction rarely makes a lot of practical difference. The factor score may make a difference if there are some items with very large loadings, say, 0.9, and others with very small loadings, say, 0.2. But we would probably drop the weakest items. When the loadings do not vary a great deal, computing a factor score or a mean/total score will produce comparable results.

factor rnatenvir rnatheal rnatcity rnatcrime rnatdrug rnateduc rnatrace ///
	rnatfare rnatsoc rnatchld, pcf
predict libfscore, norotate
egen libmean = rowmean(rnatenvir rnatheal rnatcity rnatcrime rnatdrug ///
	rnateduc rnatrace rnatfare rnatsoc rnatchld)

correlation higher than 0.9...

Missing values

Wed, 26 Jun 2019 00:00:00 +0000

Many advanced Stata estimation models can use multiple imputation for handling missing values.

Auxiliary variables are variables that can help to make estimates on incomplete data, while they are not part of the main analysis (Collins et al., 2001).

Include all variables in the analysis model, including the dependent variable,
Include auxiliary variables that predict patterns of missingness,
and Include additional variables that predict a person’s score on a variable that has missing values.

The imputation model is then used to generate a complete dataset.

Once you have included a reasonably large number of variables, adding additional variables may not be helpful because of multicollinearity.

Drop any participant who does not have complete information on every item used in the analysis. This approach goes by several names, including full case analysis, casewise deletion, or listwise deletion.

There will be a substantial loss of power because of the reduced sample size.
Listwise deletion can introduce substantial bias. (survival bias)

One alternative to listwise deletion involves substituting the mean on a variable for anybody who does not have a response. This has two serious limitations. People who are average on a variable are often more likely to give an answer than are people who have an extreme value.The second problem with mean substitution is that when you give several people the same score on a variable, these people have zero variance on the variable. This artificially reduced variance will seriously bias our parameter estimates.

The key to understanding multiple imputation is that the imputed missing values will not contain any unique information once the variables in the model and the auxiliary variables are allowed to explain the patterns of missing values and predict the score of the missing values. The imputed values for variables with missing values are simply consistent with the observed data. This allows us to use all available information in our analysis.

Multiple imputation

A powerful way of working with missing values involves multiple imputation. The command mi involves three straightforward steps:

Create m complete datasets by imputing the missing values. Each dataset will have no missing values, but the values imputed for missing values will vary across the datasets.
Do your analysis in each of the m complete datasets.
Pool your m solutions to get one solution.
- The parameter estimates—for example, regression coefficients—will be the mean of their corresponding values in the datasets.
- The standard errors used for testing significance will combine the standard errors from the solutions plus the variance of the parameter estimates across the solutions. If each solution is yielding a very different estimate, this uncertainty is added to the standard errors. Also the degrees of freedom is adjusted based on the number of imputations and proportion of data that have missing values.

The most widely used approach is using multivariate normal regression (MVN). mi impute mvn is designed for continuous variables. mi impute chained is another useful alternative.

A missing value will have a code of ., .a, .b, etc. Remember that a missing value is recorded in a Stata dataset as an extremely high value. Within mi, a missing-value code, . (dot), has a special meaning. It denotes the missing values eligible for imputation. If you have a set of missing values that should not be imputed, you should record them as extended missing values, that is, as .a, .b, etc.recode agem (.a = .)

misstable summarize ln_wagem gradem agem ttl_expm tenurem not_smsa south blackm
misstable patterns ln_wagem gradem agem ttl_expm tenurem not_smsa south blackm
quietly misstable summarize ln_wagem gradem agem ttl_expm tenurem not_smsa south blackm, gen(miss_)

then

logit miss_ln_wagem gradem agem ttl_expm tenurem not_smsa south blackm if ln_wagem <= .
logit miss_gradem ln_wagem agem ttl_expm tenurem not_smsa south blackm if gradem <= .
logit miss_agem ln_wagem gradem ttl_expm tenurem not_smsa south blackm if agem <= .
logit miss_ttl_expm ln_wagem gradem agem tenurem not_smsa south blackm if ttl_expm <= .
logit miss_tenurem ln_wagem gradem agem ttl_expm not_smsa south blackm if tenurem <= .
logit miss_blackm ln_wagem gradem agem ttl_expm tenurem not_smsa south if blackm <= .

Or use pwcorr , obs sig to find potential auxiliary variables.

Any variable that is statistically significant in these logistic regressions should be included in the imputation step.

mi set flong
mi register imputed ln_wagem gradem agem ttl_expm tenurem blackm
mi register regular not_smsa south

The mi set flong command tells Stata how to arrange our multiple datasets(flong (full and long), or mlong (marginal and long)). The mi register imputed command registers all the variables that have missing values and need to be imputed. The mi register regular command registers all the variables that have no missing values or for which we do not want to impute values.

mi impute mvn ln_wagem gradem agem ttl_expm tenurem blackm, add(20) rseed(2121)

生成m=20个数据集，_mi_m variable identifies datasets and ranges from 0 to 20.

mi impute mvn ln_wagem gradem agem ttl_expm tenurem blackm, add(20) rseed(2121)

To get pooled $R^2$ and standardized $\beta$s use mibeta

mibeta ln_wagem gradem agem ttl_expm tenurem not_smsa south blackm, fisherz miopts(vartable)

When impossible values are imputed(建议不调整): Binary variables, squares, and interactions（在原数据集先相乘，再impute）

Multilevel analysis

Wed, 26 Jun 2019 00:00:00 +0000

Multilevel analysis can address the lack of independence of the observations when you are analyzing grouped data. See Stata Multilevel Mixed-Effects Reference Manual.

groups of individuals
panel data

Fixed-effects regression models

\[y_it = \beta_0 +\beta x_{it}+\mu_i+\eta_{it}\]

if $\mu_i$ correlates with $x_{it}$ -> Fixed-effects if $\mu_i$ independent of $x_{it}$ -> Random-effects models give consistent estimates

xtreg see Stata Longitudinal-Data/Panel-Data Reference Manual.

Random-effects regression models

\[y_it = \beta_0 +\beta x_{it}+\gamma z_i +\mu_i+\eta_{it}\]

assume $\mu_i$ is independent of $x_{it}$

fixed component, $ \beta_0 +\beta x_{it}+\gamma z_i$ , describes the overall relationship between our dependent variable and our independent variable. The random component, $\mu_i$ i represents the effects of the unobserved time-invariant variables.

score = fixed part + random effects + error

Going back and forth between wide and long formats : reshape wide and reshape long

reshape long drink, i(id) j(wave)

Random-intercept model

linear model

mixed drink c.wave || id:
estimates store linear
margins, at(wave=(0(2)10))
marginsplot

quadratic term

mixed drink c.wave##c.wave || id:
estimates store quadratic
margins, at(wave=(0(2)10))
marginsplot
lrtest linear quadratic

A proportional reduction in error (PRE) measuring how much the residual (error) variance is reduced by adding the quadratic term may be useful. We will call the random-intercept linear model “Model 1” and the random-intercept quadratic model “Model 2”.

PRE = (var(Residual)Model1-var(Residual)Model2)/var(Residual)Model1

Treating time as a categorical variable

mixed drink i.wave || id:
estimates store means
margins, at(wave=(0(2)10))
marginsplot
lrtest linear means
lrtest quadratic means

Random-coefficients model

mixed drink c.wave || id: wave, cov(unstructured)
predict yhat_drink, fitted

Including a time-invariant covariate

* Random coefficients model with time invariant covariate
* gender coded as male = 1, female = 0
mixed drink c.wave i.male || id: wave
margins male, at(wave=(0(2)8))
marginsplot

* Random coefficients, with wave interacting with the
* time invariant covariate--gender coded
mixed drink c.wave##i.male || id: wave
margins male, at(wave=(0(2)8))
marginsplot

mixed drink c.wave##c.wave##i.male || id: wave
margins male, at(wave=(0(2)8))
marginsplot

Multiple Regressions

Wed, 26 Jun 2019 00:00:00 +0000

Note: toc is not compatible with markup: mmark

Basic

F: There is a highly significant relationship between outcomes and the set of predictors.
R2: How much of the outcome variance is explained by the regression model
Adj-R2: remove the chance effects
Coef.: unstandardized regression coefficients
t: coef/standard error
Std. Err.: represents the average distance that the observed values fall from the regression line. Conveniently, it tells you how wrong the regression model is on average using the units of the response variable.
,beta gives beta weights: based on standardizing all variables to have a mean of 0 and a standard deviation of 1. These beta weights are interpreted similarly to how you interpret correlations in that beta<0.2 is considered a weak effect, between 0.2 and 0.5 is considered a moderate effect, and is considered a strong effect.(range of -1 to +1, if out of range, ->multicollinearity problem):a 1-standard-deviation change in the independent variable produces a - beta standard-deviation change in the dependent variable.
increment in R2:part-correlation square because it measures the part that is uniquely explained by the variable. or semipartial R2 (Semipartial Corr.^2 in pcorr )estimates only the unique effect of each predictor. Another way to compare is partial correlation;
distribution of the dependent variable: histogram env_con, frequency normal kdensity (for kernel density estimation)Skewness(0:Normal; <0: negative or left skew, >0: positive or skew to the right)kurtosis(3: normal; <3: tails are too thick, flat or negative kurtosis; >3: tails are too thin, peaky or positive kurtosis)sktest
distribution of the residuals: for large sample, normality is not a critical issue. rvfplot, yline(0)residual-versus-fitted plot: To solve the non-normal distribution of residual, we can use reg y xs, vce(robust) or use bootstrapreg y xs, vce(bootstrap, rep(1000)) , it will change std err and hence t-value. However, Andrew J. Leone, Miguel Minutti-Meza, and Charles E. Wasley (2019) Influential Observations and Inference in Accounting Research. The Accounting Review In-Press. they talk about robust regression using robreg, what's the difference? ALso, check Correcting for Cross-Sectional and Time-Series Dependence in Accounting Research

regress env_con educat inc com3 hlthprob epht3, beta
predict envhat
preserve
set seed 515
sample 100, count
twoway (scatter env_con envhat) (lfit env_con envhat)
restore

Diagnostic statistics

Rstandard:

The standardized residual is the residual divided by its standard deviation.

regress env_con educat inc com3 hlthprob epht3, beta
predict yhat
predict residual, residual
predict rstandard, rstandard
list respnum env_con yhat residual rstandard if abs(rstandard) > 2.58 & rstandard < .
dfbeta
list respnum rstandard _dfbeta_1 if abs(_dfbeta_1) > 2/sqrt(3769) & _dfbeta_1 < .
estat vif

Influential observations: DFbeta: You could think of this as redoing the regression model, omitting just one observation at a time and seeing how much difference omitting each observation makes. **A value of DFbeta >2/sqrt(N) ** indicates that an observation has a large influence More specific than rstandard

. dfbeta
(739 missing values generated)
                       _dfbeta_1: dfbeta(educat)
(739 missing values generated)
                       _dfbeta_2: dfbeta(inc)
(739 missing values generated)
                       _dfbeta_3: dfbeta(com3)
(739 missing values generated)
                       _dfbeta_4: dfbeta(hlthprob)
(739 missing values generated)
                       _dfbeta_5: dfbeta(epht3)

multicollinearity: The more correlated the predictors, the more they overlap and, hence, the more difficult it is to identify their independent effects. In such situations, you can have multicollinearity in which one or more of the predictors are virtually redundant. variance inflation factor estat vif after regression, if >10, for any variable, a multicollinearity problem may exist. If the average VIF is substantially greater than 1.00, there still could be a problem.(Dropping a variable, create a scale that combines them into one variable.) 1/VIF = 1-R2(of regress X1 on other Xs) It tells how much of the variance in the independent variable is available to predict the outcome variable independently.

Weighted data

regress env_con educat inc com3 hlthprob epht3 [pweight=finalwt], beta

When you do a weighted regression this way, Stata automatically uses the robust regression—whether you ask for it or not—because weighted data require robust standard errors.

Categorical predictors and hierarchical regression

regress smday97 age97 male psmoke97 aa hispanic other if !missing(smday97, ///
	age97, male, psmoke97, aa, hispanic, other), beta
test aa hispanic other

nested regressions

nestreg: regress smday97 (age97 male) (psmoke97) (aa hispanic other), beta

If you put i. as a stub in front of a categorical variable, Stata will make the first category the reference category and then generate a dummy variable for each of the remaining categories.

regress smday97 age97 male psmoke97 i.race
#change reference category or what Stata refers to as the baselevel
regress smday97 age97 male psmoke97 ib3.race
regress smday97 age97 male psmoke97 ib(last).race

interaction

g ed_male = educ*male
reg inc educ male ed_male,beta
nestreg: regress inc (educ male) (ed_male), beta
regress inc i.male##c.educ, beta

some researchers choose to center quantitative independent variables, such as education, before computing the interaction terms. Centering is important for independent variables where a value of zero may not be meaningful.

summarize educ
generate educ_c = educ - r(mean)

margins help us to interpret the interaction term

margins male, at(educ=(8 10 12 14 16 18))
marginsplot

nonlinear

regress ln_wage c.ttl_exp##c.ttl_exp, beta
margins, at(ttl_exp = (0(2)28))
marginsplot

Power analysis

Stata | Little World

用R取代Stata与SAS

安装Stata

在R中调用Stata

三种环境下数据互通

SEM and GSEM

SEM

GSEM

Panel data in R vs in Stata

Panel data with one way fixed effect

vcetype robust

组间系数比较

Logistic Regression

Odds ratios

Logistic regression

Hypothesis testing

Testing individual coefficients

Testing sets of coefficients

Margins

Nested logistic regressions

Power analysis

Measurement, reliability, and validity

Constructing a Scale

Reliability

Validity

Factor analysis

Terminology

Rotation

Get one factor score

Missing values

Multiple imputation

Multilevel analysis

Fixed-effects regression models

Random-effects regression models

Random-intercept model

linear model

quadratic term

Treating time as a categorical variable

Random-coefficients model

Including a time-invariant covariate

Multiple Regressions

Basic

Diagnostic statistics

Weighted data

Categorical predictors and hierarchical regression

interaction

nonlinear