##------ Aug 2, 2010 9:14:54 PM ------##
> ##############################################
> # Lab 7: PEER INFLUENCE & QAP REGRESSION LAB #
> ##############################################
>
> # NOTE: if you have trouble because some packages are not installed,
> # see lab 1 for instructions on how to install all necessary packages.
>
> ###########################################################################
> #
> # Lab 7
> #
> # Part I - Develop peer influence models that meet the high standards of
> # publication in todays top journals (i.e., addressing autocorrelation
> # and issues of causality).
> #
> # Part II - Use QAP regression to predict increases in social interaction
> # and task interaction for two classrooms. Basically run dyadic level
> # regressions suitable for continuous variables and count data (what
> # ERGM- Exponential Random Graph's cannot do).
> #
> ###########################################################################
>
>
> ##########################################################################
> # PART I -- PEER INFLUENCE
> ###########################################################################
> #
> # This lab examine peer effects in classroom data. It first introduces
> # the reader to the necessary data processing and manipulation, then basic
> # visualization related to peer-effects. It then introduces the Linear Network
> # Autocorrelation Model (lnam), which can be used to better model the
> # dependencies in network data (well, better than an OLS model anyway).
> #
> # The next section introduces the reader to the concept of matching--treating
> # the explanatory variable of interest as an experimental "treatment"
> # and then matching on other covariates. Matching allows the application
> # of straightforward analylitical techniques used to analyze experiments,
> # namely comparison of means. Code for bootstrapping is introduced to allow for
> # non-parametric estimation of the standard errors for the mean. Code is also provided
> # to produce a dotplot showing the means and a (bootstrapped) 95% CI.
> #
> # The data was collected by Dan McFarland (CITE). The key measures of interest
> # for this example are the time 1 and 2 measures of how much each student liked
> # the subject, and the friend network at time 1.
> #
> # The codebook is available here: http://stanford.edu/~messing/codebook.v12.txt
> ###############################################################################
>
> #Clear Variables
> rm(list=ls(all=TRUE))
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1305812 34.9 1967602 52.6 1967602 52.6
Vcells 1308905 10.0 4504317 34.4 4920950 37.6
>
> # Install and load necessary packages:
> install.packages("reshape", dependencies = TRUE)
--- Please select a CRAN mirror for use in this session ---
Error in contrib.url(repos, type) :
trying to use CRAN without setting a mirror
> library(reshape)
> library(igraph)
>
> data(studentnets.peerinfl, package="NetData")
>
> ######################################################################################
> # Now we need to reshape our attitudes data according to semester. The data is
> # currently in "long" format, so that student ids and measures are repeated on
> # multiple lines and "semester" is a single variable. We need
> # to transform the data so that measures are repeated on a single
> # line, or "wide" format. We'll use the excellent "reshape" package
> # to accomplish this. Take a look at the documentation using ?reshape
>
> attitudesw = reshape( attitudes, idvar="std_id",
+ timevar="sem_id", direction="wide" )
>
> # Notice that all semester 1 variables now feature a ".1" postscript, while all
> # semester 2 variables feature a ".2" postscript
>
> # We'll first visualize the data. We want to know whether the change in
> # a student's appreciation of a subject (t_2 - t_1) changes in the direction of his
> # friends' appreciation at t_1.
>
> # So, we'll first take the difference of a few dependent variables
> # for which we might expect peer influence (e.g. "sub" (liking of subject),
> # "cmt" (liking of classmates), and "tch" (liking of teacher)),
> # and then take the mean value for an individual's friends' values at t_1.
>
> # First create the delta variables:
> attitudesw$deltasub = attitudesw$sub.2 - attitudesw$sub.1
> attitudesw$deltatch = attitudesw$tch.2 - attitudesw$tch.1
> attitudesw$deltacmt = attitudesw$cmt.2 - attitudesw$cmt.1
>
> # Next we'll create the mean of friends' sub.1.
> # We'll use the adjacency matrix, multiply that by the attitudesw$sub.1
> # (element-wise, not matrix multiplication), and then take the mean of each row.
>
> sem1graph = graph.data.frame(d = sem1[1:2], directed=TRUE)
> #sem1graph = network(x = sem1[1:2], directed=TRUE)
> #sem2graph = graph.data.frame(d = sem2, directed=TRUE)
>
> # But first, we'll need to clean the data to make sure the rows line up!
>
> # let's check to see whether the edge data has the same cases as the attitudes data
> which(!(V(sem1graph)$name %in% as.character(attitudesw$std_id)))
[1] 535 546 547 551 558 560 561 571
> which(!(as.character(attitudesw$std_id) %in% V(sem1graph)$name ))
[1] 44 56 65 83 86 92 107 130 161 196 203 204 230 247 315 396 424 433 434
[20] 435 436 449 490 510 550 571 572
>
> # They are not the same... we'll need to address this below.
>
> # Now let's get the matrix representation of the network:
> sem1matrix = get.adjacency(sem1graph, binary=T)
> # This is often referred to as "W" in the literature
>
> # When you have a large square matrix like this, you can get a good idea of
> # it's density/sparsity using the image function.
> image(sem1matrix)
>
> # It's also generally good to know the degree distribution of any network you
> # are handling:
>
> plot(density(degree(sem1graph)))
>
> # Looks like degree might be distributed exponentially. We can check as follows:
> simexpdist100 = rexp(n=length(V(sem1graph)), rate = .100)
> simexpdist125 = rexp(n=length(V(sem1graph)), rate = .125)
> simexpdist150 = rexp(n=length(V(sem1graph)), rate = .150)
> lines(density(simexpdist100), col="red")
> lines(density(simexpdist125), col="blue")
> lines(density(simexpdist150), col="green")
>
> # It's not a precise fit, but it might be a good approximation.
>
> # Let's reorder the attitudes data to make sure it's in the same order as
> # sem1matrix
> attitudesw = attitudesw[match(row.names(sem1matrix), attitudesw$std_id),]
>
> # Make sure it worked (this should return 0 if so):
> which(row.names(sem1matrix)!=as.character(attitudesw$std_id))
integer(0)
> which(colnames(sem1matrix)!=as.character(attitudesw$std_id))
integer(0)
>
> # Let's also make sure that igraph read the graph in correctly:
> # (note that igraph can behave strangely if you pass it a
> # parameter like "directed = FALSE" for a directed network like this one).
>
> sem1$alter_id[sem1$std_id == 149824]
[1] 119516 122634 114679 124451 132669 120849 117808 147258
> V(sem1graph)[0]
Vertex sequence:
[1] "149824"
> V(sem1graph)[unique(neighbors(sem1graph, v=0, mode = 1))]
Vertex sequence:
[1] "122634" "114679" "119516" "117808" "120849" "132669" "147258" "124451"
>
> # looks good.
>
> # Now we can compute the mean of sub.1 for a student's friends,
> # by element-wise multiplying sub.1 by the matrix and then taking the
> # mean of each non-zero cell in each row:
>
> # Let's first test this out with a simple matrix to make sure we know what
> # we are doing:
> (M = matrix(c(1,0,1,0,1,0,1,0,1), nrow=3, ncol=3))
[,1] [,2] [,3]
[1,] 1 0 1
[2,] 0 1 0
[3,] 1 0 1
>
> (R = c(1,2,3))
[1] 1 2 3
>
> R * M
[,1] [,2] [,3]
[1,] 1 0 1
[2,] 0 2 0
[3,] 3 0 3
>
> # This is multiplying the first row by 1, the second row by 2, and so on.
> # Instead we want:
>
> t(R * t(M))
[,1] [,2] [,3]
[1,] 1 0 3
[2,] 0 2 0
[3,] 1 0 3
>
> # This is multiplying the first column by 1, the second column by 2, and so on.
> # Recall that we will be analyzing this data so that rows are cases, so
> # this is what we want if we are going to calculate sub.1 for each
> # student's friends' sub.1. (The first option would return the
> # student's sub.1 for each non-zero row, not his/her friends' sub.1).
>
> sem1Wxsub1 = t(attitudesw$sub.1 * t(sem1matrix))
> sem1Wxtch1 = t(attitudesw$tch.1 * t(sem1matrix))
> sem1Wxcmt1 = t(attitudesw$cmt.1 * t(sem1matrix))
>
> # Now we'll take the mean of cells not equal to zero:
> attitudesw$frsub1mean = numeric(length(attitudesw$std_id))
> attitudesw$frtch1mean = numeric(length(attitudesw$std_id))
> attitudesw$frcmt1mean = numeric(length(attitudesw$std_id))
> for(i in 1:length(attitudesw$std_id)){
+ attitudesw$frsub1mean[i] = mean(sem1Wxsub1[i,sem1Wxsub1[i,]!=0 ], na.rm=T)
+ attitudesw$frtch1mean[i] = mean(sem1Wxtch1[i,sem1Wxtch1[i,]!=0], na.rm=T)
+ attitudesw$frcmt1mean[i] = mean(sem1Wxcmt1[i,sem1Wxcmt1[i,]!=0], na.rm=T)
+ }
>
> # Now let's plot the data and look at the relationship:
> plot(attitudesw$frsub1mean, jitter(attitudesw$deltasub))
> plot(attitudesw$frtch1mean, jitter(attitudesw$deltatch))
> plot(attitudesw$frcmt1mean, jitter(attitudesw$deltacmt))
>
> # Looks noisy, but there might be a relationship for "sub".
>
> ############################################################################
> # An alternative way to compute this is iteratively. The advantage
> # to the following approach is that if you are working with a large
> # data set, you do not need to load an adjacency matrix into
> # memory, you can keep things in edgelist format, which is MUCH
> # more efficient.
>
> attitudesw$mfrsub.1 = numeric(length(attitudesw$sub.1))
> attitudesw$mfrtch.1 = numeric(length(attitudesw$sub.1))
> attitudesw$mfrcmt.1 = numeric(length(attitudesw$sub.1))
>
> # If you thought the number of alters who like the class mattered
> # more than the mean, you might uncomment the code for the nfrsubgt3
> # variable and incorporate it into your analysis (two occurrences):
> #attitudesw$nfrsubgt3 = numeric(length(attitudesw$sub.1))
>
> for(i in 1:length(attitudesw$std_id)){
+ # first get alters of student i:
+ altrs = sem1$alter_id[ sem1$std_id == attitudesw$std_id[i] ]
+
+ # then get alters' attitudes
+ altatts = attitudesw$sub.1[ attitudesw$std_id %in% altrs ]
+
+ # now count how many friends like the class more than "3"
+ # attitudesw$nfrsubgt3[i] = length(which(altatts > 3))
+
+ # then take the mean, ignoring NAs:
+ attitudesw$mfrsub.1[i] = mean(altatts, na.rm = TRUE)
+
+ # Note that this can all be done in one line of code:
+ # attitudesw$mfrsub.1[i] = mean(attitudesw$sub.1[ attitudesw$std_id %in% sem1$alter_id[sem1$std_id %in% attitudesw$std_id[i]]], na.rm=TRUE )
+ attitudesw$mfrtch.1[i] = mean(attitudesw$tch.1[ attitudesw$std_id %in% sem1$alter_id[sem1$std_id %in% attitudesw$std_id[i]]], na.rm=TRUE )
+ attitudesw$mfrcmt.1[i] = mean(attitudesw$cmt.1[ attitudesw$std_id %in% sem1$alter_id[sem1$std_id %in% attitudesw$std_id[i]]], na.rm=TRUE )
+ }
> # if you are going to run this with a lot of data, you may wish to include
> # the following lines inside your for-loop:
> # gc()
> # print(i)
>
> # this will run the garbage collector, gc(), which helps R manage memory
> # and print(i) will let you know your progress as R chugs away.
>
> # The plot is exactly the same:
> plot(attitudesw$mfrsub.1, jitter(attitudesw$deltasub))
>
> # And the correlation should be 1
> cor(attitudesw$mfrsub.1,attitudesw$frsub1mean, use= "complete")
[1] 1
>
> # The plots suggest that there might be a relationship for "sub", or
> # how much each student likes the subject.
>
> # Let's cheat a little bit and run linear models predicting
> # change in each of our variables based on mean friend's values at t=1.
> # The data violates many of the assumptions of OLS regression so the
> # estimates are not particularly meaningful, other than to let us know
> # that we may be on the right path.
>
> summary(lm(deltasub~mfrsub.1, data=attitudesw))
Call:
lm(formula = deltasub ~ mfrsub.1, data = attitudesw)
Residuals:
Min 1Q Median 3Q Max
-2.9122 -0.7139 0.0878 0.2530 2.8400
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.831 0.223 -3.73 0.00023 ***
mfrsub.1 0.248 0.078 3.18 0.00162 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.805 on 343 degrees of freedom
(227 observations deleted due to missingness)
Multiple R-squared: 0.0286, Adjusted R-squared: 0.0258
F-statistic: 10.1 on 1 and 343 DF, p-value: 0.00162
> summary(lm(deltatch~mfrtch.1, data=attitudesw))
Call:
lm(formula = deltatch ~ mfrtch.1, data = attitudesw)
Residuals:
Min 1Q Median 3Q Max
-2.9288 0.0609 0.0746 0.0884 3.0609
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1159 0.2072 -0.56 0.58
mfrtch.1 0.0138 0.0692 0.20 0.84
Residual standard error: 0.806 on 342 degrees of freedom
(228 observations deleted due to missingness)
Multiple R-squared: 0.000116, Adjusted R-squared: -0.00281
F-statistic: 0.0396 on 1 and 342 DF, p-value: 0.842
> summary(lm(deltacmt~mfrcmt.1, data=attitudesw))
Call:
lm(formula = deltacmt ~ mfrcmt.1, data = attitudesw)
Residuals:
Min 1Q Median 3Q Max
-2.9559 -0.9209 0.0591 0.0861 3.0231
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0607 0.3003 0.2 0.84
mfrcmt.1 -0.0419 0.1046 -0.4 0.69
Residual standard error: 0.886 on 342 degrees of freedom
(228 observations deleted due to missingness)
Multiple R-squared: 0.00047, Adjusted R-squared: -0.00245
F-statistic: 0.161 on 1 and 342 DF, p-value: 0.689
>
> # Significance is always encouraging, even though we are only predicting a tiny
> # fraction of the variance (R^2 = .026)
>
> # We'll cheat a little more and remove everyone whose attitudes did not change:
> summary(lm(deltasub~mfrsub.1, data=attitudesw, subset = deltasub!=0 ))
Call:
lm(formula = deltasub ~ mfrsub.1, data = attitudesw, subset = deltasub !=
0)
Residuals:
Min 1Q Median 3Q Max
-2.793 -0.793 -0.493 1.207 2.608
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.006 0.509 -3.94 0.00013 ***
mfrsub.1 0.600 0.178 3.38 0.00094 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.19 on 145 degrees of freedom
(221 observations deleted due to missingness)
Multiple R-squared: 0.0729, Adjusted R-squared: 0.0665
F-statistic: 11.4 on 1 and 145 DF, p-value: 0.000945
> summary(lm(deltatch~mfrtch.1, data=attitudesw, subset = deltasub!=0 ))
Call:
lm(formula = deltatch ~ mfrtch.1, data = attitudesw, subset = deltasub !=
0)
Residuals:
Min 1Q Median 3Q Max
-1.9165 -0.8949 0.0943 0.1085 3.0835
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1267 0.3503 -0.36 0.72
mfrtch.1 0.0108 0.1199 0.09 0.93
Residual standard error: 0.923 on 144 degrees of freedom
(222 observations deleted due to missingness)
Multiple R-squared: 5.64e-05, Adjusted R-squared: -0.00689
F-statistic: 0.00812 on 1 and 144 DF, p-value: 0.928
> summary(lm(deltacmt~mfrcmt.1, data=attitudesw, subset = deltasub!=0 ))
Call:
lm(formula = deltacmt ~ mfrcmt.1, data = attitudesw, subset = deltasub !=
0)
Residuals:
Min 1Q Median 3Q Max
-2.830 -0.803 0.125 0.253 3.307
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.673 0.513 -1.31 0.19
mfrcmt.1 0.183 0.182 1.00 0.32
Residual standard error: 0.94 on 144 degrees of freedom
(222 observations deleted due to missingness)
Multiple R-squared: 0.00692, Adjusted R-squared: 2.67e-05
F-statistic: 1 on 1 and 144 DF, p-value: 0.318
>
> # Very nice, for "sub" the effect grows in strength, signficance,
> # and R^2 goes up to .067.
>
> # Let's also take a look at the plot:
> pdf("1.1_Peer_effect_on_opinion_of_subject.pdf")
> plot( x = attitudesw$mfrsub.1,
+ y = jitter(attitudesw$deltasub),
+ main = "Peer effects on opinion of subject",
+ xlab = "mean friends' opinion",
+ ylab = expression(Delta~"opinion from "~t[1]*~to~t[2])
+ )
>
> # We can draw a regression line on the plot based on the regression
> # we estimated above:
> abline(lm(deltasub~mfrsub.1, data=attitudesw ))
> dev.off()
windows
2
> # We've got some evidence of a peer effect on how much a student
> # likes the subject. Our evidence is based on temporal precendence:
> # your friends opinion of the subject predicts how much
> # your opinion the subject changes from t_1 to t_2.
>
> ############################################################################
> # Now let's run some "real" models.
> # We'll first estimate the model Y_2 = Y_1 + W*Y_alter, using a network
> # autocorrelation model. Read up on it here:
>
> # Roger Th. A. J. Leenders, Modeling social influence through network
> # autocorrelation: constructing the weight matrix, Social Networks, Volume 24, Issue 1, January 2002, Pages 21-47, ISSN 0378-8733, DOI: 10.1016/S0378-8733(01)00049-1.
> # (http://www.sciencedirect.com/science/article/B6VD1-44SKCC2-1/2/cae2d6b4cf4c1c21f4f870fd2d58b5cc)
>
> # This model arguably takes into accounts the correlation between friends in
> # the disturbances (error term). Yet there are problems with this kind of model.
> # See:
>
> # Eric J. Neuman, Mark S. Mizruchi, Structure and bias in the network autocorrelation
> # model, Social Networks, In Press, Corrected Proof, Available online 31 May 2010, ISSN 0378-8733, DOI: 10.1016/j.socnet.2010.04.003.
> # (http://www.sciencedirect.com/science/article/B6VD1-506H0SN-1/2/73a16c627271d29d9283b6d69b873a07)
>
> # With that in mind, let's proceed:
> library(sna)
> library(numDeriv)
>
> # We'll use the mfrsub.1 variable to approximate the W*Y_alter term. (If
> # we actually use the matrix, we cannot estimate the model--there
> # would be the same number of variables as cases).
>
> # Let's remove the NA values from the attitudes measures
> # (otherwise lnam will crash). We'll remove rows with NAs for
> # sub.1, sub.2, or mfrsub.1. Note that we could remove any row with an
> # NA value with the syntax: na.omit(attitudesw), but if there are
> # rows that contain our variables of interest but are NA for some other
> # variable, na.omit would also drop these rows and we would lose valuable
> # data.
>
> atts = attitudesw[!is.na(attitudesw$sub.2),]
> atts = atts[!is.na(atts$sub.1),]
> atts = atts[!is.na(atts$mfrsub.1),]
> W = sem1matrix
>
> # Now we'll make sure the rows and columns in W are in the same
> # order as atts:
> W = W[match(atts$std_id,row.names(W)), match(atts$std_id,colnames(W))]
> # Let's make sure (this will return 0 if we did it right):
> which(rownames(W) != atts$std_id)
integer(0)
>
> # Now we'll estimate the peer influence model using lnam, which stands for
> # Linear Network Autocorrelation Model.
> pim1<-lnam(atts$sub.2,cbind(atts$sub.1, atts$mfrsub.1), W2 = W)
> summary(pim1)
Call:
lnam(y = atts$sub.2, x = cbind(atts$sub.1, atts$mfrsub.1), W2 = W)
Residuals:
Min 1Q Median 3Q Max
-2.472 -0.472 0.134 0.440 1.998
Coefficients:
Estimate Std. Error Z value Pr(>|z|)
X1 0.6064 0.0404 15.02 < 2e-16 ***
X2 0.3489 0.0424 8.22 2.2e-16 ***
rho2.1 0.0271 0.0192 1.41 0.16
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Estimate Std. Error
Sigma 0.726 0.00077
Goodness-of-Fit:
Residual standard error: 0.732 on 342 degrees of freedom (w/o Sigma)
Multiple R-Squared: 0.414, Adjusted R-Squared: 0.409
Model log likelihood: -379 on 341 degrees of freedom (w/Sigma)
AIC: 767 BIC: 782
Null model: meanstd
Null log likelihood: -476 on 343 degrees of freedom
AIC: 957 BIC: 964
AIC difference (model versus null): 190
Heuristic Log Bayes Factor (model versus null): 182
>
> # This shows a pretty strong peer effect on subjects at time2. Notice that
> # the adjusted R^2 is .41. This is not good news. Even after controlling
> # for t1 values of attitude for the subject, we are predicting less than
> # half of the variance in t_2 values. That means that there is a good
> # chance of omitted variable bias.
>
> # Below is an attempt to run the model with the actual WYalt as the X matrix.
> # For the reasons stated above, it doesn't work. It is included in case the
> # reader wants to uncomment this code and see for him/herself.
>
> #WYalt = sem1Wxsub1
> #WYalt = WYalt[match(atts$std_id,row.names(WYalt)), match(atts$std_id,colnames(WYalt))]
> #image(WYalt)
> #
> ##pim2<-lnam(y = atts$sub.2, x = WYalt, W2 = W)
> #pim2<-lnam(y = atts$sub.2, x = WYalt)
> #summary(pim2)
>
>
> ############################################################################
> # Matching
> #
> # One method that is less vulnerable to some problems related to regression is
> # matching. In matching, one divides the key explanatory variable into discrete
> # "treatments" and then matches participants so that they are as similar on the
> # remaining set of covariates as possible. A simple difference in means/medians/
> # quantiles can be used. From there we can use a simple t-test to estimate the
> # "treatment" effect without having to worry about multicolinearity.
> # If the data violates the normality assumption (as will be the
> # case here), non-parametric methods such as a permutation test or bootstrapped
> # standard errors can be used. This approach also also violates less statistical
> # assumptions and therefore should be expected to be less biased than running a
> # simple OLS model.
> # For the original article on matching, see:
> # Donald B. Rubin, Matching to Remove Bias in Observational Studies
> # Biometrics, Vol. 29, No. 1 (Mar., 1973), pp. 159-183
> # Stable URL: http://www.jstor.org/stable/2529684
>
> # Also worthy of note is the Rubin Causal Model, which uses matching.
> # Read up on it here: http://en.wikipedia.org/wiki/Rubin_Causal_Model
>
> # However, matching is not a cure-all and is obviously sensitive to the variables
> # on which one matches. For an excellent article on this, see:
>
> # Jeffrey A. Smith, Petra E. Todd, Does matching overcome LaLonde's critique of
> # nonexperimental estimators?, Journal of Econometrics, Volume 125, Issues 1-2,
> # Experimental and non-experimental evaluation of economic policy and models,
> # March-April 2005, Pages 305-353.
> # (http://www.sciencedirect.com/science/article/B6VC0-4CXTY59-1/2/06728bd79899fd9a5b814dfea9fd1560)
>
> # A good introduction to matching using the MatchIt package can be found courtesy
> # of Gary King here: http://gking.harvard.edu/matchit/docs/matchit.pdf
>
> library(MatchIt)
>
> # First, we dichotomize the independent variable of interest, mean friend opinion regarding
> # subject: mfrsub.1
> atts$mftreat = ifelse(atts$mfrsub.1 > 3, 1,0)
> atts = atts[c("mftreat", "deltasub", "mfrsub.1","egrd.1", "sub.1", "tot.1", "frn.1", "cmt.1", "prev.1")]
>
> # Before we do any matching, let's take a look at our new treatment variable.
>
> # We'll make this into a factor, which will help interpret our results. A factor
> # is a special type of object in R that is used for qualitative (nominal or ordinal)
> # data/variables. A factor is also generally more efficient with respect to memory,
> # because it stores an integer value corresponding to a label (e.g. 1 for "low" and 2
> # for "high") rather than storing the entire string.
>
> # Factors have an order, which you can set by ordering the labels in ascending order
> # when you create the factor, or by using the reorder() function. We'll talk about this
> # more in the section on dotplots below.
> atts$mftreatf = factor(atts$mftreat, labels = c("low", "high"))
>
> # Let's compute the means:
> tapply(X = atts$deltasub, INDEX = list(atts$mftreatf), FUN = mean)
low high
-0.21545 0.06061
>
> # Note, the tapply function is useful if you have more than one factor for which
> # you want to calcuate some function over. For example, if there were another
> # factor, say exptreat, we could get the mean for each cell using this syntax:
> # tapply(X = atts$deltasub, INDEX = list(atts$mftreat, atts$exptreat), FUN = mean)
>
> # A t-test is not totally kosher here because the DV is ordinal, and so it violates
> # assumptions behind the t-test.
> # But let's take a look anyway:
> t.test(deltasub~mftreatf, data=atts)
Welch Two Sample t-test
data: deltasub by mftreatf
t = -3.054, df = 207.2, p-value = 0.002552
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.45424 -0.09786
sample estimates:
mean in group low mean in group high
-0.21545 0.06061
>
> # A better alternative is the Wilcoxon test (a.k.a. Mann-Whitney test),
> # suitable for ordinal variables.
> wilcox.test(x=atts$deltasub[atts$mftreatf=="low"],
+ y=atts$deltasub[atts$mftreatf=="high"],
+ conf.int = TRUE)
Wilcoxon rank sum test with continuity correction
data: atts$deltasub[atts$mftreatf == "low"] and atts$deltasub[atts$mftreatf == "high"]
W = 10216, p-value = 0.008776
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-4.395e-05 -6.498e-05
sample estimates:
difference in location
-4.853e-05
>
> # Now let's match and redo the analyses.
>
> # We normally would assign each subject/case to treatment or control groups based on
> # whatever corresponds more closely to a treatment or a control. In this case, it
> # is debatable whether having friends who like the class is more of a "treatment"
> # than having friends who do not like the class. However, there are 246 students
> # whose friends do not like the class versus 99 whose friends do like the class,
> # so we can say that having friends who like the class is certainly more unusual
> # than the opposite. Accordingly, we'll conceptualize having friends who like the
> # class as the treatment.
>
> # In the actual matching proceedure, what we'll do is try to find individuals in
> # the "control" condition (having friends who do not like the subject) who look
> # most similar to each treated individual based on a variety of variables. Ideally,
> # we will achieve "balance" on the covariates between those in the treatment condition,
> # and those we select from the control condition, so that the individuals are as similar
> # as possible.
>
> # Here are the variables we'll use (all are at time 1):
>
> # sub.1 - how much the student likes the subject
> # egrd.1 - expected grade
> # tot.1 - how much the student likes the way the course is taught
> # frn.1 - how interesting the subject finds his/her friends
> # cmt.1 - how interesting the student finds his/her classmates
> # prev.1 - has the student had the teacher previously)
>
> # We'll use "nearest neighbor" matching, which uses a distance measure to
> # select the individual in the "control" group that is closest to each
> # individual in the "treatment" group. The default distance measure is the
> # logistic regression propensity score.
>
> # The codebook for this data is available here: http://stanford.edu/~messing/codebook.v12.txt
> m.out = matchit(mftreat ~ egrd.1 + sub.1 + tot.1 + frn.1 + cmt.1 + prev.1,
+ data = atts, method = "nearest")
>
> summary(m.out)
Call:
matchit(formula = mftreat ~ egrd.1 + sub.1 + tot.1 + frn.1 +
cmt.1 + prev.1, data = atts, method = "nearest")
Summary of balance for all data:
Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean
distance 0.348 0.262 0.116 0.086 0.058 0.086
egrd.1 3.364 3.122 0.872 0.242 0.000 0.253
sub.1 3.051 2.760 0.850 0.290 0.000 0.323
tot.1 2.869 2.638 0.896 0.230 0.000 0.242
frn.1 3.444 3.459 0.575 -0.015 0.000 0.030
cmt.1 2.758 2.805 0.815 -0.047 0.000 0.040
prev.1 0.384 0.163 0.370 0.221 0.000 0.222
eQQ Max
distance 0.223
egrd.1 1.000
sub.1 1.000
tot.1 1.000
frn.1 1.000
cmt.1 1.000
prev.1 1.000
Summary of balance for matched data:
Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean
distance 0.348 0.336 0.139 0.012 0 0.012
egrd.1 3.364 3.333 0.782 0.030 0 0.030
sub.1 3.051 2.990 0.886 0.061 0 0.121
tot.1 2.869 2.818 0.952 0.051 0 0.051
frn.1 3.444 3.434 0.574 0.010 0 0.051
cmt.1 2.758 2.768 0.831 -0.010 0 0.030
prev.1 0.384 0.364 0.483 0.020 0 0.020
eQQ Max
distance 0.075
egrd.1 1.000
sub.1 1.000
tot.1 1.000
frn.1 1.000
cmt.1 1.000
prev.1 1.000
Percent Balance Improvement:
Mean Diff. eQQ Med eQQ Mean eQQ Max
distance 85.94 99.18 85.56 66.25
egrd.1 87.46 0.00 88.00 0.00
sub.1 79.13 0.00 62.50 0.00
tot.1 78.09 0.00 79.17 0.00
frn.1 32.23 0.00 -66.67 0.00
cmt.1 78.65 0.00 25.00 0.00
prev.1 90.87 0.00 90.91 0.00
Sample sizes:
Control Treated
All 246 99
Matched 99 99
Unmatched 147 0
Discarded 0 0
> plot(m.out)
Waiting to confirm page change...
>
> # Now let's assess the extent to which our matching algorithm effectively matched each
> # treatment subject to a control subject, e.g., the extent to which we achieved balance.
> # The plot function displays Q-Q plots of the control (X-axis) versus treatment groups
> # (Y-axis) for the original sample and the matched sample. The Q-Q plots indicate perfect
> # balance on the covariate in question if all dots are perfectly aligned on the
> # 45 degree line. The further the deviation from this line, the worse the samples are
> # balanced on this variable.
> # Based on the plots, matching appears to have improved balance on our covariates a
> # little bit, but not perfectly.
>
> # Ideally we want to experiment with other covariates and try to find the combination
> # that best and most parsimoniously captures the key phenomena we want to control for.
>
> matchatts = match.data(m.out)
>
> # We'll make our treatment variable into a factor:
> matchatts$mftreatf = factor(matchatts$mftreat, labels = c("low", "high"))
>
> # Let's compute the means:
> tapply(X = matchatts$deltasub, INDEX = list(matchatts$mftreatf), FUN = mean)
low high
-0.26263 0.06061
>
> t.test(deltasub~mftreatf, data=matchatts)
Welch Two Sample t-test
data: deltasub by mftreatf
t = -2.85, df = 190.3, p-value = 0.004856
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.54695 -0.09951
sample estimates:
mean in group low mean in group high
-0.26263 0.06061
>
> wilcox.test(x=matchatts$deltasub[matchatts$mftreatf=="low"],
+ y=matchatts$deltasub[matchatts$mftreatf=="high"],
+ conf.int = TRUE)
Wilcoxon rank sum test with continuity correction
data: matchatts$deltasub[matchatts$mftreatf == "low"] and matchatts$deltasub[matchatts$mftreatf == "high"]
W = 3948, p-value = 0.007398
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-9.145e-05 -2.258e-05
sample estimates:
difference in location
-2.757e-05
>
> # We can also perform a full permuation test. A permutation test is an exact test of
> # whether it is possible to reject the null hypothesis that two distributions are
> # the same. It works by first calulating the mean difference, which fuctions as the
> # test statistic. The difference in sample means is calculated and recorded for every
> # possible way of dividing these pooled values into two groups of size n_A and n_B
> # for every permutation of the group labels A and B. The set of these calculated
> # differences is the exact distribution of possible differences under the null
> # hypothesis that group label does not matter. This test fuctions like a t.test but
> # does not rely on any distributional assumptions about the data.
>
> # For additional information, see:
> # http://en.wikipedia.org/wiki/Resampling_%28statistics%29#Permutation_tests
>
> library(coin)
> independence_test(deltasub ~ mftreatf, data = matchatts,
+ distribution = exact())
Exact General Independence Test
data: deltasub by mftreatf (low, high)
Z = -2.800, p-value = 0.006102
alternative hypothesis: two.sided
>
> # Another alternative is to look at bootstrapped estimates of the mean and standard
> # error. This is often an attractive way to estimate statistical significance
> # because our data violates the distributional assumptions involved in estimating
> # standard errors in this way (namely, that our dependent variable resembles anything
> # like a normal or t distribution--it cannot because it is ordinal with only
> # 7 levels). Because boostrapping relies on resampling instead of distributional
> # assumptions to estimate variance, it is more robust.
>
> # There are packages available to generate bootstrapped estimates, but seeing the code
> # is a valuable way to get a sense of exactly how bootstrapping works. Here is the code
> # for a bootstrapping function designed to estimate the mean and standard error:
>
> b.mean <- function(data, nsim) {
+ # Create a list object to store the sets of resampled data (in this case nsim = 1000).
+ resamples = list()
+
+ # Create a vector of the same length to store the resampled means
+ r.mean = numeric(nsim)
+
+ # Generate a sample with replacement
+ for(i in 1:nsim){
+ # generates a random sample of our data with replacement
+ resamples[[i]] = sample(x=data, size=length(data), replace=T)
+
+ # Now calcuate the mean for this iteration
+ r.mean[i] = mean(resamples[[i]], na.rm=T)
+ }
+
+ # Calculate the mean of the mean of the simulated estimates above:
+ mean.adj = mean(r.mean, na.rm=T)
+
+ # Calculate how this differs from the arithmatic mean estimate of the original
+ # sample:
+ bias = mean(data, na.rm=T) - mean.adj
+
+ # Generate the standard error of the estimate
+ std.err <- sqrt(var(r.mean))
+
+ # Return results
+ return( data.frame(mean = mean(data, na.rm=T), #the mean estimate
+ mean.adj = mean.adj, # the adjusted estimate based on simulations
+ bias = bias, # the mean minus the adjusted estimate
+ std.err=std.err # the standard error of the estimate (based on simulations)
+ )
+ )
+ }
>
> # Before we use it on our data, let's make sure it works. We will simulate some
> # data for which we know the parameters and then estimate the parameters using
> # the bootstrap:
>
> simdata = rnorm(n = 1000, mean = 5, sd = 1)
> b.mean(simdata, nsim=1000)
mean mean.adj bias std.err
1 4.984 4.983 0.0007549 0.03228
>
> # Calculate the theoretical mean and standard error
> mean(simdata)
[1] 4.984
> (se = sd(simdata)/sqrt(length(simdata)-1))
[1] 0.03127
> # here we use the formula SE = SD/sqrt(N - 1)
>
> # We have demonstrated that our bootstraping function is a good estimator
> # for "perfect" data that meets the assumptions behind traditional
> # statistical tests. Now let's move on to our data.
>
> # For those with friends with on average postive attitudes:
> b.mean(data=matchatts$deltasub[matchatts$mftreatf=="high"],
+ nsim=1000)
mean mean.adj bias std.err
1 0.06061 0.06266 -0.002051 0.07486
>
> # For those with friends with on average negative attitudes:
> b.mean(data=matchatts$deltasub[matchatts$mftreatf=="low"],
+ nsim=1000)
mean mean.adj bias std.err
1 -0.2626 -0.2617 -0.0009192 0.08626
>
> # Here's how to do it using the boot package, which is faster and more
> # extensible than the R-code above:
>
> library(boot)
>
> # You have to write a special function for the boot package, which
> # can be confusing at first, but is useful for more complicated
> # types of bootstrapping. The boot package is also orders of magnitude
> # faster than doing things in R, which is useful if you start to do
> # more complicated bootstrapping or use a higher number of simulations.
>
> # First, write a basic function that will return the estimate in question
> # for x data using d cases:
>
> samplemean <- function(x, d) {
+ return(mean(x[d]))
+ }
>
> (bootlow = boot(data = matchatts$deltasub[matchatts$mftreatf=="low"],
+ statistic = samplemean,
+ R = 1000))
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = matchatts$deltasub[matchatts$mftreatf == "low"],
statistic = samplemean, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* -0.2626 -0.001212 0.08757
>
> (boothigh =boot(data = matchatts$deltasub[matchatts$mftreatf=="high"],
+ statistic = samplemean,
+ R = 1000))
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = matchatts$deltasub[matchatts$mftreatf == "high"],
statistic = samplemean, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 0.06061 0.002737 0.06992
>
> # Note that estimates of the standard errors will be slightly different
> # each time due to randomness involved in estimation. As you might expect,
> # as the number of simulations increases, variance will decrease.
>
> # Based on our bootstrapped estimates and standard error calculations,
> # it's clear that the two means are quite different. So, based on our
> # matched samples, it seems there is a significant peer effect.
>
>
> ##########################################################################
> # Let's try full matching so that we are not throwing away any of our data:
> m.out = matchit(mftreat ~ egrd.1 + sub.1 + tot.1 + frn.1 + cmt.1 + prev.1,
+ data = atts, method = "full")
> matchatts = match.data(m.out)
>
> # We'll make this into a factor, which will help inturpret our results:
> matchatts$mftreatf = factor(matchatts$mftreat, labels = c("low", "high"))
>
> # Note that we'll now have to use the weights that the output provided because
> # we told it to use the full sample.
>
> # Let's compute the means:
> tapply(X = matchatts$deltasub,
+ INDEX = list(matchatts$mftreatf),
+ FUN = weighted.mean, weights = matchatts$weights )
low high
-0.21545 0.06061
>
> # There is no straightforward way to compute a t-test in R with weighted data.
>
> # We can however compute weighted standard errors and a corresponding
> # 95% CI:
>
> # Recall that a 95% CI of an estimate is calculated by taking the estimate
> # +/- the standard error * 1.96. Actually, +/- 1.96 is just an estimate of the
> # the 0.05 and 0.95 quantiles of a statistical distribution. We'll use
> # qt(quantile, N-1) in this case.
> library(Hmisc)
> meanlow = wtd.mean(matchatts$deltasub[matchatts$mftreatf=="low"],
+ weights = matchatts$weights[matchatts$mftreatf=="low"])
> varlow = wtd.var(matchatts$deltasub[matchatts$mftreatf=="low"],
+ weights = matchatts$weights[matchatts$mftreatf=="low"])
> selow = sqrt(varlow)/sqrt(sum(matchatts$weights[matchatts$mftreatf=="low"]))
> meanlow + selow * qt(.025, ( sum(matchatts$weights[matchatts$mftreatf=="low"]) - 1))
[1] -0.4554
> meanlow + selow * qt(.975, ( sum(matchatts$weights[matchatts$mftreatf=="low"]) - 1))
[1] -0.2456
>
> meanhigh = wtd.mean(matchatts$deltasub[matchatts$mftreatf=="high"],
+ weights = matchatts$weights[matchatts$mftreatf=="high"])
> varhigh = wtd.var(matchatts$deltasub[matchatts$mftreatf=="high"],
+ weights = matchatts$weights[matchatts$mftreatf=="high"])
> sehigh = sqrt(varhigh)/sqrt(sum(matchatts$weights[matchatts$mftreatf=="high"]))
> meanhigh + sehigh * qt(.025, ( sum(matchatts$weights[matchatts$mftreatf=="high"]) - 1))
[1] -0.08417
> meanhigh + sehigh * qt(.975, ( sum(matchatts$weights[matchatts$mftreatf=="high"]) - 1))
[1] 0.2054
>
> # Here's how to do it with the boot package:
> sample.wtd.mean <- function(x, d, wts) {
+ return(weighted.mean(x[d], w = wts))
+ }
>
> (bootlow = boot(data = matchatts$deltasub[matchatts$mftreatf=="low"],
+ wts = matchatts$weights[matchatts$mftreatf=="low"],
+ statistic = sample.wtd.mean,
+ R = 1000))
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = matchatts$deltasub[matchatts$mftreatf == "low"],
statistic = sample.wtd.mean, R = 1000, wts = matchatts$weights[matchatts$mftreatf ==
"low"])
Bootstrap Statistics :
original bias std. error
t1* -0.3505 0.1349 0.08139
> (boothigh =boot(data = matchatts$deltasub[matchatts$mftreatf=="high"],
+ wts = matchatts$weights[matchatts$mftreatf=="high"],
+ statistic = sample.wtd.mean,
+ R = 1000))
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = matchatts$deltasub[matchatts$mftreatf == "high"],
statistic = sample.wtd.mean, R = 1000, wts = matchatts$weights[matchatts$mftreatf ==
"high"])
Bootstrap Statistics :
original bias std. error
t1* 0.06061 -0.003505 0.07104
> # We can make a nice plot of the results with the lattice package:
> library(lattice)
>
> # Dot plots:
> visual = list()
> visual$var = c("Friends do not like subj.", "Friends like subj.")
> visual$M = c(bootlow$t0, boothigh$t0)
> visual$se = c(sd(bootlow$t), sd(boothigh$t))
> visual$N = c(length(matchatts$weights[matchatts$mftreatf=="high"]),
+ length(matchatts$weights[matchatts$mftreatf=="low"]))
> visual$lo = visual$M - visual$se
> visual$up = visual$M + visual$se
> visual = as.data.frame(visual)
>
> #print it out:
> pdf(file="1.2_dotplot_matchoutput_bootstrapped_SE.pdf", width = 6, height = 3)
> dotplot( reorder(var,M) ~ M,
+ data = visual,
+ main="Effect of friends' attitude on opinion change,\nmatched samples with bootstrapped SE",
+ panel = function (x, y, subscripts, groups, ...) {
+ panel.dotplot(x, y)
+ panel.segments(x0=visual$lo[subscripts],y0=as.numeric(y),
+ x1=visual$up[subscripts],y1=as.numeric(y),
+ lty = 1 )
+ },
+ xlab=expression(Delta~"opinion from "~t[1]*~to~t[2]),
+ xlim= c(-.5, .2)
+ )
> dev.off()
windows
2
>
> # Very nice!
>
> #
> # QUESTION #1 - What do these results show?
> # What assumptions are being made?
> # What might lead you to question the results? What would improve them?
> #
>
> #########################################################################
> # Extra-credit:
> #
> # Use the McFarland dataset to acquire other individual level variables
> # and develop better matching. Then follow this lab and explore the
> # variety of outcomes suitable for illustrating peer effects in
> # classrooms (e.g., PSAT scores, engagement, conflict, etc). Results
> # are likely suitable for an A-journal publication.
> #
> # Again, the codebook is available here: http://stanford.edu/~messing/codebook.v12.txt
> #
> #########################################################################
>
> #########################################################################
> # PART II -- QAP REGRESSION
> #########################################################################
> #
> # Here we want to compare different classrooms and discern what leads them
> # to become more or less sociable and academically engaged.
> #
> # Class m173 is an accelerated math class of all 10th graders taught
> # by a rigid teacher who used teacher-centered instructional formats.
> # Class m 182 is a regular math class of 10th and 11th grade students
> # taught by a casual teacher who used student-centered instructional formats.
> #
> #########################################################################
>
> # igraph and sna don't mix well. Before we load sna, we will detach igraph.
> detach(package:igraph)
>
> # Load the "sna" library
> library(sna)
>
> #Clear Variables
> rm(list=ls(all=TRUE))
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1307477 35.0 2105982 56.3 1967602 52.6
Vcells 1313679 10.1 4373206 33.4 5466508 41.8
>
> #(2) QAP Regressions for m l73
>
> # Note that each matrix must be the same size (n x n). You'll want to make
> # sure all input and output matrices are the same size. Predictor matrices
> # thus are not individual level variables; they are differences, matches,
> # etc. You'd use a matrix of differences in of quantitative (node level)
> # variables, and matches in the case of categorical/dichotomous (node
> # level) variables. There's a really easy way to do it in R
> # (say your node-level variables are x and y)
> # x <- seq(1,4)
> # y <- seq(2,5)
> # outer(x,y,FUN="-") # find the difference between x and y
> # outer(x,y,FUN="==") # produce 0/1 similairty matrix
> # For this lab we will use matrices that were previously generated in UCINet
>
> # Loading predictor matrices. Each matrix is a n x n matrix
> # data is saved in a CSV format. You can open the files in Excel
> # to see the structure.
>
> data(studentnets.mrqap173, package="NetData")
>
> # Look at what we loaded via
> ls()
[1] "m173_sem1_FRN" "m173_sem1_GND" "m173_sem1_RCE" "m173_sem1_SEAT"
[5] "m173_sem1_SSL" "m173_sem1_TSL" "m173_sem2_SSL" "m173_sem2_TSL"
>
> # We need the data in matrix format
> # predictor matrices
> m173_sem1_SSL <- as.matrix(m173_sem1_SSL)
> m173_sem1_TSL <- as.matrix(m173_sem1_TSL)
> m173_sem1_FRN <- as.matrix(m173_sem1_FRN)
> m173_sem1_SEAT <- as.matrix(m173_sem1_SEAT)
> m173_sem1_RCE <- as.matrix(m173_sem1_RCE)
> m173_sem1_GND <- as.matrix(m173_sem1_GND)
>
> # Load response matrices
> m173_sem2_SSL <- as.matrix(m173_sem2_SSL)
> m173_sem2_TSL <- as.matrix(m173_sem2_TSL)
>
> # In order to run the QAP regression we must create an array of matrices
> # containing all the predcitor matrices. We are, in effect, creating a
> # 3-d matrix (predcitor x n x n).
>
> # Important: All matrices must be the same size!
>
> response_matrices <- array(NA, c(6, length(m173_sem1_SSL[1,]),length(m173_sem1_SSL[1,])))
> response_matrices[1,,] <- m173_sem1_SSL
> response_matrices[2,,] <- m173_sem1_TSL
> response_matrices[3,,] <- m173_sem1_FRN
> response_matrices[4,,] <- m173_sem1_SEAT
> response_matrices[5,,] <- m173_sem1_RCE
> response_matrices[6,,] <- m173_sem1_GND
>
> ##############################
> #(2a) SSL2 <- SSL1 + TSL1 + FRN1 + SEAT1 + RCE + GND
> ##############################
>
> # Fit a netlm model by using netlm, the response matrix and the array of predictor matrices
> # This may take a LONG time.
> nl<-netlm(m173_sem2_SSL,response_matrices)
>
> # Make the model easier to read by adding lables for each predictor matrix.
> nlLabeled <- list()
> nlLabeled <- summary(nl)
> # Labels are provided in the same order as they were assigned in the response_matrices array
> nlLabeled$names <- c("Intercept", "Social normalized and labeled (SSL1)", "Task normalized and labeled (TSL1)", "Friends 1=friend, 2=best friend(FRN1)", "Seat in first semester (Seat1)","Race (RCE)","Gender (GND)")
>
> # Round the ocefficients to two decimals
> nlLabeled$coefficients = round(nlLabeled$coefficients, 2)
> nlLabeled
OLS Network Model
Residuals:
0% 25% 50% 75% 100%
-1.652584 -0.067206 0.008679 0.015217 2.924943
Coefficients:
Estimate Pr(<=b) Pr(>=b) Pr(>=|b|)
Intercept -0.02 0.368 0.632 0.611
Social normalized and labeled (SSL1) 0.45 1.000 0.000 0.000
Task normalized and labeled (TSL1) 0.03 0.949 0.051 0.055
Friends 1=friend, 2=best friend(FRN1) 0.16 0.998 0.002 0.002
Seat in first semester (Seat1) 0.08 0.997 0.003 0.003
Race (RCE) 0.00 0.486 0.514 0.917
Gender (GND) 0.01 0.591 0.409 0.839
Residual standard error: 0.3437 on 643 degrees of freedom
Multiple R-squared: 0.3817 Adjusted R-squared: 0.3759
F-statistic: 66.16 on 6 and 643 degrees of freedom, p-value: 0
Test Diagnostics:
Null Hypothesis: qap
Replications: 1000
Coefficient Distribution Summary:
Intercept Social normalized and labeled (SSL1)
Min -5.30425 -2.87324
1stQ -1.13972 -0.78419
Median -0.20386 -0.18491
Mean -0.24972 0.04001
3rdQ 0.64090 0.51252
Max 3.96993 9.14054
Task normalized and labeled (TSL1) Friends 1=friend, 2=best friend(FRN1)
Min -5.15919 -3.70427
1stQ -0.78578 -0.85105
Median -0.36675 -0.17124
Mean 0.01945 -0.07891
3rdQ 0.24560 0.59761
Max 8.72870 5.35088
Seat in first semester (Seat1) Race (RCE) Gender (GND)
Min -3.60216 -3.86594 -3.97813
1stQ -0.93956 -1.06593 -0.90991
Median -0.09426 -0.10965 -0.01830
Mean -0.05039 0.02182 0.02029
3rdQ 0.78198 1.00907 0.90703
Max 4.95567 4.71237 4.19721
>
> ##############################
> #(2b) TSL2 <- TSL1 + SSL1 + FRN1 + SEAT1 + RCE + GND
> ##############################
>
> # Fit a netlm model by using netlm, the response matrix and the array of predictor matrices
> nl<-netlm(m173_sem2_TSL,response_matrices)
>
> #make the model easier to read
> nlLabeled <- list()
> nlLabeled <- summary(nl)
> nlLabeled$names <- c("Intercept", "Social normalized and labeled (SSL1)", "Task normalized and labeled (TSL1)", "Friends 1=friend, 2=best friend(FRN1)", "Seat in first semester (Seat1)","Race (RCE)","Gender (GND)")
>
> nlLabeled$coefficients = round(nlLabeled$coefficients, 2)
> nlLabeled
OLS Network Model
Residuals:
0% 25% 50% 75% 100%
-6.79570 -0.10045 -0.00923 0.02628 7.90741
Coefficients:
Estimate Pr(<=b) Pr(>=b) Pr(>=|b|)
Intercept 0.10 0.824 0.176 0.199
Social normalized and labeled (SSL1) -0.28 0.005 0.995 0.027
Task normalized and labeled (TSL1) 1.01 1.000 0.000 0.000
Friends 1=friend, 2=best friend(FRN1) 0.01 0.576 0.424 0.918
Seat in first semester (Seat1) -0.14 0.009 0.991 0.024
Race (RCE) -0.04 0.438 0.562 0.762
Gender (GND) -0.09 0.143 0.857 0.285
Residual standard error: 0.8241 on 643 degrees of freedom
Multiple R-squared: 0.756 Adjusted R-squared: 0.7538
F-statistic: 332.1 on 6 and 643 degrees of freedom, p-value: 0
Test Diagnostics:
Null Hypothesis: qap
Replications: 1000
Coefficient Distribution Summary:
Intercept Social normalized and labeled (SSL1)
Min -3.337718 -5.319507
1stQ -0.278468 -0.698403
Median 0.654365 -0.024098
Mean 0.618341 0.066353
3rdQ 1.497785 0.615423
Max 4.341192 9.049366
Task normalized and labeled (TSL1) Friends 1=friend, 2=best friend(FRN1)
Min -2.000817 -4.106646
1stQ -1.009243 -0.767947
Median -0.681597 -0.102887
Mean 0.083602 0.014590
3rdQ -0.113823 0.754001
Max 21.325293 5.109040
Seat in first semester (Seat1) Race (RCE) Gender (GND)
Min -3.751419 -4.091117 -4.133658
1stQ -0.856588 -1.238601 -0.870247
Median 0.007167 -0.266552 -0.041021
Mean 0.027726 -0.111301 0.008289
3rdQ 0.867606 0.927934 0.885494
Max 4.222760 5.180458 3.826009
>
> ##############################
> #(3) QAP Regressions for m 182
> ##############################
>
> # Repeat for class m 182
>
> # Clear Variables
> rm(list=ls(all=TRUE))
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1307477 35.0 2105982 56.3 2105982 56.3
Vcells 1313679 10.1 4373206 33.4 5466508 41.8
>
> data(studentnets.mrqap182, package = "NetData")
>
> # Look at what we loaded via
> ls()
[1] "m182_sem1_FRN" "m182_sem1_GND" "m182_sem1_RCE" "m182_sem1_SEAT"
[5] "m182_sem1_SSL" "m182_sem1_TSL" "m182_sem2_SSL" "m182_sem2_TSL"
>
> # Again, we need the data in matrix format
> # predictor matrices
> m182_sem1_SSL <- as.matrix(m182_sem1_SSL)
> m182_sem1_TSL <- as.matrix(m182_sem1_TSL)
> m182_sem1_FRN <- as.matrix(m182_sem1_FRN)
> m182_sem1_SEAT <- as.matrix(m182_sem1_SEAT)
> m182_sem1_RCE <- as.matrix(m182_sem1_RCE)
> m182_sem1_GND <- as.matrix(m182_sem1_GND)
>
> #response matrices
> m182_sem2_SSL <- as.matrix(m182_sem2_SSL)
> m182_sem2_TSL <- as.matrix(m182_sem2_TSL)
>
> #This class will require you to make multiple extraction operations.
> #A student exits the class at the end of first semester and another
> #enters at the start of second semester. These students must be removed
> #before you can conduct QAP regression (you need the same row-column orderings).
> #Extract actor 15 for TSL and SSL of second semester (# 15 <- new student).
>
> response_matrices <- array(NA, c(6, length(m182_sem1_SSL[1,]),length(m182_sem1_SSL[1,])))
> response_matrices[1,,] <- m182_sem1_SSL
> response_matrices[2,,] <- m182_sem1_TSL
> response_matrices[3,,] <- m182_sem1_FRN
> response_matrices[4,,] <- m182_sem1_SEAT
> response_matrices[5,,] <- m182_sem1_RCE
> response_matrices[6,,] <- m182_sem1_GND
>
> ##############################
> #(3a) SSL2 <- SSL1 + TSL1 + FRN1 + SEAT1 + RCE + GND
> ##############################
>
> #Fit a netlm model
> nl<-netlm(m182_sem2_SSL,response_matrices)
>
> #make the model easier to read
> nlLabeled <- list()
> nlLabeled <- summary(nl)
> nlLabeled$names <- c("Intercept", "Social normalized and labeled (SSL1)", "Task normalized and labeled (TSL1)", "Friends 1=friend, 2=best friend(FRN1)", "Seat in first semester (Seat1)","Race (RCE)","Gender (GND)")
>
> nlLabeled$coefficients = round(nlLabeled$coefficients, 2)
> nlLabeled
OLS Network Model
Residuals:
0% 25% 50% 75% 100%
-4.5908 -0.5230 -0.2287 0.0743 10.7076
Coefficients:
Estimate Pr(<=b) Pr(>=b) Pr(>=|b|)
Intercept 0.17 0.617 0.383 0.668
Social normalized and labeled (SSL1) 0.41 0.994 0.006 0.006
Task normalized and labeled (TSL1) 0.03 0.628 0.372 0.785
Friends 1=friend, 2=best friend(FRN1) 2.58 1.000 0.000 0.000
Seat in first semester (Seat1) -0.03 0.413 0.587 0.832
Race (RCE) -0.15 0.351 0.649 0.690
Gender (GND) 0.35 0.801 0.199 0.382
Residual standard error: 2.045 on 203 degrees of freedom
Multiple R-squared: 0.4146 Adjusted R-squared: 0.3973
F-statistic: 23.97 on 6 and 203 degrees of freedom, p-value: 0
Test Diagnostics:
Null Hypothesis: qap
Replications: 1000
Coefficient Distribution Summary:
Intercept Social normalized and labeled (SSL1)
Min -3.9033769 -3.6978696
1stQ -0.7189104 -0.9429362
Median 0.2007986 -0.1384686
Mean 0.1741948 -0.0144357
3rdQ 1.0677049 0.7784918
Max 4.3684892 6.2733129
Task normalized and labeled (TSL1) Friends 1=friend, 2=best friend(FRN1)
Min -3.3560750 -3.6077484
1stQ -0.8815202 -0.8455367
Median -0.0587745 -0.1048226
Mean 0.0042537 0.0059329
3rdQ 0.7515210 0.7859172
Max 4.5697663 5.3162769
Seat in first semester (Seat1) Race (RCE) Gender (GND)
Min -4.3700085 -3.5759379 -3.7249285
1stQ -0.8523221 -0.8802622 -0.9423154
Median 0.0250436 0.0109940 -0.0090869
Mean -0.0004824 0.0129409 0.0348913
3rdQ 0.8398148 0.8522812 0.9661874
Max 3.7955797 4.1698180 5.0220140
>
> ##############################
> #(3b) TSL2 <- SSL1 + TSL1 + FRN1 + SEAT1 + RCE + GND
> ##############################
>
> #Fit a netlm model
> nl<-netlm(m173_sem2_SSL,response_matrices)
Error in as.sociomatrix.sna(y) : object 'm173_sem2_SSL' not found
>
> #make the model easier to read
> nlLabeled <- list()
> nlLabeled <- summary(nl)
> nlLabeled$names <- c("Intercept", "Social normalized and labeled (SSL1)", "Task normalized and labeled (TSL1)", "Friends 1=friend, 2=best friend(FRN1)", "Seat in first semester (Seat1)","Race (RCE)","Gender (GND)")
>
> nlLabeled$coefficients = round(nlLabeled$coefficients, 2)
> nlLabeled
OLS Network Model
Residuals:
0% 25% 50% 75% 100%
-4.5908 -0.5230 -0.2287 0.0743 10.7076
Coefficients:
Estimate Pr(<=b) Pr(>=b) Pr(>=|b|)
Intercept 0.17 0.617 0.383 0.668
Social normalized and labeled (SSL1) 0.41 0.994 0.006 0.006
Task normalized and labeled (TSL1) 0.03 0.628 0.372 0.785
Friends 1=friend, 2=best friend(FRN1) 2.58 1.000 0.000 0.000
Seat in first semester (Seat1) -0.03 0.413 0.587 0.832
Race (RCE) -0.15 0.351 0.649 0.690
Gender (GND) 0.35 0.801 0.199 0.382
Residual standard error: 2.045 on 203 degrees of freedom
Multiple R-squared: 0.4146 Adjusted R-squared: 0.3973
F-statistic: 23.97 on 6 and 203 degrees of freedom, p-value: 0
Test Diagnostics:
Null Hypothesis: qap
Replications: 1000
Coefficient Distribution Summary:
Intercept Social normalized and labeled (SSL1)
Min -3.9033769 -3.6978696
1stQ -0.7189104 -0.9429362
Median 0.2007986 -0.1384686
Mean 0.1741948 -0.0144357
3rdQ 1.0677049 0.7784918
Max 4.3684892 6.2733129
Task normalized and labeled (TSL1) Friends 1=friend, 2=best friend(FRN1)
Min -3.3560750 -3.6077484
1stQ -0.8815202 -0.8455367
Median -0.0587745 -0.1048226
Mean 0.0042537 0.0059329
3rdQ 0.7515210 0.7859172
Max 4.5697663 5.3162769
Seat in first semester (Seat1) Race (RCE) Gender (GND)
Min -4.3700085 -3.5759379 -3.7249285
1stQ -0.8523221 -0.8802622 -0.9423154
Median 0.0250436 0.0109940 -0.0090869
Mean -0.0004824 0.0129409 0.0348913
3rdQ 0.8398148 0.8522812 0.9661874
Max 3.7955797 4.1698180 5.0220140
>
> #Report your results. Describe what they mean? Repeat for ETSL of second semester.
>
> # QUESTION #2 - Compare your results for m 173 and m 182.
> # Do the classes have different results?
> # If not, why not?
> # If so, why so?
> # What sort of story can you derive about the change in
> # task and social interactions within these classrooms?
> #
> # Remember you are predicting changes in relations of sociable or task
> # interaction using other types of relations (i.e., friendship,
> # proximate seating, same race, etc).
>
> #########################################################################
> #
> # Extra-credit - run the QAP regression part of the lab on the
> # entire McFarland dataset of classrooms, and perform meta-analyses
> # on the results using multi-level modeling - then presto -
> # you will have results on academic and social relations in classrooms
> # which is suitable for publication in an A-journal.
> #
> #########################################################################
> setwd("Y:\\cgi-bin\\sna_drupal\\sna_R_labs\\output\\lab_8")