R wrapup
To wrapup the learning of R, we are going to tackle a larger project: analyze the data by Clauset et al. (Science Advances 2015) on inequality in faculty hiring. They have manually compiled a list of faculty hires at major US research institutions in three disciplines (CS, Business, History). For each faculty member, they have recorded the institution that granted their PhD.
Load and clean the data
For example, for business, and assuming you’re in the sandbox
:
library(tidyverse)
edges <- read_tsv("../data/Dataset 1. Business_edgelist.txt")
This is a list of edges (links connecting nodes in a network):
- The first column contains the ID of the source university
- The second column contains the ID of the hiring university
- The third column specifies the rank of the hire (Assistant, Associate, Full)
- The fourth column the gender of the candidate
We rename the columns for easier programming:
edges <- edges %>% rename(from = `# u`, to = "v")
Now we read in the information about institutions:
vertices <- read_tsv("data/Dataset 2. Business_vertexlist.txt")
We do some cleanup of the data:
vertices[vertices == "."] <- NA
vertices <- vertices %>% rename(id = `# u`,
usn = USN2012,
nrc = `NRC--`, region = Region)
vertices$usn <- as.integer(vertices$usn)
-
Write a function that accepts a filename for the edges, and one for the vertices, and performs the cleaning operations outlined above. The function should return a list with a
tibble
calledvertices
and another callededges
Note however that the names of the columns vary among data sets! Can you find a way to rename them anyway? (hint) -
Run the data cleaning on the three data sets, and store the lists in
cs
,business
andhistry
(noo
to prevent renaming the commandhistory
)
Gender imbalance in rank
- Produce a barplot of the count of M/F hires for each rank. In which field to we have the less gender imbalance?
Top feeder institutions
- For each discipline, tally the number of times each institution is the PhD granting institution (
feeder
). Join with the vertices table to get the name of the institution.
Note: Remove the row All others
lumping together institutions in the rest of the World.
- Write a function that plots the cumulative distribution of number of hires — do few institutions contribute disproportionaltely?
Gini coefficient
You have seen that in history 4 institution provide more than 25% of the professors. To quantify the inequality you can use the Gini coefficient.
The Gini coefficient is used to measure income/wealth inequality. If x_i
is the wealth or income of person i
, and there are n
people, then the Gini coefficient G
is given by:
The index ranges from 0 to approximately 1, with higher values standing for larger inequality. For example, the Gini coefficient for income inequality is 0.27 for Sweden, 0.34 for Canada, 0.41 for the US and 0.63 for South Africa.
- Compute the Gini coefficient for the number of hires from each institution. How does it compare with economic inequality?
Proportion hires from an institution
-
For each discipline and institution, compute the proportion of hires with a PhD from a given institution. (e.g., in CS, 15.4% of the faculty at Stanford hold a PhD from Berkeley)
-
For each discipline, find the 10 most “inbred” institutions (i.e., those hiring their own graduates in higher proportion)
Here’s a possible solution