Part of project: UIUC Affiliation Network

Lets create a UIUC network visualization!

One of the benefits of being a student at a public university is that university-related data is publically available such as how much people are making. As one of the largest public institution in the US, I thought it would be interesting to examine the affiliation network of the employees at UIUC. In this series of posts, I will walk you through the process of how I went about creating a visualization of the UIUC network.

1. Obtaining the data

The data for the project is available from the Grey Book. We will use the 2015-2016 salary data which contains the college and departmental affiliations of all faculty, staff and university administrators. The data is available as a PDF or a collection of HTML pages. We will scrape the data from the latter as OCR is more complicated.

We will use Python to scrape the data. BeautifulSoup is a great library to scrape HTML and the module urlib.request will be used to open URLs. We will also be using regular expressions with re and json to output our data.

Let us first navigate to the .html file and parse the html file.

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

weblink = ""
webpage = urlopen(weblink + "TOC.html")
soup = BeautifulSoup(webpage, "html.parser")
mydivs = soup.findAll("table", { "summary" : re.compile(".*Urbana-Champaign.*")})

The function urlopen allows us to open the url of the website. We use the findAll function to find the table tag for the the Urbana-Champaign campus. Note the use of compile to find the specific character on the summary description.

for div in mydivs:
    links = div.findAll('a')

Since TOC.html contains multiple links to different departments, we need to obtain all of the links. We do so by iterating through the summary table, finding the a href tags and storing the urls to the list links.

The next few lines looks convoluted but it’s quite straightforward. First, we create a file to output the data to once scraped. We then iterate through each of the links obtained from the previous step. Again, urlopen is used open the link for parsing. The table of interest this time is the table of salaries which looks like this:


The corresponding html code is as follows:

Now, it is clear that we need the class called “dept-heading” and “empl” from the table. We can obtain the cells by iterating through the table with findAll, storing the department if it has a number in the beginning (using RegEx). Below is the code.

f = open("uiucData.txt", "w+")

for a in links:
    subpage = urlopen(weblink + a['href'])
    soup2 = BeautifulSoup(subpage, "html.parser")
    table = soup2.find("table", {"summary": "Table of Salary Ranges"})
    for row in table.findAll("th", {"class": ["dept-heading", "empl"]}):
        if re.match("\d+\s-\s", row.text):
            department = re.sub("\d+\s-\s", "", row.text)
            prof = row.text.strip()
            print(department+"\t"+prof, file=f)

We print the result to the file labeled uiucData.txt and that’s it! This file will serve as an edgelist for our graph.

In the upcoming posts, we will discuss how to ingest the data to create a graph, followed by creating an interactive network.