Analysis of Colleges Vs. Housing Prices in Boston

By: Skye Nygaard, Stella Johnson, and Yuge Xue

Abstract

Using the following code, we seek to analyze the effect of presence and number of colleges had on home prices in the boston area. Once that analysis is complete, we want to know the effect that specifically admission rate and population for these colleges has on their surrounding regions. It is thought that since admission rate serves as a proxy for quality, as it goes down and population go up, pressure will be put on the surrounding area, driving up prices. We want to know the effect at a micro scale (Northeastern vs Roxbury) as well as a macro scale (colleges vs nearby regions).

Ultimately, when the analysis is complete, we get significant results. Places with colleges in Boston have higher prices, and a greater average increase in housing prices. Additionally, though results cannot be zoomed in as far as Northeastern vs. Roxbury, a statistically signifcant result is obtained stating that lowering admission rate increases housing prices nearby, as does raising the student population.

Ethical Analysis

Ultimately the data set did not return anything on a fine enough scale to harm (or help) individuals. While the conclusions were significant; that colleges have a real role on the prices of housing in areas, no particular college is named so it is simply a knowledge building exercise. One potential misuse would be for someone to say to a college, "look at this data, you have to spend money to adress this problem." This data set only applies to colleges in Boston and doesn't establish the direction of causality, so trying to force actions based on this data would be extremely irresponsible. However, it is still worth making note of the analysis. If these colleges are correlated so strongly with housing prices in the way specified, policy should be made with those considerations in mind.

No data with a personal component was included, only averaged data. This ensures a pretty high and representative quality of data, improving analysis. Zillow and the Federal Reserve would not be doing too well if their data could no be trusted, and this authority is transferred to my analysis. However, no data set is perfect and as is outlined in the inconsistencies section, the results should be taken with this in mind. The statistics were also done in such a way as to introduce minimal bias. Once the regression results for Northeastern were found, there were many other data pieces available to cherry pick which had a high correlation value. However, showing that data would be dishonest to the overall picture. Three seperate analyses were created which would seperate bias in the data. Overall, there is minimal harm, and the gain of knowledge that colleges could have this effect on their local area.

In [70]:
import requests
import pickle
import csv
import pandas
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import statistics
import numpy as np
from IPython.display import Image

Data Gathering

The first step to any analysis is data collection. First we write a simple function using the requests module to return data, in JSON format.

In [5]:
def pull_data_simple(address):
    data = requests.get(address)
    return data.json()

College Data

There are two pages worth of colleges to pull data for, so we get a json file for the first page, then the second, as dictated by the Department of Education API documentation. We receive lists containing dictionaries for colleges in Boston, which are then merged, and returned.

https://collegescorecard.ed.gov/data/documentation/

In [60]:
API_Key = "nzzbzy16WMaKbB1aN6gJTMERlo2xDa60np95P4U2"

def pull_college_data():
    MA = pull_data_simple("https://api.data.gov/ed/collegescorecard/v1/schools.json?school.city=Boston&api_key=" + API_Key)
    MA2 = pull_data_simple("https://api.data.gov/ed/collegescorecard/v1/schools.json?school.city=Boston&api_key=" + API_Key + "&page=1")
    for i in range(len(MA2["results"])):
        MA["results"].append(MA2["results"][i])
    return MA

MA_colleges = pull_college_data()

Next we want to take certain data from each dictionary, and store it away for later use. The desired values are "school", "longitude", "latitude", "zipcode", "admission rate", "undergrad population". This function takes in the raw dictionaries and processes out nice lists featuring only the desired data.

In [61]:
# returns in format [[header], [school, longitude, latitude, zipcode, [admission rate by year], [undergrad population by year]]]

def get_college_data(source):
    data = []
    header = ["school", "longitude", "latitude", "zipcode", "admission rate", "undergrad population"]
    data.append(header)
    years = list(range(1996, 2017))
    for i in range(len(source["results"])):  # runs through each college
        # checks if it has admissions data, proxy for being a 'real' college
        if source["results"][i]["2016"]["admissions"]["admission_rate"]["overall"]: 
            college = []
            college.append(source["results"][i]["school"]["name"])
            college.append(source["results"][i]["location"]["lon"])
            college.append(source["results"][i]["location"]["lat"])
            college.append(source["results"][i]["school"]["zip"])
            admission = [source["results"][i][str(year)]["admissions"]["admission_rate"]["overall"] for year in years]
            pop = [source["results"][i][str(year)]["student"]["size"] for year in years]
            college.append(admission)
            college.append(pop)
            if len(college[3]) > 6:
                college[3] = college[3].replace(college[3][-5:], "")  # removes irrelevant last 4 numbers for zipcode
            data.append(college)
    return data

college_data = get_college_data(MA_colleges)

Zip Codes

A csv with zip codes in Massachusetts and their size was downloaded in order to call housing data as well as for use in map making. Unique values were taken, which were be used to call the housing data.

http://bostonopendata-boston.opendata.arcgis.com/datasets/53ea466a189b4f43b3dfb7b38fa7f3b6_1

In [63]:
# creates sorted list of just zipcodes for use in python

with open("ZIP_Codes.csv") as zips:
    boston_zips = zips.readlines()
    boston_zips = [line.split(",") for line in boston_zips]
    boston_zips = {boston_zips[i][1]:True for i in range(1, len(boston_zips))}
    boston_zips = sorted(list(boston_zips.keys()))

Without modification, this list creates an error when trying to call housing prices. Upon closer inspection, there is an issue with three of the zipcodes. One is Harvard Business School, one is the Government Center, and one is the Prudential center. None of these zip codes have publically purchasable housing, and must be removed before the call is made.

In [64]:
bad_zip = ["02163", "02199", "02203"]
for zip in bad_zip:
    boston_zips.pop(boston_zips.index(zip))

Housing Data

In order to obtain data on housing prices in Boston, Quandl's API was found, containing data created aggregated by Zillow. It featured median housing prices each month, callable using many different scopes. One scope was zip code. A function was written to call data for years 1996 to 2016, for each zip code in Boston.

https://www.quandl.com/data/ZILLOW/Z02115_ZHVIAH-Zillow-Home-Value-Index-Zip-Zillow-Home-Value-Index-All-Homes-02115-Boston-MA

In [65]:
# returns in format [[header], ['zip', [list of years]]]

API_Key = "4tx4qumaWAfGieujhD63"

def pull_housing_data(zip_data):
    # Zillow Home Value Index - All Homes
    data = pull_data_simple("http://www.quandl.com/api/v3/datasets/ZILLOW/Z" + zip_data[0] + "_ZHVIAH?api_key=" + API_Key)
    housing_data = []
    header = ["zip"]
    header_dates = []
    for i in range(1, (len(data["dataset"]["data"])+1)):
        date = data["dataset"]["data"][-i][0]  # gets each date in yyyy-mm-dd format
        year = int(date[0:4])
        if year < 2017:
            header_dates.append(date)
    header.append(header_dates)
    housing_data.append(header)
    for zip in zip_data:
        data = pull_data_simple("http://www.quandl.com/api/v3/datasets/ZILLOW/Z" + zip + "_ZHVIAH?api_key=" + API_Key)
        info = [zip]
        price = []
        for i in range(1, (len(data["dataset"]["data"]) + 1)):
            date = data["dataset"]["data"][-i][0]
            year = int(date[0:4])
            if year < 2017:
                price.append(data["dataset"]["data"][-i][1])
        info.append(price)
        housing_data.append(info)
    return housing_data

housing_data = pull_housing_data(boston_zips)

Writing Data For Later

Pickle is a default module that was imported, letting us store the data in the list format for future use. This served invaluable for the goal of not using up all the API calls.

https://docs.python.org/3/library/pickle.html

In [67]:
# with open("CollegeData.pkl", "wb") as f:
#     pickle.dump(college_data, f)

# with open("CollegeData.pkl", "rb") as f:
#     CollegeData = pickle.load(f)

# with open("HousingData.pkl", "wb") as f:
#     pickle.dump(housing_data, f)

# with open("HousingData.pkl", "rb") as f:
#     HousingData = pickle.load(f)

Exporting Data For GIS

In order to get college proximity we exported the lattitude, longitude, zip code, and name of each college. A buffer was then created in GIS software, approximating the area of effect for students in the college. The buffer radius was approximately 3.3 times the campus radius, as an estimate of distance students would be willing to go to find outside housing.

In [145]:
# writes the data for use in outside program

def write_college_csv(college_data, filepath):
    college_data_csv = [college_data[0][:4]]
    for i in range(1, len(college_data)):
        college_data_csv.append(college_data[i][:4])
    with open(filepath, "w") as write_csv:
        line_writer = csv.writer(write_csv, quoting=csv.QUOTE_MINIMAL)
        for line in college_data_csv:
            line_writer.writerow(line)

write_college_csv(college_data, "CollegeData.csv")
In [71]:
# using IPython.display, displays picture of GIS inline

Image(filename='BufferDisplay.png') 
Out[71]:

Reimporting

The data from GIS was read back into Python, in csv format. The CSV contained many duplicates, redundant data that had to be processed through. A function was created to match each college with each zip code it was in proximity to. The total number of matched colleges was then obtained, and stored in dictionary format

In [13]:
# read GIS file

def read_zip_csv(filepath):
    with open(filepath) as f:
        file = csv.reader(f)
        header = next(file)
        data = list(file)
    return header, data


# creates tuples of each pair of college and nearby zipcode, using file as read in by read_zip_csv
# returns [(college, zip), ()] sorted by zipcode

def get_unique_pairs(filepath):
    header, data = read_zip_csv(filepath)
    list_of_pairs = []
    zip_loc = header.index('ZIP5')
    bad_zip = ["02163", "02199", "02203"]
    for line in data:
        if not line[zip_loc] in bad_zip:
            list_of_pairs.append((line[2], line[header.index('ZIP5')]))
    unique_pairs = set(list_of_pairs)
    affected_zipcode = sorted(unique_pairs)
    return affected_zipcode

unique_pairs = get_unique_pairs("Zipcodes_Near_College.csv")


# using the newly created tuple pairs, and the old list of zip codes, creates dict of number of colleges in each zip
# returns {'zip': num_colleges, } number of colleges is an integer

def get_count_values(unique_pairs, list_of_zips):
    zip_college_dict = {zip: 0 for zip in list_of_zips}
    for k, v in unique_pairs:
        if v in zip_college_dict:
            zip_college_dict[v] += 1
    return zip_college_dict

colleges_in_zip = get_count_values(unique_pairs, boston_zips)

ANCOVA

In order to take an initial look at the overall trends of the prices, each of the zips was added to a large pandas dataframe. This is condusive to running regressions, and made it easy to get subsets. These subsets were assigned based on the number of colleges: 0, 1, 2-4, 5+.

This first function creates a percentage increase, to analyze the relative slopes between groups, of increase from the initial housing price.

In [74]:
# takes in the housing data from the initial API call and processing
# returns [header, [zip, [price index by month]]], same format as other data

def housing_index(housing_data):
    full_house_copy = housing_data[:]
    house_index = []
    house_index.append(full_house_copy.pop(0))
    for line in full_house_copy:
        house_index.append([line[0], [line[1][i]/line[1][0] for i in range(len(line[1]))]])
    return house_index

percentage_increase_housing = housing_index(housing_data)

Utilizing the housing data and the percentage increase function, the next function creates a dataframe containing the year as either a decimal for python or yyyymm format for GIS; the zip code, the price, the percentage increase in price over a base level, and the number of colleges (used to group). It takes in the dictionary of zips to number of colleges, and a boolean of whether you want the year in decimal format, defaulting to true. It utilized the two sets of housing data.

In [77]:
# takes in dict of zips to number of colleges, a boolean of wheter you want the year in decimal or yyyymm format
# returns data frame for use in regression, with 
# columns={0: 'year', 1: 'zip', 2: 'price ($)', 3: 'cumulative increase', 4: 'num_colleges'}

def make_ancova_list(zipdict, year_as_decimal=True):
    ancova_list = []
    housingmodifiable = housing_data[:]
    percentage_increase_housing = housing_index(housing_data)
    header = housingmodifiable.pop(0)
    j = 1
    for line in housingmodifiable:
        if year_as_decimal:
            years = [int(year[:4]) + int(year[5:7])/12 for year in header[1]]
        else:
            years = [year[:4] + year[5:7] for year in header[1]]
        for i in range(len(line[1])):
            ancova_list.append([years[i], (line[0]), line[1][i], percentage_increase_housing[j][1][i], zipdict[line[0]]])
        j += 1
    df = pandas.DataFrame(ancova_list)
    df.rename(columns={0: 'year', 1: 'zip', 2: 'price ($)', 3: 'cumulative increase', 4: 'num_colleges'}, inplace=True)
    return df

ancova_df = make_ancova_list(colleges_in_zip, housing_data)

Finally, the groups were created, and a regression was run on each. The line of best fit (multiple linear regression) was created for each group, and added to the graphs. Two seperate graphs were created, one for general price changes, and one for percentage over a base level.

https://towardsdatascience.com/simple-and-multiple-linear-regression-in-python-c928425168f9

In [147]:
# takes in the ancova data frame constructed using the previous function, the column for the dependant variable, 
# either absolute price or percentage increase
# groups the data, and runs a predictive regression for each group, then plots it

def ancova(ancova_df, dependant_variable):
    predicted_sets = []
    groups = [0, 1, 4, 11]
    group_names = ['0', '1', '2-4', '5-11']
    bottom_of_set = -1
    years = list(sorted(set(ancova_df['year'].values)))
    years = [[year] for year in years]
    for i in range(len(groups)):
        ancova_subset = ancova_df[(ancova_df['num_colleges'] <= groups[i]) & (ancova_df['num_colleges'] > bottom_of_set)]
        dependant = ancova_subset[dependant_variable]
        independant = ancova_subset['year'].values.reshape(-1, 1)
        r2 = LinearRegression().fit(independant, dependant)
        predicted = r2.predict(years)
        predicted_sets.append(predicted)
        plt.plot(years, predicted, label=str(group_names[i]) + ' Colleges')
        bottom_of_set = groups[i]
        plt.legend()
    plt.xlabel('time in years')
    plt.ylabel(str(dependant_variable))
    plt.title('time in years vs ' + str(dependant_variable))

figure_size = plt.rcParams["figure.figsize"]
figure_size[0] = 24.0
figure_size[1] = 8.0
plt.rcParams["figure.figsize"] = figure_size
plt.subplot(1, 2, 1)
ancova(ancova_df, 'price ($)')
plt.subplot(1, 2, 2)
ancova(ancova_df, 'cumulative increase')
plt.show()

At this point, animations can also shed light on this phenomenon. "RawIndexFinal" and "RawPriceFinal".

Exporting Ancova For Display

This data was exported to csv using pandas, this time utlilizing a False in the function and returning the data in yyyymm format. This was used to create the animations displayed.

In [80]:
GIS_ANCOVA = make_ancova_list(colleges_in_zip, False)
GIS_ANCOVA.to_csv('ancova_data.csv', sep=',', index=False)

Aggregating Housing Prices By Year

In order to conform the housing data to the college data, it has to be converted to year averages. The first of the next two functions takes the data in the month format, and averages it across the year. The second utilizes the function, repeating for each zip code and storing the data in a dictionary.

In [82]:
# Takes in the given zip code and the housing data.
# Returns a list with the data averaged into 21 values, one for each year

def price_grouping(zip_code, housing_data):
    grouping = []
    for housing in housing_data:
        if zip_code == housing[0]:
            grouping = housing[1]
    result = [0]
    years = housing_data[0][1]
    current_year = years[1][:4]
    current_pos = 0
    entries = 0
    for year in years:
        if year[:4] == current_year:
            result[current_pos] += grouping[years.index(year)]
            entries += 1
        else:
            current_year = year[:4]
            result[current_pos] = round(result[current_pos] / entries, 2)
            entries = 0
            current_pos += 1
            result.append(grouping[years.index(year)])
    result[current_pos] = round(result[current_pos] / entries, 2)

    return result
In [83]:
# takes in a list of zips, and housing data containing those zips
# utilizes the price_grouping function to create a dictionary of zipcodes to average housing prices by year
# returns {'zip': average price by year, }

def house_dict(list_of_zips, housing_data):
    housing_prices_year = {}
    for zip in list_of_zips:
        housing_prices_year[zip] = price_grouping(zip, housing_data)
    return housing_prices_year

housing_price_dict = house_dict(boston_zips, housing_data)

Data Normalization

Inflation, GDP, and popuation density all obviously have a larger impact on housing than colleges have. To best adjust for all these factors at once, a csv of the Boston Housing Price Index was obtained from the Federal Reserve Database (FRED). This was imported, and all the housing values were adjusted accordingly. This would allow for analysis in differences between zip codes as the main factor, rather than general changes distorting the data. It was also done in part to adress the issue of multicollinearity, but that issue is not especially relevant as long as the data is correct.

https://fred.stlouisfed.org/series/ATNHPIUS14454Q

https://en.wikipedia.org/wiki/Multicollinearity

In [22]:
with open('ATNHPIUS14454Q.csv', 'r') as f:
    boston_housing_index = [line.strip().split(',') for line in f.readlines()]
boston_housing_index = [float(boston_housing_index[i][1]) / float(boston_housing_index[1][1]) for i in range(1, len(boston_housing_index))]
In [23]:
# takes in dict of house prices, data set of housing index to adjust prices
# reuturns a dictionary with the lists of the first divided by the second list

def adjust_housing(housing_price_dict, adjuster):
    adjusted_dict = {}
    for zip in housing_price_dict:
        adjusted_dict[zip] = [housing_price_dict[zip][i]/adjuster[i] for i in range(len(adjuster))]
    return adjusted_dict

adjusted_housing_price_dict = adjust_housing(housing_price_dict, boston_housing_index)

Data Inconsistency

If our data were completely representative then the index would result in a dictionary with a stable level of prices, on average. However, our data still shows a notable upward trend meaning that the Federal Reserve and Zillow have slightly different data.

Individual Regression

In order to make the data more usable, a dictionary is created mapping each college to its respective data.

In [85]:
# takes in college list data
# returns same data in more usable dictionary in format {college: (admission by year, population by year), }

def college_dict(college_list):
    dict_of_colleges = {}
    for i in range(1, len(college_list)):
        dict_of_colleges[college_list[i][0]] = (college_list[i][4], college_list[i][5])
    return dict_of_colleges

college_numbers_dict = college_dict(college_data)

The next function takes in a specified college and zipcode, then gets the data from the respective datasets, and returns it in a format that can be easily made into a dataframe. The regression is run on this dataframe. This allows individual regressions, but a lot of power is gained when utilizing the tuple pairs created earlier.

In [93]:
# takes in the data dictionaries, a college, and a zipcode
# returns the housing data in a list, and the acceptance and population data in a list of lists

def regression_data(college, zipcode, college_dict=college_numbers_dict, housing_dict=adjusted_housing_price_dict):
    housing = []
    acceptance = []
    population = []
    for i in range(len(college_dict[college][0])):
        if college_dict[college][0][i] and housing_dict[zipcode][i]:  # checks if not None
            housing.append(housing_dict[zipcode][i])
            acceptance.append(college_dict[college][0][i])
            population.append(college_dict[college][1][i])
    return housing, [acceptance, population]

As we now have the dataset for regressions, we can run regression to determine correlation between housing prices and the Northeastern's admission rate + student population. This allow us to see how much impact our school has had in different areas.

In [101]:
# takes in a zip, and a 0 or 1, representing whether you want the admission rate plot (0), or the population plot (1)
# returns the regression plot and R2 value

def neu_impact(zip, adm_pop):
    data = regression_data('Northeastern University', zip)
    if adm_pop == 0:
        plt.title(f'Northeastern Admission Rates vs {zip} Housing Prices')
        plt.xlabel('Admission Rate')
        plt.ylabel('Housing Prices at {zip}')
        plt.plot(data[1][adm_pop], data[0], "ro")
        plt.axis([0, 1, 0, 300000])
        fit = LinearRegression().fit([[i] for i in data[1][0]], data[0])
        r_2 = fit.score([[i] for i in data[1][0]], data[0])
        print(f'R2 = {r_2}')
    elif adm_pop == 1:
        plt.title(f'Northeastern Student Population vs {zip} Housing Prices')
        plt.xlabel('Population')
        plt.ylabel('Housing Prices at {zip}')
        plt.plot(data[1][adm_pop], data[0], "ro")
        plt.axis([0, 25000, 0, 300000])
        fit = LinearRegression().fit([[i] for i in data[1][1]], data[0])
        r_2 = fit.score([[i] for i in data[1][1]], data[0])
        print(f'R2 = {r_2}')
    x = np.linspace(0, 1000000, 1000)
    plt.plot(x, float(fit.coef_)*x + fit.intercept_, linestyle='-')
    plt.show()
In [102]:
figure_size = plt.rcParams["figure.figsize"]
figure_size[0] = 6.0
figure_size[1] = 4.0
plt.rcParams["figure.figsize"] = figure_size

# Northeastern's impact on Roxbury(02119) Housing Prices
neu_impact("02119", 0)
neu_impact("02119", 1)

# Northeastern's impact on local(02115) Housing Prices
neu_impact("02115", 0)
neu_impact("02115", 1)
R2 = 0.0428451155214491
R2 = 0.1350821434302133
R2 = 0.352914530113418
R2 = 0.745496592514964

For Northeastern, prices go up as population falls! That's not what we hypothesize. Additional, the r2 values are wildly different for each of the plots. It seems that a small scale approach is not applicable for this data. We'll have to look at all the college to see if there's any overall trend.

Overall Regression

Three functions were created, which would allow the regressions to be run on every college vs. zip code where it is in proximity, and then every college vs. zip code where it is not in proximity. The first group acts as a test, and the second as a control. The coefficent, or slope of each variable is returned, as well as the correlation coefficent. If there is real impact from our statistics, then the correlation for zip codes in proximity would be higher than for those not in proximity. Additionally, if the data points towards the relationship we expect, then there will be a negative relationship between admission rate and price, and a positive relationship between population and price.

In [100]:
# takes in a list of r2 values to add to, a college and zip to run the regression for
# gets the data from the dictionaries using the previous function, and runs regression for it
# returns the list with the new r2 added, in addition to the coefficent values for each college variable

def get_r2(list_of_r2, college, zip):
    dependant, independant = regression_data(college, zip)
    df = pandas.DataFrame(independant)
    df = df.transpose()
    target = pandas.DataFrame(dependant)
    fit = LinearRegression().fit(df, target)
    coef_values.append(fit.coef_)
    return list_of_r2 + [fit.score(df, target)]


# runs get_r2 for each college and zip in the tuple pairs
# returns r2 list for the test variable

def dependant_regressions(unique_pairs):
    r2dependant = []
    for college, zip in unique_pairs:
        r2dependant = get_r2(r2dependant, college, zip)
    return r2dependant

# runs get_r2 for every college vs every zip code
# removes value that are in the test list
# returns the control r2 list

def independant_regressions(r2dependant, college_numbers_dict, colleges_in_zip):
    r2independant = []
    for college in college_numbers_dict:
        for zip in colleges_in_zip:
            r2independant = get_r2(r2independant, college, zip)
    for val in r2dependant:
        r2independant.remove(val)
    return r2independant

Histograms were created to display the correlation coefficents for each dataset. The shape should reflect the overall impact, with more towards the right side being a higher level of correlation.

In [126]:
coef_values = []

bins_of_tenths = [i/10 for i in (range(0, 11))]
r2dependant = dependant_regressions(unique_pairs)
plt.hist(r2dependant, bins=bins_of_tenths)
plt.title('R2 of Dependant Zip Codes')
plt.ylabel('# of Zip Codes')
plt.xlabel('R2 Value')
Out[126]:
Text(0.5,0,'R2 Value')

For the proximate zip codes, or the ones that should be dependant on college data, it does show a notable amount at the high end, though with also a good amount at the very low end.

The following function averages each coefficent, as obtained in arrays during the regression calls.

In [127]:
# takes in list of arrays of admission and population coefficents
# returns the averages for each data set

def get_coef(coef_values):
    average_admission_coef = 0
    average_population_coef = 0
    for i in range(len(coef_values[0])):
        average_admission_coef += coef_values[0][i][0]
        average_population_coef += coef_values[0][i][1]
    return average_admission_coef / len(coef_values[0]), average_population_coef / len(coef_values[0])
In [128]:
dependant_coef = get_coef(coef_values)
In [129]:
coef_values = []

r2independant = independant_regressions(r2dependant, college_numbers_dict, colleges_in_zip)
plt.hist(r2independant, bins=bins_of_tenths)
plt.title('R2 of Independent Zip Codes')
plt.ylabel('# of Zip Codes')
plt.xlabel('R2 Value')
Out[129]:
Text(0.5,0,'R2 Value')

The regressions for non-proximate data, that should be independant, show the random scatter that would be expected when running correlations for things that should not be correlated.

In [130]:
independant_coef = get_coef(coef_values)

Final Analysis

When we look at the slopes, both relationships are in the expected direction. We can see that the proximate dataset has a stronger effect from both admission and population data, likely reflecting regression to the mean in the control dataset.

In [131]:
print(f"For the dependant zip codes, the slope of admission rate vs price is {dependant_coef[0]},\n\
and the slope of undergraduate population vs price is {dependant_coef[1]}\n")

print(f"For the independant zip codes, the slope of admission rate vs price is {independant_coef[0]},\n\
and the slope of undergraduate population vs price is {independant_coef[1]}")
For the dependant zip codes, the slope of admission rate vs price is -89464.28849311528,
and the slope of undergraduate population vs price is 72.33355742744872

For the independant zip codes, the slope of admission rate vs price is -36668.91641116601,
and the slope of undergraduate population vs price is 2.060226541325107

Similarly, when we look at the averages for correlation coefficents in the two groups, there is a fairly notable difference of 0.089.

In [132]:
mean_dependant = statistics.mean(r2dependant)
mean_independant = statistics.mean(r2independant)

print('The average r2 for the dependant is', round(mean_dependant, 2))
print('The average r2 for the independant is', round(mean_independant, 2))
print('The average difference is', round(mean_dependant - mean_independant, 3))
The average r2 for the dependant is 0.55
The average r2 for the independant is 0.46
The average difference is 0.089

T-Test

Stats models contains support for t-tests. A t-test is a robust analysis comparing two groups, of whether the means are significantly different. The tests return that the data is! The admission rate and population data had a greater effect on the areas where it should have an effect, and in the direction that was expected. We can reject the null hypothesis of no effect.

https://www.statsmodels.org/dev/generated/statsmodels.stats.weightstats.ttest_ind.html

In [140]:
t_test_results = sm.stats.ttest_ind(r2dependant, r2independant)
print(f"There is a t statistic of {round(t_test_results[0], 4)} and a p-value of {round(t_test_results[1], 4)}, with {t_test_results[2]} degrees of freedom.\n\
This is a statistically significant result.")
There is a t statistic of 2.2463 and a p-value of 0.0251, with 576.0 degrees of freedom.
This is a statistically significant result.

T-statistic of 2.24, p-value of 0.025, meaningful difference!

Exporting the Final Data for GIS display

All of the data has to be exported in the format used earlier, to display visually changes through time. The following exports the data into a GIS usable CSVs.

The first function normalizes the undergraduate population data by dividing it by the first value. Admission is normalized as a change from the first value.

In [144]:
# takes in the college data
# normalizes admission and population data by the first available data point
# returns the data in a list that pandas will easily make into a dataframe, and then a csv

def make_college_csv(collegedata):
    collegecsv = []
    for college in collegedata:
        admission = collegedata[college][0]
        population = collegedata[college][1]
        for i in range(len(years)):
            collection = [college, years[i]]
            admission_reference = next((val for val in admission if val is not None), None) # gets the first non none value
            population_reference = next((val for val in population if val is not None), None)
            if population[i] is not None:
                collection.append(population[i] / population_reference)
            else:
                collection.append(None)
            if admission[i] is not None:
                collection.append(admission[i] - admission_reference)
            else:
                collection.append(None)
            collegecsv.append(collection)
    return collegecsv


# turns the price data into a usable list for export to pandas, then GIS

def make_price_csv(price_dict, adjusted_dict, years):
    list_of_rows = []
    for zip in price_dict:
        for i in range(len(price_dict[zip])):
            list_of_rows.append([zip, years[i], price_dict[zip][i], adjusted_dict[zip][i]])
    return list_of_rows


# normalizes the adjusted prices by the first year
# returns in same data formate as taken in, with the cummulative increase substituted for the price

def make_adjusted_housing_index(housedict):
    adjusted_dict = {}
    for zip in housedict:
        adjusted_dict[zip] = [value/housedict[zip][0] for value in housedict[zip]]
    return adjusted_dict




years = range(1996, 2017)

adjusted_dict = make_adjusted_housing_index(adjusted_housing_price_dict)

GIS_college_data = make_college_csv(college_numbers_dict)
GIS_college_df = pandas.DataFrame(GIS_college_data)
GIS_college_df.to_csv('buffer_data.csv', sep=',', index=False)

GIS_housing_data = make_price_csv(adjusted_housing_price_dict, adjusted_dict, years)
GIS_housing_df = pandas.DataFrame(GIS_housing_data)
GIS_housing_df.to_csv('polygon_data.csv', sep=',', index=False)