Identifying Institutional Peers Through Cluster Analysis

Jorge Martinez
University of Houston
TAIR 2020 Conference

March 2, 2020

(Click right of slide or use arrow keys)

Introduction

Strategic plan for the UH Centennial 2027: UH100 Dare to Dream
How are we performing and how can we improve?

“Our University is too important on so many levels to be left without a strong vision for the next eight years. We are light years ahead of where we were eight years ago, and in another eight years, we will be equally advanced from where we are today.”
- Provost Paula Myrick Short

Why is benchmarking important?

Benchmarking is the process where policymakers compare the performance, practices, and policies of institutions or groups of institutions to gain insight (Betsinger et al, 2013)

“An institution that does not routinely evaluate all aspects of the organization and make the changes necessary to address its shortcomings, from the curriculum to the physical plant, is jeopardizing its future.”

Barbara Bender
Rutgers University

Why is benchmarking important?

Answers questions like “What makes an institution highly ranked?”

“Through analyzing the best practices of peer institutions, then adapting and developing programs for their own campuses, higher education leaders can improve the quality of programs and services that they provide,” (Benchmarking 2002: P 119).

Why is benchmarking important?

A tool to overcome resistance to change

“Benchmarking can help overcome resistance to change that can be very strong in conservative organizations, such as colleges and universities, that have had little operational change in many years,” (Alstete 1995: P 25).

“Using the appropriate institutions and programs as models, especially schools with higher reputational ratings, can generate an enthusiasm for change that can transform an institution,” (Benchmarking 2002: P 116).

Benchmarking as an evidence-based tool

to encourage policymakers to learn to think like scientists, and

to learn how to solve problems primarily with reference to the evidence generated by professional, scientific, and technical methods of inquiry (Witting 2017: P 2)

How to identify peers

Texas accountability peers (Emerging research)
Nationally recognized peers (USNWR)
Public/Private
Financial peers (faculty salary, tuition)

Types of peers

Comparable: Similar institutional level (two or four-year), control (public, private) and enrollment characteristics (size, race)
Aspirational: Similar yet significantly different in several key performance indicators (graduation rate, research expenditures)
Competitors: Similar yet more students choose to attend a competing institution
Consortium: Belonging to a specific organization (AAU, Power 5, College Completion Summit)

Data

2018 Integrated Postsecondary Education Data System (IPEDS)
Filter variables
- Sector: Public, 4-year or above
- Carnegie Classification 2018 Basic: Doctoral Universities - Very High Research Activity
- Size Category: 20,000 and above
Final Sample: 85 institutions where row=observation column=variable

Data

Cluster Variables

Percent admitted
Admissions yield
SAT/ACT 25th & 75th percentiles
Undergraduate & graduate enrollment
Percent Race/Ethnicity
Percent Women
FTE for last academic year, 2017-18
Percent full-time, first-time awarded Pell
Student-to-Faculty Ratio

All academic rank faculty
Six-year graduation rate
Total degrees awarded
Core revenues and expenses
Instruction, research, and student service expenses as percent of core
Endowment assets
In-state, out-of-state tuition
Total price of attendance in-state on campus

Cluster analysis

Data Preparation
Calculate difference/distance between institutions by their characteristics
Cluster so that difference is minimized within clusters and maximized between clusters
No response variable = unsupervized method
Tool: R

Step 1: Prepare Data

1A - Remove or estimate missing data

Handle missing data by removing or estimating them
9 of the 85 institutions contain at least one missing value (11%)
Of the 2,890 data points, 50 data points were imputed (1.7%)
Use knnImputation from the DMwR package to identify k-closest observations based on euclidian distance and computes weighted averages
Remove missing peer_data <- na.omit(peer_data)

Step 1: Prepare Data

1B - Standardize data

Remove influence of different scales: $, %, N
Transform data so that each variable has =0 and =1
Use scaled <- scale(peer_data)

Step 1: Prepare Data

1B - Standardize data

head(peer_data)

## # A tibble: 6 x 45
##   tuition_in_state tuition_out_sta… pct_native pct_black pct_hispanic pct_white
##              <dbl>            <dbl>      <dbl>     <dbl>        <dbl>     <dbl>
## 1             8568            19704          0        22            3        59
## 2            10780            29230          0        11            5        76
## 3             9624            28872          0         6            3        76
## 4            10104            27618          1         3           19        48
## 5            10467            31688          1         4           24        49
## 6             7384            23422          1         4            8        74
## # … with 39 more variables: pct_two_more <dbl>, pct_nonresident <dbl>,
## #   pct_asian_pi <dbl>, pct_women <dbl>, stu_fac_ratio <dbl>,
## #   pct_admitted <dbl>, yield <dbl>, price_instate_oncampus <dbl>,
## #   enrollment_undg <dbl>, enrollment_grad <dbl>, degree_ba <dbl>,
## #   degree_ma <dbl>, degree_phd_research <dbl>, degree_phd_prof <dbl>,
## #   degree_phd_other <dbl>, six_year_grad <dbl>, pct_pell <dbl>,
## #   revenue_core <dbl>, revenue_pct_tuition <dbl>, expense_core <dbl>,
## #   expense_instructional <dbl>, expense_research <dbl>,
## #   expense_student_service <dbl>, endowment <dbl>, sat_read25 <dbl>,
## #   sat_read75 <dbl>, sat_math25 <dbl>, sat_math75 <dbl>, act_comp25 <dbl>,
## #   act_comp75 <dbl>, fte1718 <dbl>, faculty_total <dbl>,
## #   degree_phd_total <dbl>, degree_total <dbl>, cluster2 <dbl>, cluster5 <dbl>,
## #   us_rank2020 <dbl>, aau_member <dbl>, aau_year <dbl>

Step 1: Prepare Data

1B - Standardize data

scaled <-scale(peer_data)
head(as_tibble(scaled))

## # A tibble: 6 x 45
##   tuition_in_state tuition_out_sta… pct_native pct_black pct_hispanic pct_white
##              <dbl>            <dbl>      <dbl>     <dbl>        <dbl>     <dbl>
## 1          -0.398           -1.20       -0.280     2.85        -0.809     0.291
## 2           0.339            0.0767     -0.280     0.803       -0.651     1.26 
## 3          -0.0461           0.0288     -0.280    -0.127       -0.809     1.26 
## 4           0.114           -0.139       0.910    -0.685        0.459    -0.334
## 5           0.235            0.406       0.910    -0.499        0.855    -0.277
## 6          -0.793           -0.701       0.910    -0.499       -0.413     1.14 
## # … with 39 more variables: pct_two_more <dbl>, pct_nonresident <dbl>,
## #   pct_asian_pi <dbl>, pct_women <dbl>, stu_fac_ratio <dbl>,
## #   pct_admitted <dbl>, yield <dbl>, price_instate_oncampus <dbl>,
## #   enrollment_undg <dbl>, enrollment_grad <dbl>, degree_ba <dbl>,
## #   degree_ma <dbl>, degree_phd_research <dbl>, degree_phd_prof <dbl>,
## #   degree_phd_other <dbl>, six_year_grad <dbl>, pct_pell <dbl>,
## #   revenue_core <dbl>, revenue_pct_tuition <dbl>, expense_core <dbl>,
## #   expense_instructional <dbl>, expense_research <dbl>,
## #   expense_student_service <dbl>, endowment <dbl>, sat_read25 <dbl>,
## #   sat_read75 <dbl>, sat_math25 <dbl>, sat_math75 <dbl>, act_comp25 <dbl>,
## #   act_comp75 <dbl>, fte1718 <dbl>, faculty_total <dbl>,
## #   degree_phd_total <dbl>, degree_total <dbl>, cluster2 <dbl>, cluster5 <dbl>,
## #   us_rank2020 <dbl>, aau_member <dbl>, aau_year <dbl>

Step 2: Calculate distance

Select a distance measure

Many measures: Euclidean, Manhattan, Perason, Spearman, or Kendall correlation distances
Euclidean Distance (n-dimensions):

Step 2: Calculate distance

Euclidean distance 1-dimension

Step 2: Calculate distance

Euclidean distance 2-dimensions (n=2)

Step 2: Calculate distance

Euclidean distance 3-dimensions (n=3 )

Step 2: Calculate distance

Euclidean distance n-dimensions

Step 2: Calculate distance

Euclidean distance matrix

library(factoextra)
distance <- get_dist(sample25) # computes distance matrix
fviz_dist(distance, gradient=list(low="#00AFBB", mid="white", high="#FC4E07"))

Step 2: Calculate distance

Euclidean distance matrix

Step 3: Cluster Institutions

K-Means Clustering

Partition data set into k-clusters
High intra-cluster similiarity and low inter-cluster similarity
Each cluster contains a centroid (mean of distances in cluster)
Hartigan-Wong algorithm (1979):
W(Ck)=∑xi∈Ck(xi−μk)2
where:
- is an institution belonging to Cluster
- is the mean value of the points in cluster

Step 3: Cluster Institutions

K-Means Clustering: Total within-cluster sum of squres

Step 3: Cluster Institutions

K-Means Clustering

Specify number of clusters k
Randomly select k institutions from data to initialize centers
Assign each institution to their closest centroid
Update cluster centroid by calculating new mean values of the data in cluster
Iterate assignment and updates until clusters stop changing (or smallest total withinness is met, Eq. 3)

Step 3: Cluster Institutions

K-Means Clustering

Step 3: Cluster Institutions

K-Means Clustering

Step 3: Cluster Institutions

K-Means Clustering: k=3

Step 3: Cluster Institutions

K-Means Clustering: k=3

Step 3: Cluster Institutions

K-Means Clustering: k=3

Step 3: Cluster Institutions

K-Means Clustering: k=3

Step 3: Cluster Institutions

K-Means Clustering: k=3 (solution @ 4 iterations)

K-means clustering

k2 <- kmeans(scaled, centers=2, nstart = 25) # performs clustering on matrix
fviz_cluster(k2, geom="point", data=scaled)

K-means clustering

Selecting optimal clusters

1. Elbow Method

Define clusters such that total intra-cluster variation is minimized:
Total within-cluster sum of squares (wss) measures compactness of clustering

Selecting optimal clusters

1. Elbow Method

Steps:
- Run clustering algorithm for different values of k
- For each k, calculate total within-cluster sum of squares (wss)
- Plot curve for each value of k
- Bend of plot indicates optimal number of k

Selecting optimal clusters

1. Elbow Method

set.seed(123)
fviz_nbclust(scaled, kmeans, method="wss")

Selecting optimal clusters

2. Average Silhouette Method

Measures quality of clustering
How close each point in one cluster is to points in the neighboring clusters
How far away is each cluster
Range of -1 to 1 where:
- +1 is far away from neighboring cluster
- 0 on the fence
- -1 assigned to wrong clusters

Selecting optimal clusters

2. Average Silhouette Method

fviz_nbclust(scaled, kmeans, method="silhouette")

fviz_cluster(k2, geom="point", data=scaled)

Selected K=5 Cluster

Mixed support from Elbow Method (k=5 or 6)
Second best option by Silhouette Method (k=2 is best)
k=5 shows clear separation of UH cluster (2) from neighbors
WSS for UH cluster smallest in k=5 clustering
- k=2 (1,349.7)
- k=3 (565.3)
- k=4 (468.7)
- k=5 (276.1)

Cluster 2 Summary

Aspirational Peers

Similar yet significantly different in several key performance indicators

Similar demographic and input characteristics
- Percent admitted
- Admissions yield
- SAT Reading and Math 25th percentile scores
- Undergrduate enrollment
- Race/ethnicity
- Percent women
- Percent award Pell
- FTE 2017-18
- All academic rank faculty
- In-state and out-of-state tuition

Exceptional output
- Six-year graduation rate
- Endowment
- % Expenses as Research
- Graduate enrollment
Which cluster institutions out-perform UH (+1 s.d.)?

Aspirational Peers

University of Houston Aspirational Peers

Conclusion

Why benchmarking is important
Identified and transformed data set for analysis
Unbiased, replicable way of identifying institutional peers using K-means clustering
Comparable Peers
- Florida International University
- Georgia State University
- University of Central Florida
- University of Illinois at Chicago
- University of Nevada - Las Vegas
- University of North Texas
- University of South Florida
- University of Texas at Arlington
- University of Texas at El Paso

Aspirational Peers
- University of California - Davis
- University of California - Irvine
- University of California - Riverside
- University of California - San Diego
- University of California - Santa Barbara
- University of Texas at Dallas
- Stony Brook University

Website: jxmartinez.com

Resources

Alstete, Jeffrey W. 1995. “Benchmarking in Higher Education: Adapting Best Practices to Improve Quality.” ASHE-ERIC Higher Educaiton Report No. 5. Washington D.C.: The George Washington University Graduate School of Education and Human Development.

Andrew, Luna. 2018. “Selecting Peer Institutions Using Cluster Analysis.” Austin Peay State University.

Bender, Barbara E. 2002. “Chapter 8: Benchmarking as an Administrative Tool for Institutional Leaders.” New Directions for Higher Education 188: 113-120.

Betsinger, Alicia et al. 2013. “Peer Selection: Methodology and Models.” Texas Association for Institutional Research 35th Annual Conference. February, 2013.

Boehmke, Bradley. 2017. “UC Business Analytics R Programming Guide: k-Means Cluster Analysis.” University of Cincinnati.

Lang, Daniel W. and Qiang Zha. 2004. “Comparing Universities: A Case Study between Canada and China.” Higher Education Policy 17(4).

Shueler, Brian. 2016. “University of Wyoming Peer Institutions.” University of Wyoming.

Witting, Antje. 2017. “Insights from ‘Policy Learning’ on How to Enhance the Use of Evidence by Policymakers.” Palgrave Communications(49)3: 1-9.