Collegiate Facebook Data

(for CS 277, Data Mining)


Download: [Matlab Data][Technical Report]

Summary:

This data contains 5 different anonymized Facebook university networks (Caltech, Oklahoma, Princeton, UNC Chapel Hill, and Georgetown) which were recorded in 2005. The adjacency matrix of each social network is provided. Information about each user is also provided, such as student/faculty status, gender, major, and dorm.

For more information about the data, please consult the technical report.

Data Format:

Within the zip file is a Matlab .mat file for each university. Each .mat file contains two matrices: Note that the mappings to the original textual representation of the attributes (e.g. the name of the high school) are not available.

Potential Project Ideas:

All these ideas are intertwined, so please feel free to mix and match ideas.

Joint Modeling of Links and Attributes:

The task is to learn a model over both the friendship links (in the adjacency matrix) and the user attributes. Given the links of a user, can that user's attributes be predicted? For example, can we predict the gender or major of an individual based on knowledge of that individual's Facebook friends? Conversely, given the attributes of a user, can that user's friendship links be predicted? To accurately evaluate the model, one can perform cross-validation.

Transfer Learning:

Since the data consists of 5 independent Facebook networks, a logical question to ask is whether these networks share similar social structures. If so, can we use these commonalities to make improved predictions over links/attributes? The task is similar to task 1 above -- learn a model over the links/attributes such that the model captures shared commonalities between the five networks. Given complete information about four of the networks, the goal is to perform link/attribute prediction on the fifth network and determine whether knowledge of the other four networks improves prediction quality.

Feature Selection:

An important task is determining which features would be most informative when doing link prediction. For instance, perhaps the probability of a friendship link is 99% if two people are in the same dorm and in the same major. The most informative features may be complicated functions of the attributes. The task is to find these features. One can perform exploratory data analysis and use Matlab to quickly create plots in order to discover correlations within the data. One can also try to find these features in an automated way.

Contact:

Feel free to email Arthur Asuncion (asuncion 'at' uci 'dot' edu) or consult the technical report for details about the data.