Research Opportunities in Wayne Hayes's Group

Reviews

I've supervised and mentored dozens of young researchers (grads, undergrads, high school students) over the years. It's fun and productive for all --- you, me, other students, and other scientific collaborators. It's hard work, but rewarding. Here's what some have said about working in my research group:

Selection Criteria

To work in my group, you need to demonstrate yourself to be significantly above average among your peers: you need to undertake one of the following challenges and submit the results to me. If they're not correct... no problem, I give second chances. You're also allowed to ask questions, but try to keep them to a minimum. The primary criterion by which you will be judged is how well you can perform, and then write up, these tasks independently, without much help. Note: If you already have any undergraduate degree, then you need to do the "extra" work required in each of the challenges. This applies to potential grad students or STEM teachers applying to work with me, for example.

In all cases you need to submit a PDF write-up, with histograms or figures plotted. I don't need your code. I want to see your write-up including a description of you did, and why, and your results with commentary. THIS IS JUST AS MUCH A TEST OF YOUR COMMUNICATION SKILLS AS CODING. Doing research requires critical thinking and the ability to explain your rationale for what you did AND WHY. Without that your code or blind results are worthless.

I have several projects running. Three of them are described in the PowerPoint presentation here. You only need to do ONE of the below challenges, depending upon which project you're interested in working on. You need to do the task, and then write it up nicely, with graphs or plots to illustrate your answer. You should be able to do the task within about a week at most, but the answer has to be GOOD. If you hand in a GOOD solution later, that's better than a crappy solution earlier. In other words, a good solution is required, but faster is better than slower.

Let me know if there are any ambiguities, but your job is to do this task with as little supervision from me as is possible.

Please direct all questions to me at whayes@uci.edu.

The Projects

First, even if you are not inerested in the scientific aspects of any of these projects, many of them can benefit with parallelism, and it's usually nontrivial to implement. If you are interestid in message-passing parallelism using MPI or other methods of parallelism, consider joining anyway. (You'll still need to choose a project below and pass the challenge.)

(0) Physics-based tracking of moving and growing cells in a colony: YouTube video

The left side shows frames from a real video of a growing bacterial colony; the right frame shows our algorithm tracking the growth and motion of each individual bacterium during its whole life cycle from being born, moving, growing, to splitting into two daughter cells. Biologists need to track cells in video frames for many purposes, including tracking the growth of cancer cells, learning about the growth of embryos, learning how bacteria move, learning how genetic changes to a cell result in functional changes during it's lifetime... it's a huge research area. Although there already exist several cell tracking algorithms out there, we are working on a novel approach that seems to have several advantages. In order to join this project, your task is to take the above animated GIF, and automatically estimate the number of bacteria in each of the frames, and produce a text file whose only output is one integer per line, representing the count, and the number of lines should equal the number of frames. You only need to use one of the two sides. You can use any language you want, and any method you want, as long as it's automatic, and you wrote it yourself (cite any references you use). Describe your algorithm and the output, and send your PDF write-up to me by email. Extra work for those who already have an undergrad degree: You must create two algorithms, one that can handle each side of the above image. Compare the results and explain any differences.

Alternatively, if you prefer "library" type (reading and writing) research, use Google Scholar to look up and describe everything you can find out about cell tracking that has been published in the last 2 calendar years. (For example, if it's now 2018, then click "since 2017" on the left-hand side toolbar in Google Scholar and see what you can find.) There is no set number of words or pages, but you should describe all relevant work that has happened in cell tracking in the designated time period, devoting at least one paragraph to each paper you find.

(1) Graph theory applied to network database engines and biological network alignment: YouTube Video

Network modeling and network database queries

We are working on a database engine to allow complex network-based queries. This basically means we are trying to solve the subgraph isomorphism problem in such a way as to allow lightning-fast graph-based database queries of a large network corpus. Additionally, we are working on modeling networks theoretically so as to be able to create synthetic databases of graphs that "model" real graphs.

Network Alignment

Network alignment is the task of finding large subgraphs, or near-subgraphs, that are common between two or more large networks. The differerence between this and network modeling and database queries is that in queries, we are generally looking for very small (10-20 nodes at most) exact or near-exact matches inside a large database of networks; in alignment, we are given a small number of input graphs (maybe half a dozen), and we're looking for much larger (hundreds or thousands of nodes) of near-matches between them.

1) If you want to do either the graph database and network modeling, or the biological network alignment project, you need to know what a graph is and how to work with them, especially how to code with them. Your task is the following: you're given a text file representing a network. The first line of the file is N, the number of nodes. You will name the nodes from 0 through N-1. The remaining lines will have two integers per line, representing an edge. You don't know in advance how many edges there are, you just keep reading until you reach end of file. Your task is to compute and output two different types of connected componets: strong and weak. In other words, you are to compute the number of WEAKLY CONNECTED COMPONENTS in the graph, treating it as undirected (and you can ignore duplicate edges); and then a second number which is the number of STRONGLY CONNECTED COMPONENTS, and output both integers. Below I provide some sample inputs. I don't care what language you use. In addition, in your write up, include a histogram of the distribution of DEGREES of nodes; in the undirected case, just count all degrees; in the strongly connected (directed) case, provide histograms for both the in-degree and out-degree. (That means you'll have three histograms per graph.) The histogram shows how many nodes have degree zero, degree 1, etc., up to the max degree. If you don't know any of these terms, look them up, don't ask me. The data for this project is here. Analyze all the graphs in the zip file. Extra work for those who already have a degree: Read the GRAAL paper and count the number of graphlets of size 3 in all the networks.

Alternatively, if you prefer "library" type (reading and writing) research, use Google Scholar to look up and describe everything you can find out about biological network alignment that has been published in the last 2 calendar years. (For example, if it's now 2018, then click "since 2017" on the left-hand side toolbar in Google Scholar and see what you can find.) There is no set number of words or pages, but you should describe all relevant work that has happened in biological network alignment in the designated time period, devoting at least one paragraph to each paper you find.

(2) Image analysis of spiral galaxies: YouTube Video

2) If you want to work in the Galaxy Image Analysis project, then you should start by playing around with any galaxy images you find on the web and putting them into the SpArcFiRe webpage. Once you get the hang of it, you have two choices:

(A) find an image of NGC5054, or take the one from my paper with Darren Davis (cited on the above web page), and try to find a set of SpArcFiRe options on the website that can find the "dim" arm on the right hand side of the image of that galaxy in the above paper.

(B) Go get the following file: here Each row is some data about a galaxy, and the columns have names in the top row. You don't need to know what all of the columns mean, but pay attention to these ones: P_CS: the probability that this galaxy is a spiral. numDcoArcsGEXXX for various values of XXX: the number of arms in that spiral galaxy that are longer than XXX. Your task is to plot a histogram of the number of galaxies with N or more arms of length XXX, for each of the XXX values in the file. It would be best to plot all the histograms on one figure to be easily able to compare them to each other.

Alternatively, if you prefer "library" type (reading and writing) research, use Google Scholar to look up and describe everything you can find out about automated galaxy classification that has been published in the last 2 calendar years. (For example, if it's now 2018, then click "since 2017" on the left-hand side toolbar in Google Scholar and see what you can find.) There is no set number of words or pages, but you should describe all relevant work that has happened in automated galaxy classification in the designated time period, devoting at least one paragraph to each paper you find.

Extra work for those who already have a degree: Tell me about your astronomy and/or physics background.

(3)Theoretical Machine Learning

See here for the challenge.