To work in my group, you need to demonstrate yourself to be significantly above average among your peers: you need to undertake one of the following challenges and submit the results to me. If they're not correct... no problem, I give second chances. You're also allowed to ask questions, but try to keep them to a minimum. The primary criterion by which you will be judged is how well you can perform, and then write up, these tasks independently, without much help. Note: If you already have any undergraduate degree, then you need to do the "extra" work required in each of the challenges. This applies to potential grad students or STEM teachers applying to work with me, for example.
In all cases you need to submit a PDF write-up, with histograms or figures plotted. I don't need your code. I want to see your write-up including a description of you did, and why, and your results with commentary. THIS IS JUST AS MUCH A TEST OF YOUR COMMUNICATION SKILLS AS CODING. Doing research requires critical thinking and the ability to explain your rationale for what you did AND WHY. Without that your code or blind results are worthless.
I have several projects running. Three of them are described in the PowerPoint presentation here. You only need to do ONE of the below challenges, depending upon which project you're interested in working on. You need to do the task, and then write it up nicely, with graphs or plots to illustrate your answer. You should be able to do the task within about a week at most, but the answer has to be GOOD. If you hand in a GOOD solution later, that's better than a crappy solution earlier. In other words, a good solution is required, but faster is better than slower.
Let me know if there are any ambiguities, but your job is to do this task with as little supervision from me as is possible.
Please direct all questions to me at email@example.com.
The left side shows frames from a real video of a growing bacterial colony; the right frame shows our algorithm tracking the growth and motion of each individual bacterium during its whole life cycle from being born, moving, growing, to splitting into two daughter cells. Biologists need to track cells in video frames for many purposes, including tracking the growth of cancer cells, learning about the growth of embryos, learning how bacteria move, learning how genetic changes to a cell result in functional changes during it's lifetime... it's a huge research area. Although there already exist several cell tracking algorithms out there, we are working on a novel approach that seems to have several advantages. In order to join this project, your task is to take the above animated GIF, and automatically estimate the number of bacteria in each of the frames, and produce a text file whose only output is one integer per line, representing the count, and the number of lines should equal the number of frames. You only need to use one of the two sides; I'd recommend you use the right side (red lines on an otherwise black and grey image are easy to isolate.) You can use any language you want, and any method you want, as long as it's automatic. Describe your algorithm and the output, and send your PDF write-up to me by email. Extra work for those who already have an undergrad degree: You must create two algorithms, one that can handle each side of the above image. Compare the results and explain any differences.
1) If you want to do the biological network alignment project, you need to know what a graph is and how to work with them, especially how to code with them. Your task is the following: you're given a text file representing a network. The first line of the file is N, the number of nodes. You will name the nodes from 0 through N-1. The remaining lines will have two integers per line, representing an edge. You don't know in advance how many edges there are, you just keep reading until you reach end of file. When you are done, you are to compute the number of CONNECTED COMPONENTS in the graph, and output a single integer. Below I provide some sample inputs. I don't care what language you use. In addition, in your write up, include a histogram of the distribution of DEGREES of nodes. That is, how many nodes have degree zero, degree 1, etc., up to the max degree. If you don't know any of these terms, look them up. The data for this project is here. The graphs are undirected; analyze all the graphs in the zip file. That means each edge can only exist in your graph once, even if it is listed multiple times (or with the node endpoints reversed) in the input file. Extra work for those who already have a degree: Treat the graphs as directed, and enumerate the number of strongly, and weakly connected components. In addition, read the GRAAL paper and count the graphlets of size 2 and 3 in all the networks.
2) If you want to work in the Galaxy Image Analysis project, then you should start by playing around with any galaxy images you find on the web and putting them into the SpArcFiRe webpage. Once you get the hang of it, you have two choices:
(A) find an image of NGC5054, or take the one from my paper with Darren Davis (cited on the above web page), and try to find a set of SpArcFiRe parameters that can find the "dim" arm on the right hand side of the image of that galaxy in the above paper.
(B) Go get the following file: here Each row is some data about a galaxy, and the columns have names in the top row. You don't need to know what all of the columns mean, but pay attention to these ones: P_CS: the probability that this galaxy is a spiral. numDcoArcsGEXXX for various values of XXX: the number of arms in that spiral galaxy that are longer than XXX. Your task is to plot a histogram of the number of galaxies with N or more arms of length XXX, for each of the XXX values in the file. It would be best to plot all the histograms on one figure to be easily able to compare them to each other. Extra work for those who already have a degree: Tell me about your astronomy and/or physics background.
3) If you want to work on the global warming project, then you should start by going to the website http://issm.jpl.nasa.gov and seeing what the project is all about. The ISSM architecture is briefly described in this PPT file. Then pick a challenge below:
Many existing and upcoming NASA satellites orbit around the Earth to measure a variety of physical quantities related to the terrestrial water cycle. NASA scientists routinely develop tools to analyze these data as well as numerical models that help simulate the movement of water on Earth. Together these computer programs have the potential to help water resources management agencies particularly in the developing world where observations are sparse. However, such agencies often have to focus on critical real-time decision-making activities and therefore only have limited resources available to learn how to analyze state-of-the-art NASA observations. There is hence much to be gained in easing the usage of NASA hydrology tools. If you want to work on the water resources project, then perform the challenge below:
Your challenge is to publish a simple Python code on GitHub and run it within a hosted Continuous Integration service. Your code must download data directly from the URL http://rapid-hub.org/data/angles_UCI_CS.csv, notice and print the top line is a header, add a third header column, and then repeat the next lines of data as well as the cosine values for each angle---that is, print all values of “station_id”, “angle_degrees” and associated cosine values in the standard output. (PS: the data file does not go into the repo. Your code must download the data, because the data may change behind your back, although the format of the data will not change.) You might use the Ubuntu or MacOS capabilities of Travis CI, or the Windows capabilities of AppVeyor, depending on your preferred OS. Note that these services are free for open source software. Hint: this challenge requires the inclusion of a YAML file (*.yml) in your repository and that this YAML file shall be tailored to the hosted Continuous Integration service you selected.