Skip to main content

Topological Molecular Biology lab at UC Davis

·456 words·3 mins
Jake Roggenbuck
Author
Jake Roggenbuck
Working at Capital One as a SWE Intern and Conducting Programming Language Research

Before college when I was in high school, I emailed a few professors and PIs at UC Davis to see if I could help do machine learning in their labs. Sure enough, the TMB lab let me help them with research to predict the human infection of viruses using Logistic Regression.

Here is a presentation that I created after creating the regression model to predict if a spike protein was likely to be infectious to humans based on its DNA sequence.

Here is what I wrote about it in Nov of 2021:

This fall, I started an internship with the Arsuaga-Vazquez lab, the Topological Molecular Biology lab at UC Davis. I’m working with professors and UCD students to construct machine learning models and other tools to predict the transmissibility of new COVID-19 variants by looking at their genetic sequence and the proteins it encodes. With my expertise in Python, I am a super valuable asset to this project. I am developing tools for the isolation and extraction of the region that encodes the spike protein, as it is in a slightly different location for each sequence and variant - and may look different from one variant to another. Next fall, I hope to continue this type of meaningful work at a UC campus.

A lot of work was done in Python scripts, notebooks, and Google Sheets. Seen here is a test dataset I used for part of my regression model experimentation.

The main process for this research involved finding existing sequences of sars-cov-2 and sars-1 using the NCBI website. You’d then find a research paper describing the sequence and find the start and end “residues” (index used to locate genes in DNA). I would look for where the spike protein started and ended and write up a script that looked similar to this:

Lots of work went into dealing with thousands of sequences and fitting a regression model. It also took a lot of work to understand why the model worked for certain sequences and not others. During the course of a year, I learned the thoroughness needed for research.

Above are all files that I wrote for different tasks like checking why certain false positives were present or why certain gene translations didn’t work as expected.

This research experiment went on to inform the lab’s next experiments, which led to their paper Using machine learning to detect coronaviruses potentially infectious to humans published in Nature.

I’m so thankful to have had this experience to learn about research, regression, and machine learning more generally. As a high schooler at the time, being a part of a lab, a machine learning literature circle, and a project doing cutting-edge research, this experience could not have been any better.