Overview: Data Carpentry Workshop on Genomics

  • January 31, 2019
iJOBS Blog

On January 17th and 18th Rutgers iJOBS hosted a two-day Genomics Workshop designed by Data Carpentry; this is the 3rd time Rutgers has held this workshop, alternating between Newark and New Brunswick. There were two instructors, Niel Infante and Amanda Charbonneau, along with three assistants from Rutgers, Katarzyna Tyc, Zhenru Zhou, and Roman Wernyj. Here is the description of the program: Data Carpentry workshops are for any researcher who has data they want to analyze, and no prior computational experience is required. This hands-on workshop teaches basic concepts, skills and tools for working more effectively with data. The focuses of this workshop will be working with genomics data, and data management & analysis for genomics research. They will cover metadata organization in spreadsheets, data organization, connecting to and using cloud computing, the command line for sequence quality control and bioinformatics workflows. Being that my own research on autism spectrum disorder uses bioinformatic tools, I thought this workshop might provide a solution in automating data analysis. I often compare gene expression between unaffected and affected individuals, for genes implicated in the disorder. This type of analysis tends to generate 100s-1000s of data points per experiment, leading to time consuming, and often tedious, data analysis. Day 1 The first day of the workshop started off with introductions, during which participants stated what they hoped to learn during the course. Many individuals were interested in RNA sequencing (RNA-seq) data analysis, however only a handful of people had RNA sequencing data collected and ready to analyze. We discussed different aspects of planning an RNA-seq experiment: how to set up a plate in order to avoid any bias (machine or human), and how to structure a spreadsheet. Here are some cardinal rules for organizing a spreadsheet. image1 We then moved on to talking about where to store data once the samples have been analyzed. The sequencing facility will send documentation (metadata) and the sequencing files themselves. The raw data from the facility will be the foundation of your sequencing analysis – so it is extremely important to properly store this data. After about 2 weeks (check with your sequencing facility) the data will be deleted forever, and will be lost if you haven’t saved it. Yikes! This is because sequencing facilities do not have the capacity to infinitely store everyone’s data. Fear not, here are some guidelines for data storage. Using all three suggestions simultaneously will ensure your files are securely stored: image2 Lucky for us, Rutgers has data storage options available; you can get more information here. If you are not from Rutgers, have left Rutgers, or want other options, you can use resources like Amazon S3, Microsoft Azure, or Google Cloud. It is also a good idea to have all data stored on external hard drives, at minimum two, which are stored in physically different locations. End of Day 1 & Day 2 Towards the end of day 1, and all of day 2, we worked on analyzing genomic data using command-line. More specifically we used Bash, a commonly used Unix shell that gives the user power to do simple tasks quickly. Here is a Bash scripting cheat sheet I found online containing many of the commands we used in class. During these hands-on parts of the workshop we use two sticky notes – one blue (I understand!) and one yellow (Help me!) to indicate whether we were having trouble or were able to work through the example. I thought this was a great technique! Did you look away and miss some of the commands? Put a yellow sticky note on your computer and someone will come help you! This was very useful during the faster paced command entries. For the remainder of the workshop, we worked on a sample data set, mapping the sequenced data to the E. coli genome, searching for SNPs. It was interesting and pretty exciting to go through this process. We learned many many different commands that could be useful for analysis. While I don’t think it would be very useful for anyone if I just gave a list of commands, here is an extensive reference manual for Bash. Again, it might not be very useful to read through, but it would be helpful if you have a project you want to work on. Overall this workshop helped myself and participants become familiar with command line prompts and how to use Bash. Toward the end of the workshop, I found myself easily calling on different saved files, opening them, changing directories, and looking for specific sequences within the sequenced data. However, I will say that I was not able to learn how to take excel files of data and use command line to automate the data analysis. I spoke to the participants sitting around me, and some of them seemed to feel the same way. Thus remains the unanswered question; “How do I use the skills learned during the workshop to help with my own data analysis?” Perhaps if you already have RNA sequencing data, this workshop would be more appropriate. If you were unable to attend the workshop you can see what was covered on Day 1 & Day 2, in its entirety, on this webpage. This will allow you to work through examples and all of the lessons on your own time. Further, this workshop might be available again in 2020 – so keep an eye out! image3 Day 2: Course instructor Niel Infante explaining how to open sequencing data and view different components of the analysis. You can see blue sticky notes on participants computers’ indicating they are understanding the material!   This article was written by Monal Mehta. Edits and contributions were made by Eileen Oni and Helena Mello.