Making Data Science More Efficient: Workshop Recap

  • January 14, 2016

By: Maria Qadri

[caption id="" align="alignright" width="310"]Hands Typing on a Keyboard How to Program as a Biomedical Scientist 101[/caption]

A major component of our doctoral training is learning how to think critically and use data effectively. To work towards this goal, iJOBS partnered with Data Carpentry to run a two-day workshop that taught basic programming skills and introduced commonly used computational tools to the biomedical scientists at Rutgers. Even though the target audience was those with little to no prior knowledge of programming and I typically have taught introductory programming courses for undergraduates, I decided to attend in order to gain experience with some new tools, explore a different perspective on how to approach data science, and perhaps pick up a few teaching tips.

On the first day, the room was laden with laptops and coffee cups. The most valuable thing you could have brought to the seminar was an extra power strip or extension cord. We were equipped with two post-its and a bundle of software – OpenRefine (aka GoogleRefine if you’re old school), Anaconda Python 3, and SQLite.

The overall approach was to provide the students with datasets and well-written and informative instructions to complete problems embedded within the documents and answer questions as they arose. We used our post-its to request additional help and indicate if we were done with the task at hand. With the whole group, the instructors reviewed the more complex sections and answers to the problems. The model worked well with the large group of attendees with varying skill levels.

The first day covered how to use OpenRefine on “messy” datasets (ones that may include extraneous information, need reformatting, etc) to clean, cluster, and identify desirable sections of data in the morning. A web based program, OpenRefine has a straightforward graphical user interface and the ability to save the steps into a script to reuse for the future.

While the Python introduction was largely a review for me, the instructors walked us through the basics of different data types, assigning variables, slicing, and indexing, performing mathematical operations, and manipulating data. Python and Matlab are very similar languages with some notable differences, particularly apparent when indexing or slicing variables. One new product that I discovered via this session was iPython notebooks that emulate lab notebooks for code – you can include code, annotate sections with notes, retain plotted figures with the scripts used to create them, and easily share them with your advisor. The most powerful section covered several matplotlib techniques to demonstrate the power of plotting in Python.

The second day ended with us delving into database exploration and manipulation with SQL. Since SQL depends largely on order of operations, the problem sets for this section echoed the word problems we would tackle in basic math classes. While I was most excited to experience this material, I was a little underwhelmed and may have opted for an online tutorial instead since I have an existing foundation of programming experience. My takeaway was that SQL retains data integrity better than other data tools like Excel or Matlab.

By the end of the workshop, we had received insight into the software but also the personal styles used by the two instructors. One of my favorite parts was the extra discussion where they demonstrated how they use Python in their own research and day-to-day lives. While knowledge is freely available on the internet and there are many programming tutorials, the ability to ask questions to real people, stay motivated though a tutorial, and engage with our peers that are attacking different problems with similar tools provided by this seminar was invaluable. The students that started with minimal or no prior experience with programming gained the most, but even those with existing experience picked up a few tips and tricks to improve our data science practices.

If you’re interested in exploring the program and delving into our discussions from class, check out