2019-01-15

Teaching undergraduates important technologies for science and life

As an student, I've been taught many pieces of software to aid me on my academic journey. Since I have not attended any class specifically to learn software or technology, all of the things I have been taught were intended for high school biology or an undergraduate psychology student audience. Unfortunately, I've had it disregard most of that instruction on ethical or practical grounds. Students are being taught skills that they will probably never use after they leave college or the class itself. I'm listing some of what I've been taught (or haven't), outlining better alternatives and providing my reasons for why they are better.

Boards of Canada - An Eagle In Your Mind

Version control

Anytime I take a look at the files of my peers, I cringe at the filenames. Usually there's a base file named Final Paper.docx and multiple Final Paper v2.docx, Final Paper v3.docx or Final Paper v4.docx. Sometimes you find the strategy where the longest file name is the most recent one (Final Paper final.docx or Final Paper final final v3.docx). Why do people do this? Because it's easy, and they aren't aware of an alternative. This is actually an issue that programmers have dealt with and have largely solved. hg (Mercurial) and git are the most popular solutions, and I'd recommend git to be taught to undergraduates, due to it's prevalence. You can view a history of all the edits, rollback, make experimental changes, and even maintain multiple branches with different changes in them (like different personal statements for universities for example). Any time you commit, you have saved that version of the file for the future. The best part is the files in the directory will be the files you need, no old version with annoying and cluttery file names that get in the way of productivity.

As an added benefit, collaborating and sharing is a breeze. That's the spirit of open source and science!

Here's a way to begin.

Data management

So I lied. There's no software recommendation here. This is just a request to teach proper data management skills before people leave universities. I think it would be beneficial for people to think about data in terms of tables and relations between tables. Spreadsheet software already primes people to think that way but pivot tables aren't used by many so the relational aspect of this is ignored. What are relations? It's when data is... umm... related to other data.

Let's take a person. Each person has a name and an age. Each person has only one age. That can be called a one to one relationship, and it is often saved in the same table (or spreadsheet). Depending on your data storage requirements, a person might have a single name (one to one) or multiple names (one to many). You can't really store one to many relations in a single table. You need a new table, and store the names next to a unique identifier so that it can be joined later. There are also many to many relationships, where a student can have multiple classes and a class can have multiple students.

Thinking about data this way helps to store is reliably and makes it easy to analyze data later without sanitation (the worst part of data analysis). The easiest way to retain these concepts is to learn a system that utilizes them. Why not learn SQL?.

Statistical Analysis

Don't use Excel. Don't teach Excel. Excel is proprietary. Excel is not free. Excel cannot be improved by anyone but Microsoft. Excel cannot be easily automated. Excel cannot scale to large amount of data. Excel is not reproducible. Excel is exclusionary (large licensing fees). Learning Excel is not a transferable life skill. Excel's visualizations look ugly and statistical functions have wrong defaults which are usually not changed. In my opinion, Excel goes against the right way of viewing data processing. I don't think pivot tables alleviate any of these concerns, but people who don't know them don't think about data relationally (or in a document format, or any way really).

SPSS is even more expensive (ridiculously), and is an even less transferable skill than Excel. It suffers from most of the same problems Excel does, but is a teeny bit better.

R should be taught to students as soon as possible, ideally during their first semester in university. While R has statistical advantages, learning R also would teach people how to program, which is an indispensible skill for the modern age. It opens multiple possibilities, and even if students don't use it during their careers, they are aware of the possibilities and would be able to efficiently delegate tasks to professionals. R is open source, free, reproducible and easy to share and critique, which is exactly what science needs! Once you learn a teeny bit of R, you'll realize how much faster it is too! People can manage their finances with R if they don't use it for research atleast.

Document writing

I was going to copy and paste the paragraph about Excel over here but I think the point has been made :) The .docx format has not been standardized, and looks quite ugly to be honest. Sure, it's easy to make quick edits and style changes, but for longer pieces of work like papers and books, students need to use a format that's free, open source, version controllable, and more importantly (to me), typographically perfect. Markdown, LaTeX, and plain text come to mind. These are markup languages that can be written in any text editor and converted to a pdf document at will. They facilate a focus on the content at hand rather than the appearance, which is left to years of advanced typesetting software. After writing your document in Markdown, you can convert it to any format you like, even a .docx by using pandoc.

Writing this way separates the data, images, and text from each other, and makes it easy for version control to track changes. With git's --word-diff option, you can see precisely which changes occured at which time, and since it's all text, you can just email the file to anybody in the world. They would be able to compile it into a document and read it for free, which is critical for science and it's future.


I think it's important to use future proof technologies that are free and open source. It would allow for individuals in countries with low purchasing power parity to engage in the scientific discussion and consume the same research with the same tools. We've probably missed a lot of Einsteins in Africa, India, and China due to a lack of resources, so let's reduce the barrier of entry. There are numerous other benefits too! Don't play into the hands of companies and teach / learn proprietary techniques and stay open for humanity :)