Excel is great for spreadsheet use. I can remember taking accounting in college, when I was not aware of Excel (before I had a computer – I know, it was a long time ago). We used sheets of paper that we had to adjust manually (erasers were invaluable). Some of the figures were adjusted so many times, that the paper would wear thin. Years later, on discovering Excel, I reflected how much more effective and efficient completing my accounting class exercises would have been if I had only known of its existence. Today, Excel is used for far more than accounting. It is used for data storage, data analysis, reporting, graphing, and a bunch of other stuff that it really is not well suited for.
For example, if you find yourself loading CSV files into excel, doing a lot of index/matches, copy and pastes, adding filters and then creating charts and graphs, there are definitely better ways of doing that then Excel. One is to use R instead. The learning curve is steep, but in the end, it will be a faster and more flexible way to get some of these things done. Also, since it is code, you can use it repeatably.
R is great for data analysis, reporting and graphing. In fact, I first discovered R for myself because I was looking for an alternative to Excel for data visualization. I am going to show a few steps to get you started. The first step is to load data. There are a number of packages for R out there that help to connect you to data sources. Here, I’m just going to use a CSV file.
This is the code to get a CSV file loaded: myFile <- read.csv(file.choose()) I use the file.choose() command to pop-up a file selector. You just choose the data file and the file is loaded. That’s it! If you are using an R IDE like RStudio, then you can view your table by finding it in the upper right-hand corner and clicking on it. You can also highlight the name of the file in the scripting window and click on the Run button (check out the RStudio IDE Cheat Sheet https://www.rstudio.com/wp-content/uploads/2016/01/rstudio-IDE-cheatsheet.pdf). The data will be returned in the Console window.
The table in R is actually called a data.frame and is a set of vectors & factors (vectors generally for numeric data and factors for categorical data). That’s important to understand, and it can also be helpful for a number of data functions. Generally, when I am exploring data, I just create a copy of the imported data. The reason is, if I make an error, I can just make another copy, instead of having to import the data again. To make a copy of a data.frame, you just use the following command: myFile2 <- myFile That’s it! Now you have a copy to work with.
From here, suppose that you want to sum a column. It’s a simple command: sum(myFile2$Sales) That all it takes – the file name, a $, and the column name. If you wanted to get the average of the column, just run the following command: mean(myFile2$Sales) The mean is the statistical reference to the average and since R is a statistical programming language, it makes sense for that to be the command. If you are looking at the mean, you most certainly will want the standard deviation as well: sd(myFile2$Sales)
Now what if you wanted a subset of the data that you imported. That can be handled by the subset command: subset(myFile2, Sales > 35, select = c(Product, Sales)) This one is a little more complicated. There are three sections. The first is the data.frame that was created earlier. The second is the subset, meaning the expression of the rows in the data to keep. The third is the columns to return. The columns are enclosed in parentheses after a “c”. The “c” is generally used when there is more than one thing in an element of a statement. It basically means to combine values into a vector or a list, but I like to think of it as “c” for collection, since we are not combining the columns, just returning them (and not the other columns that may be in the data.frame).
Here I’ve added a second column to return called Product, which would most certainly be stored as a factor. If you wanted to use the subset of the data.frame, you can create another data.frame: myFile3 <- subset(myFile2, Sales > 35, select = c(Product, Sales)) myFile3 will be a data.frame with one factor (Product), one vector(Sales) and will only include Sales where the value is greater than 35. Now, let’s say that you have loaded another CSV file as a data.frame called allProducts that contains the columns Product and Price and you wanted to add the Sales from myFile3.
Here is one way to do that: #Convert data.frames to data.tables and add a key.allProducts2 <- data.table(allProducts, key = “Product”) myFile3 <- data.table(myFile3, key = “Product”) #Join all data from allProducts2 and matching data (by Product) from #myFile3 using a “left” join.allProducts3 <- join(allProducts2, myFile3, type = “left”).
The first thing to see here is the “#”. That is to comment out the text. You want to use this a lot. That is so when you (or someone else) go back to your code, you will know what the code’s purpose is without having to look at the code itself. Second, the code is converted into a data.table, primarily so I add a key to each table that can be matched int he next command.
Finally, the “join” command joins all the data from one data.table and the matching data (using the “key” from the prior commands) from the other data.table. If you wanted only data that matched between the two data.tables, replace “left” with “inner”. If you wanted to return all, use “full”. This is based on the way SQL combines data.
If you want to graph the data, I have written a blog post to help you: https://daranjjohnson.com/blog/graphing-with-r/. If you want to go a little deeper and explore correlations in your data with R, I have a blog post on that as well: https://daranjjohnson.com/blog/correlation-of-data-attributes/.
R is really powerful, free (which I love), and, once you get the hang of it, really fun to work with. I hope you check it out, start moving away from Excel, and get to love R as much as I do.