It looks like you’ve found yourself to this tutorial (of sorts). Today, we’ll be looking at some basic statistics, code and programming in the statistical programing language R. This will not be a comprehensive tutorial, but at least it will give you an idea of how it works. But before we go any further, let’s talk about what in God’s name R is and its purpose in the programming/coding world.
Q: What the hell is R? Great question! R is considered a statistical computing software package. It’s also open source – which means it’s free like Linux! Yay! It was created back in the 90s and has been gaining popularity to this day. Many academic institutions and professional settings use this program every day!
You can conduct anywhere from basic statistical analysis up to machine learning/neural networking as well as produce cool graphs and plots within the software environment.
The source language is based upon C and FORTRAN.
Q: Cool, what do I need to get started then? To begin your epic journey into R, you’ll need the base software package which can be found on the Comprehensive R Archive Network (CRAN):
It currently has support for Windows, Mac (OS X) and some distros for Linux! (specifically debian, redhat, suse and Ubuntu – so pick your poison!)
Just follow the prompts and you should be all good.
Q: I installed R and the environment isn’t too good… Is there an IDE we can use? Why yes there is! Which is actually what we’ll be using today. It is called “RStudio” and trust me, it is way better than just base R by itself:
This IDE is also free just like base R – so just choose the free version.
So, after successfully installing RStudio and R, you can open it up and it should look something like this:
I’ve changed my theme for the IDE to “cobalt” which can be done by going to Tools -> Global Options -> Appearance and changing the theme – so don’t freak out if everything is white for you.
Choose a theme that you think is cool!
So before I start explaining each panel, let’s open a blank script (which is arguably the most important panel). Go to the top left corner, and click on the drop down menu next to the page with a plus on it and choose “R Script”.
Wabam, we now have the script:
Quick summary of the panels:
Top Left :The R Script - the page you can actively edit your lines of code on (kind of like Microsoft Word or Eclipse if you’re familiar with Java)
Top Right: Global Environment and History – If you make any variables, objects or custom functions, they’ll show up here so you can keep track of them.
Bottom Right: Files, Plots, Packages, Help and Viewer – We’ll be focusing on the plots tabs today which simply displays any graphics you generate in RStudio.
Bottom Left: Console – This is where the executed commands from the R Script show up so you can see what has been executed and in what order. Furthermore, any error messages will show up here to let you know you done goofed in your code (which happens to everyone – especially me!)
Q:
Alright, alright we’ve got the basic setup and we can now start doing some things in R.
Generally, R is used to manipulate and analyse data which is stored in a file (i.e. .xlsx, .csv etc.)
So today I think it would be cool to use the data collected from the Level1Techs forum-wide survey and generate some graphs with R.
You can grab the data set from here:
Save the file to a folder (your desktop is usually the easiest) so we can use it later.
Let’s get cookin’!
A good place to start when programming in any language is to get your bearings in relation to your system. In R, it’s no different. So we can use the following commands to get the info we need to start:
# Checks current Directory path
getwd()
# Setting the Working Directory
setwd("C:/Users/Kamakoda/Desktop/Level1 Techs Data")
# Checks contents of current Directory
dir()
You can put these three commands into your R script and execute them from the script by hitting CTRL + ENTER. The executed commands will then show up in the Console with their results (if you executed any functions in the script that is). We can use the results of dir() to find what the name of the file is to read it into R.
PRO TIP 1: There is an easier way to set the working directory via the GUI. Just go to Session -> Set Working Directory -> Choose Directory and you can select the folder you’ve placed your files (fondly enough this gets sent to the console anyway, so you get to see it both ways!)
Anyway, let’s read the data into R so we can actually use the damn thing (finally!):
# Reading the data file into R
level1techs <- read.csv("Public Level 1 Techs Community Survey (Responses) - Form responses 1.csv", header=T)
You’ll notice I used <- before using the read.csv() command. This is simply to define an object within R so we can just store it for later use. The = sign is also acceptable, I just generally use <- to lessen ambiguity when I need to write equations and what not. Neat!
After running this command, we can now have a peek at our global environment and what do you know – there’s our data file! Huzzah! If we click on the little spreadsheet icon next to the name of the object in the Global Environment, we’ll get a nice look at our dataset in its glorious spreadsheet format.
All the Graphs
Alright, so after all the fiddly reading into R, we can move onto actually doing some things – graph things in particular. Let’s start by plotting the number of Level1Tech’s members using AMD or NVIDIA based GPU’s. We’ll start by finding the summary of GPU types – by using the summary() function!
# Makin' some graphs - GPU counts of users
# Summary function for counts of GPU
summary(level1techs$GPU)
Looks like the Console has spit out: AMD – 101 Intel – 17 NVIDIA – 121 Alright cool, let’s graph it!
# Plotting GPU Type vs. Count
plot(level1techs$GPU)
Huzzah! No errors to speak of and it worked! You’ll find that the functions within R are quite lazy when it comes to syntax. This however, is a double-edged sword. If you’re not careful and make a mistake in the code, R will just do it regardless and spit out a result – regardless if it’s actually what you want. Luckily in this case it’s quite simple, so we can continue.
For the plot() function, all we really need to get going is to call the function then feed it some data, then it will generally do the rest. In this case, we called the level1techs object then used the $ to then call the GPU stored inside the object.
PRO TIP2: The plot() function generally defaults to a bar chart. There are many other graphs and entire packages dedicated to making graphics in R, bit we’ll stick to the basics for now.
Anyway, this graph is…. Quite dull and boring. Let’s throw some spiciness into the mix:
# Plotting with Labels
plot(level1techs$GPU, main = "Count of Users per GPU Type",
xlab = "GPU Type",
ylab = "Count")
We’ve simply added some labels to make the graph a bit more presentable. It is also allows us to track what each axes is actually measuring. But it still lacking some flair and I’m a stickler for details. Let’s add some colour while also reordering the GPU count in descending order:
# Reordering the factor levels of GPUs – in the specific order we want!
GPU.ordered <- factor(level1techs$GPU, levels = c("Nvidia", "AMD", "Intel"))
# Plotting with Colours and Labels
plot(GPU.ordered, main = "Count of Users per GPU Type",
xlab = "GPU Type",
ylab = "Count",
col = c("chartreuse3", "firebrick3", "royalblue"))
All we really did was make a new object and used the factor() function to order them in a specific order. Then we added colour by adding the col= argument and using the c() argument to set the specific colours relative to their position (i.e. NVIDIA = chartreuse3, AMD = firebrick3, Intel = royalblue). Keep in mind, the c() argument can be used outside of the col= argument, but we’ll worry about that another time.
PROTIP3: There’s a file called the R colour palette which can be found from a quick google search. That’s how I chose these weirdly named colours haha
Finally let’s add some text so we can see the actual number of GPUs in use. Adding text to a base plot in R uses a coordinate system relative to the graph – it’s easier that it sounds:
# Plotting with Colours and Labels
plot(GPU.ordered, main = "Count of Users per GPU Type",
xlab = "GPU Type",
ylab = "Count",
col = c("chartreuse3", "firebrick3", "royalblue"))
# Adding text to the plot
text(3, c(110, 100, 90),
paste(c("NVIDIA = 121", "AMD = 101", "Intel = 17")), font = 2)
And here we have a semi-good looking plot for the GPU distribution within the Level1Tech’s Community! Hoorah!
If you made this far, well done! Also, thank you for reading and checking out this basic tutorial (which probably still leaves much to be desired).
I’ll be hanging around this thread if anyone has any issues – so feel free to post any questions!
------------ BONUS CHALLENGE ------------
How about you guys give making a graph ago, it’d be cool to see what you come up with! Ye be warned though, you’ll probably run into problems (many problems knowing R) but I’ll be here to help. So good luck!
R’s a solid choice for large datasets. The language makes data manipulation pretty effective and it’s also super lightweight in comparison to something like excel or something (albeit confusing at times haha).
I defiantly find excel gets really slow and unresponsive once the data set gets over 400k rows. I was chatting with someone the other day who made some really cool graphs with R. I’m looking forward to learning it. Do you have any suggestions for more advanced learning?
For more advanced visualisation, I highly recommend the ggplot2 package. There’s lots of tutorials for it and I might make one for it in the future if I have time haha
Also, for data manipulation I’d recommend the packages diplyr and tidyr
Although, the best way to learn R is by simply doing stuff in it haha (i.e. attempt somthing - fail - then ask google how to fix it - profit)