Beginner's guide to R: Get your data into R
- 19 August, 2017 02:40
Once you've installed and configured R to your liking, it's time to start using it to work with data. Yes, you can type your data directly into R's interactive console. But for any kind of serious work, you're a lot more likely to already have data in a file somewhere, either locally or on the Web. Here are several ways to get data into R for further work.
[This story is part of Computerworld's "Beginner's guide to R." To read from the beginning, check out the introduction; there are links on that page to the other pieces in the series.]
If you just want to play with some test data to see how they load and what basic functions you can run, the default installation of R comes with several data sets. Type:
into the R console and you'll get a listing of pre-loaded data sets. Not all of them are useful (body temperature series of two beavers?), but these do give you a chance to try analysis and plotting commands. And some online tutorials use these sample sets.
One of the less esoteric data sets is mtcars, data about various automobile models that come from Motor Trends. (I'm not sure from what year the data are from, but given that there are entries for the Valiant and Duster 360, I'm guessing they're not very recent; still, it's a bit more compelling than whether beavers have fevers.)
You'll get a printout of the entire data set if you type the name of the data set into the console, like so:
There are better ways of examining a data set, which I'll get into later in this series. Also, R does have a print() function for printing with more options, but R beginners rarely seem to use it.
Existing local data
R has a function dedicated to reading comma-separated files. To import a local CSV file named filename.txt and store the data into one R variable named mydata, the syntax would be:
mydata <- read.csv("filename.txt")
(Aside: What's that <- where you expect to see an equals sign? It's the R assignment operator. I said R syntax was a bit quirky. More on this in the section on R syntax quirks.)
And if you're wondering what kind of object is created with this command, mydata is an extremely handy data type called a data frame -- basically a table of data. A data frame is organized with rows and columns, similar to a spreadsheet or database table.
The read.csv function assumes that your file has a header row, so row 1 is the name of each column. If that's not the case, you can add header=FALSE to the command:
mydata <- read.csv("filename.txt", header=FALSE)
In this case, R will read the first line as data, not column headers (and assigns default column header names you can change later).
If your data use another character to separate the fields, not a comma, R also has the more general read.table function. So if your separator is a tab, for instance, this would work:
mydata <- read.table("filename.txt", sep="\t", header=TRUE)
The command above also indicates there's a header row in the file with header=TRUE.
If, say, your separator is a character such as | you would change the separator part of the command to sep="|"
Categories or values? Because of R's roots as a statistical tool, when you import non-numerical data, R may assume that character strings are statistical factors -- things like "poor," "average" and "good" -- or "success" and "failure."
But your text columns may not be categories that you want to group and measure, just names of companies or employees. If you don't want your text data to be read in as factors, add stringsAsFactor=FALSE to read.table, like this:
mydata <- read.table("filename.txt", sep="\t", header=TRUE, stringsAsFactor=FALSE)
If you'd prefer, R allows you to use a series of menu clicks to load data instead of 'reading' data from the command line as just described. To do this, go to the Workspace tab of RStudio's upper-right window, find the menu option to "Import Dataset," then choose a local text file or URL.
As data are imported via menu clicks, the R command that RStudio generated from your menu clicks will appear in your console. You may want to save that data-reading command into a script file if you're using this for significant analysis work, so that others -- or you -- can reproduce that work.
The 3-minute YouTube video below, recorded by UCLA statistics grad student Miles Chen, shows an RStudio point-and-click data import.
Copying data snippets
If you've got just a small section of data already in a table -- a spreadsheet, say, or a Web HTML table -- you can control-C copy those data to your Windows clipboard and import them into R.
The command below handles clipboard data with a header row that's separated by tabs, and stores the data in a data frame (x):
x <- read.table(file = "clipboard", sep="\t", header=TRUE)
You can read more about using the Windows clipboard in R at the R For Dummies website.
On a Mac, the pipe ("pbpaste") function will access data you've copied with command-c, so this will do the equivalent of the previous Windows command:
x <- read.table(pipe("pbpaste"), sep="\t")
There are R packages that will read files from Excel, SPSS, SAS, Stata and various relational databases. I don't bother with the Excel package; it requires both Java and Perl, and in general I'd rather export a spreadsheet to CSV in hopes of not running into Microsoft special-character problems. For more info on other formats, see UCLA's How to input data into R which discusses the foreign add-on package for importing several other statistical software file types.
If you'd like to try to connect R with a database, there are several dedicated packages such as RPostgreSQL, RMySQL, RMongo, RSQLite and RODBC. And, the popular dplyr package includes some database support.
read.csv() and read.table() work pretty much the same to access files from the Web as they do for local data.
Do you want Google Spreadsheets data in R? You don't have to download the spreadsheet to your local system as you do with a CSV. Instead, in your Google spreadsheet -- properly formatted with just one row for headers and then one row of data per line -- select File > Publish to the Web. (This will make the data public, although only to someone who has or stumbles upon the correct URL. Beware of this process, especially with sensitive data.)
Select the sheet with your data and click "Start publishing." You should see a box with the option to get a link to the published data. Change the format type from Web page to CSV and copy the link. Now you can read those data into R with a command such as:
mydata <- read.csv("http://bit.ly/10ER84j")
The command structure is the same for any file on the Web. For example, Pew Research Center data about mobile shopping are available as a CSV file for download. You can store the data in a variable called pew_data like this:
pew_data <- read.csv("http://bit.ly/11I3iuU")
It's important to make sure the file you're downloading is in an R-friendly format first: in other words, that it has a maximum of one header row, with each subsequent row having the equivalent of one data record. Even well-formed government data might include lots of blank rows followed by footnotes -- that's not what you want in an R data table if you plan on running statistical analysis functions on the file.
Help with external data
R enthusiasts have created add-on packages to help other users download data into R with a minimum of fuss.
For instance, the financial analysis package Quantmod, developed by quantitative software analyst Jeffrey Ryan, makes it easy to not only pull in and analyze stock prices but graph them as well.
All you need are four short lines of code to install the Quantmod package, load it, retrieve a company's stock prices and then chart them using the barChart function. Type in and run the following in your R editor window or console for Apple data:
Want to see just the last couple of weeks? You can use a command like this:
barChart(AAPL, subset='last 14 days')
chartSeries(AAPL, subset='last 14 days')
Or grab a particular date range like this:
Quantmod is a very powerful financial analysis package, and you can read more about it on the Quantmod website.
There are many other packages with R interfaces to data sources such as twitteR for analyzing Twitter data; Quandl and rdatamarket for access to millions of data sets at Quandl and Data Market, respectively; and several for Google Analytics, including rga, RGoogleAnalytics and ganalytics.
Looking for a specific type of data to pull into R but don't know where to find it? You can try searching Quandl and Datamarket, where data can be downloaded in R format even without needing to install the site-specific packages mentioned above.
Removing unneeded data
If you're finished with variable x and want to remove it from your workspace, use the rm() remove function:
Saving your data
Once you've read in your data and set up your objects just the way you want them, you can save your work in several ways. It's a good idea to store your commands in a script file, so you can repeat your work if needed.
How best to save your commands? You can type them first into the RStudio script editor (top left window) instead of directly into the interactive console, so you can save the script file when you're finished. If you haven't been doing that, you can find a history of all the commands you've typed in the history tab in the top right window; select the ones you want and click the "to source" menu option to copy them into a file in the script window for saving.
You can also save your entire workspace. While you're in R, use the function:
That stores your workspace to a file named .RData by default. This will ensure you don't lose all your work in the event of a power glitch or system reboot while you've stepped away.
When you close R, it asks if you want to save your workspace. If you say yes, the next time you start R that workspace will be loaded. That saved file will be named .RData as well. If you have different projects in different directories, each can have its own .RData workspace file.
You can also save an individual R object for later loading with the save function:
Reload it at any time with:
Ready to do more with R? Download the free PDF Advanced Beginner's Guide to R.