(This article was first published on rOpenSci Blog - R, and kindly contributed to R-bloggers)
We've received a number of questions from our users about dealing with the finer details of data sources on the web. Whether you're reading data from local storage such as a csv file, a .Rdata store, or possibly a proprietary file format, you've most likely run into some issues in the past. Common problems include passing incorrect paths, files being too big for memory, or requiring several packages to read files in incompatible formats. Reading data from the web entails a whole other set of challenges. Although there are many ways to obtain data from the web, this post primarily deals with retrieving data from Application Programming Interfaces also known as APIs.
To demystify some of the issues around getting data from the web in R, here's some background on some of the common issues we get questions about:
REST APIs
First, let's cover the types of APIs we work with. We prefer, and almost exclusively get data from providers via, REST APIs. REST stands for Representational State Transfer. This isn't a specific set of rules, but more a set of principles followed. They are
Use HTTP methods explicitly. These are GET, PUT, POST, DELETE
Expose directory structure-like URIs.
Transfer XML, JavaScript Object Notation (JSON), or both.
Be stateless. This one we'll explain in more detail. One advantage of REST over other data transfer protocols is that it is stateless. state simply refers to the state of a request for data. In other protocols like SOAP the state is preserved on the server, while in REST, the state is all controlled by the client (the user). state can be conrolled by the user in REST calls via parameters passed like limit, page, etc. (see below). In short, REST APIs are nicer to build and use (in many cases).
Where do the data come from?
Where are data actually stored? Data are stored in some sort of database, sometimes relational column-row based database such as a SQL database (MySQL, PostgreSQL, etc.), or perhaps a NoSQL databases like CouchDB (NoSQL databases don't have a column-row setup, but store documents, like a text document, XML text, and even binary files). Nearly all of the data that can be obtained through one of our packages is likely stored in a relational database.
If there is an ability to search in the API, data providers often provide this by using an indexing tool like Solr or Elasticsearch to provide much richer search on their data.
Databases can be hosted by the institutions serving the data, but most often they are located on some cloud service like Amazon Web Services.
Data vs. metadata vs. data/metadata
Information providers differ in many ways, but one of them is what kind of information they provide. Some provide data directly in response to the request. GBIF's API is a good example of this. If you click on this URL http://api.gbif.org/v0.9/occurrence/search?taxonKey=1&limit=2 you will see JSON formatted data in your browser window.
Other providers give back metadata in response to requests. Dryad is a good example of this. If you search for a dataset on Dryad, you get back metadata describing the data. Another call is required to get the data itself.
Another set of providers only has, and thus can only provide, metadata. For example, our rmetadata R package interacts with a number of APIs for scholarly metadata. These are metadata describing scholarly works, and don't provide the content of the works they describe (e.g., full content of a journal article).
The issue of what is returned from the provider is separate from the above topic about the type of database - data can be returned from a SQL (column-row database) or a NoSQL database.
Data return limits
Data providers often impose limits on the number of requests within a certain time frame to avoid slowing down the system, and to aid in supporting simultaneous data requests.
Given these limits, data providers impose limits on how many records can be returned in a single query. This parameter is usually called limit. There is often a maximum number of requests you can request at any given time, and is an arbitary number set by the data provider. Because there is a limit, an additional parameter is often used called offset (or start). offset is the record to start at for the returned data. Let's say a data provider you want data from has a maximum limit value of 1000, and you want 3000 records. If you want to get them all, you'd have to make 3 different data requests, setting limit to 1000 for each request, and offset to 1 for the first request, 1001 for the second, and 2001 for the third request.
Another approach is with a set of parameters: page and page_size. For example, if you wanted 3000 records returned, but the limit is 1000 records per page, you could set page_size to 1000, and page to 1, 2, and 3 on three separate requests. Some combination of the parameters limit, offset (or start), page, and page_size are available in many of the functions in rOpenSci R packages.
In some functions in rOpenSci packages, we hide this detail from you and just expose the limit parameter to you so that if you set limit = 15000 and the data provider allows a max of 1000, we figure out the calls needed, and make those for you so a many calls appear as one to the user. We haven't been consistent across our packages in doing this method or letting the user manually do multiple calls.
Parameters
Parameters can vary widely among data providers. The ones mentioned above, limit, page, and start are relatively common among data providers. There is often a query parameter called query or simply q (Bond anyone?). If there is authentication (see below), there is often an api_key or key parameter. For searches of data that are geographically defined, there is often a geometry parameter (often accepts Well Known Text polygons), or bbox for bounding box (which accepts NW, SE, SW, and NE points defining the bounding box).
As you are using the API of a data provider through an R package we have written, we can set parameters differently, then just match them up internally in our code - so be aware that the parameter you see in the R function is not necessarily the same as the parameter name in the API. For example, if the parameter defined in the API is theQuickBrownFoxJumpsOverTheLazyDog, we'd likely use a parameter in the function that is much shorter and easier to remember, like foxdog as a shorthand.
HTTP codes
HTTP codes are a system for signaling the status of an HTTP request in a 3-digit number. APIs return such codes so that developers can quickly resolve a problem. We are starting to expose these HTTP codes in our R packages when something goes wrong, and mostly not showing them when the call worked as planned. Success is indicated by 2XX series codes, where 200 indicates standard successful request, whereas 206 indicates that the provider is only giving partial content back. Here are error codes you may get when something is wrong - it's good to be aware of them:
4XX series codes indicate a client error - or rather, a user error. Common codes include
400 Bad request - query specified incorrectly
401 Unauthorized - you need to provide authentication or what was provided was incorrect
403 Forbidden - you aren't allowed to access the resource
404 Not found - often indicates that youre query was not constructed correctly
5XX series codes indicate a server error, or a data provider error. Common ones include
501 Not implemented - implying future availability
503 Service unavailable - server unavailable, usually (hopefully) temporary
If you get any of the 4XX series errors using our R packages, check to make sure your query is correct, and if so, get in touch with us as we may need to fix a bug in our code. If you get a 5XX series code, it suggests that something is wrong on the provider's end. If this problem perists, contact the data provider directly or let us know.
Authentication
Most providers require authentication to limit access to paying customers, impose limits on users, and/or simply to track data usage. These limits also prevent a few heavy users from monopolizing the resource.
Authentication usually happens in one of three ways:
Username/password combo: You usually pass these in as parameters to a function call. This is generally not a great way to do authentication, but sometimes data sources require this. In our experience, this method is rarely used.
API key: Pass in an alphanumeric key as a parameter in a function call.
OAuth: In R, a web page opens from your R session where you have to enter some personal information, or simply get redirected back to R saying you're all set. This method is relatively easy, but is an extra step not required in the above two methods.
Wrap up
We hope this been useful for those that weren't familiar with these things previously. If you have questions about APIs, ask below, on Twitter, or email us.
To leave a comment for the author, please follow the link and comment on his blog: rOpenSci Blog - R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...