Post #2072: R, the second task.

Posted on December 23, 2024

 

This is my second post on learning the R computer language.  That, after a lifetime of using the SAS language to manipulate and analyze date files.

I’m learning R piecemeal, one task at a time.  My first task was to show the upward trend in the annual minimum temperature recorded for my location (Post #1970)Today’s task is to make a pretty picture.  I want a choropleth (heat-map) showing income level by Census Block Group, for Fairfax County, VA.

I succeeded (below).

But it was not quite the thrill I thought it would be.


The bottom line

If you came here on the off chance that you, too, wanted to use R to produce a Census Block Group choropleth of income in Fairfax County, VA, then assuming you have installed R and RStudio and (see below) assuming you aren’t running Windows 7 or earlier, it’s as easy as:

install.packages("tidycensus")
library(tidycensus)


Your_Name_Here <- get_acs(
geography = "block group", 
variables = "B19013_001",
state = "VA", 
county = "059",
year = 2020,
geometry = TRUE
)


plot(Your_Name_Here["estimate"])


Kind of anti-climactic, really.  For something that I thought was hard to do.  Or, possibly, it actually is hard to do, but somebody’s already done all the hard work.

I more-or-less stumbled across this example on-line.   It worked.  That’s pretty much end-of-story.

By way of explanation:

The install and library commands make the Tidycensus package available to your R session.  (If required, R will automatically download and install the package from CRAN (the Google Play Store of the R world.  If you’ve already installed it and it’s up to date, R just moves on to the next command.)  Library is what makes the tidycensus package available to your R program (script).

Tidycensus defines the “get_acs” command.  That reaches out and obtains your specified file from the Census Bureau.  (That’s via an API, and, optionally, you can get your very own API key from Census and list that in the program.)  In particular, this is asking for data from the American Community Survey, but you could ask for data from the decennial census.)

The important part is that this Census file brings its “geometry” with it.  That is, each line of the file — each geographic unit — in this case, each Census Block Group — comes with the detailed line-segment-by-line-segment description of its boundaries.  That description sits in a great big long variable-length text field at the end of the record.  (Including the geometry with the file increases the file size by a couple of orders of magnitude, which probably explains why it’s optional.)

(This job  also brings a bit of data, but you have to study the arcana of Census files to know that B19013 is median household income.   I think -001 signifies entire population.  Plus, that hardly matters.  You can merge your own CBG-level data values to this file and use those to make a CBG-level heat map.) 

Once you have the Census file, with the Census geometry on it, you can easily find something in R that will plot it as a map.  I think a lot of that happens natively in R because Tidycensus will create your Census file as an R “shapefile” (“sf” data frame), when you keep the geometry.  Because it’s that type of file, R then knows that you want to use it to draw a picture, and … apparently R handles the rest through some reasonable defaults, in its native plot command.

Or something.

I’m not quite sure.

Plus, there appear to be many R packages that will help you make prettier plots.  So if the simple plot command doesn’t do it for you, I’m sure there’s something that will.

This last bit pretty much sums up my take in R, so far, after a lifetime of programming in SAS.


And yet, it took me two days.

To produce and run that four-line program above.

My biggest mistake, as it turns out, was learning R on my trusty Windows 7 laptop.  That worked fine, it just required finding and installing outdated copies of R and RStudio.  (R is the language, RStudio is the interface you use to write and run programs (scripts) in R.)

This worked fine for straightforward data manipulations (read in from disk, calculate stuff, read out to disk).  I was able to pick up the rudiments of R this way, with zero barrier to entry.

But this didn’t work at all for this task.  I tried several R packages that promised to produce choropleths, only to face disappointment coupled with cryptic error messages.

Eventually it dawned on me that some of what R was trying to do, as a matter of routine, in 2024, was perhaps non-existent when Windows 7 was launched in 2009.

So that led to a big while-you’re-at-it, installing the latest version of R and RStudio on my glacially-slow Windows 10 laptop.  I shun that machine, and for good reason.  But everything R ran just hunky-dory.  If very slowly.

All R-related incompatibilities ceased.

The desired choropleth emerged.


Living the R lifestyle.

Below are the biggest things I’ve noticed in programming in R, compared to SAS.

To make sense of this, translate the R term “packages” as add-ons, or plug-ins, or extensions, or whatever rings a bell.  They are things that add functionality to a base piece of software.

First, there’s a whole sub-market catering to the SAS-to-R switchers.  Everything from excellent cheat sheets for R equivalents of common SAS tasks, to at least one R package (“procs”) that lets R mimic a few handy SAS procedures (freq, means, print, and some others).

Second, there are more than 5,000 R packages on CRAN.  The Comprehensive R Archive Network is like the Google Play Store of the R world.  It’s where all the interesting optional software is kept.  There’s some organization to all of that, but I’m not quite sure how much.  There’s an index, of sorts, but I haven’t used it yet.

Third, some chunk of that package-intensive computing just makes up for base R being not very useful.  A whole lot of example programs assume you’ve attached the “tidyverse” package, plausibly because a lot of the basic commands in tidyverse are routinely useful things that base R lacks.

Fourth, the whole “package” thing has no (or little) top-down organization.  Near as I can tell, nothing prevents different package writes from defining the same command or same operator differently.   As a SAS guy, that strikes me as a major quality control problem just waiting to happen.  But the upshot is that the list of packages used (via attach and library statements) is an integral part of a well-documented program.

Five, now all restaurants are Taco Bell files are spreadsheets.  By that I mean that R can only work on files that will fit into computer memory (RAM).  Whereas SAS can work on files of essentially unlimited size, but that’s by working disk-to-disk or tape-to-tape.  That has some odd spillovers to programming style, where R seems to favor making many-little-changes (like formulas in spreadsheet columns), where SAS favored one-long-data-step, where a complex series of calculations was carried out in one “pass” of an underlying data file.

Six, R names are case-sensitive.  As a SAS programmer, I sure wish they weren’t.  E.g. Var and var are two different names, of two different variables.  I’m stuck with having to respect that.  For at least the next reason.

Seven, R does a dandy job of reading data out of spreadsheets.  By far the easiest way to import data into R is .csv or spreadsheet.  In both cases, the variable names “come with”, so you inherit the data and the names that the data creators used.

Eight, slang, or short and long-form grammar for commands.  I’ve already come across two forms of the merge function, one of which kind of spells-it-all-out, one of which is abbreviated.

Nine, R can only merge two files at once, natively.  I think that’s right.  The original (non-slang) form of the merge statement makes that clear with “x =, y= ” terminology, which pretty clearly on accommodates two files.


Conclusion

I don’t think I’m ever going to be a big fan of R.

But, R will do.  It’s good enough for doing all kinds of “serious” data set manipulations (e.g., match-merging files based on some common identifier or identifiers).

And it’s kind of like a lottery.  If somebody has already written a package that’s just spot-on for something you’re trying to do, then all you need is a few magic words, and voila.