Post #2072: R, the second task.

 

This is my second post on learning the R computer language.  That, after a lifetime of using the SAS language to manipulate and analyze date files.

I’m learning R piecemeal, one task at a time.  My first task was to show the upward trend in the annual minimum temperature recorded for my location (Post #1970)Today’s task is to make a pretty picture.  I want a choropleth (heat-map) showing income level by Census Block Group, for Fairfax County, VA.

I succeeded (below).

But it was not quite the thrill I thought it would be.


The bottom line

If you came here on the off chance that you, too, wanted to use R to produce a Census Block Group choropleth of income in Fairfax County, VA, then assuming you have installed R and RStudio and (see below) assuming you aren’t running Windows 7 or earlier, it’s as easy as:

install.packages("tidycensus")
library(tidycensus)


Your_Name_Here <- get_acs(
geography = "block group", 
variables = "B19013_001",
state = "VA", 
county = "059",
year = 2020,
geometry = TRUE
)


plot(Your_Name_Here["estimate"])


Kind of anti-climactic, really.  For something that I thought was hard to do.  Or, possibly, it actually is hard to do, but somebody’s already done all the hard work.

I more-or-less stumbled across this example on-line.   It worked.  That’s pretty much end-of-story.

By way of explanation:

The install and library commands make the Tidycensus package available to your R session.  (If required, R will automatically download and install the package from CRAN (the Google Play Store of the R world.  If you’ve already installed it and it’s up to date, R just moves on to the next command.)  Library is what makes the tidycensus package available to your R program (script).

Tidycensus defines the “get_acs” command.  That reaches out and obtains your specified file from the Census Bureau.  (That’s via an API, and, optionally, you can get your very own API key from Census and list that in the program.)  In particular, this is asking for data from the American Community Survey, but you could ask for data from the decennial census.)

The important part is that this Census file brings its “geometry” with it.  That is, each line of the file — each geographic unit — in this case, each Census Block Group — comes with the detailed line-segment-by-line-segment description of its boundaries.  That description sits in a great big long variable-length text field at the end of the record.  (Including the geometry with the file increases the file size by a couple of orders of magnitude, which probably explains why it’s optional.)

(This job  also brings a bit of data, but you have to study the arcana of Census files to know that B19013 is median household income.   I think -001 signifies entire population.  Plus, that hardly matters.  You can merge your own CBG-level data values to this file and use those to make a CBG-level heat map.) 

Once you have the Census file, with the Census geometry on it, you can easily find something in R that will plot it as a map.  I think a lot of that happens natively in R because Tidycensus will create your Census file as an R “shapefile” (“sf” data frame), when you keep the geometry.  Because it’s that type of file, R then knows that you want to use it to draw a picture, and … apparently R handles the rest through some reasonable defaults, in its native plot command.

Or something.

I’m not quite sure.

Plus, there appear to be many R packages that will help you make prettier plots.  So if the simple plot command doesn’t do it for you, I’m sure there’s something that will.

This last bit pretty much sums up my take in R, so far, after a lifetime of programming in SAS.


And yet, it took me two days.

To produce and run that four-line program above.

My biggest mistake, as it turns out, was learning R on my trusty Windows 7 laptop.  That worked fine, it just required finding and installing outdated copies of R and RStudio.  (R is the language, RStudio is the interface you use to write and run programs (scripts) in R.)

But this didn’t work at all for this task.  I tried several R packages that promised to produce choropleths, only to face disappointment coupled with cryptic error messages.

Eventually it dawned on me that some of what R was trying to do, as a matter of routine, in 2024, was perhaps non-existent when Windows 7 was launched in 2009.

So that led to a big while-you’re-at-it, installing the latest version of R and RStudio on my glacially-slow Windows 10 laptop.  On that machine, everything R ran just hunky-dory.  If very slowly.

All R-related incompatibilities ceased.  The desired choropleth emerged.


Living the R lifestyle.

Below are the biggest things I’ve noticed in programming in R, compared to SAS.

To make sense of this, translate the R term “packages” as add-ons, or plug-ins, or extensions, or whatever rings a bell.  They are things that add functionality to a base piece of software.

First, there’s a whole sub-market catering to the SAS-to-R switchers.  Everything from excellent cheat sheets for R equivalents of common SAS tasks, to at least one R package (“procs”) that lets R mimic a few handy SAS procedures (freq, means, print, and some others).

Second, there are more than 5,000 R packages on CRAN.  The Comprehensive R Archive Network is like the Google Play Store of the R world.  It’s where all the interesting optional software is kept.  There’s some organization to all of that, but I’m not quite sure how much.  There’s an index, of sorts, but I haven’t used it yet.

Third, some chunk of that package-intensive computing just makes up for base R being not very useful.  A whole lot of example programs assume you’ve attached the “tidyverse” package, plausibly because a lot of the basic commands in tidyverse are routinely useful things that base R lacks.

Fourth, the whole “package” thing has no (or little) top-down organization.  Near as I can tell, nothing prevents different package writes from defining the same command or same operator differently.   As a SAS guy, that strikes me as a major quality control problem just waiting to happen.  But the upshot is that the list of packages used (via attach and library statements) is an integral part of a well-documented program.

Five, now all restaurants are Taco Bell files are spreadsheets.  By that I mean that R can only work on files that will fit into computer memory (RAM).  Whereas SAS can work on files of essentially unlimited size, but that’s by working disk-to-disk or tape-to-tape.  That has some odd spillovers to programming style, where R seems to favor making many-little-changes (like formulas in spreadsheet columns), where SAS favored one-long-data-step, where a complex series of calculations was carried out in one “pass” of an underlying data file.

Six, R names are case-sensitive.  As a SAS programmer, I sure wish they weren’t.  E.g. Var and var are two different names, of two different variables.  I’m stuck with having to respect that.  For at least the next reason.

Seven, R does a dandy job of reading data out of spreadsheets.  By far the easiest way to import data into R is .csv or spreadsheet.  In both cases, the variable names “come with”, so you inherit the data and the names that the data creators used.

Eight, slang, or short and long-form grammar for commands.  I’ve already come across two forms of the merge function, one of which kind of spells-it-all-out, one of which is abbreviated.

Nine, R can only merge two files at once, natively.  I think that’s right.  The original (non-slang) form of the merge statement makes that clear with “x =, y= ” terminology, which pretty clearly on accommodates two files.


Conclusion

I don’t think I’m ever going to be a big fan of R.

But, R will do.  It’s good enough for doing all kinds of “serious” data set manipulations (e.g., match-merging files based on some common identifier or identifiers).

And it’s kind of like a lottery.  If somebody has already written a package that’s just spot-on for something you’re trying to do, then all you need is a few magic words, and voila.

Post #1970: Learning R as a veteran SAS programmer.

 

Above, that’s a plot of the day of the year on which Fall first frost occurred, at Dulles Airport, VA, for the years between 1965 and 2022 (ish).

In theory, that’s the same data as this plot, that I did some time ago:

The interesting thing is that I ran the first plot above using the computer language R.  The second plot came from analysis of the same NOAA data, using SAS (then dumped out to Excel for plotting).

Two days ago, I knew nothing about R.  But as it turns out, once you learn the quirks of the language, R is pretty understandable for an old-school SAS programmer.  Just a few (maybe four) hours, spread over two days, and I’m up and running in R.

The proximate benefit is that I can cancel my $3400 annual license for SAS.  Or, more to the point, I can cancel that without feeling that I have totally abandoned all ability to do statistical analyses, beyond what can be done in Excel.  And cut myself off from my entire prior life as a health care data analyst.


Baseball is 90 per cent mental. The other half is physical.

The quote is from Yogi Berra.

At some level, all computer languages designed for data manipulation and statistical analysis do the exact same thing.  You sort a file, you select observations out of the file, you do some math on those observations, and you summarize your result in some way.

The logical flow of the R program that I used to create the first graph above is identical to that of the SAS program I had run to create the second graph.

  • Take the daily temperature data file from NOAA
  • Restrict it to freezing days in the Fall.
  • Find the first such freezing day in each year.
  • Then tabulate or plot those first Fall frost dates, in some fashion

It’s just a question of figuring out the basics of the language.   In my case, there were a few stumbling blocks.

Initial stumbling blocks

First, my computer’s out-of-date.  I use Windows 7 on a Toshiba laptop.  Microsoft no longer supports Windows 7.  Toshiba no longer sells laptops.  But I don’t want to change, because the existing setup runs well, particularly for number-crunching.

In order to run R on my computer, I had to do some minor updating of Windows, as directed by the instructions for installing R under Windows, as found on CRAN.  That went smoothly, after installing the Universal C runtime update from Microsoft.

To be clear, I avoid mucking about with the Windows operating system, if possible.  I’ve had too many bad experiences in the past, where updating Windows had undesirable consequences.  But this one — and the one below — were installed without apparent incident.

The next problem is that, natively, the R distribution puts you in the “console” when you run it, which is a combination command-line interpreter/line editor/output  file.  There’s really nothing you can do in the R console except test the syntax of a single line of code.

You type a line of code.  You hit return.  It executes.  But that’s it.  Up-arrow to recall earlier lines of code.  Results from what you executed get dumped right into the window where you type your line of code.

You can’t write what I would call “a program”, in the console.  Turns out, you need another piece of software to enable you write R programs.  So R is the language, but you need something in addition to R itself, to write programs/scripts that run in in R.

To write a program (script) in R, you need a script editor.  Of which the common choice is Rstudio, which is an IDE — an integrated development environment.  Rstudio gives you a window in which to write programs (the Script Editor), in addition to that original R console window. It then interfaces with your installation of R, and runs your script (program) when you tell it to.

For SAS programmers, its the logical equivalent of … whatever the screenshot below would be called:

The thing in which I write and run programs.  I think of it as “SAS”, but it’s not, It’s just the (inter-)face of SAS, to me.  It’s software integrated with SAS (the statistical language) that allows me to write and run SAS programs.

So it is with Rstudio.  Far as I can tell, having this or some close substitute is not optional, practically speaking.  Maybe something like this actually comes with the native R distribution, but if so, it did not pop up and present itself to me.

The most recent versions of Rstudio will not run on Windows 7, but if you keep asking Google, you’ll eventually stumble across a page that has older versions of Rstudio that will run under Windows 7.  I use Version 1.0.153 – © 2009-2017 RStudio, Inc, found on this page.  Why I arrived at that version, I am no longer entirely sure.  Even with that, the instructions pointed me to an October 2019 Windows security update that I had to install (and reboot) before Windows would accept the Rstudio package to be installed.

Once you have R and Rstudio installed on your machine, you can actually write a multi-line program in R, run it, debug it, and save it.

My first R program

To learn something, I think it’s helpful to have a task you actually want to do.  In this case, I have an old analysis of first frost dates, that I had run in SAS.  That’s exactly the sort of thing I’d like to be able to run in R.  So let me replicate that.

A few points of interest for SAS programmers:

  • Comments start with #.
  • There is no explicit line terminator character, but you can separate multiple commands on the same line using a semicolon.  The stray semicolons below are just force-of-habit from writing SAS for so long.  They don’t affect the functioning of the R program.
  • But unlike SAS, if you don’t punctuate your comments, it makes them hard to read.  You can’t tell where a sentence ends.  On the plus side, I think “single-quote-in-Macro-comment” paranoia is something I can leave behind with SAS.  So my long-standing habit of omitting apostrophes from comments should be obsolete.

So, here’s an R program to read in NOAA weather data, as stored on my hard drive, and plot the Fall first-frost dates.  (Note that line-wraps got shortened when copied to the format below, causing some lines below to wrap, when they would not have in the R script itself.)

# This is an R program, termed an R script.
# It is being composed in the R script editor in Rstudio
# Near as I can tell, the R distribution itself only provides a command 
# line interpreter, e.g., a line editor.
# You need a script editor to be able to write a program aka script.


# Input data set is from NOAA, I think.
# It has daily temperature data from Dulles Airport going back to part-year 1963-ish


x <- read.csv("C://Direct Research Core//GARDEN//First frost date trend//3139732.csv", header = TRUE, sep=",")
str(x)

# The code above creates the data frame x (think, SAS file or SAS temporary file X, from the raw .csv input file ;
# Because the .csv has the column names (variable names) on the first row,
# this imports the data, using those names, via the header clause ;


# note the awkward reference to the file as stored under Windows on my computer
# every "\" in the actual file path/name needs to be overwritten with "//".
# A minor annoyance.

# Also note that R IS CASE SENSITIVE, so if a variable name was given as all caps,  
# that is how you must refer to that variable in the code ;

# Next, a crude PROC CONTENTS ; 
# This outputs a list of variables and their attributes to the console ;

str(x)

# Below, date (read in as character data) is converted to a numeric date value  which I call ndate. 
# That is the equivalent of converting a character string holding a date, to a numeric SAS date ; 
# Like a SAS date, this then lets you do arithmetic on the date ;

x$ndate <- as.Date(x$DATE)
x$month <- months(x$ndate)

# The funny nomenclature here, X$ndate and x$month is to indicate that I want ;
# to create these as new columns in my data frame, as opposed to ... I'm not ;
# quite sure what, but if I just named it ndate, I think it would be a vector in 
# the active work session, in no way connected with the data set (data frame) x

# So this isnt like the SAS DATA AAA; SET BBB nomenclature. There, if you
# create a variable in normal (non-macro) SAS code, that variable is in 
# the data set you are creating.  You cant do calculations "outside of"
# the dataset that you're working on.

# But in R, the default is to create a variable in your temporary 
# workspace.  So if you want the variable to be in the new data set
# you have to tell R that by prefixing with the dataset (data frame) name.

# Next, where SAS or EXcel typically provides a broad array of native functions ; 
# R is old-school and, even for fairly basic stuff, requires you to read in those ; 
# functions. In this case, after looking at on-line examples, I want to use 
# the "aggregate" function in library libridate. I believe that lubridate was 
# either included with my R distribution, or somehow R seamlessly finds it on the
# internet and downloads it

# Or something;

# Ok, quick test of the data, show average low temp by month, entire dataset ; 


library(lubridate)
bymonth <- aggregate(x$TMIN~month(ndate),data=x,FUN=mean)
print(bymonth)

# The above is like running ;
# PROC SUMMARY DATA = X ; 
# CLASS MONTH ; * BUT CALCULATED ON THE FLY AS MONTH(NDATE) ; 
# OUTPUT OUT = BYMONTH MEAN = /AUTONAME ; 
# RUN ; 
# PROC PRINT DATA = BYMONTH; 
# RUN ;

# and sure enough, I get mean low temp by month ; 
# NOTE THAT UPPER and lower MATTERS HERE, unlike SAS ; 
# So NAME is not the same as name or Name ;

# I AM NOT ENTIRELY SURE WHAT THE TABLE THING DOES. THE RESULTS ARE NOWHERE ;
# NEAR AS USEFUL AS A PROC FREQ OUTPUT IN SAS. THE DEFAUL HERE SEEMS TO BE ; 
# TO GIVE YOU A LIST OF ALLTHE VALUES IN THE DATASET.


w = table(x$NAME)
print(w)
W = table(x$TMIN)
print(W)

# NOW CREATE NEW DATASETS FOR ANALYSIS, 
# TAKE ONLY THE DAYS AT FREEZING OR BELOW ;
# THEN SORT BY DATE ;


x2 <- subset(x, TMIN <= 32)
x3 <- x2[order(x2$ndate),]

# BUT THIS STILL HAS E.G., JANUARY FREEZING DAYS IN IT ; 
# AS SHOWN BY CALCULATING AND TABULATING THE MONTHS PRESENT ; 
# TEST BELOW WOULD HAVE VALUE 1 FOR JANUARY AND SO ON ;

test = month(x3$ndate) ;
W = table(test)
print(W)

# month is numeric ; 
# NOW RESTRICT TO MONTHS BETWEEN JULY AND DECEMBER ;

x4 <- subset(x3, month(ndate) > 6) 
x5 <- subset(x4, month(ndate) < 13)


# Im sure theres a way to do that in one step but I do not know it yet ;
# NOW FIND THE FIRST FROST DATE EACH YEAR AS THE MINIMUM OF THE DATES ; 
# REMAINING IN THE FILE, AT THIS POINT ;


frost <- aggregate(x5$ndate~year(ndate),data=x5,FUN=min)
print(frost)
str(frost)
print(frost$`x5$ndate`)

# IF YOU THOUGHT THE NAMING CONVENTIONS WERE AWKWARD, ON THE AGGREGATE STEP ; 
# THE DEFAULT VARIABLE NAME, OF THE THING THAT HOLDS THE VALUE YOU JUST ; 
# AGGREGATED, IS FILE$VARNAME. BUT TO REFER TO IT, YOU HAVE TO SURROUND THE 
# VARNAME OF THAT TYPE WITH LITERALS ;

# BELOW, CREATE A "NORMALLY NAMED" VARIABLE IN THE FROST DATASET ; 
# THEN CHUCK OUT EVERYTHING BEFORE 1964 AS REPRESENTING INCOMPLETE YEARS OF DATA ;

frost$ndate = frost$`x5$ndate` 
frost <- subset(frost, year(ndate) > 1964)

# CREATE THE JULIAN DAY, THAT IS 1 TO 365 ;
frost$julian_day <- yday(frost$ndate)

print(frost$julian_day)

median(frost$julian_day)

# The latter computes and prints the median to the console ;

# answer is 290 ; 
# Thats October 19, more or less correct

plot(year(frost$ndate),frost$julian_day)

# FINALLY, TO RUN THIS SCRIPT/PROGRAM, HIGHLIGHT IT, THEN HIT "RUN" AT THE TOP OF THE 
# SCRIPT EDITOR WINDOW ;

An initial judgment

There are a lot of things about R that I find awkward, compared to SAS.  But so far, there are no stoppers.  What was PROC SORT in SAS is now an order command in R.  A SAS PROC SUMMARY statement becomes an aggregate command in R.  And so on.

I’m sure I’m going to miss SAS’s automatic treatment of missing values.  I’ll probably miss the SAS system of value labels at some point.

But just for messing about with data, R seems to do well enough for my purposes.

After holding (and paying for) my own SAS license for close to 30 years now, I’m finally giving that up.

I had been dreading learning a SAS replacement.  I figured I would be floundering around for weeks.  But R is intuitive enough, for a long-time SAS user, that it really doesn’t seem like it’s going to be any problem at all to pick up R as a language for data analysis.