Above, that’s a plot of the day of the year on which Fall first frost occurred, at Dulles Airport, VA, for the years between 1965 and 2022 (ish).
In theory, that’s the same data as this plot, that I did some time ago:
The interesting thing is that I ran the first plot above using the computer language R. The second plot came from analysis of the same NOAA data, using SAS (then dumped out to Excel for plotting).
Two days ago, I knew nothing about R. But as it turns out, once you learn the quirks of the language, R is pretty understandable for an old-school SAS programmer. Just a few (maybe four) hours, spread over two days, and I’m up and running in R.
The proximate benefit is that I can cancel my $3400 annual license for SAS. Or, more to the point, I can cancel that without feeling that I have totally abandoned all ability to do statistical analyses, beyond what can be done in Excel. And cut myself off from my entire prior life as a health care data analyst.
Baseball is 90 per cent mental. The other half is physical.
The quote is from Yogi Berra.
At some level, all computer languages designed for data manipulation and statistical analysis do the exact same thing. You sort a file, you select observations out of the file, you do some math on those observations, and you summarize your result in some way.
The logical flow of the R program that I used to create the first graph above is identical to that of the SAS program I had run to create the second graph.
- Take the daily temperature data file from NOAA
- Restrict it to freezing days in the Fall.
- Find the first such freezing day in each year.
- Then tabulate or plot those first Fall frost dates, in some fashion
It’s just a question of figuring out the basics of the language. In my case, there were a few stumbling blocks.
Initial stumbling blocks
First, my computer’s out-of-date. I use Windows 7 on a Toshiba laptop. Microsoft no longer supports Windows 7. Toshiba no longer sells laptops. But I don’t want to change, because the existing setup runs well, particularly for number-crunching.
In order to run R on my computer, I had to do some minor updating of Windows, as directed by the instructions for installing R under Windows, as found on CRAN. That went smoothly, after installing the Universal C runtime update from Microsoft.
To be clear, I avoid mucking about with the Windows operating system, if possible. I’ve had too many bad experiences in the past, where updating Windows had undesirable consequences. But this one — and the one below — were installed without apparent incident.
The next problem is that, natively, the R distribution puts you in the “console” when you run it, which is a combination command-line interpreter/line editor/output file. There’s really nothing you can do in the R console except test the syntax of a single line of code.
You type a line of code. You hit return. It executes. But that’s it. Up-arrow to recall earlier lines of code. Results from what you executed get dumped right into the window where you type your line of code.
You can’t write what I would call “a program”, in the console. Turns out, you need another piece of software to enable you write R programs. So R is the language, but you need something in addition to R itself, to write programs/scripts that run in in R.
To write a program (script) in R, you need a script editor. Of which the common choice is Rstudio, which is an IDE — an integrated development environment. Rstudio gives you a window in which to write programs (the Script Editor), in addition to that original R console window. It then interfaces with your installation of R, and runs your script (program) when you tell it to.
For SAS programmers, its the logical equivalent of … whatever the screenshot below would be called:
The thing in which I write and run programs. I think of it as “SAS”, but it’s not, It’s just the (inter-)face of SAS, to me. It’s software integrated with SAS (the statistical language) that allows me to write and run SAS programs.
So it is with Rstudio. Far as I can tell, having this or some close substitute is not optional, practically speaking. Maybe something like this actually comes with the native R distribution, but if so, it did not pop up and present itself to me.
The most recent versions of Rstudio will not run on Windows 7, but if you keep asking Google, you’ll eventually stumble across a page that has older versions of Rstudio that will run under Windows 7. I use Version 1.0.153 – © 2009-2017 RStudio, Inc, found on this page. Why I arrived at that version, I am no longer entirely sure. Even with that, the instructions pointed me to an October 2019 Windows security update that I had to install (and reboot) before Windows would accept the Rstudio package to be installed.
Once you have R and Rstudio installed on your machine, you can actually write a multi-line program in R, run it, debug it, and save it.
My first R program
To learn something, I think it’s helpful to have a task you actually want to do. In this case, I have an old analysis of first frost dates, that I had run in SAS. That’s exactly the sort of thing I’d like to be able to run in R. So let me replicate that.
A few points of interest for SAS programmers:
- Comments start with #.
- There is no explicit line terminator character, but you can separate multiple commands on the same line using a semicolon. The stray semicolons below are just force-of-habit from writing SAS for so long. They don’t affect the functioning of the R program.
- But unlike SAS, if you don’t punctuate your comments, it makes them hard to read. You can’t tell where a sentence ends. On the plus side, I think “single-quote-in-Macro-comment” paranoia is something I can leave behind with SAS. So my long-standing habit of omitting apostrophes from comments should be obsolete.
So, here’s an R program to read in NOAA weather data, as stored on my hard drive, and plot the Fall first-frost dates. (Note that line-wraps got shortened when copied to the format below, causing some lines below to wrap, when they would not have in the R script itself.)
# This is an R program, termed an R script. # It is being composed in the R script editor in Rstudio # Near as I can tell, the R distribution itself only provides a command # line interpreter, e.g., a line editor. # You need a script editor to be able to write a program aka script. # Input data set is from NOAA, I think. # It has daily temperature data from Dulles Airport going back to part-year 1963-ish x <- read.csv("C://Direct Research Core//GARDEN//First frost date trend//3139732.csv", header = TRUE, sep=",") str(x) # The code above creates the data frame x (think, SAS file or SAS temporary file X, from the raw .csv input file ; # Because the .csv has the column names (variable names) on the first row, # this imports the data, using those names, via the header clause ; # note the awkward reference to the file as stored under Windows on my computer # every "\" in the actual file path/name needs to be overwritten with "//". # A minor annoyance. # Also note that R IS CASE SENSITIVE, so if a variable name was given as all caps, # that is how you must refer to that variable in the code ; # Next, a crude PROC CONTENTS ; # This outputs a list of variables and their attributes to the console ; str(x) # Below, date (read in as character data) is converted to a numeric date value which I call ndate. # That is the equivalent of converting a character string holding a date, to a numeric SAS date ; # Like a SAS date, this then lets you do arithmetic on the date ; x$ndate <- as.Date(x$DATE) x$month <- months(x$ndate) # The funny nomenclature here, X$ndate and x$month is to indicate that I want ; # to create these as new columns in my data frame, as opposed to ... I'm not ; # quite sure what, but if I just named it ndate, I think it would be a vector in # the active work session, in no way connected with the data set (data frame) x # So this isnt like the SAS DATA AAA; SET BBB nomenclature. There, if you # create a variable in normal (non-macro) SAS code, that variable is in # the data set you are creating. You cant do calculations "outside of" # the dataset that you're working on. # But in R, the default is to create a variable in your temporary # workspace. So if you want the variable to be in the new data set # you have to tell R that by prefixing with the dataset (data frame) name. # Next, where SAS or EXcel typically provides a broad array of native functions ; # R is old-school and, even for fairly basic stuff, requires you to read in those ; # functions. In this case, after looking at on-line examples, I want to use # the "aggregate" function in library libridate. I believe that lubridate was # either included with my R distribution, or somehow R seamlessly finds it on the # internet and downloads it # Or something; # Ok, quick test of the data, show average low temp by month, entire dataset ; library(lubridate) bymonth <- aggregate(x$TMIN~month(ndate),data=x,FUN=mean) print(bymonth) # The above is like running ; # PROC SUMMARY DATA = X ; # CLASS MONTH ; * BUT CALCULATED ON THE FLY AS MONTH(NDATE) ; # OUTPUT OUT = BYMONTH MEAN = /AUTONAME ; # RUN ; # PROC PRINT DATA = BYMONTH; # RUN ; # and sure enough, I get mean low temp by month ; # NOTE THAT UPPER and lower MATTERS HERE, unlike SAS ; # So NAME is not the same as name or Name ; # I AM NOT ENTIRELY SURE WHAT THE TABLE THING DOES. THE RESULTS ARE NOWHERE ; # NEAR AS USEFUL AS A PROC FREQ OUTPUT IN SAS. THE DEFAUL HERE SEEMS TO BE ; # TO GIVE YOU A LIST OF ALLTHE VALUES IN THE DATASET. w = table(x$NAME) print(w) W = table(x$TMIN) print(W) # NOW CREATE NEW DATASETS FOR ANALYSIS, # TAKE ONLY THE DAYS AT FREEZING OR BELOW ; # THEN SORT BY DATE ; x2 <- subset(x, TMIN <= 32) x3 <- x2[order(x2$ndate),] # BUT THIS STILL HAS E.G., JANUARY FREEZING DAYS IN IT ; # AS SHOWN BY CALCULATING AND TABULATING THE MONTHS PRESENT ; # TEST BELOW WOULD HAVE VALUE 1 FOR JANUARY AND SO ON ; test = month(x3$ndate) ; W = table(test) print(W) # month is numeric ; # NOW RESTRICT TO MONTHS BETWEEN JULY AND DECEMBER ; x4 <- subset(x3, month(ndate) > 6) x5 <- subset(x4, month(ndate) < 13) # Im sure theres a way to do that in one step but I do not know it yet ; # NOW FIND THE FIRST FROST DATE EACH YEAR AS THE MINIMUM OF THE DATES ; # REMAINING IN THE FILE, AT THIS POINT ; frost <- aggregate(x5$ndate~year(ndate),data=x5,FUN=min) print(frost) str(frost) print(frost$`x5$ndate`) # IF YOU THOUGHT THE NAMING CONVENTIONS WERE AWKWARD, ON THE AGGREGATE STEP ; # THE DEFAULT VARIABLE NAME, OF THE THING THAT HOLDS THE VALUE YOU JUST ; # AGGREGATED, IS FILE$VARNAME. BUT TO REFER TO IT, YOU HAVE TO SURROUND THE # VARNAME OF THAT TYPE WITH LITERALS ; # BELOW, CREATE A "NORMALLY NAMED" VARIABLE IN THE FROST DATASET ; # THEN CHUCK OUT EVERYTHING BEFORE 1964 AS REPRESENTING INCOMPLETE YEARS OF DATA ; frost$ndate = frost$`x5$ndate` frost <- subset(frost, year(ndate) > 1964) # CREATE THE JULIAN DAY, THAT IS 1 TO 365 ; frost$julian_day <- yday(frost$ndate) print(frost$julian_day) median(frost$julian_day) # The latter computes and prints the median to the console ; # answer is 290 ; # Thats October 19, more or less correct plot(year(frost$ndate),frost$julian_day) # FINALLY, TO RUN THIS SCRIPT/PROGRAM, HIGHLIGHT IT, THEN HIT "RUN" AT THE TOP OF THE # SCRIPT EDITOR WINDOW ;
An initial judgment
There are a lot of things about R that I find awkward, compared to SAS. But so far, there are no stoppers. What was PROC SORT in SAS is now an order command in R. A SAS PROC SUMMARY statement becomes an aggregate command in R. And so on.
I’m sure I’m going to miss SAS’s automatic treatment of missing values. I’ll probably miss the SAS system of value labels at some point.
But just for messing about with data, R seems to do well enough for my purposes.
After holding (and paying for) my own SAS license for close to 30 years now, I’m finally giving that up.
I had been dreading learning a SAS replacement. I figured I would be floundering around for weeks. But R is intuitive enough, for a long-time SAS user, that it really doesn’t seem like it’s going to be any problem at all to pick up R as a language for data analysis.