In a previous post, I described using R to capture data from an online Jack Kirby bibliography and turn it into a graph. The process was reasonably simple, but those of you who were paying close attention may have realised that’s because I was only working with a small subset of the total bibliographical data.
The bibliography is set out like this:
The top line of each section gives a summary, with the date plus the total pages published during that period. On subsequent lines, each assignment is shown in detail, with publication name, issue number, publisher, story title and page count. All I had to do in order to graph Kirby’s career was to use the summary data. Now, though, I want to dig a little deeper, and that means working with the individual assignments.
First step, as before, is to fire up R, load the string manipulation package I’ll be using and download the text from the bibliography website.
library(stringr) jkall <- paste( readLines( "http://www.marvelmasterworks.com/resources/kirby_chronology.html" ), readLines( "http://www.marvelmasterworks.com/resources/kirby_chronology1.html" ), readLines( "http://www.marvelmasterworks.com/resources/kirby_chronology2.html" ), readLines( "http://www.marvelmasterworks.com/resources/kirby_chronology3.html" ), readLines( "http://www.marvelmasterworks.com/resources/kirby_chronology4.html" ), readLines( "http://www.marvelmasterworks.com/resources/kirby_chronology5.html" ), collapse="")
Now I have a large text string stored in the jkall variable. What I want to do next is separate the assignments by date. I’m going to use the CC0000 colour code (which the bibliography uses to highlight the date, as in the example shown above) and either the closing paragraph tag or, in cases where the closing tag was left off, the next CC0000 code, as separators.
x <- str_extract_all(jkall, perl("(?<=CC0000\\\">).*?(?=<[P|p]>|CC0000)")) x <- x[]
Next, I want to separate all the attributes of each assignment into their own field, so the publication is separated from the issue number, which is separated from the publisher and so on. Looking again at the example above, there are some obvious separators: the ‘#’ character, the ‘-’ character and the ‘:’ character, all surrounded by space characters. I’m going to replace these with tabs. I’m also going to replace the brackets around the page counts with tabs, although this is a little bit more tricky. Some page counts are given with plus signs, like ‘(10+)’ or question marks. There are also a couple of incidences of the string ‘(3-D)’ in the bibliography, which aren’t page counts. The following commands take all that into account.
x2 <- gsub("\\(([0-9](?!-)[[:print:]]*?)\\)", "\t\\1\t", x, perl=T) x2 <- gsub(" : ", "\t", x2) x2 <- gsub(" # ", "\t", x2) x2 <- gsub(" - ", "\t", x2)
Next, I get rid of all the html tags and output to a text file.
x2 <- gsub("<.*?>", "", x2) write.csv(x2, "x2.csv")
With a quick little bit of manipulation in a text editor, this gives me a tab-delimited file. The output I get is pretty good, but it isn’t perfect. Some columns which should only contain page counts have publisher names in them, or story titles. These things are all due to little inconsistencies: missing white space, or additional ‘#’ characters, or (in the case of things like the Silver Surfer Graphic Novel) lack of an issue number. At that point all it really makes sense to do is sigh, roll up your sleeves and do a bit of manual text editing.
Once that’s over and done with, I end up with a spreadsheet containing all of Jack Kirby’s published assignments. It looks like this.
To be continued…