The Jack Kirby Dataset, part 1

In a previous post, I described using R to capture data from an online Jack Kirby bibliography and turn it into a graph. The process was reasonably simple, but those of you who were paying close attention may have realised that’s because I was only working with a small subset of the total bibliographical data.

The bibliography is set out like this:

The top line of each section gives a summary, with the date plus the total pages published during that period. On subsequent lines, each assignment is shown in detail, with publication name, issue number, publisher, story title and page count. All I had to do in order to graph Kirby’s career was to use the summary data. Now, though, I want to dig a little deeper, and that means working with the individual assignments.

First step, as before, is to fire up R, load the string manipulation package I’ll be using and download the text from the bibliography website.

jkall <- paste(
 readLines( "" 
), readLines( "" 
), readLines( "" 
), readLines( "" 
), readLines( "" 
), readLines( "" 
), collapse="")

Now I have a large text string stored in the jkall variable. What I want to do next is separate the assignments by date. I’m going to use the CC0000 colour code (which the bibliography uses to highlight the date, as in the example shown above) and either the closing paragraph tag or, in cases where the closing tag was left off, the next CC0000 code, as separators.

x <- str_extract_all(jkall, perl("(?<=CC0000\\\">).*?(?=<[P|p]>|CC0000)"))
x <- x[[1]]

Next, I want to separate all the attributes of each assignment into their own field, so the publication is separated from the issue number, which is separated from the publisher and so on. Looking again at the example above, there are some obvious separators: the ‘#’ character, the ‘-‘ character and the ‘:’ character, all surrounded by space characters. I’m going to replace these with tabs. I’m also going to replace the brackets around the page counts with tabs, although this is a little bit more tricky. Some page counts are given with plus signs, like ‘(10+)’ or question marks. There are also a couple of incidences of the string ‘(3-D)’ in the bibliography, which aren’t page counts. The following commands take all that into account.

x2 <- gsub("\\(([0-9](?!-)[[:print:]]*?)\\)", "\t\\1\t", x, perl=T)
x2 <- gsub(" : ", "\t", x2)
x2 <- gsub(" # ", "\t", x2)
x2 <- gsub(" - ", "\t", x2)

Next, I get rid of all the html tags and output to a text file.

x2 <- gsub("<.*?>", "", x2)
write.csv(x2, "x2.csv")

With a quick little bit of manipulation in a text editor, this gives me a tab-delimited file. The output I get is pretty good, but it isn’t perfect. Some columns which should only contain page counts have publisher names in them, or story titles. These things are all due to little inconsistencies: missing white space, or additional ‘#’ characters, or (in the case of things like the Silver Surfer Graphic Novel) lack of an issue number. At that point all it really makes sense to do is sigh, roll up your sleeves and do a bit of manual text editing.

Once that’s over and done with, I end up with a spreadsheet containing all of Jack Kirby’s published assignments. It looks like this.

To be continued…


Update: Smoothing Jack Kirby with R

As described in a previous post, I had made a chart of Jack Kirby’s career as a comics artist. The finished product looked like this:

After thinking about it for a while I decided that I want to make some changes. First of all I want to indicate on the chart that Kirby no longer made his living from comics after 1980 (he moved into animation work). Second, maybe a straightforward connect-the-points line graph wasn’t such a great idea. The line spikes up and down a lot, making it difficult to see an overall trend. Instead, I think I want to plot the points and add a trend line using ggplot’s statistical smoothing function.

Step one: indicating Kirby’s retirement. I need to type the information for that into R.

jkhighlight <- data.frame(list(start=c(1943,1961,1980),end=c(1945,1970,1995),
period=c("Military service","The Marvel Age","Retirement")))
jkhighlight$period <- factor(jkhighlight$period, 
levels=c("Military service","The Marvel Age","Retirement"))

(The last two lines above are to reorder the ‘period’ element so it’ll appear in chronological order in the key. Otherwise, the default is to alphabetical order: WWII, then retirement, then Marvel.)

Now to replot everything as points.

ggplot(yrtotals, aes(x=year,y=pages)) + geom_point()

This gives us:

This doesn’t tell a very clear story, which is why I’m going to need the trend line. Before getting to that, let’s get rid of the grey background, clean up the axes and add a title.

last_plot() + theme_bw()
last_plot() + scale_y_continuous("Published comic art pages per year") + 
scale_x_continuous("Source:", breaks=seq(from=1940,to=1995,by=5)) 
+ opts(axis.title.x = theme_text(hjust=1, vjust=0, size=8), title="Jack 
Kirby's career in comics", plot.title=theme_text(hjust=0.5, vjust=1, 

And add the colour regions:

last_plot() + geom_rect(aes(NULL, NULL, 
xmin=start, xmax=end, fill=as.factor(period)), ymin=0, ymax=1300, 
data=jkhighlight, alpha=0.3) + scale_fill_manual("", values=c("red", 

Now we have:

Still not very obvious what’s actually going on here. So now it’s time to add the trend line.

last_plot() + coord_cartesian(ylim=c(0,1300),xlim=c(1937,1995))
last_plot() + stat_smooth(span=0.16)

I’m much happier with the way the graph looks now.

Graphing Jack Kirby’s career using R

God bless Ray Owens, who compiled a web page itemising every page of Jack Kirby art by title, publication date etc. I decided to make a graph of the King’s entire career in comics, so I fired up R, rolled up my metaphorical sleeves and did the following:

First, I need to have the source code of the web pages loaded into R. The bibliography is split over six pages, so I use the following command to download all six and concatenate them into a single text string.

jkall <- paste(
   readLines( "" 
), readLines( "" 
), readLines( "" 
), readLines( "" 
), readLines( "" 
), readLines( "" 
), collapse="")

This gives me one big character vector with all the html code in. The code looks something like this:

<font color=#CC0000><u><b>Mar 1938</b></u></font color> (2)<br>Wags # 64 
: J B Powers (UK) - <b>The Count of Monte Cristo</b> (1)<br>Wags # 65 : J
 B Powers (UK) - <b>The Count of Monte Cristo</b> (1)<P><font color=#CC0000>
<u><b>Apr 1938</b></u></font color> (4)<br>Wags # 66 : J B Powers (UK)...

And so on. The parts I’m interested in are the monthly totals, which are always preceded by that ‘CC0000’ colour code. So I dig out the first seventy characters after each instance of that code (I’ll need the functions from the stringr library to do this easily).

raw <- str_extract_all(jkall,"CC0000.{70}")

I now have a 567-element character vector called raw, which looks like this:

[1] "CC0000><u><b>Mar 1938</b></u></font color> (2)<br>Wags # 64 : J B Powers (UK" 
[2] "CC0000><u><b>Apr 1938</b></u></font color> (4)<br>Wags # 66 : J B Powers (UK"
[567] "CC0000><u><b>Nov 1995</b></u></font color> (2)<br>Dark Horse Presents # 103 "

I now need to pull out the monthly total (the figure in parentheses) and the year for each element, so it’s time to break out the regular expressions.

year <- str_extract(raw[[1]],"19[0-9]{2}")
pages <- str_extract(str_extract(raw[[1]],"\\([0-9]+\\.?\\+?[0-9]?\\)"), 

And I’ll want to store the year and page numbers in a data frame.

jkdf <- data.frame(list(pages=as.numeric(pages),year=as.numeric(year)))

Okay. So what I have now looks like this:

    pages year
1     2.0 1938
2     4.0 1938
3     2.0 1938
4     8.0 1938
5     8.0 1938
562   3.0 1994
563   2.0 1994
564   2.0 1994
565   2.0 1994
566   0.5 1995
567   2.0 1995

What I want is to sum all the pages by year, ready for plotting on my graph. I create a summary data frame using the aggregate function and rename the elements for simplicity.

yrtotals <- aggregate(jkdf$pages,list(jkdf$year),sum)
names(yrtotals) <- c("year","pages")

Now to plot the graph, using ggplot2.

ggplot(yrtotals,aes(x=year,y=pages)) + geom_line()

Et voila:

Doesn’t look terrible, but I want to clean it up a bit. First the axes and plot title.

last_plot() + theme_bw()
last_plot() + scale_y_continuous("Published comic art pages per year", expand=c(0,0),
 limits=c(0,1300), breaks=seq(from=0,to=1300,by=100)) + scale_x_continuous(
"Source:",breaks=seq(from=1940,to=1995,by=5)) + 
opts(axis.title.x = theme_text(hjust=1, vjust=0, size=8), 
title="Jack Kirby's career in comics", plot.title=theme_text(hjust=0.5, 
vjust=1, size=16))

I think this is nearly done, but I want to add a couple of annotations. First, that sudden dip to fewer than 100 pages published in 1945 came about because Kirby was fighting in World War II. Secondly, that sustained burst of productivity in the 1960s: what we now know as the Marvel age. I’m going to add some shaded rectangles to the graph to highlight those periods.

jkhighlight <- data.frame(list(start=c(1943,1961),end=c(1945,1970), 
period=c("Military service","The Marvel Age")))
last_plot() + geom_rect(aes(NULL, NULL, xmin=start, xmax=end, fill=period),
 ymin=0, ymax=1300, data=jkhighlight,alpha=0.3) + scale_fill_manual("",

Okay, done.