Using Goodreads and Bayes’s Theorem to Find the Next *Great* Read of Your Life

Over the past year or so, I have slowly built up a list of trusted reviewers (“Follows”) to follow on the reading social network Goodreads. Goodreads helpfully features the reviews, ratings, and shelves that my Follows have added for a book title at the top of each title’s page, under those of my Friends.

Since many of my Follows are active on the site and I have read many of their reviews, I have an intimate sense of the breadth and depth of their tastes, and how tough they can be even on classics and award-winners. This was the context for the moment late last year when I, through a series of casual clicks, came to the page of a novel I'd never heard of before, Paul Scott's THE JEWEL IN THE CROWN, and saw this:

That sends a pretty strong signal, doesn’t it? A signal that turned out to be 100% correct. THE JEWEL IN THE CROWN is one of the finest works of fiction that I’ve ever read.

Unfortunately, this is something like being dealt a full house on the first hand: an enjoyable surprise that is impossible to replicate through strategy. There aren’t, to my knowledge, any other novels to which 100% of three or more of my Follows have given five stars. But what about other novels that have received multiple five-star ratings from my Follows? Which of these several dozen strong contenders should I read next?

I was reading Chapter 8 of Nate Silver’s THE SIGNAL AND THE NOISE when it struck me that Bayes’s theorem could be useful here. In just a few hours on Goodreads, I could easily assemble a list of multi-five-star novels from my most trusted Follows, gather ratings data across all twenty Follows for the titles, and source personalized yet objective *x*-, *y*- and *z*-values. If I Silverized my book recommendations, could I outpeform heuristics and intuition? Could I actually save time and money and choose better books for myself that I might have otherwise missed?

Since I read and rate mainly novels on Goodreads, I used the following rough calculation for my *x*-value:

That gave a prior probability of 11.22%, a higher number than I would have guessed. I flatter myself on being a tougher critic than I actually am. Bayesian methods have already exposed one of my subconsious biases.

I decided that six different kinds of “events” were meaningful to this experiment: a one-star rating, a two-star rating, a three-star rating, a four-star rating, a five-star rating, and a five-star rating with a special favorites-type shelf (“top-20,” “bitchin!,” etc.)(“5+”). Assembling good data to estimate my *y*-values (the probability of a novel that I rated five stars getting a rating of [1, 2, 3, 4, 5, or 5+] from one of my Follows) was relatively simple given how few novels I’ve given five stars over the years. I recorded and tallied up the ratings given to all of my five-star novels, across titles and Follows, and then it was a matter of simple division:

My *z*-values were a different story. For my *z*-values, I needed to estimate the probability of a book getting a rating of [1, 2, 3, 4, 5, or 5+] from one of my Follows if I had given it any rating other than a five. Accurate *z*-values for this experiment would have been easy for a Goodreads employee to pull, but as I am not a Goodreads employee I used much rougher calculations:

Now I had *y*- and *z*-values as follows:

Rating

5+

5

4

3

2

1

y

4.1%

26.5%

36.7%

25.5%

4.1%

3.1%

z

2.4%

9.2%

41.9%

29.4%

11.8%

5.3%

It is a testament to the value of starting with good data (by which I mean the high signal-to-noise ratio in the Goodreads output of my Follows) that my fairly rough estimations yielded such sound-looking probabilities. As you can see, ratings of 5 or 5+ were much more infrequent in the more general *z* population of books versus among the elite sixteen-book list that I used to calculate my *y*-values. Ratings of 4 and 3, respectively, were slightly (4-5%) more frequent in the *z*-population. Indeed, four-star ratings were by far the most common rating in each population—more than one out of every three ratings was a 4—and three-star ratings were roughly one-quarter of each population. Two-star and one-star ratings have been relatively rare to date (except from Lobstergirl!), and slightly rarer among my five-star books than among the *z* population.

I browsed a few of my Follows’ shelves and easily found thirteen novels to which at least one of my Follows has given five stars. I added THE JEWEL IN THE CROWN to the list as a test for the model. I went to the Goodreads page of all fourteen novels one by one and recorded each Follow rating that I saw in a separate cell on each novel’s row in my spreadsheet. I copied and pasted this data into another worksheet (transposing from rows to columns) to do my calculations.

I did my calculations in Microsoft Excel, as Bayes's theorem consists of simple algebra. You can see from this screenshot of my spreadsheet how much information each individual Bayes calculation revealed. As the model assessed each individual ranking for a book, its prior (*x*-value) rose or fell:

Particularly interesting on this chart is the comparison of my full house (THE JEWEL IN THE CROWN) to THE HEART IS A LONELY HUNTER and INFINITE JEST. Each of these books had *only* two 5 and *only* one 5+ ratings. But the model docked David Foster Wallace’s book ten points for some 4 and 3 ratings. THE HEART IS A LONELY HUNTER’s extra 4 rating was worth minus three percentage points to the model.

I took the liberty of averaging each book’s ratings, since a straight mean is so often the only kind of metascore we see in book review systems. Here is how my Bayesian-inspired method compared to simply averaging the available ratings among my trusted reviewers:

The first draft model has predicted that I will rate approximately three out of the five bolded titles on my list five stars. If I had to bet on it, I would bet that REBECCA and INFINITE JEST would be the two to fall short a star or two.

Although both methods picked the exact same bottom three (only in a different order), the model’s #5 pick, the gothic romance REBECCA, ranked at #11 in the metascore list. Conversely, THE MAGIC MOUNTAIN had a perfect score that landed it tied with THE JEWEL IN THE CROWN for #1 on the metascore list, but the Bayesian model was not convinced by only two ratings, even one 5 and one 5+, at least relative to more fully tested books on the list.

You see, a 38% in Bayes land is not the analogue of a 38% metascore such as you might see on Rotten Tomatoes. (The latter is more similar to a 1.9 average rating, which none of the books on my list received.) Indeed, the 38% chance that I will *love* THE MAGIC MOUNTAIN is still more than *three times* the chance that I will *love* a novel selected more casually (11.22% if you remember). However, THE MAGIC MOUNTAIN is less tested than other books on my list (it’s a rookie, not a pro), and thus considered a riskier bet in Bayes land.

What have we learned? That within the three- to five-star rating range among a small number of like-minded readers, the number of ratings likely matters as much or even more than score in predicting the quality of my future reading experiences. Currently, a three-star rating and a four-star rating have nearly the same effect in the model: a very slight decrease in probability. It would be better to get one 3 and two 5s instead of two 4s and one 5, for instance, even though both rating sets average out to 4.33.

How should publishing professionals use data and methods like this? (Hopefully nobody believes anymore that they should or can influence a community like Goodreads through cybershilling—that just creates noise.) Instead, gatekeeper accountability comes to mind. Imagine a book editor, let’s call him Ed, who has time to read two manuscripts all the way through, or one manuscript all the way through and the opening chapters of four. He has just received five debut novel submissions from five different literary agents. What should he read and in what order?

First he should build a model like the one I used above. Instead of Goodreads reviewers, the literary agents are the ones whose judgment on a certain book or manuscript is an event to run through the model. In this scenario, each agent should be assigned his or her own *y*- and *z*-values. You can see immediately what is missing. Data to source accurate *y*- and *z*-values for the agents. Let’s say that all five agents play ball and send Ed their ratings of hundreds of books on the 1, 2, 3, 4, 5, 5+ scale. He analyzes this data against his own ratings data to estimate the sixty *y*- and *z*-values (twelve for each of five agents). Each of the manuscripts submitted is also rated by its repping agent. Although the variation in the *y*- and *z*-values alone should be enough to give Ed a clear starting point after running each manuscript through his model once, you can see what else is missing. Who else has read these manuscripts, and what were their ratings?

I am starting to see why Nate Silver loves the Bayesian method of data analysis. The frequentist method of creating artificial samples to survey or A/B test things on not only lets one get away with fudging data, but encourages it. The Bayesian method requires good data, and the more the better. It encourages you to go out and get good data, to create and capture accountability in your system. Agents and editors who have passed on manuscripts or authors that turned out to be artistically significant and/or commercially viable have never been compelled to go on the record with these departures in taste. The publishing industry’s collective track record for picking winners is poor (approximately 30% of published books earn out their advance). That number will never be at 90% or even 75% perhaps, but I think that a community of smart and passionate professionals can improve on 30% by using Bayesian analysis and providing the missing data. That means that sixty agents will have to admit to the possibly embarrassing act of having passed on Kathryn Stockett’s THE HELP, but that also means discovering that a model like this can actually be quite forgiving, even as it reveals unexpected patterns in the data.

I know that *my* curiosity has been piqued, and I hope to run more personal reading experiments like the above as my time permits. I may try to analyze the Goodreads and Amazon community-wide metascores in depth. Are they meaningless? Do they contain any signal at all, and if so, what?

However, the very next experiment will be running my model again including the next four novels on my queue: BLOOD MERIDIAN,THE DEATH OF THE HEART, REVOLUTIONARY ROAD, and THE STARS MY DESTINATION. How will books that I chose for myself stack up against the model’s choices in round one? What personal biases of mine will the model expose this time?

I won’t feel too bad no matter what I find out. Because next comes the fun part—testing the model—A.K.A. **reading**.