One Score to Rule Them Allcomments powered by Disqus
Posted on Friday, December 9 2011 @ 11:53:32 Eastern
This member blog post was promoted to the GameRevolution homepage.Sit down and grab some popcorn: this is a long one.
(Thanks to Bras for the article suggestion.)
Have I ever told you the story of Metacritic? Let me tell you the story of Metacritic.
Once upon a time, there lived a man in the basement of his parent’s one-story home. He did not shower. He did not shave. By day, he slept, the basement’s single window smothered with a throw pillow stolen from the above world. By night, he ate Cheetos, drank Mountain Dew, and laughed with unnatural volume in voice chat while smoking pot beneath a disconnected smoke alarm. [Ugh, is that you? ~Ed. Anthony]
He read video game reviews like this:
*open* scrollscrollscrollscrollscroll... 8/10 *close*
On one such night, after toking a particularly big spliff of some good sensimilla, the man spoke aloud, to no one in particular, “Gee-suhs. It sure would be nice if these reviews didn’t have all that pointless garbage dumped between the game title and the score”.
And so, fueled by an overly functional high, the man pulled out his sharpest pair of scissors and went to work, clipping out the text from every video game review in all the major gaming magazines and his hard copy of the internet. What remained were neatly organized lists of scores for all games, each about as long as the man’s arm.
As the man surveyed his work, he noticed a most curious pattern: a large portion of review scores for each game were numbers between zero and one hundred. A thought came to mind. “Why, if I were to simply average the grades of these eighty game reviews, I would get--” he counted carefully on his fingers-- “One review instead of eighty! That’s eighty times less reading”! Hastily, he sketched out his vision, complete with annotations:
The man looked again at the context-stripped clippings that lay before him. Some of the paper scraps held numbers out of ten, out of five, and out of four, rather than out of one hundred. Others had no numbers whatsoever and displayed only arcane hieroglyphics, such as ‘B+’ and ‘C-‘. A few scraps were blank—the entireties of those reviews had been contained in the texts that the man had excised from each publication.
“This is of no concern”, said the man. He then proceeded to cough for two minutes straight, finally recovering on his third attempt and continuing his monologue.
“For the numbers, I will multiply by ten, twenty, or twenty-five to force a score out of 100. To these mysterious symbols, I shall apply a transfiguration of my own design, one which transforms game review ‘B’s into academic ‘C’s and game review ‘C’s into academic ‘F-‘s, as was surely the intent of the review publishers. And for any reviews that dare eschew the distillation of human experience into a two-character expression, I shall assign my own score based on my interpretation of the review text I removed in the first place”.
The man’s scraggly beard bristled into a grin. “But why stop there? As an arbiter of taste, I know that some of these numbers are, in fact, WORSE NUMBERS than others”. Here, he paused to waggle an accusatory finger at the offending parchments. “Rather than remove these numbers less worthy, I will instead assign hidden weightings to each of the newly-divined review scores. Then, and only then, shall I take the average of all transformed, weighted scores”.
Outside the basement window, a lazy evening breeze drifted by, but one that totally could have turned into a howling wind, like, at any second.
“AND LO, THIS SINGLE SCORE SHALL DETERMINE, BEFORE ALL OTHERS, THE TRUE MERIT THAT WHICH TO ANY GAME SHALL BE ASSIGNED”!
Then the man snorted a pile of coke the size of his fist and coded for 45 hours straight, and that’s basically where Metacritic came from.
Now let me tell you why it sucks.
Metacritic currently plays host to three hundred forty-three video game publications, both magazines and professional review websites, covering everything from major platforms to casual games/apps. The vast majority of publications are North/South American, European, and Australian, with Asia receiving roughly the same representation as Africa and Antarctica (e.g. zero) save for a couple hundred archived Weekly Famitsu reviews from Japan that stopped in 2006. [Maybe this is why they stopped archiving Famitsu. ~ Ed. Anthony]
My original goal in writing this article was to search for factors that separated high-scoring publications from low-scoring publications—could I use Metacritic’s data to determine conclusively whether certain video game review outlets were nothing more than corporate shills? To begin, I pruned away any publication with fewer than 100 reviews—arbitrary, to be sure, but I wanted a number that made me comfortable enough that the average review score wouldn’t be overly influenced by certain months in which all the AAA titles seem to be released (an article for a later day, perhaps). I then chopped this data set, currently at 285 publications, down further by removing all ‘inactive’ reviewers. Three recent, high-profile game releases that spanned all major platforms (The Elder Scrolls V: Skyrim, Batman: Arkham City, and The Legend of Zelda: Skyward Sword) were chosen to represent the active set. If a publication did not review at least one of these games, it was not included in the data set.
What resulted was a list of one hundred sixteen publications, each with an average review score calculated by Metacritic (sans weighting). All together, the data set looked like this:
Here’s a quick quiz: can you guess where each of the following high-profile U.S. publications landed?
Metacritic has Game Revolution’s average score at 64--a full four points below the next closest publication. Of course, when you’re converting a ‘B’ grade to a 75/100 and a ‘C’ to a 50/100, it isn’t hard to see why. In fact, every score below an ‘A-‘ is converted lower than its American academic equivalent—it is no wonder that publishers don’t send these guys review copies. Were Game Revolution to begin covering as much shovelware as other publications (yet another topic for a later day), I would expect that average to plummet well into the mid fifties.
So, any publication grading on a letter scale is out. But that still leaves 112 review websites and magazines just waiting to be analyzed. Since they all grade on a number scale, there should be no issue with comparing their scores against one another once everything is converted to the same numerical base. Right?
You are not going to believe this.
Among publications that score numerically, there are a wide variety of grading scales. Pictured below are the ones I encountered while poring through the remaining 112 publications:
That’s a lot of different numbers. But what if the numbers themselves didn’t actually matter? We know that Metacritic converts all of its scores to an ‘out of 100’ grade—doing so is the primary mechanism that allows the website to justify its averaging of disparate review grades into a single Metascore. So let’s apply that same conversion directly to the grading scales:
Now that’s a clearer picture. As long as a grading scale has the same amount of grade intervals, or what we’ll call ‘degrees of differentiation’ here, the Metacritic conversion treats the scales as identical. This is because the intervals between each number in a given scale are identical. In fact, we can boil down any numerical grading scale that meets this criteria to just the degrees of differentiation. For instance, out of five half points and out of ten whole points both have 10 degrees of differentiation, while out of ten decimal and out of one hundred whole points both have 100 degrees. Metacritic’s methodology implies that there is no difference between these scales. Ten degree scales are the same as twenty degree scales are the same as one hundred degree scales—they’re all out of 100, so just average them!
But what if this wasn’t the case? What if a publication’s average review grade was affected simply by number of discrete intervals they placed after a perfect score? Imagine that I have five games of varying quality, and I want to express to my readership that each game is worse than the one above it. In order to find five unique points of grade differentiation, I would have to travel farther down the numerical part of a ten degree scale (100,90,80,70,60) than I would down a one hundred degree scale (100, 99, 98, 97, 96). Even if I held the exact same relative opinion about how much I enjoyed the games, my nominal grades would change solely based on the scale I was using.
An extreme hypothetical case, to be sure, but the question remains: is it possible that some publications score games lower than others only based on their grading scale, even if everyone is using the same equivalent numerical range?
Not only is it possible; it is the god-damned truth.
Ignoring the lone 5 degree entry, all of the above differences are statistically significant at the 90% confidence level. The difference in score between 20 degrees and 100 degrees is significant at the 95% confidence level. The difference in score between 10 degrees and 100 degrees is significant at the 99.9% confidence level.
What does it all mean? It means that an 80/100 on a one hundred degree scale and an 80/100 on a ten degree scale are DIFFERENT grades. It means that if I am a game reviewer, I will score the exact same game an average of 4 points out of 100 less if I use a ten degree scale versus a one hundred degree scale. It means that by averaging together all review scores for a game and then stacking up each individual publication against that average, Metacritic is falsely portraying a fair score comparison where none actually exists.
Now, I’ll level with you: In the grand scheme of things, this is more a moral concern than a practical one. Because most major game review publications manage to review most major games, the Metascore is equally skewed for all of them. Generally, Metacritic serves its purpose—the better games get better scores and vice versa. What the above issue demonstrates is actually a portion of a much more fundamental error in the way that we as consumers regard and digest game reviews.
What I am about to tell you next has no accompanying reams of data or sucker punch statistics. It is not something I can derive or state to you with any level of mathematical confidence. It is a one hundred percent unsupported opinion, and all I can do in lieu of providing any numerical argument is to preface with two questions.
First question: Do these three scores represent the same quality of game?
Second question: What about these three?
A review cannot be objective because it is directly tied to a single human being’s experience at a specific point in time. A review should not be objective because entertainment is inherently a subjective experience. When I play a game, I could not care ****ing less if some evangelist has anointed it with the sweat from Christ’s armpit. I only care about how much fun the game is TO ME.
A review score can help me determine enjoyment to some extent, but it is ultimately supplemental to the actual written review. If some bro who loves hammering his dick into the ground with a cleat rates Dick Stompers 2012 a 9.5/10, I am going to be seriously mislead if I purchase on the grade alone. I can’t stand having my dick pulverized by spiky feet! And if the reviewer is worth his salt, he will give me the subjective context I need in the review proper to understand where my tastes overlap with and differ from his own and how those differences will affect my enjoyment of the game in question.
I care if you are a veteran of the genre. I care if you played the prequel. I care if you love the IP. I care if it was the story or the mechanics that made you dock that game a point. Hell, I even care if you had a shitty time. When Daniel published his Warhammer 40k: Space Marine review broken multiplayer and all or Colin devoted a quarter of his Half-Life 2 review to tell me how much he hated Steam, I said, “**** yeah, journalism”. Don’t give me the idyllic future; give me the information I need to figure out as best I can what my experience with this game is most likely going to be.
I’ve played modestly-scored games that I’ve loved and skipped highly rated games I knew I would hate, and I owe it all to that pointless garbage dumped between the game title and the score. So for the love of God, get to know your reviewer and read the review. Metacritic might cover the broad strokes, but chances are the review of that one game made for you and you alone is not punctuated by a perfect score.
Sherlock Holmes: Crimes & Punishments
Sherlock Holmes: Crimes And Punishments gameplay video. (23:22)
The Wolf Among Us
The latest The Wolf Among Us accolade trailer. (1:33)
Tears to Tiara II: Heir of the Overlord
Tears to Tiara II: Heir of the Overlord trailer. (1:59)
Tail Drift PC launch trailer. (1:01)
Sins of a Solar Empire: Rebellion - New Frontiers Edition
Sins of a Solar Empire: Rebellion - New Frontiers Edition trailer. (1:12)
|More On GameRevolution|