Triple J Predictions

Posted on January 26, 2017 by Ross

As they say in the classics, “Prediction is very difficult, especially if it’s about the future.” Thus far I’ve been careful not to make any firm predictions and instead focus on elucidating the general factors that may or may not contribute to a songs success. When you make a prediction there are two factors to consider. Firstly, there is the margin of error. This will say, based on the factors you considered, that there is the answer is blah and bleh with a confidence of 90%. But the real killer is model uncertainty. The model is the assumptions you make in coming up with your predictions. These assumptions are usually difficult to test, and so it is impossible to give a firm figure of the uncertainty. To take an extreme example, does anyone’s statistical model of the Hottest 100 include the possibility of alien invasion, in which case Triple J would be retasked to emergency broadcast?

So if I’m not going to make a prediction, what are we going to discuss today? Other people’s predictions. The first serious attempt I can find at predicting the Hottest 100 was the Warmest 100 in 2012, and again in 2013. Why bother trying to count up YouTube views and number of radio play by month when you can go straight to the source? People freely post their ballots on Facebook, Twitter and the like. The Warmest 100 pulled 3,600 people’s votes from Facebook and used that sample to forecast the outcome. It was very successful, calling the top three spots in order and having the correct top 10 (though 4 and 5 were out of order). The group returned in 2013 this time pulling in a lesser 1,779 ballots (fun fact, Captain Cook died in 1779. Well, not so fun for him I guess). With a lower vote count they were not as accurate. The Warmest 100 of 2013 got the top spot correct, but not second or third. They managed to pick 7 out of the top 10.

That was to be the final year of the Warmest 100. After a discussion with Triple J management and changes to the social media vote sharing, the group didn’t continue. This seems to be the reason that sharing your vote is in the form of an image rather than a text list, to make it hard to collect. None-the-less, I know of two serious efforts to predict the outcome of this years poll. We have Mr Whyte’s 100 Warm Tunas and an unnamed project by Mx u/ZestfullyGreen which I shall refer to as the Harmonic 100, h/t Nathan Leivesley.

Both of these programs follow in the footsteps of the Warmest 100, but collating people votes as posted on social media. Both programs actually read the text from the images using optical character recognition, which is super cool. Warm Tunas has pulled its data from Instagram while the Harmonic 100 took its data from Twitter. Let us first discuss Warm Tunas. This is very much following the path blazed by the Warmest 100. It even has the same colour theme on its website. It has collected from 7788 people, which works out to the about the same percentage of the total as Warmest 100 in 2012. Pop quiz, does that imply that it’s going to have the same accuracy as its predecessor or more?

….

The answer is more. It turns out that when the sample that you are measuring is a small fraction of the total, that what really matters is the number of votes collected. So for example, if you poll 2% of Taswegians and 1% of New South Welshman the result of the NSW poll will be more accurate. That’s because 2% of Taswegians is 10,000 people, whereas 1% of NSW is 750,000. The margin for errors of a poll may be calculated as \[ Err \sim \frac{0.98}{\sqrt{N}}, \] where \(N\) is the number of people sampled. So for our example, the margin of error for the Tasmanian poll is 1/100 = 1%, compared to just 0.1% for NSW. (These are the 95% confidence intervals; you can be 95% sure that the true value for the poll is less than the error from the sample value). So for a real life example, the opinion polls for political parties in the newspapers typically survey about a thousand people and so have a margin of error of 3%.

This means that the Warm Tunas 100 should be about 50% more accurate than the on-the-button Warmest 100 of 2012. It’s looking good. With 7788 ballots, its error rate should be about 1.2%. This is important because he has “Adore” by Amy Shark out ahead of the favourite Flume by a margin of just 1%. In other words, it’s too close to call.

Let’s turn then to the other entrant, the Harmonic 100. This project pulls its data from Twitter, but only has the ballots of 600 people. Interestingly though it compensates by weighting the votes depending on how that Twitter user voted in the previous year. Each people gets a score in the following way: if you chose “Hoops” in 2015 you get a point. If you choose “King Kunta” you get 1/2 points. If you chose the song that came 30th (“Say My Name”, solid choice), you get 1/30 points. You add up the points from your ten songs and that’s how much each of your votes is worth this year. This system give more votes to people with tastes closer to the Triple J average, under the assumption that year to year your tastes will be a bellweather.

Because people don’t vote with the intention of trying to pick the highest ranked songs, it is unclear whether the following analysis is applicable. Regardless, the field of expert prediction is fascinating. The best known work is by Philip Tetlock, published in his popular-science book “Superforecasters”. His research finds that there are some people who are just really good at predicting things, even when they aren’t particularly experts in the field. For example, IARPA (which is a research funding program run by the US Intelligence community) ran a competition where groups came up with systems for predicting geopolitical events. Dr Tetlock lead a team that combined the predictions of around 300 experts based on how well those experts had done in their previous predictions. Their team repeatedly won the competition, doing significantly better than CIA experts.

One of the most intriguing findings of the project, and the origin for the title of the book, is that the people who were best forecasters stayed the best forecasters. Consider this; if forecaster was a result of both luck and skill, then there would be a group of people at the top who were pretty skillful but who was the very best in a given year would depend on luck. The more it’s based on luck the more the top of the field changes. Conversely, the more it depends on skill, the less there is turn-over in the top spots. Tetlock found that 70% of people persisted at being ‘superforecaster’ from one year to the next.

Bringing this back to the Harmonic 100, by my rough estimation the votes of a person with good taste should count double those with average taste. Assuming that taste is analogous to prediction skill, this should be equivalent to having the votes of 30% more people. (Homework exercise, where did this number come from?) This still only brings the total to 780 ballots. But because Mx u/ZestfullyGreen provides the predicted scores, we can use this to estimate the error to be 4%. It gets a little tricky at this stage because I’ve doing this at 2am after being out at the pub drinking. Also, I’m not sure how to normalise into a percentage here. Pressing on in spite of this this, the Harmonic 100 has Flume on 24% and Amy Shark on 20%. So again, within the margin of error.

But this could all be bunk. In the end, everything comes back to model error. The key assumption in all this is that the sample is chosen from the population randomly, and this is just manifestly not true. The two programs draw on just a single social media platform each. Personally, I am on neither Instagram nor Twitter. And for that matter, I haven’t posted my ballot at all. Perhaps the people who do are full of themselves and have terrible taste. This is the sort of thing that is basically impossible to be sure of, and why I steer clear of the betting game.