The underlying statistical theory:
@Joshua_A_Becker
I am a researcher at the UCL School of Management studying how communication impacts belief/forecasting accuracy.
www.joshua-becker.com$0 in pending offers
Joshua A. Becker
over 1 year ago
The underlying statistical theory:
Joshua A. Becker
over 1 year ago
Great comments and questions by Austin! I'll take them in turn.
On the obviousness of comments being helpful: I agree, it's so intuitively compelling that communication between forecasters should improve accuracy!
However, previous controlled experiments using chatrooms have shown that communication can reliably improve individual accuracy, but that it cannot reliably improve aggregate accuracy!
It turns out (according to current evidence) that the effect of communication is driven by statistical effects that emerge from the initial estimate distribution. In other words, it's hit-or-miss whether it actually helps. In a related example (not comments, but communication generally) the platform Estimize.com stopped allowing people to see the community estimates before providing their own independent estimate, after finding it reduced accuracy.
However, as noted, I still believe comments can be helpful, which is why I want to prove it. My explanation for previous research is that these studies failed to capture 'real' deliberation.
What specific forecasting platform would you use? Ideally, I would like to work with an existing platform and I'm currently in discussions with one possible partner to see if that could work for these purposes. The risk of using an existing platform is that I can't quite get the experimental control needed, or that they decide that running experiments is not suited to their mission. As an alternative, I would use a custom-built platform that I am currently in the final stages of developing for use in a laboratory context (i.e. with participants recruited from a platform like Amazon Mechanical Turk).
How many participants do I expect to attract? This is honestly a bit difficult to predict. Based on my previous work in this area, I would aim to collect approximately 4,000 estimates. Depending on the design, which will be developed collaboratively with my partners, this could be either 4,000 people answering one question (unlikely) or 100 people answering 40 questions (much more likely).
These numbers seem feasible: a $10k prize pool on Metaculus for forecasting the Ukraine conflict 500-3k estimates per question for 95 questions. This topic is unusually popular, however: a $20k prize pool on Metaculus for forecasting our world in data has attracted much smaller numbers of people, on the order of 30-50 for 30 questions.
The lesson here is that topic matters a lot. One advantage of my project is that we don't care what questions we forecast, so we are free to identify questions and topics that are likely to attract contributors.
How would I recruit these participants? Ideally, even if I don't work with an existing platform to run the forecasts, I still am very optimistic that I can work with an existing community to recruit participants.
In the unlikely event I am not able to find any community partner to help with recruiting, I can use more general methods that I have used in the past, which involve actively promoting/advertising the opportunity in online fora. For example, my dissertation involved a financial forecasting study (unpublished) that attracted approximately 1,000 participants by sharing the opportunity in online discussion groups.
What practical recommendations will emerge? Well, that depends on the results. If we find that even these communities are driven by statistical effects rather than information sharing, we might want to recommend removing comments sections. However, comments are potentially about more than just improvement, since they also have the ability to drive participation by creating a more engaging community. Therefore, effect sizes will be very important here, which is another important reason to study this in an ecologically valid context. That is: even if the laboratory results hold, their risk may not warrant a change in practice, if they are very small.