The final part of our crowd results series covers the lessons we learnt through running a large scale, year-long mass sensing experiment. We discuss five limitations of real-world forecasting experiments and offer our three top tips for setting questions to make the most of “crowd wisdom”.
Carrying out an experiment demanding participation from a crowd over an entire year is not easy. It requires constant monitoring of resonant issues and the ability to dynamically adapt to the ebbs and flows of engagement. Some of the issues are easier to resolve than others. Below, we highlight five of the challenges we encountered and offer some suggestions for overcoming them.
Sustaining engagement over longer periods is one of the key challenges for collective intelligence projects. This requires a keen awareness of participants’ motivations, which can change over time. While the overall engagement with crowd predictions challenge was very high, we tended to observe peaks of activity in response to media coverage, reminiscent of the activity patterns previously reported for citizen science projects that feature regular data releases. Sustained participation by individual forecasters over longer periods was much more rare and varied considerably between demographic groups or subject focus. For example, older forecasters typically participated in seven to eight questions on average, with frequent updates to their estimates and comments, while other groups were more piecemeal in their participation. These patterns of activity posed a challenge for subsequent analysis due to the need to correct for high variability. Strong media partnerships, tailored community management with regular personalised feedback and a mixed incentive approach that adapts to the changing motivations of the crowd can help to mitigate some of these effects.
At the outset of the challenge we had hoped to attract a diverse crowd and throughout we promoted the experiments as “open to everyone”. Overall, we covered a large range of ages and locations. However, we were disappointed by the gender balance and underrepresentation of the youngest age group (18-24 years). We tried to correct this imbalance by including a wider variety of forecasting questions, including those with a health focus, which had previously been shown to track well with BBC Future’s female audience. Despite this, the participation rates from women did not appear to follow a consistent pattern by topic area [1] and some of the Brexit-related questions actually attracted the highest proportions of female forecasting. However, overall both men and women participated more on non-Brexit questions than Brexit questions. More targeted media campaigns or varying the tone of communications about the challenge, away from jargon-y terms such as forecasting might have helped the challenge to resonate more with the groups who were less represented.
Any guide to experimentation will highlight the importance of specifying research questions and hypotheses in advance. But experiments beyond the lab can sometimes demand a more agile approach. Originally we planned for the experiments to provide an insight into the effect of gender-based teams on the accuracy of the crowd predictions. Our hypothesis was that communication and information sharing between team members would differ according to the gender balance of teams, based on previous research . This was a high risk design as it required consistent engagement and participation over longer periods, which ultimately proved too difficult so we were left with insufficient data. Despite this, we were able to make several comparisons between accuracy and forecasting behaviour of different demographic groups. Try to build redundancy into your research plan so that there are enough low hanging fruits to make up for the potential data gaps or failures of more complex experimental designs. For example, we still managed to compare the information sharing patterns of different groups by analysing forecasters’ commenting behaviours. While this was different to the original hypothesis we planned to test, it did reveal some interesting trends that could form the basis of future experiments. For example, we found that older participants tended to post more comments alongside their forecasts and that these were more likely to receive upvotes from other participants.
Forecasting is steeped in its own language - there are many terms to learn for new recruits as they learn the ropes, from Brier scores to Scope sensitivity. The phrasing of questions can also sometimes feel clunky and unintuitive. There is often good reason for this, as it helps to eliminate any residual ambiguity about what exactly is being asked. However, all of this combined can make the forecasting world inaccessible to the average participant. It’s important to be aware of these communication barriers. All wider communications, particularly the early messaging about the project that are crucial for driving initial recruitment, should focus on the relevance of the project for everyday lives and what the individuals taking part will gain from the experience.
In addition, forecasting requires a shift in the way that most of us approach thinking about the future. Crowd predictions encourage a more objective assessment of a situation through the lens of probability. This probabilistic approach, like much of the language around forecasting is unintuitive and can cause disengagement even after successful recruitment. Regular communication to participants that explains terminology and links to available training and tips can help.
Teams from large consultancies and individual experts are often the first to be asked for their forecasts when decision-makers face uncertainty. AI-driven predictive models and betting markets both offer alternative sources of insight into the future. Throughout our experiments we struggled to find meaningful comparisons for the crowd forecasts. Recruiting experts who would be willing to take on the crowd was especially difficult. We suspect this was primarily driven by the potential reputational damage of being outperformed by the crowd.
In contrast, the comparison with betting markets was often made difficult by the esoteric framing of our questions. In a few cases, we were able to compare our questions with the prediction market Smarkets to demonstrate that our forecasters were as good as those with “skin in the game”. Part of proving the complementary value of crowd predictions lies in comparing the method with other ways of anticipating the likelihood of future events. This could have helped us to identify the circumstances and question types where crowd forecasting can bring additional value to complement other ways of thinking about the future. After all, none of the methods is enough in isolation.
Selecting the topics and framing for forecasting questions is a mix of an art and a science. It requires a mix of creativity and precision to capture imagination while still ensuring that the result can be verified by a trusted source within the chosen timeframe. We were lucky to receive a lot of support from Good Judgment Open throughout the challenge and we still managed to stumble along the way! As more policymakers, government agencies and companies worldwide become interested in harnessing the power of the crowd to make forecasts about the future, we give our top tips for getting the questions right.
Nesta is continuing to experiment with collective intelligence methods ourselves and supporting others through our grants programme.
We would love to hear any reflections from those of you who took part or have been following the challenge from afar. Let us know by writing to [email protected], using the subject line Crowd Results. And for anyone who missed the chance, there are many ongoing forecasting challenges on platforms like Good Judgment Open and Metaculus for you to start honing your forecasting skills. Remember, 85 percent of our participants had no previous forecasting experience and they managed to get it right 70% of the time!
[1] Overall, the proportion of female forecasters varied between 18% and 40%, depending on the question.