In our new project, Arloesiadur, we are building a data platform to measure and understand innovation, and inform policy.
We have started a new innovation analytics project with support from Welsh Government. It is called Arloesiadur (Innovation Tool in Welsh). In this project, we are building a web platform that harnesses the vast memory and interactivity of the web in order to measure and understand innovation, and inform policy.
What do the microscope, the census and the radar all have in common? They are measurement technologies that have transformed the way we see the world, and our ability to change it.
The web is another powerful measurement technology. It has a vast memory, capturing masses of highly detailed data about its users; and it is interconnected and interactive - it is easy to connect different datasets, republish data, and interact with it.[1] The result has been an explosion of massive, messy, fast-growing datasets – the vaunted big data revolution.[2]
Businesses and policymakers are using an expanding arsenal of analytics technologies to manage and create value from this data.[3] These tools can also help innovation policymakers whose job is to bolster new ideas and industries that drive growth.
Arloesiadur is a new Nesta project supported by Welsh Government, where we will create a data engine to automatically access, combine and analyse data to inform innovation policy in Wales. This engine will power an online platform where users can access and interact with the data.[4]
We think this is a great opportunity. Innovation analysts and policymakers have for long struggled with existing data sources about economic activity, such as business and labour surveys.[5] This is in great part because of their interest in novelty (new ideas, businesses, industries, communities, places etc.) created by complex networks of people, organisations and knowledge. These important aspects of innovation are not well captured in existing, traditional datasets.
Where do the problems lie? How can we use the web to address them?
Official datasets have a problem capturing novelty because they are structured around infrequently updated standard industrial (SIC) codes which don’t include those industries born or recognised after the codes were agreed. They have a problem capturing complexity because they typically don’t measure relationships between businesses, or between businesses and other organisations such as universities or government. Researchers in the field of Scientometrics use citations and authorship data in patents and papers to map some of these innovation networks, but this only covers those (relatively rare) industries that patent and publish.[6]
As a consequence, it is hard to use these data sources to answer questions of interest for innovation policymakers, such as:
Increasingly, we can mine the big memory of the web to fill existing innovation data gaps, and address some of these questions.
For example, we can download (scrape) the text in a company website and analyse it to determine its industry, even if the industry is not in the official classifications.[7] We can extract data from websites such as Meetup.com (a website to organise networking events), GitHub (a website where coders collaborate in software projects) or Twitter to measure and map innovation networks outside science and tech-based sectors.[8]Another advantage of web data is that it can be more timely than official datasets that are generally released with a lag.
An increasing number of studies show that, if one is careful, it is possible to mine the web to understand innovation in ways that were not possible before. Here are some examples of relevant Nesta work:
Innovation is complex. Successful innovations cut across industries and geographies. They combine knowledge, skills and money. They generate tangible outputs (that new product you just bought) and intangible ones (the knowledge that there is a market for that product, and that you could improve on it).
Supporting innovation often requires connecting communities, industries and locations. Understanding innovation requires breaking down the data silos that contain relevant information about it. Also tracking it over long periods of time, for example to deal with the fact that the impacts of innovation interventions often aren’t realised by the businesses that receive them, but their spin-outs and companies in their network.
This is where the web’s connectivity comes into play, allowing us to merge, link and mash datasets, often automatically with Application Programming Interfaces (APIs), the data connectors that allow websites to talk with each other.[10]
We can use this connectivity to pragmatically fill gaps across datasets - for example, in Nesta’s games map we combined economic activity data from official sources with business counts based on web data to study video games clusters. We can also benchmark unproven (web) datasets, with quality assured (official) ones, many of which are now open or available via APIs.[11] The web is a machine for triangulation.
The web is also interactive. This is relevant for innovation policy because it empowers users to find the information they are looking for in big, complex and messy datasets, and explore questions that the people who originally collected and analysed the data hadn’t thought of. Interactivity also helps to visualise complex innovation datasets in ways that are easier to understand for non-technical audiences.
Here are some examples of Nesta visualisations showing:
HEFCE is also exploring this space with its interactive maps of Higher Education data.
In Arloesiadur, we plan to obtain new insights about innovation from the vast memory of the web, and connect them in an interactive website that creates value for innovation policymakers.
One could imagine applications for these insights all over the innovation policy cycle. For example, it could be used to...
We know, from economic history, that simply installing new digital technologies in an industry isn’t enough to reap their rewards. Other complementary innovations – new skills, organisations and processes – are needed to create value from those technologies.[12]
In addition to giving us new, striking and useful insights about innovation, we hope that Arloesiadur will teach us something about how innovation policy needs to change in order to adapt to the big data era, and about the tools we can use to accomplish this.
We will keep you posted about what we find.
This blog received valuable comments from Hasan Bakhshi, John Davies, James Gardiner and Giulio Quaggioto. Image by Fdecomite, via Flickr, CC by 2.0.
[1] Of course, this raises many privacy and data protection issues that have to be carefully assessed and managed. In that respect, it is not different from other measurement technologies like the census.
[2] The term big data refers to datasets with more volume, variety and velocity than was previously available. Big datasets can create big opportunities for innovation, but realising them requires new technologies, processes and skills. This is a good summary of big data opportunities, and this is a Nesta analysis of the skills angle.
[3] This McKinsey Global Institute review overviews big data related opportunities in several private sectors, and in government. This Nesta paper quantifies the impact of data analytics on business performance.
[4] We just started the project, so it still isn't clear what we will build. The final product could end up looking more like a “dashboard”, a “data analytics tool” or a “data application” . Given this, and our vision for Arloesiadur as a foundation that will be extended with new data sources and methods in the future, I have stuck to the wider “platform”, used in an informal sense.
[5] Some of these issues are picked up in last year’s interim report for Sir Charles Bean’s Review of Economic Statistics.
[6] This recent report (PDF) about the data landscape for science and innovation in the UK contains an excellent overview of data sources available for this.
[7] This paper uses this method to analyse R&D in green tech businesses.
[8] Alas, the memory of the web is not always reliable. Its data can be distorted by biases - for example (and anecdotally) businesses in digital media are heavier users of Twitter than those in biotech. It can mislead us during longitudinal analysis, say if changes in a website or the behaviour of its users creates discontinuities in a time-series, unrelated to the “real” behaviours we are interested in (here is a famous example). It isn’t complete either: online data sources often lack important variables such as company financials
[9] We just published another blog post using the same method to analyse networking at Innovate UK’s 2015 conference.
[10] Connectivity brings its own problems. It can make datasets less stable, as when changes in a website automatically spread through the network of sites that use it to source data.
[11] This presentation shows some of our results when we benchmarked web and official data in the UK games project.
[12] The canonical reference for this is Paul David’s paper about the Dynamo and the Computer. (PDF)