This is the first in a series of data blogs documenting our exploration of potential data sources for Arloesiadur, the innovation analytics platform we are developing for Welsh Government.
With Arloesiadur, we want to give innovation policymakers information relevant for decisions across the policy cycle, from agenda setting to monitoring.
But what data do we use? How do we analyse it? How do we present it?
To answer these questions, we are carrying out a series of short and focused data pilots, each of which will consider an area of interest to innovation policy makers, and explore a data source relevant for it using data analytics methods and outputs. We will be carrying out four of these pilots over the next few months. We will wrap up each of them by looking at how the data were gathered, how we analysed them, what information they hold and how they could fit with Arloesiadur.
Our first data pilot is looking at research networks:
We have explored these questions using an open dataset created by the UK Research Councils, the Gateway to Research. This blog discusses how we have collected and analysed the data. We discuss preliminary findings of our analysis of Welsh networks in this follow-up.
The Gateway to Research (GtR) has been developed by Research Councils UK (RCUK) to open up access to publicly funded research projects and their outputs. It includes research projects from the UK's seven research councils as well as data on Innovate UK and National Centre for the Replacement, Refinement and Reduction of Animals in Research (NC3Rs) funded projects, with data stretching back to 2006.
For each project, there is information about the level of funding received, the source of funding, the type of projects, the researchers and organisations involved, and the outcomes, including papers, spinouts, and follow-on funding, among other things.
Each individual entity in the database is uniquely identified, giving us the opportunity to link multiple aspects of the data together to paint a broad and detailed picture of the research landscape in the UK for projects funded through the UK's own research councils, as well as Innovate UK. This only scratches the surface of what the data actually contains; the GtR data dictionary provides far more detailed information on what is available. All of the data in GtR is subject to the Open Government Licence (OGL), allowing reuse by third parties.
The data can be accessed in a number of ways: interactively through the website, and programmatically through three different Application Programming Interfaces (APIs), details of which can be found here. We used the GTR-2 API to access all of the data about projects, individuals, funds, organisations, and publications. We stored all of these data locally for analysis. We wrote a simple Python wrapper around the API and a handful of scripts to access and store the data. We are open sourcing that software so that others can use it (making our results more reproducible).
Even though the dataset isn’t big (1), it is complex: linking hundreds of thousands of research papers to specific authors, projects and organisations isn't a simple task that can be conducted in a spreadsheet. At the time of writing, the datasets consist of 65,568 projects, 502,145 publications, 51,962 individuals, 27,785 organisations and 206,206 outcomes. Our analyses were conducted in interactive notebooks (again using Python), making our methods as transparent as possible whilst allowing others interested in the data to reproduce, amend or otherwise build on our research.
To understand the research landscape in Wales, we needed to look at the data not just in network space (i.e. connections independent of a geographic location) but also in geographic space, where each datum is geolocated. To do this, we geocoded the data using Google’s geocoding Application Programming Interface (API). In other words, we used information about the location of an organisation (such as its address, or the name of the place where it is based) to estimate its geographical coordinates - latitude and longitude.
This process isn’t without mistakes: Misspelled place names and incorrect postcodes can mean that it isn’t possible to geolocate an organisation, or that it isn't located accurately. This was a particular issue for organisations based outside of the UK (UK addresses generally contained postcodes that were matched reliably by the geocoder). Of the 27,786 organisations in the data we were able to geocode 20,219.
We used the latitude and longitude of the organisations to assign them to Local Authority Districts (LAD), generating counts of projects in each district, but also subsetting the data to look at the number of collaborations between Welsh universities and other organisations across Great Britain, by LAD (figure 1).
To understand better the interactions between businesses and academia, we needed to identify those organisations in GtR that were businesses as opposed to those that were either academic or public sector organisations. To do this, we looked for keywords such as limited, company or corporation to identify businesses, and words such as university, and council to identify other types or organisations. This approach resulted in 11,357 companies being identified, 4,785 other organisation types being identified and 9,492 organisations remaining unclassified.
To increase our recall (the number of relevant organisations we were able to classify), we ran our unclassified organisation’s names and addresses through the Companies House Beta API. We recorded an organisation as a company when the API returned a result that had the same name and postcode as the organisation, which avoided matching organisations with similar names.
Running the data through the Companies House API created an additional 1,495 matches. The final number of classified companies was 12,850. 4,718 organisations were classified as ‘other’ and 7,997 organisations remained unclassified.
We are interested in measuring levels of research activity and collaboration in different academic disciplines. This will help us understand the relative comparative advantage of Welsh universities and R&D oriented businesses in different areas. We would also expect to find variation in research networks across disciplines - in terms of their structure, the Wales-based organisations that participate, as well as their partners elsewhere. By splitting our data into disciplinary networks we are able to look at this. Further, identifying disciplines makes it possible to look at collaborations across disciplines, an area of special interest for policymakers because of its potential to generate innovations, and the barriers to it.
GtR contains research subject and topic information for 46% of the projects - these are grant awards and fellowships. Other project categories, such as Innovate UK funded projects, don’t have research subject data (2).
There are 82 unique research subjects (e.g. ICT, Climate and Climate Change, Materials Sciences) and 607 more detailed unique topics in the data. How can we reduce some of this complexity to produce a smaller set of research domains to analyse and report? Instead of using a pre-set taxonomy, we have identified research domains from the bottom up, based on their propensity of different topics to appear in the same projects using a community detection algorithm (3).
In practice, this involves representing the research topics as a network where the nodes are topics, and the links between them the number of instances where they appear in the same projects. Topics that tend to appear in the same projects are “connected” and classified in the same research domain. Those that appear together rarely, if at all, are classified in different domains (4).
The interactive “knowledge graph” above shows the outputs of this analysis. The nodes are research topics, the links represent tendency to feature in the same projects, their size represents the number of connections with other topics (a proxy for their “centrality” in the network) and the colour, the research domain they have been classified in. Our analysis identifies 7 groups which we have labelled “Arts and Humanities”, “Engineering and Technology”, “Environmental Sciences”, “Mathematics and Computing”, “Life Sciences”, “Physics” and “Social Sciences”. The categories look intuitive, and their contents make sense. The graph also captures relations between disciplines. Arts and Humanities are more closely connected with Social Sciences than with STEM research domains.
We have also looked at the “brokerage” position of different topics in the graph in order to determine which are the ones that connect different research domains. Interestingly, this includes scientific topics with significant policy or social angles, such as “Climate and Climate Change” or “Medical Science and Disease”, interdisciplinary fields such as “Science and Technology Studies” or “Complexity Science” and topics that develop tools and knowledge relevant in many different areas, such as “Statistics and Applied Probability”, “Technology and method development” and “Instrumentation Engineering and Development”.
Although we think that further analysis of this knowledge graph and its evolution can yield interesting and policy relevant findings, for now we have focused on using its main output (the classification of research topics into domains) to classify research projects. The protocol to do this is quite simple: if a project has all its research topics within a single research domain (e.g. “Physics”) then we classify them into that domain. If it has a combination of research domains (e.g. “Physics” and “Mathematics and Computing”) we classify it in a “Mixed” category which may be of particular interest insofar it may capture multidisciplinary and interdisciplinary projects.
Though the configuration of research networks in network space is of interest, the geographic setup of these networks is also important. It shows not just how research networks collaborate, but where they collaborate (and where they don't). In a similar vein to the network space analysis, it can show us where conglomerates of research in distinct subjects are taking place, throwing light on previously unseen patterns of innovative behaviour.
We are able to do this because GtR contains address data for the organisations in its database. We are able to take these address strings and, using Google's geocoding API (geocoding is the process of turning addresses into geographic coordinates) get the position in Great Briatin of the organisation.
In an accompanying blog post we detail this geographic spread of Welsh universities, finding that, perhaps unsuprisingly, Cardiff University has the highest levels of collaboration across the UK of any of the Welsh universities, with 1,050 collaborative projects having taken place across 151 districts. We also find little distinct difference in geographic patterns between the universities' collaborations: for obvious reasons, Cardiff and Swansea both collaborate strongly along the M4 corridor; similarly, Bangor has collaborations stretching across the north of England and Aberystwyth has projects covering much of mid and west Wales. The seperate blog post contains more details on the method we used to generate these results, as well as links to the code used in our analyses.
The accompanying blog presents some preliminary findings from an initial exploration of the data, focusing on Welsh research specialisations, collaborations and networks.
In terms of next steps, we feel that we have only scratched the surface of what is possible with it. For instance, we haven’t analysed in detail data relating to outcomes (including publications) and how these differ across Wales and by subject.
We also need to consider whether Gateway to Research covers enough of the research ecosystem in Wales to be representative of greater trends. If not, we will need to consider adding new networks to our analyses, for example using data about European Commission research funding through the FP7 and Horizon 2020 programmes, as well as Microsoft's Academic Graph. What is certain is that this type of information can be of great value to not just policy makers in government, but in the wider higher education sector as well: the Higher Education Funding Council for Wales (HEFCW) and the English HEFCE are prime examples of this. We look forward to discuss these issues with policymakers and support agencies in Wales and elsewhere.
We are now moving to our next data pilot, where we examine the interaction of multiple networks of innovation in Wales (including the academic network examined here). This will include gathering and analysing data from a number of sources (including GitHub, Meetup, Twitter, official Welsh Government data and information from local sources in Wales, such as tech incubators).
1. Even after storing all of GtR, the size of the database was well within the storage capacities of any modern laptop.
2. Not all grant awards and research funded projects have subject data - it seems that departments in Life Sciences and Medicine are less likely to provide subject data than those in other areas.
3. One could use other methods here, such as topic modelling. We will explore what those methods yield in follow-up analyses.
4. We then use a community detection algorithm (the Louvain method) that decomposes this network into components whose nodes are densely connected with each other, and sparsely connected with other components in the network. In other words, the algorithm optimises the modularity of the network.