The Open Jobs Observatory was created by Nesta, in partnership with the Department for Education. This first article describes how we identified jobs in green industries. The second article compares green job definitions, presents preliminary results from applying our methodology, and discusses the current policy climate surrounding the transition to a green economy.
In parallel with government intervention to stimulate the green economy, we have developed one of the first open methodologies for automatically identifying job advertisements in green industries. This effort has come during a time of busy policy action: the UK Government has recently committed to creating and supporting two million jobs in green industries by 2030, in a new Ten Point Plan for a Green Industrial Revolution[1] report. They have also created a Green Jobs Taskforce to facilitate this goal[2]. While there has been considerable effort to generate additional jobs in green industries, providing tangibility, such as common job titles and dense green job locations, is a critical next step in the green transition. Our methodology for identifying job advertisements as ‘green’ serves to address this lack of tangibility.
At the highest level, we took a supervised machine learning approach to identifying jobs in green industries. This meant that we manually labelled jobs as either green or not green and trained a classifier to label unseen jobs as belonging to either of those categories.
We chose to operationalise one official definition of jobs in green industries: the United Nations System of Environmental Accounting’s Environmental Goods and Services Sector (EGSS)[3]. The EGSS is made up of areas of the economy engaged in producing goods and services for environmental protection purposes, as well as those engaged in conserving and maintaining natural resources. There are 17 different UK specific activities associated with EGSS, including (but not limited to): wastewater management, forest management, environmental consulting and in-house business activities that include waste and recycling[4]. Our methodology identifies both critical roles (e.g. a renewable energy engineer) and general roles (e.g. an accountant for a green energy company) within these sectors.
The set of job adverts which was used to train our model and then identify jobs in green industries comes from Nesta’s Open Jobs Observatory. The Observatory, which is in partnership with the Department for Education, provides free and up-to-date information on the skills requested in UK job adverts. The collection began in January 2021, and the Observatory now contains several million job adverts.
While our pipeline does a reasonable job of identifying jobs in green industries within our evaluation set, there are invariably some limitations to bear in mind. Primarily, by using this approach we will not capture jobs in green industries where their adverts contain descriptions that are vague and lack green-specific terminology. Secondly, our pipeline relies on the assumption that our training data is representative of all 17 different EGSS activities. In the event that there are too few labelled jobs in specific EGSS activities (such as environmental construction), the model will be less effective at identifying these jobs.
The article will now walk through, in greater detail, the steps that we took to identify jobs in green industries, within the Observatory, using a supervised approach.
Our approach to identifying jobs in green industries can be broken down into three steps:
Before following the methodology shown above, we first generated labelled data to train our classifier. We did this by manually reviewing a random sample of the job adverts in the database and labelling these jobs as green or not green, accordingly. Jobs were labelled ‘green’ if they fell into any of the 17 different EGSS activities, while jobs were labelled ‘not green’ if they did not fall into any of the categories.
After we labelled the random sample of job adverts as green or not green, we manually generated a list of key phrases that mapped onto the 17 different EGSS activities. For example, EGSS activity number 6 - ‘production of renewable energy’ - and its associated description - were summarised as ‘renewable energy production’, ‘renewable heat’ and ‘biofuels’.
After developing the list of key phrases that mapped onto all EGSS activities, we ‘expanded’ those queries by identifying similar phrases. We did so by using Word2Vec’s word embeddings. Word embeddings are a learned representation of text where words that have similar meaning will be similarly represented numerically. This representation allowed us to conduct mathematical operations (such as distance calculation) to identify similar words in an embedding space. This process was helpful because we were able to generate additional key phrases that have similar representations and are related to EGSS activities beyond the initial keyword list. Following this process, we generated approximately 230 key phrases and terms. This keyword list acted as part of a ‘feature’ to input into our classifier.
Once we generated our expanded list of key phrases and terms associated with EGSS activities, we turned our attention to the job adverts, namely to the raw job title and description text. We ‘preprocessed’ our text data so that the text was in a predictable and analysable form for the task at hand. The preprocessing steps we took included removing punctuation from the text, converting all text to lowercase, removing ‘stop words’ (i.e. uninformative words such as ‘the’, ‘a’ or ‘your’) and lemmatising terms. Lemmatization is the simplification of inflected words by converting them to their canonical, dictionary forms.
Once we had preprocessed our text data, we wanted to identify useful features that the model could use to determine whether a job was green or not. We identified two groups of features that could be helpful: keyword counts and the ‘relevance’ of each word in the text of the job adverts. For keyword counts, we counted the number of expanded EGSS green terms or phrases that were present in the preprocessed job title and job description. We normalised this count by the total number of words in the title and description of the advert. Meanwhile, we captured word relevance in job texts by representing the text data as matrices of Term Frequency-Inverse Document Frequency (TF-IDF) features. TF-IDF is a common information retrieval technique that weighs the frequency of a word (or term) against the inverse document frequency[5]. We also trimmed our text data to remove terms that appear in more than 60% and fewer than 5% of all job advertisements. We did this to remove ‘noisy’ terms that were not helpful for distinguishing between jobs in green and non-green industries, such as ‘resume’, ‘seeking’, or ‘apply’.
After representing our cleaned text data numerically, we were able to train a classifier using our labelled data to predict whether or not the job advert was likely to be for a position within a green industry. But first, we addressed the imbalance between the number of jobs in green versus non-green industries in the Observatory.
As there were far fewer adverts in the Observatory for jobs in green industries than in non-green industries, we oversampled the incidence of jobs in green industries in our training data to generate an even class distribution. We did so by applying a data augmentation method called Synthetic Minority Oversampling Technique (SMOTE)[6]. At a high level, this technique works by 1) selecting a random vectorised green job 2) k Nearest Neighbors (kNN) are found and then 3) a randomly selected neighbor point is chosen as a green job. kNN is a simple supervised machine learning algorithm that calculates the proximity between a random labelled point (or in our instance, a green job) and k closest matrices where k is the number of neighbors. The key assumption kNN makes is that ‘similar’ matrices exist in close proximity to each other.
Finally, once we had oversampled our training data, we were able to train our classifier. While we tested multiple different classifiers, we ultimately chose to deploy an Extreme Gradient Boosting[7] (XGBoost) model, owing to its superior performance on our evaluation set. The ‘gradient boosting’ element of the model refers to the fact that it is based on a series (or ensemble) of weak classification and regression decision tree models. The model differs from other gradient boosting algorithms as it uses a different mathematical formula to build the decision trees. Whereas other gradient boosting algorithms typically use the mean squared error[8] or gini impurity[9] as a splitting criterion for building trees, XGBoost uses its own splitting criterion formula with stronger ‘regularisation’. This means that XGBoost typically does a better job of not overfitting to the training data.
Once we trained our tuned[10] XGBoost model to identify job adverts as green or not green, we used it to assess job adverts that the model had not yet seen. When we ran our pipeline on an evaluation set, we were able to achieve a precision score of 93% and a recall score of 94% for the green class. In this instance, the evaluation metric, precision, refers to the percentage of job adverts in green industries that the model correctly labelled as green, while recall referred to the percentage of job adverts in green industries that the model was able to recall. Ultimately, our methodology (when applied to the last three months of adverts collected) estimated that 3% of the job adverts were for positions in green industries.
After we identified job adverts that were likely to sit within green industries, we applied a clustering algorithm to embedded representations of unique green job titles in an effort to identify groups of common job titles. We labelled the groups of job titles by identifying job titles above a probability threshold of cluster membership and then deriving the top n phrases associated with these job titles using TF-IDF.
While our model does a reasonable job of classifying jobs within the repository as green or not, there are a number of methodological improvements to consider for future development: we could a) label more training data, b) treat the task as a multi-class problem and c) change our representation of the text.
Firstly, increasing the amount of data the classifier is trained on could provide additional information and improve the overall fit of the model. This is especially the case for jobs within EGSS activities that may be underrepresented in the labelled training data. Secondly, while we treated the problem as a binary classification task, there are 17 different activities (or ‘classes’) in the EGSS. We could therefore treat this problem as a multi-class task and train the model to predict which activity (or activities) are connected to a given job. This would provide additional specificity and more granular insight into the green economy. Finally, we could represent our text data in alternative ways beyond TF-IDF, such as using transformer models[11] to embed our job descriptions.
The aims of this work have been two-fold: to demonstrate the types of analysis that are possible using job adverts from the Open Jobs Observatory, and to start exploring the green economy via data science methodologies. Click here to read about the results of our methodology and how it relates to the current policy climate surrounding the transition to a green economy.
[1]HM Government, The Ten Point Plan for a Green Industrial Revolution, 2020, London, UK.
[2]HM Government, Green Jobs Taskforce, 2020, London, UK.
[3]Eurostat, Environmental Goods and Services Sector Accounts - Practical Guide, (United Nations System of Environmental Economic Accounting: 2016).
[4]Office for National Statistics, UK Environmental Goods and Services Sector (EGSS) Methodology Annex, (ONS: 2021).
[5]Wikipedia, “tf-idf,” Wikipedia, https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition.
[6]Wikipedia, “Data augmentation”, Wikipedia https://en.wikipedia.org/wiki/Data_augmentation#Synthetic_oversampling_techniques_for_traditional_machine_learning
[7]XGBoost, “XGBoost Documentation”, https://xgboost.readthedocs.io/en/latest/index.html
[8]Wikipedia, “Mean squared error”, Wikipedia https://en.wikipedia.org/wiki/Mean_squared_error
[9]Wikipedia, “Decision tree learning”, Wikipedia https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity
[10]Wikipedia, “Hyperparameter optimisation”, Wikipedia https://en.wikipedia.org/wiki/Hyperparameter_optimization
[11]Hugging Face, “Sentence Transformers”, Hugging Face https://huggingface.co/sentence-transformers