Resume Parser to extract technical skills

3 min readJun 28, 2020

Resume Parsing or Resume Screening is one of the application fields of artificial intelligence specifically natural language processing and text mining. As said by the mother of text mining Marti Hearst “Text has rich information, but it is encoded in a form that is difficult to decode” so are the resume. They try to depict the complete summary of an individual. Tools like Spacy and NLTK to summarize the text after preprocessing is widely used and possible for resume parsing but resume being a summary itself further summarizing it doesn’t seem relevant.

A resume can be two types:

1- Structured Resume: Content to be mentioned in the resume is defined and their corresponding position on the resume is also predefined.

2- Unstructured Resume: Applicant is free to draft his/her resume as per his/her creativity, no fixed position for resume contents.

Parsing a structured resume is a bit easier than the unstructured. In this, we performed a text phrase matching to create a summary of any structured or unstructured resume for individual technical skill extraction. We created a dataset consisting of software developers’ technical skills and education as broad features. Then words and phrases matching between the resume content and the dataset is done to extract the skills from the resume. Dataset consists of 5 feature which are the software developer skills like Frontend, Backend, Android Developer, Machine Learning they fall under skill feature and another is education that consist of academic degrees name provided by the university. Using a matching technique the skill and education are extracted and regex is used to extract personal details.

The whole process used for this is depicted below by the diagram

**Figure 1: Architecture used for Technical Resume Parsing**

It takes/accepts any text document either pdf, docx, doc, txt, or other formats as input resume which is further processed using the following steps.

1- Tika parser parses submitted a resume and writes extracted content from the file into a new text file excluding graphics.

2- Then the dataset is loaded and any null values are removed.

3- Named Entity Recognition is done, in this the email and phone number are extracted from the resume using the regular expression, as they are the crucial personal information during the hiring process.

4- Preprocessing which includes capitalizing text, converting whole text to lowercase, capitalizing word first letter for a phrase matching.

5- Academic qualification extraction using the phrase matching.

6- Stopword removal and join the obtained list to create a string without stopwords.

7- Finally skill matching using leftover text after stopping word removal

8- Then all the extracted details (email, phone, academic qualification, and skill) are displayed.

The code and the created dataset(excel file here) for this can be found here.

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Dipendra Pant

No responses yet