How I extracted 588 Questions from A PDF file with Regex

How I extracted 588 Questions from A PDF file with Regex

Unleashing the potentials of Regular expressions

ยท

9 min read

Introduction

Before I start, I'd like to mention the article is meant to iterate the potentials of regular expression with keen attention on its style and syntax in JavaScript. Please note that the code samples in this article were executed in Node v14.8.1. To get started with Node, please check out the official installation guide

To begin with, regular expression (Regex or Regexp), as defined on MDN Web Docs "are patterns used to match character combinations in strings".

The concept of Regexp, reportedly first came into use in the the 1950s, as Unix text-processing utilities and they(Regexp) occur in two variants. The POSIX variant and the Perl variant.

Regexp uses a combination of character sequence to find a pattern in a text, it is commonly employed in find-and-replace functionality of text editors and many forms of data validation including but not limited to, email validation, file type validation, phone number validation, and many more.

Using Regex in JavaScript

JavaScript implements greater part of the Perl variant of Regex syntax and using Regex in JavaScript could take two approaches, one is using Regexp literals. The other is using JavaScript's RegExp() constructor. Both approaches are valid and both gives the desired result. However the use RegeExp() constructor is recommended if the text to which the pattern would be checked against would be fetched dynamically during the code run time.

//using regexp constructor
const construcorPattern = new RegExp('<pattern>', '<flags>');

//using literal syntax 
const literalPatten = /<pattern>/<flags>;

The placeholder <pattern> is the pattern to look for, whilst the <flags> could be any of g, i, m to mean global search, case insensitive search and multi-line search respectively. To learn more about flags, check out codeguage guide .

To illustrate concepts discussed so far, consider the following code sample


//define a regex
const pattern = /r+/ig;

//text to run the regex against
const text =  "lord Valdemarr hates furrrry animals"

//perform the search
const result = text.match(pattern)
console.log(result); //[ 'r', 'rr', 'rrrr' ]

The code listing above can be read literally as

  1. define a regex pattern to match occurrence of one or more letter r,
  2. print the result. Since letter r as seen in lord,Valdemarrandfurrrry` satisfy the condition of an array the occurrences where returned.

match is not the only regex method available in JavaScript, others include exec() to executes a search for a match in a string, test() to check if pattern exists in a text sample. , search(), replace() and many more.

Phew! The introduction got longer than I anticipated, I hope you were able to pick a few ideas.

My Regexp Adventure.

I had to work on a project wherein I'm required to build a questions bank for a computer based test prototype. Unfortunately, perhaps fortunately. The best source I could get the questions from was encoded as a pdf file and I had maximum of three days to extract the question and parse them to SQL and JSON. Good 588 questions ๐Ÿ˜ฉ!

At first, I gave typing the questions a thought, but it seems way impossible to finish in three days and it is vulnerable to a high percent of inaccuracies ... the clock was ticking, ticktock ... ticktock. I knew I had to come up with something but What? The urge to quit was setting in.

Luckily for me, I remembered a project I built a project a few years ago, wherein I used Regex to parse nearly 1000 quotes (scraped from websites) to JSON. A quote is taken form the JSON aggregates of quotes every few seconds and rendered in a webpage. A nice fade in and fade out animation too were added for smooth transition. You might want to check out the source or demo here.

Back to my story, I was seen banking on previous knowledge of using Regex to parse quotes scraped from websites. But, hey "this ain't website man, it's pfd", I had to remind myself.

At some point, my thought were crippled when I discovered the text can't be easily copied from the pfd file. For some reasons I won't understand, some characters were missing out, allowing themselves not to be copied. I was running out of time.

Well, in the end, I reached out to a friend who offered to write a shell in Python to execute the extraction. He was back in few minutes with a working solution but then I want something I could tweak to taste and using python isn't really my thing. Luckily for me, I found an npm package to extract the text from .pdf to .txt (text) format with that out of the way, I quickly put together a few JavaScript code to get the Job done quite easily and have the output written to a file.

extract-pdf.js


//import dependencies
const { extractText } = require('node-extract-text-from-file')
const fs = require('fs');
const path = require('path');

/*define a function to extract  text from pdf file
* the function takes three argument
* the pdf file name => fileName 
* the name of the file to write the extracted text to => targetName, default to current UNIX timestamp offset
* the  preferred file extension of of the extracted text, default to .txt (text format)
*/
async function extract(fileName, targetName = Date.now(), extension = "txt") {
    const url = path.join(__dirname, fileName);
    const { text, originFileType } = await extractText({ fromPath: url })
    fs.writeFile(`${targetName}.${extension}`, text, function (err) {
        if (err) throw err;
        console.log('Saved!');
    });
}

//function call
extract("my-pdf-source.pdf");

The Questions have been extracted, now what?

Yes! I kept asking myself, "How do I write a regex to capture this", it's the second day already and it's noon ๐Ÿ˜•. I knew all I have to do was to extract and parse the text with Regexp. I sat starring at my screen, trying out different patterns, thanks to heavens, VS Code supports using regex in its search functionality, I was able to quickly validate my approach. I was happy to have a working solution ๐Ÿค—

let parser_01 = \d{1,3}\.\s*(\w*\s)\;

let parser_02 = \d{1,3}\.\s*(\w*\s).*(?=D\))\;

let parser_03 = \d{1,3}\.\s*(\w*\s).*(?=D\)).*\s\|$\;

let parser_04 = \d{1,3}\.\s*(\w*\s).*(?=D\)).*\s\|\s*$\;

//working solution
let parser_05 = /\d{1,3}\.\s*(\w*\s).* (?=D\)).*\s\|\s*$/gm;

You might want to ask how I came about the pattern, Let's go over the working solution together. The first thing is, my extracted text is in the format : text.png

Remember literal syntax of regex in JavaScript const literalPatten = /<pattern>/<flags>; as mentioned earlier., To define the pattern (regex), I noticed each each question number was followed by a dot as seen in 1., 456., 30. and more another thing is the question number is between one and three characters long so I wrote my first part.


let parser_05 = /\d{1,3}\.\s*/;

The Listing above means, match text having digits \d between one and three characters in length {1,3} and followed by a full stop (or dot) \.. Noticed the dot was proceeded by and slash, the intent was to escape it.

Going further, we have \s* Which can be read as "zero or more space" . \s like the \d we saw earlier belong to character class of one kind or the other in Regex. \d matches number. Think of "d" as seen in digit. other examples of character classes include \f, \n, \r, \t, \v, \w, so far I've only been able to match text starting with a number followed by a full stop and finally a space, as seen in 167.

In the mid part of my regex (\w*\s).*, I strengthened my grasp, matching any word \w followed by space \s. and any other character .*. the ., *, + and many more as MDN docs explained in a bit belong to a class called quantifiers.

Well, the last part of my regex uses an asserter, (?=D\)).*, to match the patterns stated above only and only if it is followed by a D) (option d) and anything after it.

The down side of this is that it matches the whole text as just a match and not 588 results as expected. I had to write a function to append a pipe to the beginning of each pattern, \d{1,3} This will allow consistent pattern matching since the pipe | will serve as the end of a previous question and the character class \d{1,3} will serve as the beginning of the next.

This made me add "\s\|\s*$" To the existing Regex. You might want to ask why I used pipe to add consistency, I could have used anything but I chose pipe, it seem to stand out and easy to read.

I put all these logic together and wrapped it in a function parse-extracted-text.js


//import file system `fs` and path `path` to be able to read and write to file
const fs = require('fs');  
const path = require('path');

//computed the file path of the extracted text
const filePath = path.join(__dirname, "1650010783522.txt");

//the regex 
const parser = /\d{1,3}\.\s*(\w*\s).* (?=D\)).*\s\|\s*$/gm;

//read the extracted file path
fs.readFile(filePath, "utf-8", (err, fileData) => {
    if (err) throw err;
    //split the extracted text to a format like "11. Computer is free from tiresome and boardroom. We call it A) Accuracy B) Reliability C) Diligence D) Versatility |" to allow consistency
    let splitted = fileData.replace(/\d{1,3}\./gm, `\|

$&`)

    //parse the splitted text to JSON  then write the result to a file
    const result = [...splitted.matchAll(parser)];

 //introduce a temporary store, to hold the parsed JSON before writting it to a file
    let json = {}
    for (const elem of result) {
        json[`${result.indexOf(elem) + 1}`] = `${elem[0]}`;

    }
    // write the JSON to a file named json.json and print "saved" when the process is completed 
    fs.writeFile("json.json", JSON.stringify(json), function (err) {
        if (err) throw err;
        console.log('Saved!');
    });
});

To wrap it all, I did a few data integrity checks to replace a few of the questions that were not matched or truncated due to loose conformity to the pattern defined. json.png

I parsed and refined the individual field of the JSON result using VS Code regex feature. You can check out Efi Shtain's guide on how to get started with this. Finally, I got what I wanted, I stored the result in a separate file question-bank-json.js result.png

Epilogue

The Third day came I did the last rounds of data integrity checks. I wrote the last function to take my JSON result to SQL. Note that I had swapped out the SQL table creation script

json-to-sql.js

//import mysql
const mysql = require('mysql2');
const { questionBank } = require('./question-bank-json');
// create the connection to database
const database = mysql.createConnection({
    host: 'localhost',
    user: 'root',
    database: 'database',
    password: 'top-secret'
});

database.connect(err => {
    if (err) console.error('error connecting: ' + err.stack);
    console.log('connected as id ' + database.threadId);
});


//loop thru the question bank 
for (const elem of Object.keys(questionBank)) {
    const { question, option_a, option_b, option_c, option_d } = questionBank[elem];
    database
        .promise()
        .query("INSERT INTO <table_name> (`id`, `question`, `option_a`, `option_b`, `option_c`, `option_d`, ) VALUES (?, ?, ?, ?, ?, ?)", [elem, question, option_a, option_b, option_c, option_d])
        .then(([rows, fields]) => {
            console.log(elem + " saved");
        })
}

I was glad I thought wild enough to gave it a try. More importantly, I was glad it worked. If you learnt a thing from the article, consider sharing it on twitter or follow me. You might also wanna follow me on LinkedIn for updates on related posts and adventure.

ย