Extract text from PDF with AWS Textract + NodeJS

3 min readMar 12, 2021

Intro

A month ago I needed make a backend that accept a PDF, get all text from them and search keywords.

The first new concept for me was OCR (Optical Character Recognition), in the web we do not have so much documentation about OCR for NodeJS, the only open-source is Tesseract (now property of Google), this is so powerful OCR but in NodeJS environment is so slow to process 50 pages of one PDF.

After navigate the web I found Textract, product form AWS to get all text of long PDFs with easy setup but poor documentation about this.

Requirements

NodeJS project working
AWS account (with free tier is enough)

Set up AWS

IAM

If you never create a AWS permissions with IAM console, checkout this section

Go to aws.amazon.com
In search bar type IAM
Go to groups on the sidebar
Click on create group on the top of the page
Set a descriptive name for the group
Now you will set the policies for the group, for this search and select AmazonS3FullAccess and AmazonTextractFullAccess
Create the group
Now go to Users and create a new user
Set a descriptive name for the user and select the Access Type this will be by Programming because will use with the SDK
In the next page set the group previously created
Labels are not necessary
Download de .csv with credentials and finish.

Install AWS dependencies

You will need some dependencies to make this works, copy it:

npm i @aws-sdk/client-textractnpm i @aws-sdk/client-s3npm i @aws-sdk/node-http-handler

Set the AWS credentials in project

The AWS SDK v3.0 simplify the way to set up environments in the project.

You only need setup the credentials in .env of you project, something like this

AWS_ACCESS_KEY_ID = “Your access id from csv”AWS_SECRET_ACCESS_KEY = “Your access key from the csv”

The SDK automatically read the file and setup for you globally

Now, the code

Send file to S3

First of all you need know, large files like PDFs is mandatory uploads to S3 services before to use, so…

It is an easy step, you need create a bucket with AWS Console and set in the project.

Then only create command with PutObjectCommand and send with the S3 client

After save file in the bucket you can start with the magic.

Textract divides the job in two steps.

Start Document Analysis
Get Document Analysis

Send document to analyze

Set the params to send, like in the code, you need indicates the S3 Object with Bucket and Name of the file and FeatureTypes the last one is used to know what kind the information could have the document, is mandatory send this property and you have TABLES and FORMS

StartDocumentAnalysis will start the process to analyze the document.

The response of this action is one JobId number, that you will use in the next step

Get results of analysis

Now we can request for the Job results

In params you can send JobId and NextToken

The first one is mandatory to identifies the job.

The second is optional, normally if the file is long you will get NextToken key every response till the last one that return null

How to request all data

So Textract returns the data in json with 30,000 lines. For large documents this is not enough.

I made a loop with while sentence to get all data

Conclusion

We have a powerful tool to extract text without so much efforts, we can get advantage of this a make beautiful things. AWS Textract is built with Computer Vision technologies and AI, so this improves everyday to get better recognitions. I used this technology to recognize Spanish texts and works awesome. Go ahead a try it!

Thanks for reading!

If you got doubts or you see anything wrong, let me know! Is my first Blog.

Me on GitHub.