Extract text from PDF with AWS Textract + NodeJS

Nicolas Kobelt
3 min readMar 12, 2021

Intro

A month ago I needed make a backend that accept a PDF, get all text from them and search keywords.

The first new concept for me was OCR (Optical Character Recognition), in the web we do not have so much documentation about OCR for NodeJS, the only open-source is Tesseract (now property of Google), this is so powerful OCR but in NodeJS environment is so slow to process 50 pages of one PDF.

After navigate the web I found Textract, product form AWS to get all text of long PDFs with easy setup but poor documentation about this.

Requirements

  • NodeJS project working
  • AWS account (with free tier is enough)

Set up AWS

IAM

If you never create a AWS permissions with IAM console, checkout this section

  • Go to aws.amazon.com
  • In search bar type IAM
  • Go to groups on the sidebar
  • Click on create group on the top of the page
  • Set a descriptive name for the group
  • Now you will set the policies for the group, for this search and select AmazonS3FullAccess and AmazonTextractFullAccess
  • Create the group
  • Now go to Users and create a new user
  • Set a descriptive name for the user and select the Access Type this will be by Programming because will use with the SDK
  • In the next page set the group previously created
  • Labels are not necessary
  • Download de .csv with credentials and finish.

Install AWS dependencies

You will need some dependencies to make this works, copy it:

npm i @aws-sdk/client-textractnpm i @aws-sdk/client-s3npm i @aws-sdk/node-http-handler

Set the AWS credentials in project

The AWS SDK v3.0 simplify the way to set up environments in the project.

You only need setup the credentials in .env of you project, something like this

AWS_ACCESS_KEY_ID = “Your access id from csv”AWS_SECRET_ACCESS_KEY = “Your access key from the csv”

The SDK automatically read the file and setup for you globally

Now, the code

Send file to S3

First of all you need know, large files like PDFs is mandatory uploads to S3 services before to use, so…

It is an easy step, you need create a bucket with AWS Console and set in the project.

Then only create command with PutObjectCommand and send with the S3 client

After save file in the bucket you can start with the magic.

Textract divides the job in two steps.

  • Start Document Analysis
  • Get Document Analysis

Send document to analyze

Set the params to send, like in the code, you need indicates the S3 Object with Bucket and Name of the file and FeatureTypes the last one is used to know what kind the information could have the document, is mandatory send this property and you have TABLES and FORMS

StartDocumentAnalysis will start the process to analyze the document.

The response of this action is one JobId number, that you will use in the next step

Get results of analysis

Now we can request for the Job results

In params you can send JobId and NextToken

The first one is mandatory to identifies the job.

The second is optional, normally if the file is long you will get NextToken key every response till the last one that return null

How to request all data

So Textract returns the data in json with 30,000 lines. For large documents this is not enough.

I made a loop with while sentence to get all data

Conclusion

We have a powerful tool to extract text without so much efforts, we can get advantage of this a make beautiful things. AWS Textract is built with Computer Vision technologies and AI, so this improves everyday to get better recognitions. I used this technology to recognize Spanish texts and works awesome. Go ahead a try it!

Thanks for reading!

If you got doubts or you see anything wrong, let me know! Is my first Blog.

Me on GitHub.

--

--