Extract text from PDF with AWS Textract + NodeJS
Intro
A month ago I needed make a backend that accept a PDF, get all text from them and search keywords.
The first new concept for me was OCR (Optical Character Recognition), in the web we do not have so much documentation about OCR for NodeJS, the only open-source is Tesseract (now property of Google), this is so powerful OCR but in NodeJS environment is so slow to process 50 pages of one PDF.
After navigate the web I found Textract, product form AWS to get all text of long PDFs with easy setup but poor documentation about this.
Requirements
- NodeJS project working
- AWS account (with free tier is enough)
Set up AWS
IAM
If you never create a AWS permissions with IAM
console, checkout this section
- Go to aws.amazon.com
- In search bar type
IAM
- Go to
groups
on the sidebar - Click on
create group
on the top of the page - Set a descriptive name for the group
- Now you will set the
policies
for the group, for this search and selectAmazonS3FullAccess
andAmazonTextractFullAccess
- Create the group
- Now go to
Users
and create a new user - Set a descriptive name for the user and select the
Access Type
this will be byProgramming
because will use with the SDK - In the next page set the group previously created
- Labels are not necessary
- Download de
.csv
with credentials and finish.
Install AWS dependencies
You will need some dependencies to make this works, copy it:
npm i @aws-sdk/client-textractnpm i @aws-sdk/client-s3npm i @aws-sdk/node-http-handler
Set the AWS credentials in project
The AWS SDK v3.0 simplify the way to set up environments in the project.
You only need setup the credentials in .env
of you project, something like this
AWS_ACCESS_KEY_ID = “Your access id from csv”AWS_SECRET_ACCESS_KEY = “Your access key from the csv”
The SDK automatically read the file and setup for you globally
Now, the code
Send file to S3
First of all you need know, large files like PDFs is mandatory uploads to S3
services before to use, so…
It is an easy step, you need create a bucket with AWS Console
and set in the project.
Then only create command
with PutObjectCommand
and send with the S3
client
After save file in the bucket you can start with the magic.
Textract divides the job in two steps.
- Start Document Analysis
- Get Document Analysis
Send document to analyze
Set the params to send, like in the code, you need indicates the S3 Object
with Bucket
and Name
of the file and FeatureTypes
the last one is used to know what kind the information could have the document, is mandatory send this property and you have TABLES
and FORMS
StartDocumentAnalysis
will start the process to analyze the document.
The response of this action is one JobId
number, that you will use in the next step
Get results of analysis
Now we can request for the Job results
In params
you can send JobId
and NextToken
The first one is mandatory to identifies the job.
The second is optional, normally if the file is long you will get NextToken
key every response till the last one that return null
How to request all data
So Textract returns the data in json
with 30,000 lines. For large documents this is not enough.
I made a loop with while
sentence to get all data
Conclusion
We have a powerful tool to extract text without so much efforts, we can get advantage of this a make beautiful things. AWS Textract is built with Computer Vision technologies and AI, so this improves everyday to get better recognitions. I used this technology to recognize Spanish texts and works awesome. Go ahead a try it!
Thanks for reading!
If you got doubts or you see anything wrong, let me know! Is my first Blog.
Me on GitHub.