A month ago I needed make a backend that accept a PDF, get all text from them and search keywords.
The first new concept for me was OCR (Optical Character Recognition), in the web we do not have so much documentation about OCR for NodeJS, the only open-source is Tesseract (now property of Google), this is so powerful OCR but in NodeJS environment is so slow to process 50 pages of one PDF.
After navigate the web I found Textract, product form AWS to get all text of long PDFs with easy setup but poor documentation about this.
- NodeJS project working
- AWS account (with free tier is enough)
Set up AWS
If you never create a AWS permissions with
IAM console, checkout this section
- Go to aws.amazon.com
- In search bar type
- Go to
groupson the sidebar
- Click on
create groupon the top of the page
- Set a descriptive name for the group
- Now you will set the
policiesfor the group, for this search and select
- Create the group
- Now go to
Usersand create a new user
- Set a descriptive name for the user and select the
Access Typethis will be by
Programmingbecause will use with the SDK
- In the next page set the group previously created
- Labels are not necessary
- Download de
.csvwith credentials and finish.
Install AWS dependencies
You will need some dependencies to make this works, copy it:
npm i @aws-sdk/client-textractnpm i @aws-sdk/client-s3npm i @aws-sdk/node-http-handler
Set the AWS credentials in project
The AWS SDK v3.0 simplify the way to set up environments in the project.
You only need setup the credentials in
.env of you project, something like this
AWS_ACCESS_KEY_ID = “Your access id from csv”AWS_SECRET_ACCESS_KEY = “Your access key from the csv”
The SDK automatically read the file and setup for you globally
Now, the code
Send file to S3
First of all you need know, large files like PDFs is mandatory uploads to
S3 services before to use, so…
It is an easy step, you need create a bucket with
AWS Console and set in the project.
Then only create
PutObjectCommand and send with the
After save file in the bucket you can start with the magic.
Textract divides the job in two steps.
- Start Document Analysis
- Get Document Analysis
Send document to analyze
Set the params to send, like in the code, you need indicates the
S3 Object with
Name of the file and
FeatureTypes the last one is used to know what kind the information could have the document, is mandatory send this property and you have
StartDocumentAnalysis will start the process to analyze the document.
The response of this action is one
JobId number, that you will use in the next step
Get results of analysis
Now we can request for the Job results
params you can send
The first one is mandatory to identifies the job.
The second is optional, normally if the file is long you will get
NextToken key every response till the last one that return
How to request all data
So Textract returns the data in
json with 30,000 lines. For large documents this is not enough.
I made a loop with
while sentence to get all data
We have a powerful tool to extract text without so much efforts, we can get advantage of this a make beautiful things. AWS Textract is built with Computer Vision technologies and AI, so this improves everyday to get better recognitions. I used this technology to recognize Spanish texts and works awesome. Go ahead a try it!
Thanks for reading!
If you got doubts or you see anything wrong, let me know! Is my first Blog.
Me on GitHub.