I want to extract all the text boxes and text box coordinates from a PDF file. I would like to extract text from a portion (using coordinates) of PDF page, can anyone help me out?
Given a PDF file, output should look something like:
489, 41, "Signature"
500, 52, "b"
630, 202, "a_g_i_r"
Customer #1
-----------------------------------------------
Hi,
I was wondering if anyone could recommend a program which can extract the starting (top left) coordinates (x,y) of each word in a PDF file (and the end if possible). Ideally output would be in a format that could be easily inserted into a database.
Customer #2
-----------------------------------------------
Sometimes, we have some customers who want to extract text contents and their positions from PDF pages, the text positions are used to parse the values, such as read invoice numbers from PDF files or looking for some other information.
PDF Extractor SDK (PDF Parser SDK and Command Line) is a good product to extract various information from PDF files, of course, it can extract text contents and text coordinates also.
1. You may download the trial version of PDF Extractor SDK (PDF Parser SDK and Command Line) from this web page first,
https://veryutils.com/pdf-extractor-sdk-pdf-parser-sdk-and-command-line
2. After you download it, you may unzip it to a folder.
3. Please run a CMD window first, if you don't know how to run a CMD window, please look at following web page,
https://veryutils.com/blog/top-10-methods-to-run-a-command-line-window-in-windows-10/
4. pdfextract.exe is a command line application, it supports following command line options,
D:\VeryPDF_PDFExtractTool>pdfextract.exe
pdfextract.exe version 3.0
Copyright 1996-2017 VeryPDF.com Inc.
Product Name: VeryPDF PDF Extract Tool Command Line
http://www.verypdf.com
http://www.verydoc.com
http://support.verypdf.com
Email: support@verypdf.com
Usage: pdfextract.exe [options] <PDF-file>
-f <int> : first page to print
-l <int> : last page to print
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files)
-outfolder <string>: Set a folder to store extracted files
-layout : maintain original physical layout
-textfile : Extract text contents from PDF file
-textpos : Extract text and coordinates from PDF file
-nopgbrk : don't insert page breaks between pages
-h : print usage information
-help : print usage information
--help : print usage information
-? : print usage information
-$ <string> : input your license key
Example:
pdfextract.exe D:\in.pdf
pdfextract.exe -outfolder D:\out\ D:\in.pdf
pdfextract.exe -outfolder D:\out\ D:\in.pdf
pdfextract.exe -opw 123 -upw 456 -outfolder D:\out\ D:\in.pdf
pdfextract.exe -outfolder D:\out\ D:\in.pdf > out.log
pdfextract.exe -outfolder D:\out\ D:\in.pdf out.log
pdfextract.exe D:\in.pdf out.log
pdfextract.exe -textpos D:\in.pdf D:\out.txt
pdfextract.exe -textpos -nopgbrk D:\in.pdf D:\out.txt
pdfextract.exe -textfile D:\in.pdf D:\out.txt
pdfextract.exe -layout -textfile D:\in.pdf D:\out.txt
5. You can simple run following command line to extract all information from your PDF file,
pdfextract.exe -outfolder D:\VeryUtils\test\ D:\downloads\Test_in.pdf
6. You will find a "TextFileWithPosition.txt" file in the "D:\VeryUtils\test" folder, this text file contains all text contents and coordinates for each word, such as,
7. "PageContents.xml" is a XML file which contain coordinates for each character, such as,
8. Now, you can write a simple PHP or Python application to read and parse X/Y positions from these PDF files, then you can process these PDF files easily.
If you wish extract more information from PDF files, such as hyperlinks, colorspaces, attachments, bookmarks, pictures, embedded fonts, forms, etc. elements, please feel free to contact us, we are glad to assist you asap,