Text Extraction Command Line utility allows to extract text from the various types of files. The extracted text can be combined into one file or/and split into few files. The converted text files can be reused for index or any other purposes easily.

Supported formats for input files:
AZW, AZW3, CHM, DjVu, DOC, DOCX, EML, EPUB, FB2, FB3, HTML, LIT, MD, MHT, MOBI, ODP, ODS, ODT, PDB, PDF, PPT, PPTX, PRC, RTF, TCR, TXT, WPD, WRI, XLS, XLSX. The IFilter interface will be used for files with unknown extensions.

The utility works from the command line, without displaying any user interface. This is useful to integrate the text processing options to other applications, for example.

Execution order of operations:
* Extract text from input file(s).
* Format text: remove spaces, linebreaks, etc. (if options are specified).
* Combine files into one file (if option is specified).
* Split text (if options are specified).
* Apply rules for pronunciation correction (if option is specified).
* Save output file(s).

The Text Extraction Command Line utility handles various command line parameters to be able to extract text from files. The command line options use the syntax "all2text.exe [options ...]", all parameters must be separated by a space. Options can appear in any order on the command line so long as they are paired with their related parameters. Use the "all2text.exe -?" command line to get help on the command line syntax and parameters.

Text Extraction Command Line Options:

Usage: all2text.exe [options ...]

-f [file_mask]
Sets the name of input file or the mask for the group of input files. The command line may contain few options [-f].

-v [folder_name]
Sets the name of output folder for saving of text files.

-p [text]
Sets the pattern for output file name (for example, "Text Document"). If absent, the input file name will be used as a pattern.
Use the %FileName% variable to insert the input file name to the output file name.
Use the %FirstLine% variable to insert the first line of text.
Use the %Header% variable to insert the chapter title.
Use the %Number% variable to change the position of the sequence number inside the output file name.
Warning! It is necessary to double a percent sign (%) in a batch script. For example: -p %%FirstLine%%.

-out [file_name]
Sets the full name for output file. The option is recommended to specify only when the utility is used as a part of other software.
If the utility is used for custom document import, the external program runs the utility from a command line and passes the full name of a text file to create.

-i
Reads data from STDIN. The file format will be auto-detected from data. If the option is specified, the option [-f] is ignored.

-o
Writes text to STDOUT. If the option is specified, the options [-v] and [-p] are ignored.

-u
Combines text files into one output file.

-b
Adds sequence number before output file name (when text is split).

-a
Adds sequence number after output file name (when text is split).

-n [integer]
Sets the starting sequence number for output files (when text is split). The default is 1.

-e [encoding]
Sets the encoding for output files ("ansi", "utf8" or "unicode"). The default is "ansi".

-t [integer]
Splits text by output target size of text parts (as a number of characters).

-k [keyword]
Splits text by a special keyword in input file. The option is case-sensitive. The command line may contain few options [-k].

-r [keyword]
Splits text by a keyword and removes it from output files. The option is case-sensitive. The command line may contain few options [-r].

-w
Splits text by two empty lines in succession.

-l
Splits text by lines where all letters are capital.

-c
Splits text by a table of contents. The application extracts positions of chapter beginnings from the input file (if the file contains such information).

-toc
Generates a table of contents and splits text. The application splits the extracted text by keywords (like "chapter" or "volume").
If the option is used together with the option [-c], the application will try to extract a table of contents from the document; if it fails, a new table of contents will be generated.

-m [integer]
Sets the minimal size of text parts for splitting (as a number of characters).

-j [integer]
Ignores the chapter beginning if the size of the previous chapter is less than the specified value (in characters). The option is used together with the option [-c] or [-toc].

-hh [text]
Inserts text in front of headings (for example: ## Chapter 1).

-d [file_name]
Uses a dictionary for pronunciation correction (*.BXD, *.DIC or *.REX). The command line may contain few options [-d].
You may use the desktop application 'Balabolka' to edit a dictionary.

-if
Uses IFilter interface to extract text. If this fails, the default method will be used by the application.

-g [folder_name]
Sets the name of output folder for saving of images from a document.

-cvr [folder_name]
Sets the name of output folder for saving of a book cover image.

-x [file_type]
Sets the input file type. It allows to define a format of input documents with unknown file name extensions. For example: -x doc.

-pwd [text]
Sets the password for the encrypted PDF files.

-? or -h
Prints the list of available command line options.

--remove-spaces or -rs
Removes excess spaces (two or more blank spaces in succession, no-break spaces).

--remove-hyphens or -rh
Removes hyphens at the ends of lines in the text.

--remove-linebreaks or -rl
Removes linebreaks inside paragraphs.

--remove-empty-lines or -rm
Removes empty lines.

--replace-empty-lines or -rp
Replaces few empty lines by one empty line.

--remove-square-brackets or -rsb
Removes text in [square brackets].

--remove-curly-brackets or -rcb
Removes text in {curly brackets}.

--remove-angle-brackets or -rab
Removes text in [angle brackets].

--remove-round-brackets or -rrb
Removes text in (round brackets).

--remove-comments or -rc
Removes comments. Single-line comments start with // and continue until the end of the line. Multiline comments start with /* and end with */.

--remove-page-numbers or -rpn
Removes page numbers (it may be useful for DjVu/PDF files).

--fix-ocr-errors or -ocr
Fixes OCR errors (for languages with Cyrillic alphabets only).

--fix-letter-spacing or -ls
Fixes letter-spacing in words (for example: s p a c e, _w_o_r_d).

--add-period or -ap
Adds a period if there is no punctuation after the last word of the paragraph.

--skip-summary or -ss
Skips a summary (also called "annotation"), when the application extracts text from FB2/FB3 files.

--skip-notes or -sn
Skips notes, when the application extracts text from DOCX/FB2/FB3/MD/ODT files.

--include-notes [integer] or -in [integer]
Includes notes inside text, when the application extracts text from DOCX/FB2/FB3/MD/ODT files.
Possible values for the integer parameter:
0 - removes links to notes from text,
1 - keeps default positions of notes inside text (this value is used by default),
2 - places notes at the end of sentences,
3 - places notes at the end of paragraphs.

--insert-note-begin [text] or -inb [text]
Inserts words at the beginning of notes, when notes are included inside text (for example: Editor's note.).
The option is used for DOCX/FB2/FB3/MD/ODT files.

--insert-note-end [text] or -ine [text]
Inserts words at the end of notes, when notes are included inside text (for example: End of note.).
The option is used for DOCX/FB2/FB3/MD/ODT files.

--csv-comma
Columns are separated by a comma, when the application extracts data from XLS/XLSX/ODS files (default delimiter for CSV files).

--csv-semicolon
Columns are separated by a semicolon, when the application extracts data from XLS/XLSX/ODS files.

--csv-space
Columns are separated by a blank space, when the application extracts data from XLS/XLSX/ODS files.

--csv-tab
Columns are separated by a tab, when the application extracts data from XLS/XLSX/ODS files.

--csv-double-quote
Uses double-quote characters if a field must be quoted (export from XLS/XLSX/ODS files).

--csv-single-quote
Uses single-quote characters if a field must be quoted (export from XLS/XLSX/ODS files).

--eml-save [folder_name]
Extracts attachments from EML files and saves to a specified folder.

--eml-att
Extracts the list of attachments from EML files (names of files attached to the message).

--eml-cc
Extracts the header field "Cc" from EML files ("carbon copy"; it specifies additional recipients of the message).

--eml-date [date_format]
Extracts the header field "Date" from EML files (the local time and date when the message was composed and sent). A date format are defined by specifiers (such as "d", "m", "y", etc.). For example: "dd.mm.yyyy hh:nn:ss".

--eml-from
Extracts the header field "From" from EML files (the email address, and optionally the name of the author).

--eml-org
Extracts the header field "Organization" from EML files (the name of the organization through which the sender of the message has net access).

--eml-rt
Extracts the header field "Reply-To" from EML files (the address for replies to go to).

--eml-subj
Extracts the header field "Subject" from EML files (the subject of the message).

--eml-to
Extracts the header field "To" from EML files (the email address, and optionally the name of the message's recipient).

Text Extraction Command Line Examples:

Extract text from "book.doc" and save as "book.txt" to the output folder:
all2text.exe -f "d:\Docs\book.doc" -v "d:\Text\"

Also this variant can be used if necessary (when the only one input file is specified):
all2text.exe -f "d:\Docs\book.doc" -out "d:\Text\book.txt"

Extract text from BOOK.DOC and save as "New Book.txt":
all2text.exe -f "d:\Docs\book.doc" -v "d:\Text\" -p "New Book"

Extract text from the Microsoft Word and RTF documents, remove empty lines and save text files in UTF-8 encoding:
all2text.exe -f "d:\Docs\*.doc" -f "d:\Docs\*.rtf" -v "d:\Text\" -e utf8 -rm

Extract text from all files in the specified folder, unite and save as "Document.txt":
all2text.exe -f "d:\Docs\*.*" -v "d:\Text\" -p "Document" -u

Extract text from 1.DOC, divide on parts with size 100 KB and save as text files "Document 20.txt", "Document 21.txt", etc.:
all2text.exe -f "d:\Docs\1.doc" -v "d:\Text\" -p "Document" -a -n 20 -t 100000

Extract text from BOOK.FB2, find the words "CHAPTER" and "CONTENTS" to divide text on parts and save as files with the names "Book 1.txt", "Book 2.txt", etc.:
all2text.exe -f "d:\Book\book.fb2" -v "d:\Text\" -p "Book" -k "CHAPTER" -k "CONTENTS"

Extract text from BOOK.EPUB, find "###" to divide text on parts, remove "###" from text and save each part as a new file:
all2text.exe -f "d:\Book\book.epub" -v "d:\Text\" -p "Book" -r "###"

Extract text from BOOK.FB2, split by a table of contents, save files and use chapter titles as file names. New text files must not be less than one kilobyte:
all2text.exe -f "d:\Book\book.fb2" -v "d:\Text\" -p "%Number% - %Header%" -c -j 1024

Get text from STDIN, remove excess spaces, linebreaks and empty lines, write the updated text to STDOUT:
all2text.exe -i -o --remove-spaces --remove-linebreaks --replace-empty-lines

Operating System Requirements:
* Microsoft Windows XP/Vista/7/8/10 and later systems.
* Support both 32bit and 64bit systems.

Using Table of Contents:
The application allows to split text by table of contents. There are two way to get a table of contents:

- extract information about a table of contents from a document; it is available for next formats: AZW, CHM, DOCX, EPUB, FB2, HTML, LIT, MHT, MOBI, ODT, PDB, PRC.
- generate a new table of contents, by using keywords found in text (like "chapter" or "volume"); it is available for next languages: English, Armenian, Belarusian, Bulgarian, Czech, French, German, Italian, Polish, Russian, Spanish, Ukrainian.

Be careful when use this option: the result of text splitting may be unpredictable.

For example, HTML uses the [H1] to [H6] tags to define headings. [H1] defines the most important heading. [H6] defines the least important heading. The application uses all kinds of the [H] tag to generate a table of contents. The result may contain many items.

Write a review

Note: HTML is not translated!
    Bad           Good
Captcha

Text Extraction Command Line

  • Product Code: MOD200419095943
  • Availability: In Stock
  • Viewed: 1620
  • Sold By: eDoc Software
  • Seller Rating:
  • Seller Reviews: (0)
  • $79.95
  • $59.95-25%

  • Ex Tax: $59.95

Available Options


Related Products

Text to PDF Converter Command Line

Text to PDF Converter Command Line

Text to PDF Converter Command Line does batch convert plain text files to PDF files. It's a great ..

$89.95 Ex Tax: $89.95

XPS Print Command Line

XPS Print Command Line

XPS Print Command Line is a XPS Printing application, it can be used to batch print XPS files to PDF..

$199.00 Ex Tax: $199.00

PDF Virtual Printer Based on Postscript Printer Driver

PDF Virtual Printer Based on Postscript Printer Driver

PDF Virtual Printer Based on Postscript Printer Driver PDF Virtual Printer is a PDF Printer Drive..

$1,500.00 Ex Tax: $1,500.00

PDF to Excel Converter

PDF to Excel Converter

PDF to Excel Converter does convert PDF Data to Excel Spreadsheets. Our PDF to XLS Converter is th..

$49.95 Ex Tax: $49.95

Save
17%

Office to PDF Converter Command Line

Office to PDF Converter Command Line

OfficeToPDF Command Line is a Command Line utility that converts Microsoft Office 2003, 2007, 2010..

$49.95 $59.95 Ex Tax: $49.95

SVG Viewer Extension for Windows Explorer

SVG Viewer Extension for Windows Explorer

SVG Viewer Extension for Windows Explorer Extension module for Windows Explorer to render SVG thu..

$19.95 Ex Tax: $19.95

Save
25%

PDF Signer Software

PDF Signer Software

PDF Signer can be used to add your signature to PDF documents. The main function of PDF Signer is ..

$29.95 $39.95 Ex Tax: $29.95

PDF Highlighter Command Line

PDF Highlighter Command Line

PDF Highlighter Command Line PDF Highlighter Command Line is a command line application which can b..

$299.00 Ex Tax: $299.00

Save
25%

PDF Comparer for Windows

PDF Comparer for Windows

PDF Comparer can be used to compare two PDF files and text files. PDF Comparer is able to find the..

$29.95 $39.95 Ex Tax: $29.95

Save
25%

PhotoSlicer software for big poster printing

PhotoSlicer software for big poster printing

PhotoSlicer cuts a raster image into pieces which can afterwards be printed out and assembled to a..

$29.95 $39.95 Ex Tax: $29.95

HTML5 PDF Annotation Source Code License

HTML5 PDF Annotation Source Code License

HTML5 PDF Annotation Source Code License HTML5 PDF Annotation is a HTML5 Based Document & Image Ann..

$5,960.00 Ex Tax: $5,960.00

PDF Split-Merge SDK

PDF Split-Merge SDK

PDF Split-Merge SDK is a PDF DLL/SDK Library to Combine, Merge and Split PDF documents. PDF Split-..

$299.00 Ex Tax: $299.00

XPS to PDF Converter Command Line

XPS to PDF Converter Command Line

XPS to PDF Converter Command Line does convert from XPS and OXPS files to PDF and Image files. Bo..

$79.00 Ex Tax: $79.00

PDFPrint Command Line

PDFPrint Command Line

PDFPrint Command Line is a Command Line application for batch PDF Printing.PDFPrint Command Line all..

$199.00 Ex Tax: $199.00

Tags: text extraction command line, text extraction, extract text, plain text, any to text, document to text, file to text, convert to text, text conversion, azw to text, azw3 to text, chm to text, djvu to text, doc to text, docx to text, epub to text, fb2 to text, fbz to text, fb3 to text, html to text, lit to text, md to text, mht to text, mobi to text, odp to text, ods to text, odt to text, pdb to text, pdf to text, ppt to text, pptx to text, prc to text, rtf to text, tcr to text, txt to text, wpd to text, wri to text, xsl to text, xslx to text