Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
catdoc(1)		    General Commands Manual		     catdoc(1)

NAME
       catdoc -	reads MS-Word file and puts its	content	as plain text on stan-
       dard output

SYNOPSIS
       catdoc  [-vlu8btawxV] [-m number] [ -s charset] [ -d charset] [ -f out-
       put-format] file

DESCRIPTION
       catdoc behaves much like	cat(1) but it reads MS-Word file and  produces
       human-readable text on standard output.	Optionally it can use latex(1)
       escape  sequences  for characters which have special meaning for	LaTeX.
       It also makes some effort to  recognize	MS-Word	 tables,  although  it
       never tries to write correct headers for	LaTeX tabular environment. Ad-
       ditional	output formats,	such is	HTML can be easily defined.

       catdoc doesn't attempt to extract formatting information	other than ta-
       bles from MS-Word document, so different	output modes means mainly that
       different  characters should be escaped and different ways used to rep-
       resent characters, missing from output charset. See CHARACTER SUBSTITU-
       TION below

       catdoc uses internal unicode(4) representation of text, so it  is  able
       to  convert texts when charset in source	document doesn't match charset
       on target system.  See CHARACTER	SETS below.

       If no file names	supplied, catdoc processes its standard	 input	unless
       it  is  terminal. It is unlikely	that somebody could type Word document
       from keyboard, so if catdoc invoked without arguments and stdin is  not
       redirected,  it	prints	brief  usage message and exits.	 Processing of
       standard	input (even among other	files) can be forced using dash	'-' as
       file name.

       By default, catdoc wraps	lines which are	more than 72  chars  long  and
       separates  paragraphs by	blank lines. This behavior can be turned of by
       -w switch. In wide mode catdoc prints each paragraph as one long	 line,
       suitable	for import into	word processors	that perform word wrapping.

OPTIONS
       -a      -  shortcut for -f ascii. Produces ASCII	text as	output.	 Sepa-
	       rates table columns with	TAB

       -b      - process broken	MS-Word	file. Normally,	catdoc checks if first
	       8 bytes of file is Microsoft OLE	signature. If so, it processes
	       file, otherwise it just copies it to stdin. It is  intended  to
	       use catdoc as filter for	viewing	all files with .doc extension.

       -dcharset
	       -  specifies  destination charset name. Charset file has	format
	       described in CHARACTER SETS below and should have  .txt	exten-
	       sion   and  reside  in  catdoc  library directory ( ${exec_pre-
	       fix}/lib/catdoc). By default, current locale charset is used if
	       langinfo	support	compiled in.

       -fformat
	       - specifies output format as described in  CHARACTER  SUBSTITU-
	       TION  below.   catdoc comes with	two output formats - ascii and
	       tex. You	can add	your own if you	wish.

       -l      Causes catdoc to	list names of available	charsets to the	stdout
	       and exit	successfully.

       -mnumber
	       Specifies right margin for text	(default 72).  -m 0 is equiva-
	       lent to -w

       -scharset
	       Specifies source	charset. (one used in Word document), if  Word
	       document	 doesn't  contain UTF-16  text.	When reading rtf docu-
	       ments, it is typically not  necessary,  because	rtf  documents
	       contain	ansicpg	specification. But it can be set wrong by Word
	       (I've seen RTF documents	on Russian, where  cp1252  was	speci-
	       fied).  In  this	 case  this  option would take precedence over
	       charset,	specified in the document. But	source_charset	state-
	       ment  in	the configuration file have less priority than charset
	       in the document.

       -t      - shortcut for -f tex
		converts all printable chars, which have special  meaning  for
	       LaTeX(1)	 into  appropriate  control sequences. Separates table
	       columns by &.

       -u      - declares that Word   document	 contain   UNICODE    (UTF-16)
	       representation  of  text	(as some Word-97 documents). If	catdoc
	       fails to	correct	 Word document with   default  charset,	   try
	       this  option.

       -8      - declares is Word document is 8	bit. Just in case that catdoc
		recognizes file	format incorrectly.

       -w      disables	 word wrapping.	By default catdoc output is split into
	       lines not longer	than 72	(or  number, specified by -m   option)
	       characters  and	paragraphs  are	 separated by blank line. With
	       this option each	paragraph is one long line.

       -x      causes catdoc to	output unknown UNICODE	character  as  \xNNNN,
	       instead of question marks.

       -v      causes catdoc to	print some useless information about word doc-
	       ument structure to stdout before	actual start of	text.

       -V      outputs catdoc version

CHARACTER SETS
       When  processing	MS-Word	file catdoc uses information about two charac-
       ter sets, typically different
	-  input and output. They are stored in	plain text files in catdoc li-
       brary directory.	Character set files should contain two whitespace-sep-
       arated hexadecimal numbers - 8-bit code in  character  set  and	16-bit
       Unicode	code.	Anything  from hash mark to end	of line	is ignored, as
       well as blank lines.

       catdoc distribution includes some of these character  sets.  Additional
       character  set  definitions,  directly usable by	catdoc can be obtained
       from ftp.unicode.org. Charset files have	.txt suffix,  which  shouldn't
       be specified in command-line or configuration files.

       Note  that  catdoc is distributed with Cyrillic charsets	as default. If
       you are not Russian, you	probably don't want it,	an should  reconfigure
       catdoc at compile time or in runtime configuration file.

       When  dealing with documents with charsets other	than default, remember
       that Microsoft never uses ISO charsets. While letters  in,  say	cp1252
       are at the same position	as in ISO-8859-1, some punctuation signs would
       be lost,	if you specify ISO-8859-1 as input charset. If you use cp1252,
       catdoc  would deal with those signs as described	in CHARACTER SUBSTITU-
       TION below.

CHARACTER SUBSTITUTION
       catdoc converts	MS-Word	file into following internal Unicode represen-
       tation:

       1. Paragraphs are separated by ASCII Line Feed symbol (0x000A)

       2. Table	cells within row are separated by ASCII	Field Separator	symbol
	   (0x001C)

       3. Table	rows are separated by ASCII Record Separator (0x001E)

       4. All printable	characters, including whitespace are represented with
       their
	   respective UNICODE codes.

       This UNICODE representation is subsequently converted into  8-bit  text
       in target character set using following four-step algorithm:

       1. List of special characters is	searched for given Unicode character.
	   If  found,  then appropriate	multi-character	sequence is output in-
	   stead of character.

       2. If there is an equivalent in target character	set, it	is output.

       3. Otherwise, replacement list is searched and, if there	is multi-char-
       acter
	   substitution	for this UNICODE char, it is output.

       4. If all above fails, "Unknown char" symbol (question mark) is output.

       Lists of	special	characters and list of substitution are	character set-
       independent, because special chars  should  be  escaped	regardless  of
       their  existence	 in  target character set  (usually, they are parts of
       US-ASCII, and therefore exist in	any  character	set)  and  replacement
       list is searched	only for those characters, which are not found in tar-
       get character set.

       These lists are stored in catdoc	library	directory in files with	prefix
       of format name. These files have	following format:

       Each  line  can	be either comment (starting with hash mark) or contain
       hexadecimal UNICODE value, separated by whitespace from	string,	 which
       would  be substituted instead of	it. If string contain no whitespace it
       can be used as is, otherwise it should be enclosed in single or	double
       quotes.	Usual  backslash sequences like	'\n','\t' can be used in these
       string.

RUNTIME	CONFIGURATION
       Upon startup catdoc reads its system-wide configuration file ( catdocrc
       in catdoc library directory) and	then user-specific configuration  file
       ${HOME}/.catdocrc.

       These files can contain following directives:

       source_charset =	charset-name
	       Sets  default  source charset, which would be used if no	-s op-
	       tion specified. Consult configuration of	nearby	windows	 work-
	       station to find one you need.

       target_charset =	charset-name
		Sets  default output charset. You probably know, which one you
	       use.

       charset_path = directory-list
	       colon-separated list of directories,  which  are	 searched  for
	       charset	files.	This allows you	to install additional charsets
	       in your home directory.	If first directory component  of  path
	       is  ~  it is replaced by	contents of HOME environment variable.
	       On MS-DOS platform, if directory	name starts with %s, it	is re-
	       placed with directory of	executable file. Empty element in list
	       (i.e. two consequitve colons) is	considered current directory.

       map_path	= directory-list
	       colon-separated list of directories,  which  are	 searched  for
	       special	character  map and replacement map.  Same substitution
	       rules as	in charset_path	are applied.

       format =	format name
	       Output format which would be used  by  default.	 catdoc	 comes
	       with  two formats - ascii and tex but nothing prevents you from
	       writing your own	format (set two	map files - special  character
	       map and replacement map).

       unknown_char = character	specification
	       sets  character	to output instead of unknown Unicode character
	       (default	'?')  Character	specification can have one of two form
	       - character enclosed in single quotes or	hexadecimal code.

       use_locale =(yes|no)
	       Enables or disables automatic selection of output charset  (de-
	       fault yes),
		based  on system locale	settings (if enabled at	compile	time).
	       If automatic detection is enabled, than output charset settings
	       in the configuration files (but not in the  command  line)  are
	       ignored,	 and  current  system  locale charset is used instead.
	       There are no automatic choice of	input charset, based of	locale
	       language, because most modern Word files	(since	Word  97)  are
	       Unicode anyway

BUGS
       Doesn't	handle fast-saves properly. Prints footnotes as	separate para-
       graphs at the end of file, instead of producing correct LaTeX commands.
       Cannot distinguish between empty	table cell and end of table row.

SEE ALSO
       xls2csv(1), catppt(1), cat(1), strings(1), utf(4), unicode(4)

AUTHOR
       V.B.Wagner <vitus@45.free.net>

MS-Word	reader		   Version @catdoc_version@		     catdoc(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=catdoc&sektion=1&manpath=FreeBSD+Ports+14.3.quarterly>

home | help