source: app/doc/utt.texinfo @ 261bf62

help
Last change on this file since 261bf62 was 261bf62, checked in by obrebski <obrebski@…>, 16 years ago

w utt.texinfo

git-svn-id: svn://atos.wmid.amu.edu.pl/utt@60 e293616e-ec6a-49c2-aa92-f4a8b91c5d16

  • Property mode set to 100644
File size: 79.0 KB
RevLine 
[25ae32e]1\input texinfo   @c -*-texinfo-*-
2@documentencoding ISO-8859-2
3@c @documentlanguage pl
4
5@c %**start of header
6@setfilename utt.info
7@settitle UAM Text Tools v0.90
8@c %**end of header
9
10@copying
[261bf62]11This manual is for UAM Text Tools (version 0.90, October, 2008)
[25ae32e]12
[19760ef]13Copyright @copyright{}  2005, 2007  Tomasz Obrêbski, Micha³ Stolarski, Justyna Walkowska, Pawe³ Konieczka.
[25ae32e]14
15Permission is granted to copy, distribute and/or modify this document
[261bf62]16under the terms of the GNU Free Documentation License, Version 1.2 or
17any later version published by the Free Software Foundation; with no
18Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.  A
19copy of the license is included in the section entitled GNU Free
20Documentation License,,GNU Free Documentation License.
[25ae32e]21
22@c @quotation
23@c Permission is granted to ...
24@c No permission is granted until the document is completed.
25@c @end quotation
26@end copying
27
28
29@titlepage
30@title UAM Text Tools 0.90 - User Manual
31@subtitle edition 0.01, @today
32@subtitle status: prescript
33@author by Justyna Walkowska, Tomasz Obr@,{}ebski and Micha@l{} Stolarski
34@page
35@vskip 0pt plus 1filll
36@insertcopying
37@end titlepage
38
39@contents
40
41@c @paragraphindent none
42
43@iftex
44@parskip = 0.5@normalbaselineskip plus 3pt minus 1pt
45@end iftex
46
47@c @headings off
48@c @everyheading LEM(1) @| @| LEM(1)
49@everyfooting @today @c @| @thispage @|
50
51@ifnottex
52
53@node Top
54@top UTT - UAM Text Tools
55
56@insertcopying
57
58@menu
59* General information::                       
60* UTT file format::             
61* Configuration files::         
62* UTT components::
63* Auxiliary tools::
64* Usage examples::             
65* PMDBF dictionary::           
66@c * Examples::                   
67@c * Copyright::
68* GNU Free Documentation License::
69* Reporting bugs::                                   
70* Author::                     
71@end menu
72@end ifnottex
73
74
75@c ----------------------------------------------------------------------
76
77@node General information
78@chapter General information
79
80UAM Text Tools (UTT) is a package of language processing tools
81developed at Adam Mickiewicz University. Its functionality includes:
82
83@itemize @bullet
84
85@item
86tokenization
87@item
88dictionary-based morphological analysis
89@item
90heuristic morphological analysis of unknown words
91@item
92spelling correction
93@item
94pattern search
95@item
96sentence splitting
97@item
98generation of concordance tables
99@end itemize
100
101The toolkit is destined for processing of raw (not annotated)
102unrestricted text for any conceivable purpose.
103
104The system is organized as a collection of command-line programs, each
105performing one operation, e.g. tokenization, lemmatization, spelling
106correction. The components are independent one from another, the
107unifying element being the uniform i/o file format.
108
109The components may be combined in various ways to provide various text
110processing services. Also new components supplied by the used may be
111easily incorporated into the system provided that they respect the i/o
112file format conventions.
113
114UTT component programs does not depend on any specific tagset or
115morphological description format.
116
117UTT is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by
118the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
119
120The Polex/PMDBF dictionary is licensed under the Creative Commons by-nc-sa License which prohibits commercial use. 
121
122
123List of contributors:
124
125@itemize
126@item Pawel Konieczka
127@item Tomasz Obrebski
128@item Michal Stolarski
129@item Marcin Walas
130@item Justyna Walkowska
[04ae414]131@item Pawel Werenski
[25ae32e]132@end itemize
133
134@c ----------------------------------------------------------------------
135@c ---------------------------------------------------------------------
136
137@node    UTT file format
138@chapter UTT file format
139
140A UTT file contains annotation of a text. It consists of a sequence of
141segments. Each segment explicitly refers to a continuous piece of the
142text and provides some information on it.
143
144@section Segment format
145
146A segment occupies one line of a UTT file and consists of
147space-separated fields:
148
149
150@quotation
151@sp 1
152[@var{start} [@var{length}]] @var{type} @var{form} [@var{annotation1} [@var{annotation2} ...]]
153@sp 1
154@end quotation
155
156@table @var
157
158@item @var{start}
159Non-negative integer value indicating the position in the source text where the
160segment starts.
161
162@item @var{length}
163Non-negative integer value indicating the length of the segment.
164
165@item @var{type}
166A sequence of non-ASCII characters (without spaces or letters, which could lead to @var{type} being misinterpreted as a @var{start} or @var{length} field).
167@var{type} reflects the main classification of segments -
168into words, numbers, punctuation marks, meta-text markers.
169@xref{tok output,,tok output}, for description of automatically recognized type markers.
170
171@item @var{form}
172This field contains the textual form of the segment or the special
173symbol @code{*} indicating that the form is not given (e.g. when the segment has been created artificially to mark something and is of lentgh 0).
174
175The characters or character sequences that have special meaning in the
176@var{form} field are enumerated below.
177
178Characters with special meaning:
179
180@itemize
181@item @code{_} - space character
182@item @code{*} - undefined contents
183@end itemize
184
185Escape sequences:
186
187@itemize
188@item @code{\n} - new line
189@item @code{\t} - tabulation
190@item @code{\r} - carriage return 
191
192@item @code{\_} - the @code{_} character
193@item @code{\*} - the @code{*} character
194@item @code{\\} - the @code{\} character
195
196@c @item @code{\hh} - a character with hexadecimal code @code{hh} (used for non-printable characters)
197@end itemize
198
199@item @var{annotation1}
200@item @var{annotation2}
201@item ...
202Annotation fields have the following format:
203
204@var{longname} @code{:} @var{value}
205
206or
207
208@var{shortname} @var{value}
209
210where @var{longname} is a string of alphanumeric characters
211(isalnum() test), @var{shortname} - a single non-alphanumeric character
212(ispunct() test), and @var{value} is an arbitrary string of non-blank characters.
213
214@end table
215
216
217Only two fields are mandatory: @var{type} and @var{form}. All other fields
218may be absent. In the case when only one number precedes the
219@var{type} field, it is interpreted as the @var{START} position.
220
221If the @var{length} field is ommited, the length of the segment is the
222length of the @var{form} field, except when the value of the
223@var{form} field is @code{*} -- in this case, the length is assumed to
224be 0.
225
226If the @var{start} field is also absent, the segment is assumed to directly
227follow the preceding one.
228
229@c Conventions:
230
231@c Annotation fields with predefined meaning:
232
233@c @itemize
234@c @item @code{!} - UTT components are allowed to modify the contents of
235@c the @var{form} field (e.g. spelling correction does this). If this happens the
236@c original form of the segment have to be placed in the @code{!}-field.
237@c @item @code{@@} - morphological description
238@c @item @code{=} - node identifier assignment (used in graph encoding)
239@c @item @code{<} - preceding/dominating node(s) (used in graph encoding)
240@c @item @code{>} - succeeding/subordinate node(s) (used in graph encoding)
241@c @end itemize
242
243Segments of length 0 may be used to mark file positions with some
244information. See e.g. BOS and EOS (beginning/end of sentence) markers
245in the example below.
246
247Example:
248
249sentence: @samp{Piszemy dobre progrumy.}
250
251@example
2520000 00 BOS *
[19760ef]2530000 07 W Piszemy lem:pisaÊ,V
[25ae32e]2540007 01 S _
2550008 05 W dobre lem:dobry,ADJ
2560013 01 S _
2570014 08 W progrumy cor:programy lem:program,N
2580022 01 P .
2590023 00 EOS *
2600023 01 S _
2610024 00 BOS *
2620024 11 W Warszawiacy lem:Warszawiak,N
2630035 01 S _
[19760ef]2640036 03 W te¿
[25ae32e]2650039 01 P .
2660040 00 EOS *
267
268@end example
269
270@example
2710000 BOS *
[19760ef]2720000 W Piszemy lem:pisaÊ,V
[25ae32e]2730007 S _
2740008 W dobre lem:dobry,ADJ
2750013 S _
2760014 W progrumy cor:programy lem:program,N
2770022 P .
2780023 EOS *
279@end example
280
281Posion information may be provided only for some types of segments:
282
283@example
2840000 BOS *
[19760ef]285W Piszemy lem:pisaÊ,V
[25ae32e]286S _
287W dobre lem:dobry,ADJ
288S _
289W progrumy cor:programy lem:program,N
290P .
291EOS *
292S _
2930024 BOS *
294W Warszawiacy lem:Warszawiak,N
295S _
[19760ef]296W te¿
[25ae32e]297P .
298EOS *
299@end example
300
301Position/length information may be provided only when necessary:
302
303@example
3040000 04 N *
3050000 N 12
306P .
307N 5
308S _
309W km
310@end example
311
312@section UTT File
313
314A UTT file consists of a sequence of segments.  The same text position
315may be covered by multiple segments. In cosequence, ambiguous text
316segmentation and ambiguous annotation may be represented.
317
318There are two structural requirements a valid UTT-formatted file
319has to meet:
320
321@itemize @bullet
322
323@item
324segments have to be sorted with respect to the @var{position} field,
325
326@item
327for each
328segment ending at position @var{n}, either there must be a segment starting at
329position @var{n+1}, or position @var{n+1} is not covered by any segment; similarly
330for each segment starting at position @var{n}, either there must be a segment
331ending at position @var{n-1}, or the position @var{n-1} must not be covered
332by any segment.
333
334@end itemize
335
336A valid annotation for the text fragment
337@example
33812.5 km
339@end example
340
341may be
342
343@example
3440000 02 N 12
3450000 04 N 12.5
3460002 01 P .
3470003 01 N 5
3480004 01 S _
3490005 02 W km
350@end example
351
352but not
353
354@example
3550000 02 N 12
3560000 04 N 12.5
3570004 01 S _
3580005 02 W km
359@end example
360
[261bf62]361because in the latter example the first segment (starting at position
3620000, 2 characters long) ends at position @var{n}=0001 which is
363covered by the second segment and no segment starts at position
364@var{n+2}=0002.
365
366
367@section Flattened UTT file
368
369A UTT file format has two variants: regular and flattend. The regular
370format was described above.  In the flattened format some of the
371end-of-line characters are replaced with line-feed characters.
372
373The flatten format is basically used to represent whole sentences as
374single lines of the input file (all intrasentential end-of-line
375characters are replaced with line-feed characters).
376
377This technical trick permits to perform certain text
378processing operations on entire sentences with the use of such tools as
379@command{grep} (see @command{grp} component) or @command{sed} (see  @command{mar} component).
380
381The conversion between the two formats is performed by the tools:
382@command{fla} and @command{unfla}.
[25ae32e]383
384@section Character encoding
385
386The UTT component programs accept only 1-byte character encoding, such
[261bf62]387as ISO, ANSI, DOS.
[25ae32e]388
389
390@c @section Formats
391
392@c @unnumberedsubsubsec Basic format
393
394@c While processing large amounts of the overhead related with explicit
395@c ... of the start position and segment length becomes ... . Therefore,
396@c for efficiency reasons certain shortcuts are possible:
397
398@c @unnumberedsubsubsec Relative start position
399
400@c Start position may be given as relative distance from the last
401@c absolut position.
402
403@c @unnumberedsubsubsec Absent length
404
405@c Segment length may by omitted. Normally it can be restored by counting
406@c the length of the @emph{form field}. For segments with the special value
407@c @code{*} in the @emph{form field} length 0 is assumed.
408
409@c @unnumberedsubsubsec Absent length and start position
410
411@c Both start position and segment length may be omitted. In this format
412@c each segment is assumed to follow the previous one. This format is,
413@c therefore, suitable only for unambiguously tagged text
414@c (0-length markers can be still used.)
415
416
417@c @table @code
418@c @item AL
419@c @code{1234 03 W kot}
420@c @item RL
421@c @code{+56 03 W kot}
422@c @item A
423@c @code{1234 W kot}
424@c @item R
425@c @code{+56 W kot}
426@c @item 0
427@c @code{W kot}
428@c @end table
429
430
[19760ef]431@c [JAK UZYSKAÆ POLSKIE CZCIONKI W DVI???]
[25ae32e]432
433@macro parhelp
434@item @b{@minus{}@minus{}help}, @b{@minus{}h}
435Print help.
436@end macro
437
438
439@macro parversion
440@item @b{@minus{}@minus{}version}, @b{@minus{}V}
441Print version information.
442@end macro
443
444@macro parinteractive
445@item @b{@minus{}@minus{}interactive, @minus{}i}
446This option toggles interactive mode, which is by default off. In the
447interactive mode the program does not buffer the output.
448@end macro
449
450
451@c @macro parfile
452@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
453@c Input file name.
454@c If this option is absent or equal to '@minus{}', the program
455@c reads from the standard input.
456@c @end macro
457
458
459@c @macro paroutput
460@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
461@c Regular output file name. To regular output the program sends segments
462@c which it successfully processed and copies those which were not
463@c subject to processing. If this option is absent or equal to
464@c '@minus{}', standard output is used.
465@c @end macro
466
467@c @macro parfail
468@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}}
469@c Fail output file name. To fail output the program copies the segments
470@c it failed to process.  If this option is absent or equal to
471@c '@minus{}', standard output is used.
472@c @end macro
473
474
475@c @macro parcopy
476@c @item @b{@minus{}@minus{}copy, @minus{}c}
477@c Copy succesfully processed segments to regular output also in their
478@c original input form.
479@c @end macro
480
481
482@macro parinputfield
483@item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}
484The field containing the input to the program. The default is the
485@var{form} field. The fields @var{position}, @var{length}, @var{type},
486and @var{form} are referred to as @code{1}, @code{2}, @code{3},
487@code{4}, respectively.
488@end macro
489
490
491@macro paroutputfield
492@item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}
493The name of the field added by the program. The default is the name of the program.
494@end macro
495
496
497@macro pardictionary
498@item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}}
499Dictionary file name.
500@end macro
501
502
503@macro parprocess
504@item @b{@minus{}@minus{}process=@var{type}, @minus{}p @var{type}}
505Process segments with the specified value in the @var{type} field.
506Multiple occurences of this option are allowed and are interpreted as
507disjunction. If this option is absent, all segments are processed.
508@end macro
509
510
511@macro parselect
512@item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}}
513Select for processing only segments in which the field named
514@var{fieldname} is present. Multiple occurences of this option are
515allowed and are interpreted as conjunction of conditions. If this
516option is absent, all segments are processed.
517@end macro
518
519
520@macro parunselect
521@item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}}
522Select for processing only segments in which the field @var{fieldname}
523is absent.  Multiple occurences of this option are allowed and are
524interpreted as conjunction of conditions. If this option is absent,
525all segments are processed.
526@end macro
527
528
529@macro paroneline
530@item @b{@minus{}@minus{}one-line}
531This option makes the program print ambiguous annotation in one output
532line by generating multiple annotation fields. By default when
533ambiguous annotation may be produced for a segment, the segment is
534multiplicated and each of the annotations is added to separate copy of
535the segment.
536@end macro
537
538
539@macro paronefield
540@item @b{@minus{}@minus{}one-field, @minus{}1}
541This option makes the program print ambiguous annotation in one
542annotation field. By default when ambiguous annotation may be produced
543for a segment, the segment is multiplicated and each of the
544annotations is added to separate copy of the segment.
545
546This option is useful when working with @command{kot} or @command{con}.
547@end macro
548
549
550@c ---------------------------------------------------------------------
551@c CONFIGURATION FILES
552@c ---------------------------------------------------------------------
553
554@node    Configuration files
555@chapter Configuration files
556
557Values for all command line options accepted by a component
558may be set in configuration files. The default location of the
559configuration files for a component named @command{@var{program}} are
560
561@example
[246900a]562        @file{/usr/local/etc/utt/@var{program}.conf}
[25ae32e]563@end example
564
565for system-wide configuration file and
566
567@example
[246900a]568        @file{~/.utt/@var{program}.conf}
[25ae32e]569@end example
570
571for user configuration file.
572
573@c The configuration file to load may be also specified with the
574@c @option{--config} option. Configuration file need not be provided.
575
576For each option, the value is set according to the following priority:
577
578@itemize
579@item command line
580@c @item configuration file indicated with @option{--config} option
581@item user configuration file (or configuration file indicated with the @option{--config} option)
582@item system-wide configuration file
583@end itemize
584
585Parameter values are specified in the following format:
586
587@var{parametername}=@var{value}
588
589where @var{parametername} is the short or long name of an option accepted by
590the program, or
591
592@var{parametername}
593
594if the option does not need arguments.
595
596You can introduce comments to configuration files using the # sign.
597
598If a program accepts multiple occurences of an option (e.g. @var{lem}'s select option) you can specify them in two distinct lines of the program's configuration file.
599
600@c The equal sign may be omitted.
601
602
603@quotation Tip
604If you have two (or more) frequently used sets of options for the same
605program (eg. lem with PMDBF dictionary and lem with a user dictionary)
606a good solution is to create two soft links to lem, called
607eg. lemg and lemu and specify their configuration in files lemg.conf
608and lemu.conf respectively.
609@end quotation
610
611@c ---------------------------------------------------------------------
612@c COMPONENTS
613@c ---------------------------------------------------------------------
614
615@node UTT components
616@chapter UTT components
617
618UTT components are of three types:
619
620@menu
621Sources: programs which read non-UTT data (e.g. raw text) and produce output
622in UTT format
623* tok::         a tokenizer
624
625Filters: programs which read and produce UTT-formatted data
626* lem::         a morphological analyzer
627* gue::         a morphological guesser
[261bf62]628* cor::         a simple spelling corrector
629* kor::         a more elaborated spelling corrector
[25ae32e]630* sen::         a sentensizer
631* ser::         a pattern search tool (marks matches)
[261bf62]632* mar::         a pattern search tool (introduces arbitrary markers into the text)
[25ae32e]633* grp::         a pattern search tool (selects sentences containing a match)
[261bf62]634@c * gph::         a word-graph annotation tool::
635@c * dgp::         a dependency parser
[25ae32e]636
637Sinks: programs which read UTT data and produce output in another format
638* kot::         an untokenizer
639* con::         a concordance table generator
640@end menu
641
642@c ---------------------------------------------------------------------
643@c TOK
644@c ---------------------------------------------------------------------
645
646@page
647@node tok
648@section tok - a tokenizer
649
650@c ----------------------------------------
651
652@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[19760ef]653@item @strong{Authors:}                 @tab Tomasz Obrêbski
[25ae32e]654@item @strong{Component category:}      @tab source
[261bf62]655@item @strong{Input format:}            @tab raw text file
656@item @strong{Output format:}           @tab UTT regular
657@item @strong{Required annotation:}     @tab -
[25ae32e]658@end multitable
659
660
661@menu
662* tok description::
663* tok input::
664* tok output::
665* tok command line options::
666* tok example::
667@end menu
668
669@node tok description
670@subsection Description
671
672@code{tok} is a simple program which reads a text file and identifies
673tokens on the basis of their orthographic form.  The type of the token
674is printed as the @var{type} field.
675
676@node tok input
677@subsection Input
678
679Raw text.
680
681@node tok output
682@subsection Output
683
684UTT-file with four fields: @var{start}, @var{length}, @var{type}, and @var{form}. In the @var{type} field five types of tokens are distinguished:
685
686@itemize
687
688@item @code{W}
689(word)
690- continuous sequence of letters
691
692@item @code{N}
693(number)
694- continuous sequence of digits
695
696@item @code{S}
697(space)
698- continuous sequence of space characters
699
700@item @code{P}
701(punctuation mark)
702- single printable characters not belonging to any of the other classes
703
704@item @code{B}
705(unprintable character)
706- single unprintable character
707
708@end itemize
709
710
711
712@node tok command line options
713@subsection Command line options
714
715@table @code
716
717@item @b{@minus{}@minus{}help}, @b{@minus{}h}
718Print help.
719
720@item @b{@minus{}@minus{}version}, @b{@minus{}V}
721Print version information.
722
723@item @b{@minus{}@minus{}interactive, @minus{}i}
724This option toggles interactive mode, which is by default off. In the
725interactive mode the program does not buffer the output.
726
727@end table
728
729@node tok example
730@subsection Example
731
732Input:
733
734@example
735Piszemy dobre programy.
736@end example
737
738Output:
739
740@example
7410000 07 W Piszemy
7420007 01 S _
7430008 05 W dobre
7440013 01 S _
7450014 08 W programy
7460022 01 P .
7470023 01 S \n
748@end example
749
750
751@c ---------------------------------------------------------------------
752@c SEN
753@c ---------------------------------------------------------------------
754
755@c @node sen - sentencizer
756@c @chapter sen - sentencizer
757
[19760ef]758@c Authors: Tomasz Obrêbski
[25ae32e]759
760@c ---------------------------------------------------------------------
761@c LEM
762@c ---------------------------------------------------------------------
763
764@page
765@node lem
766@section lem - morphological analyzer
767
768@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[19760ef]769@item @strong{Authors:}                 @tab Tomasz Obrêbski, Micha³ Stolarski
[25ae32e]770@item @strong{Component category:}      @tab filter
[261bf62]771@item @strong{Input format:}            @tab UTT regular
772@item @strong{Output format:}           @tab UTT regular
773@item @strong{Required annotation:}     @tab tok
[25ae32e]774@end multitable
775
776@menu
777* lem description::             
778* lem command line options::   
779* lem input::
780* lem output::
781* lem example::                 
782* lem dictionaries::           
783* lem hints::           
784@end menu
785
786@node lem description
787@subsection Description
788
789@command{lem} performs morphological analysis of a simple orthographic
790word, returning all its possible morphological annotations,
791disregarding the context.
792
793@c ----------------------------------------
794
795@node lem command line options
796@subsection Command line options
797
798@table @code
799@parhelp
800@parversion
801@parinteractive
802@c @parfile
803@c @paroutput
804@c @parfail
805@c @parcopy
806@parinputfield
807@paroutputfield
808@pardictionary
809@parprocess
810@parselect
811@parunselect
812@paroneline
813@paronefield
814@end table
815
816@c ----------------------------------------
817
818@node lem input
819@subsection Input
820
821Lem reads a UTT file and processes the value of the @var{form} field
822(the input field may be changed with @option{--input-field} option).
823
824@node lem output
825@subsection Output
826
827@command{lem} adds a new annotation field, whose default name is @code{lem}.  In
828case of ambiguity either the segment is multiplicated (default),
829multiple @code{lem} fields are added (@option{--one-line}) or ambiguous
830annotation is produced as the value of single @code{lem} field (option
831@option{--one-field,-1}):
832
833@itemize @bullet
834
835@item
836unambiguous value format:
837
838@example
839   <lemma>,<descr>
840@end example
841
842@item
843ambiguous value format (@option{--one-field} option)
844
845
846@example
847   <lemma>,<descr>[,<descr>][;<lemma>,<descr>[,<descr>]]
848@end example
849
850(alternative descriptions for the same lemma are separated by commas,
851alternative lemmata are separated by semicolons.)
852
853@end itemize
854
855@node lem example
856@subsection Example
857
858Input:
859
860@example
8610000 07 W Piszemy
8620007 01 S _
8630008 05 W dobre
8640013 01 S _
8650014 08 W programy
8660022 01 P .
8670023 01 B \n
868@end example
869
870Output (default):
871
872@example
[19760ef]8730000 07 W Piszemy lem:pisaÊ,V/AiVpMdTrfNpP1
[25ae32e]8740007 01 B _
8750008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn
8760008 05 W dobre lem:dobry,ADJ/DpNsCnavGn
8770013 01 B _
8780014 08 W programy lem:program,N/GiNpCa
8790014 08 W programy lem:program,N/GiNpCn
8800014 08 W programy lem:program,N/GiNpCv
8810022 01 P .
8820023 01 B \n
883@end example
884
885Output (@option{--one-line} option):
886
887@example
[19760ef]8880000 07 W Piszemy lem:pisaÊ,V/AiVpMdTrfNpP1
[25ae32e]8890007 01 S _
8900008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn lem:dobry,ADJ/DpNsCnavGn
8910013 01 S _
8920014 08 W programy lem:program,N/GiNpCa lem:program,N/GiNpCn lem:program,N/GiNpCv
8930022 01 P .
8940023 01 S \n
895@end example
896
897Output (@option{--one-field} option):
898
899@example
[19760ef]9000000 07 W Piszemy lem:pisaÊ,V/AiVpMdTrfNpP1
[25ae32e]9010007 01 S _
9020008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn,ADJ/DpNsCnavGn
9030013 01 S _
9040014 08 W programy lem:program,N/GiNpCa,N/GiNpCn,N/GiNpCv
9050022 01 P .
9060023 01 S \n
907@end example
908
909@c ----------------------------------------
910
911@node lem dictionaries
912@subsection Dictionaries
913
914@command{lem} requires a dictionary. The dictionary may be provided in
915one of two formats: in text (source) format or in binary (fsa) format.
916
917@subsubheading Text format
918
919Dictionary entries have the following structure:
920
921@example
922<form>;<lemma>,<descr>[;<lemma>,<descr>]
923@end example
924
925@var{lemma} may be given explicitly or in the cut-add format:
926
927@example
928@code{[<cut1><add1>-]<cut2><add2>}
929@end example
930
931meaning: replace prefix of length @code{<cut1>} with
932string @code{<add1>}, replace suffix of length @code{<cut2>} with string
933@code{<add2>}. For example @code{3t} transforms @samp{kocie} into
[19760ef]934@samp{kot}, @code{3-4a³y} transforms @samp{najbielsi} into @samp{bia³y}
[25ae32e]935
936Each dictionary entry must be written in one line and must not contain blank characters.
937
938Examples:
939@example
940kot;0,N/GaNsCn
941kota;1,N/GaNsCg;1,N/GaNsCa
942kotu;1,N/GaNsCd
943kotem;2,N/GaNsCi
944kocie;3t,N/GaNsCl;3t,N/GaNsCv
[19760ef]945najbielsi;3-4a³y,ADJ/DsNpCnGp
946najbielsze;3-5a³y,ADJ/DsNpCnGaifn
[25ae32e]947najlepsi;dobry,ADJ/DsNpCnGp
948najlepsze;dobry,ADJ/DsNpCnGaifn
949@end example
950
951
952The mandatory file name extension for a text dictionary is @code{dic}. For large
953dictionaries it is preferable, however, to compile them into binary
954(fsa) format.
955
956@subsubheading Binary format
957
958The mandatory file name extension for a binary dictionary is @code{bin}. To
959compile a text dictionary into binary format, write:
960
961@example
962compiledic <dictionaryname>.dic
963@end example
964
965@subsubheading Polex/PMDBF dictionary
966
967A large-coverage morphological dictionary for Polish language, Polex/PMDBF, is included in
968the distribution as the default @emph{lem}'s dictionary. It's
969located by default in:
970
[261bf62]971@file{$HOME/.local/share/utt/pl_PL.ISO-8859-2/lem.bin}
972
973in local installation or in
974
975@file{/usr/local/share/utt/pl_PL.ISO-8859-2/lem.bin}
976
977in system installation.
[25ae32e]978
979@node lem hints
980@subsection Hints
981
[261bf62]982@subsubheading Combining data from multiple dictionaries
[25ae32e]983
[261bf62]984@itemize
[25ae32e]985
[261bf62]986@item Apply <dict1>, then apply <dict2> to words which were not annotatated.
[25ae32e]987
[261bf62]988@example
989lem -d <dict1> | lem -S lem -d <dict2>
990@end example
[25ae32e]991
[261bf62]992@item Add annotations from two dictionaries <dict1> and <dict2>.
[25ae32e]993
[261bf62]994@example
995lem -c -d <dict1> | lem -S lem -d <dict2>
996@end example
[25ae32e]997
[261bf62]998@end itemize
[25ae32e]999
1000
1001@c ---------------------------------------------------------------------
1002@c GUE
1003@c ---------------------------------------------------------------------
1004
1005@page
1006@node gue
1007@section gue - morphological guesser
1008
1009@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1010
[19760ef]1011@item @strong{Authors:}                 @tab Micha³ Stolarski, Tomasz Obrêbski
[25ae32e]1012@item @strong{Component category:}      @tab filter
1013
1014@end multitable
1015
1016@menu
[261bf62]1017* gue description::   
[25ae32e]1018* gue command line options::   
1019* gue example::                 
1020* gue dictionaries::           
1021@end menu
1022
[261bf62]1023
1024@node gue description
1025@subsection Description
1026
1027@command{gue} guesess morphological descriptions of the form contained
1028in the @var{form} field.
1029
1030
[25ae32e]1031@node gue command line options
1032@subsection Command line options
1033
1034@table @code
1035
1036@parhelp
1037@parversion
1038@parinteractive
1039@c @parfile
1040@c @paroutput
1041@c @parfail
1042@c @parcopy
1043@parinputfield
1044@paroutputfield
1045@pardictionary
1046@parprocess
1047@parselect
1048@parunselect
1049@paroneline
1050@paronefield
1051
1052@item @b{@minus{}@minus{}delta=@var{n}}
1053Stop displaying answers after fall of weight, that is, when weight difference between 2 subsequent results is more than delta value (default=`0.2').
1054
1055
1056@item @b{@minus{}@minus{}cut-off=@var{n}}
1057Do not display answers with less weight than cut-off value (default=`200').
1058
1059
1060@item @b{@minus{}@minus{}guess_count=@var{n}, @minus{}n @var{n}}
1061Guess up to n descriptions  (default=`0', which means 'display all results').
1062
1063
1064
1065@end table
1066
1067@node gue example
1068@subsection Example
1069
1070@example
1071command: gue -n 2
1072
1073input:
10740000 07 W smerfny
1075
1076output:
10770000 07 W smerfny gue:,ADJ/CaDpGiNs
10780000 07 W smerfny gue:,ADJ/CnvDpGaipNs
1079@end example
1080                                 
1081
1082@node gue dictionaries
1083@subsection Dictionaries
1084
1085@command{gue} requires a dictionary. For now, the dictionary must be provided in binary (fsa) format.
1086The fsa format is created by compiling text-format dictionaries.
1087
1088
1089
1090@subsubheading Text format
1091
1092Dictionary entries have the following structure:
1093
1094@example
1095@var{prefix}@code{*}@var{suffix}@code{;}@var{lemma}@code{,}@var{description}@code{:}@var{weight}
1096@end example
1097
1098@var{lemma} must be given in the cut-add format:
1099
1100@example
1101@code{[<cut1><add1>-]<cut2><add2>}
1102@end example
1103(no spaces in between): replace prefix of length @var{cut1} with
1104string @var{add1}, replace suffix of length @var{cat2} with string
1105@var{add2}.
1106
1107
[19760ef]1108Example: @code{3-4a³y} transforms @i{najbielsi} into @i{bia³y}
[25ae32e]1109
1110
1111@var{description} contains the part of speech and morphosyntactic information (@xref{PMDBF dictionary}.).
1112
1113@var{weight} is an integer value between 1 and 999 indicating the
1114likelihood of the guess.
1115
1116@example
[19760ef]1117*³kê;1a,N/GfNsCa
1118naj*elszy;3-4a³y,ADJ/...:...
[25ae32e]1119@end example
1120
1121
1122@c ---------------------------------------------------------------------
1123@c COR
1124@c ---------------------------------------------------------------------
1125
1126@page
1127@node cor
1128@section cor - spelling corrector
1129
1130@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[19760ef]1131@item @strong{Authors:}                 @tab Tomasz Obrêbski, Micha³ Stolarski
[25ae32e]1132@item @strong{Component category:}      @tab filter
[261bf62]1133@item @strong{Input format:}            @tab UTT regular
1134@item @strong{Output format:}           @tab UTT regular
1135@item @strong{Required annotation:}     @tab tok
[25ae32e]1136@end multitable
1137
[261bf62]1138@menu
1139* cor description::
1140* cor command line options::   
1141* cor dictionaries::           
1142@end menu
1143
1144
1145@node cor description
1146@subsection Description
1147
[25ae32e]1148The spelling corrector applies Kemal Oflazer's dynamic programming
1149algorithm @cite{oflazer96} to the FSA representation of the set of
1150word forms of the Polex/PMDBF dictionary. Given an incorrect
1151word form it returns all word forms present in the dictionary whose
1152edit distance is smaller than the threshold given as the parameter.
1153
1154
1155@node cor command line options
1156@subsection Command line options
1157
1158@table @code
1159
1160@parhelp
1161@parversion
1162@parinteractive
1163@c @parfile
1164@c @paroutput
1165@c @parfail
1166@c @parcopy
1167@parinputfield
1168@paroutputfield
1169@pardictionary
1170@parprocess
1171@parselect
1172@parunselect
1173@paroneline
1174@paronefield
1175
1176@item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}}
1177Maximum edit distance (default='1').
1178
[261bf62]1179@c @item @b{@minus{}@minus{}replace, @minus{}r}
1180@c Replace original form with corrected form, place original form in the
1181@c cor field. This option has no effect in @option{--one-*} modes (default=off)
1182
[25ae32e]1183
1184@end table
1185
1186@node cor dictionaries
1187@subsection Dictionaries
1188
1189@command{cor} requires a dictionary. The dictionary has to be provided in binary (fsa) format.
1190The fsa format is created by compiling text-format dictionaries.
1191
1192@subsubheading Text format
1193
1194The @command{cor} dictionary is a list of words:
1195@example
1196odlot
1197odlotowy
1198odludek
1199@end example
1200
[261bf62]1201@subsubheading Binary format
1202
1203The mandatory file name extension for a binary dictionary is @code{bin}. To
1204compile a text dictionary into binary format, write:
1205
1206@example
1207compiledic <dictionaryname>.dic
1208@end example
1209
1210@c ---------------------------------------------------------------------
1211@c KOR
1212@c ---------------------------------------------------------------------
1213
1214@page
1215@node kor
1216@section kor - configurable spelling corrector
1217
1218[TODO]
1219
1220@c ---------------------------------------------------------------------
1221@c SEN
1222@c ---------------------------------------------------------------------
1223
[25ae32e]1224@page
1225@node sen
1226@section sen - a sentensizer
1227
1228@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1229
[19760ef]1230@item @strong{Authors:}                 @tab Tomasz Obrêbski
[25ae32e]1231@item @strong{Component category:}      @tab filter
[261bf62]1232@item @strong{Input format:}            @tab UTT regular
1233@item @strong{Output format:}           @tab UTT regular
1234@item @strong{Required annotation:}     @tab tok
[25ae32e]1235
1236@end multitable
1237
1238
1239@menu
[261bf62]1240* sen description::
[25ae32e]1241@c * sen input::
1242@c * sen output::
1243* sen example::                 
1244@end menu
1245
[261bf62]1246@node sen description
1247@subsection Description
1248
1249@command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation.
1250
[25ae32e]1251@node sen example
1252@subsection Example
1253
1254@example
1255command: sen
1256
1257input:
[19760ef]12580000 05 W Cze¶Ê
[25ae32e]12590005 01 P !
12600006 01 S _
12610007 02 W To
12620009 01 S _
12630010 02 W ja
12640012 01 P .
12650013 01 S \n
1266
1267output:
12680000 00 BOS *
[19760ef]12690000 05 W Cze¶Ê
[25ae32e]12700005 01 P !
12710006 00 EOS *
12720006 00 BOS *
12730006 01 S _
12740007 02 W To
12750009 01 S _
12760010 02 W ja
12770012 01 P .
12780013 01 S \n
12790014 00 EOS *
1280@end example
1281
1282
1283@c ---------------------------------------------------------------------
1284@c GPH
1285@c ---------------------------------------------------------------------
1286
1287@c @node gph - graphizer
1288@c @chapter gph - graphizer
1289
[19760ef]1290@c Authors: Tomasz Obrêbski
[25ae32e]1291
1292
1293
1294@c ---------------------------------------------------------------------
[261bf62]1295@c SER
[25ae32e]1296@c ---------------------------------------------------------------------
1297
1298@page
1299@node ser
1300@section ser - pattern search tool
1301
1302@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[19760ef]1303@item @strong{Authors:}                 @tab Tomasz Obrêbski
[25ae32e]1304@item @strong{Component category:}      @tab filter
[261bf62]1305@item @strong{Input format:}            @tab UTT regular
1306@item @strong{Output format:}           @tab UTT regular
1307@item @strong{Required annotation:}     @tab tok,  lem --one-field
[25ae32e]1308@end multitable
1309
1310@menu
[261bf62]1311* ser description::
[25ae32e]1312* ser command line options::   
1313* ser pattern::                 
1314* ser how ser works::           
1315* ser customization::           
1316* ser limitations::             
1317* ser requirements::           
1318@end menu
1319
1320
[261bf62]1321@node ser description
1322@subsection Description
1323
1324@command{ser} looks for patterns in UTT-formatted texts.
1325
1326
[25ae32e]1327@c ---------------------------------------------------------------------
1328@node ser command line options
1329@subsection Command line options
1330
1331@table @code
1332
1333@parhelp
1334@parversion
1335@c @parfile
1336@c @paroutput
1337@c @parinputfield
1338@c @paroutputfield
1339@parprocess
1340@parinteractive
1341
1342@item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
1343The search pattern.
1344
1345@item @b{@minus{}@minus{}morph=@var{field}}
1346The name of the annotation field containing the morphological
1347description (default @code{lem}).
1348
1349@item @b{@minus{}@minus{}flex}
1350Only print the generated flex source code.
1351
1352@item @b{@minus{}@minus{}macro=@var{filename}}
1353Read macrodefinitions from file @var{filename} rather than from
1354default location. This option allows to redefine the set of terms.
1355
1356@item @b{@minus{}@minus{}define=@var{filename}}
1357Append macrodefinitions from file @var{filename}. This option
1358allows to extend the set of terms.
1359
1360@end table
1361
1362
1363@c ---------------------------------------------------------------------
1364@node ser pattern
1365@subsection Pattern
1366
1367The @command{ser} pattern is a regular expression over terms corresponding
1368to text segments or segment sequences. Predefined terms are:
1369
1370@table @code
1371
1372@item seg(@var{t},@var{f},@var{a})
1373a segment of type @var{t}, containing form @var{f} and annotation
1374@var{a}
1375
1376@item form(@var{f})
1377a segment containing form @var{f}
1378
1379@item field(@var{f})
1380a segment containing annotation field @var{f}
1381
1382@item space(@var{f})
1383a space segment of form @var{f}
1384
1385@item word(@var{f})
1386a word segment of form @var{f}
1387
1388@item punct(@var{f})
1389a punct segment of form @var{f}
1390
1391@item number(@var{f})
1392a number segment of form @var{f}
1393
1394@item lexeme(@var{f})
1395a word segment with lemma @var{f}
1396
1397@item cat(@var{c})
1398a word segment of category @var{c}
1399
1400@end table
1401
1402All arguments are optional. If an argument is omitted, an arbitrary
1403string of non-blank characters is assumed as the argument value. Term
1404arguments may be arbitrary character-level regular expressions. The
1405following special symbols can by used:
1406
1407@multitable {aaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1408@item @code{[@dots{}]}            @tab a character class
1409@item @code{[^@dots{}]}           @tab a negated character class
1410@item @code{|}                    @tab alternative
1411@item @code{*}                    @tab repetition, including zero times
1412@item @code{+}                    @tab repetition, at least one time
1413@item @code{?}                    @tab optionality
1414@item @code{@{@var{m},@var{n}@}}  @tab repetition from @var{m} to @var{n} times
1415@item @code{@{@var{m},@}}         @tab repetition @var{m} or more times
1416@item @code{@{@var{m}@}}          @tab repetition @var{m} times
1417@item @code{@var{\ddd}}           @tab the character with octal value @var{ddd}
1418@item @code{\x@var{hh}}           @tab the character with hexadecimal value @var{hh}
1419@item @code{( )}                  @tab parentheses, used to override precedence
1420@c @end multitable
1421
1422@c @multitable {aaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1423@item @code{.}    @tab a non-blank character
1424@item @code{\w}   @tab a letter
1425@item @code{\W}   @tab a non-blank character other than a letter
1426@item @code{\d}   @tab a digit
1427@item @code{\D}   @tab a non-blank character other than a digit
1428@item @code{\s}   @tab a space or tab character
1429@item @code{\S}   @tab a non-blank character (the same as @code{.})
1430@item @code{\l}   @tab a lowercase letter
1431@item @code{\L}   @tab an uppercase letter
1432@end multitable
1433
1434
1435@noindent The following characters:
1436@example
1437@verb{%  [   ]   ^   |   *   +   ?   {   }   ,   .   <   >   \ %}
1438@end example
1439must be escaped with a backslash, i.e. written as:
1440@example
1441@verb{% \[  \]  \^  \|  \*  \+  \?  \{  \}  \,  \.  \<  \>  \\ %}
1442@end example
1443
1444@quotation Note
1445The special symbols are ... borrowed from Perl with minor
1446modifications ... for convenience
1447The meaning of certain special characters/sequences slightly differs
1448from their common ???. This is motivated by convenience reasons.
1449The meaning of the @code{.} special character is modified due to
1450the special function of spaces in utt files (they are field
1451separators). Use @code{\s} to explicitly
1452@end quotation
1453
1454In the argument of the @code{cat} term a special operator <...> may be
1455used. A category specification enclosed in angle brackets matches all
1456category descriptions which are consistent (non-contradictory) with the
1457specification. For example @code{<N>} matches all noun descriptions,
1458@code{<ADJ/Can>} matches all adjectives in accusative or nominal case.
1459
1460
1461@*
1462@noindent @b{Examples of one-segment patterns:}
1463
1464@multitable {aaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1465@item @code{seg}            @tab any segment
1466@item @code{word}           @tab any word-form
1467@item @code{word(pomocy)}   @tab the word-form @samp{pomocy}
1468@item @code{word(naj.+)}    @tab a word-form beginning with @samp{naj}
1469@item @code{word(\L\l+)}    @tab a capitalized word-form
1470@item @code{punct}          @tab a punctuation character
1471@item @code{space(.*\\n.*)} @tab a space segment containing a newline character
1472@item @code{lexeme(pomoc)}  @tab any form of the lexeme 'pomoc'
1473@item @code{cat(N/.*)}      @tab a word which category starts with @code{N/}
1474@item @code{cat(<N/Ca>)}    @tab a word which category matches @code{N/Ca}
1475@end multitable
1476
1477@*
1478@noindent @b{Examples of multi-segment patterns:}
1479
1480@table @code
1481
1482@item (word(\L) punct(\.) space?)+ word(\L\l+)
1483a sequence of initials followed by a surname
1484
1485@item punct seg(W|S|N)* cat(<NPRO/Sr>) seg(W|S|N)* punct
1486a text fragment between two punctuation characters, containing an
1487ocurrence of a relative pronoun
1488
1489@end table
1490
1491
1492@node ser how ser works
1493@subsection How ser works
1494
1495@node ser customization
1496@subsection Customization
1497
1498@c All predefined terms correspond to single segments,
1499
1500@example
[261bf62]1501define(`verbseq', `(cat(<V>) (space cat(<V>)))')
[25ae32e]1502@end example
1503
1504
1505the term @code{cat()} may not be used as a ... of
1506
1507@c See @command{m4} manual for further details on macro definition format.
1508
1509@node ser limitations
1510@subsection Limitations
1511
[261bf62]1512Do not use more than 3 attributes in <>.
[25ae32e]1513
1514@node ser requirements
1515@subsection Requirements
1516
1517In order to run @command{ser}, the following programs must be
1518installed in the system:
1519
1520@itemize
1521
1522@item @command{m4}
1523@item @command{grep}
1524@item @command{flex}
1525@item @command{gcc}
1526
1527@end itemize
1528
1529
1530@c ---------------------------------------------------------------------
[261bf62]1531@c GRP
[25ae32e]1532@c ---------------------------------------------------------------------
1533
1534@page
1535@node grp
1536@section grp - pattern search tool
1537
1538@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[19760ef]1539@item @strong{Authors:}                 @tab Tomasz Obrêbski
[25ae32e]1540@item @strong{Component category:}      @tab filter
[261bf62]1541@item @strong{Input format:}            @tab UTT flattened
1542@item @strong{Output format:}           @tab UTT flattened
1543@item @strong{Required annotation:}     @tab tok, sen, lem --one-field
[25ae32e]1544@end multitable
1545
1546
[261bf62]1547@menu
1548* grp description::
1549* grp command line options::   
1550* grp pattern::                 
1551* grp hints::   
1552@end menu
1553
1554
1555@node grp description
1556@subsection Description
1557
[25ae32e]1558@code{gre} selects sentences containing an expression matching a
1559pattern. The pattern format is exactly the same as that accepted by
1560@code{ser}.
1561
1562@code{gre} is intended mainly for speeding up corpus search process.
1563It is extremely fast (processing speed is usually higher then the speed
1564of reading the corpus file from disk).
1565
1566@node grp command line options
1567@subsection Command line options
1568
1569@table @code
1570
1571@parhelp
1572@parversion
1573@parprocess
1574@parinteractive
1575
1576@item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
1577The search pattern.
1578
1579@item @b{@minus{}@minus{}morph=@var{field}}
1580The name of the annotation field containing the morphological
1581description (default @code{lem}).
1582
1583@item @b{@minus{}@minus{}command}
1584Only print the generated flex source code.
1585
1586@item @b{@minus{}@minus{}macro=@var{filename}}
1587Read macrodefinitions from file @var{filename} rather than from
1588default location. This option allows to redefine the set of terms.
1589
1590@item @b{@minus{}@minus{}define=@var{filename}}
1591Append macrodefinitions from file @var{filename}. This option
1592allows to extend the set of terms.
1593
1594@end table
1595
1596
1597@node grp pattern
1598@subsection Pattern
1599
1600(see @code{ser})
1601
1602@node grp hints
1603@subsection Hints
1604
1605The corpus search speed may be increased by combining grp with lzop
1606compression tool (grp usually processes data faster than it is read from a
1607disk, especially for slow laptop drives).
1608
1609@example
1610cat corpus | tok | sen | lem | grp -a p | lzop -7 > corpus.grp.lzo
1611@end example
1612
1613@example
1614lzop -cd corpus.grp.lzo | grp -a gP -e @var{EXPR} | ser -e @var{EXPR}
1615@end example
1616
1617
[261bf62]1618
[25ae32e]1619@c ---------------------------------------------------------------------
[261bf62]1620@c MAR
[25ae32e]1621@c ---------------------------------------------------------------------
[261bf62]1622
1623@page
1624@node mar
1625@section mar
1626
1627@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1628@item @strong{Authors:}                 @tab Marcin Walas, Tomasz Obrêbski
1629@item @strong{Component category:}      @tab filter
1630@end multitable
1631
1632[TODO]
1633
1634@c ---------------------------------------------------------------------
1635@c KOT
[25ae32e]1636@c ---------------------------------------------------------------------
1637
[261bf62]1638
[25ae32e]1639@page
1640@node kot
1641@section kot - untokenizer
1642
[261bf62]1643@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1644@item @strong{Authors:}                 @tab Tomasz Obrêbski
1645@item @strong{Component category:}      @tab filter
1646@item @strong{Input format:}            @tab UTT regular
1647@item @strong{Output format:}           @tab text
1648@item @strong{Required annotation:}     @tab tok
1649@end multitable
[25ae32e]1650
1651
1652@menu
[261bf62]1653* kot description::
[25ae32e]1654* kot command line options::   
1655* kot usage examples::   
1656@end menu
1657
[261bf62]1658@node kot description
1659@subsection Description
1660
1661@command{kot} transforms a UTT formatted file back into raw text format.
1662
[25ae32e]1663@node kot command line options
1664@subsection Command line options
1665
1666@table @code
1667
1668@parhelp
1669
1670@c @item @b{@minus{}@minus{}version}, @b{@minus{}v}
1671
1672@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
1673
1674@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
1675
1676@c @item @b{@minus{}@minus{}interactive @minus{}i}
1677
1678@c @item @b{@minus{}@minus{}config=@var{filename}}
1679
1680@item
1681
1682@item @b{@minus{}@minus{}gap-fill=@var{string}, @minus{}g @var{string}}
1683print @var{string} between nonadjacent segments of the input file
1684
1685@item @b{@minus{}@minus{}spaces, @minus{}r}
1686retain the special characters @code{_}, @code{\t},
1687@code{\n}, @code{\r}, @code{\f} unexpanded in the output
1688
1689@end table
1690
1691@node kot usage examples
1692@subsection Usage examples
1693
1694@example
1695cat legia.txt | tok | kot       
1696@end example
1697
1698@example
1699cat legia.txt | tok | lem -1 | kot
1700@end example
1701
[261bf62]1702@c ---------------------------------------------------------------
1703@c CON
1704@c ---------------------------------------------------------------
1705
[25ae32e]1706
1707@page
1708@node con
1709@section con - concordance table generator
1710
1711@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1712@item @strong{Authors:}                 @tab Justyna Walkowska
1713@item @strong{Component category:}      @tab sink
[261bf62]1714@item @strong{Input format:}            @tab UTT regular
1715@item @strong{Output format:}           @tab text
1716@item @strong{Required annotation:}     @tab ser or mar
[25ae32e]1717@end multitable
1718@c
1719
1720@menu
[261bf62]1721* con description::
[25ae32e]1722* con command line options::
1723* con usage example::
1724* con hints::   
1725@end menu
1726
[261bf62]1727
1728@node con description
1729@subsection Description
1730
1731@command{con} generates a concordance table based on a pattern given to @command{ser}.
1732
1733
[25ae32e]1734@node con command line options
1735@subsection Command line options
1736
1737@table @code
1738
1739@parhelp
1740
1741@c @item @b{@minus{}@minus{}help}, @b{@minus{}h}
1742@c @item @b{@minus{}@minus{}version}, @b{@minus{}v}
1743@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
1744@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
1745@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}} [???]
1746@c @item @b{@minus{}@minus{}copy, @minus{}c} [???]
1747@c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}
1748@c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}
1749@c @item @b{@minus{}@minus{}process=@var{class}, @minus{}p @var{class}}
1750@c @item @b{@minus{}@minus{}interactive @minus{}i}
1751@c @item @b{@minus{}@minus{}config=@var{filename}}
1752@c @item
1753@c @item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
1754@c search pattern
1755@c
1756@c @item @b{@minus{}@minus{}flex}
1757@c only print the generated flex source code
1758@c
1759@c @item @b{@minus{}@minus{}macro=@var{filename}}
1760@c read macrodefinitions from file @var{filename} rather than from
1761@c default location. This option allows to redefine the set of terms.
1762@c
1763@c @item @b{@minus{}@minus{}define=@var{filename}}
1764@c append macrodefinitions from file @var{filename}. This option
1765@c allows to extend the set of terms.
1766
1767@item @b{@minus{}@minus{}left @minus{}l}           
1768        Left context info (default='30c'). Example:
1769@example                         
1770                                 -l=5c: left context is 5 characters
1771                                 -l=5w: left context is 5 words
1772                                 -l=5s: left context is 5 non-empty input lines
1773                                 -l='\s*\S+\sr\S+BOS': left context starts with the given regex
1774@end example
1775
1776@item @b{@minus{}@minus{}right @minus{}r}           
1777        Right context info (default='30c').
1778@item @b{@minus{}@minus{}trim @minus{}t}           
1779        Clear incomplete words from output.
1780@item @b{@minus{}@minus{}white @minus{}w}           
1781        DO NOT change all white characters into spaces.
1782@item @b{@minus{}@minus{}column @minus{}c}           
1783        Left column minimal width in characters (default = 0).
1784@item @b{@minus{}@minus{}ignore @minus{}i}           
1785        Ignore segment inconsistency in the input.
[261bf62]1786@item @b{@minus{}@minus{}bom}           
[25ae32e]1787        Beginning of selected segment (regex, default='[0-9]+ [0-9]+ BOM .*').
[261bf62]1788@item @b{@minus{}@minus{}eom}           
[25ae32e]1789        End of selected segment (regex, default='[0-9]+ [0-9]+ EOM .*').
1790@item @b{@minus{}@minus{}bod}           
1791        Selected segment beginning display string (default='[').
1792@item @b{@minus{}@minus{}eod}           
1793        Selected segment end display string (default=']').
1794
1795
1796
1797@end table
1798
1799@node con usage example
1800@subsection Usage example
1801@example
[261bf62]1802cat file.txt | tok | lem -1 | ser -e 'lexeme(dom)' | con 
[25ae32e]1803@end example
1804
1805
1806@node con hints
1807@subsection Hints
1808
1809@command{con} is a rather slow program. Do not pass large amounts of
1810redundant text through this program. @command{con} works fine in the following
1811sequence:
1812
1813@example
1814... | grp -e EXPR | ser -e EXPR | con
1815@end example
1816
1817
1818@c ---------------------------------------------------------------------
1819@c ---------------------------------------------------------------------
1820
1821@page
1822@node Auxiliary tools
1823@chapter Auxiliary tools
1824
1825@menu
1826* compiledic::         dictionary compiler
1827* fla::                UTT file flattener
1828* unfla::              UTT file unflattener
1829@end menu
1830
1831
1832@page
1833@node compiledic
1834@section compiledic - the dictionary compiler
1835
1836@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1837@item @strong{Authors:}                 @tab Michal Stolarski, Tomasz Obrebski
1838@item @strong{Component category:}      @tab additional tool
1839@end multitable
1840@c
1841
1842@command{compiledic} compiles dictionaries in text format (@code{.dic} extension) into binary
1843(FSA) format (@code{.bin} extension).
1844
1845Automaton representation of a dictionary is built using the AT&T tools:
1846@itemize
1847@item AT&T FSM Library,
1848@item AT&T Lextools.
1849@end itemize
1850
1851In order for the compiledic program to work you have to install the
1852above mentioned packages into your system.  They are freely available
1853for non-commercial use.
1854
1855Usage:
1856@example
1857        compiledic <dictionaryname>.dic
1858@end example
1859
1860The file <dictionaryname>.bin will be generated.
1861
1862Remarque: The program produces a lot of temporary files which are
1863stored in the current directory. They are deleted after successfull
1864termination of the program.
1865
1866@c @menu
1867@c * con command line options::
1868@c * con usage example::
1869@c * con hints::   
1870@c @end menu
1871
1872
1873@page
1874@node fla
1875@section fla - the UTT file flattener
1876
1877@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[19760ef]1878@item @strong{Authors:}                 @tab Tomasz Obrêbski
[25ae32e]1879@item @strong{Component category:}      @tab filter
1880@end multitable
1881@c
1882
1883@command{fla} ``flattens'' a utt file by merging segments belonging
1884to one sentence in one line. Technically, end-of-line characters
1885('\n', ASCII code 10) are replaced with line-feed characters ('\f',
1886ASCII code 12).  The flattening makes it possible to process UTT files
1887with such tools as @command{grep} or @command{sed} sentence by
1888sentence (used in @command{grp} and @command{mar}).
1889
1890Flattened files should have the suffix @code{.fla}, eg. @file{thetext.utt.fla}.
1891
1892Flattened files are still human-readible.
1893
1894Usage:
1895
1896@example
1897        fla [<bosregex>]
1898@end example
1899
1900The facultative argument is a regular expression describing segments
1901which should be treated as sentence beginnings (the test is: the
1902segment contains a fragment matching the @code{<bosregex>}). By
1903default, segments containing a field @code{BOS} are seeked.
1904@c @menu
1905@c * con command line options::
1906@c * con usage example::
1907@c * con hints::   
1908@c @end menu
1909
1910
1911
1912@page
1913@node unfla
1914@section unfla - the UTT file unflattener
1915
1916@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[19760ef]1917@item @strong{Authors:}                 @tab Tomasz Obrêbski
[25ae32e]1918@item @strong{Component category:}      @tab filter
1919@end multitable
1920
1921@command{unfla} transforms a flattened UTT file, produced by
1922@command{fla}, into the regular format by restoring end-of-line
1923characters.
1924
1925
1926
1927
1928@c ---------------------------------------------------------------------
1929@c USAGE EXAMPLES
1930@c ---------------------------------------------------------------------
1931
1932@node Usage examples
1933@chapter Usage examples
1934
1935@subsubheading Simple pipelines
1936
1937@enumerate
1938
1939@item tokenization
1940
1941cat text | tok > output1
1942
1943@item morphological annotation (1)
1944
1945simple dictionary based lemmatization
1946
1947cat text | tok | lem > output1
1948
1949@item morphological annotation (2)
1950
19511) perform dictionary-based lemmatization
19524) guess descriptions for words which have no annotation
1953
1954@example
1955cat text | tok | lem | gue -S lem > output2
1956@end example
1957
1958@item morphological annotation (3)
1959
19601) perform dictionary-based lemmatization
19612) try to correct words with no annotation
19623) perform dictionary-based lemmatization of corrected words
19634) guess descriptions for words which still have no annotation
1964
1965@example
1966cat text | tok | lem | cor -p W -S lem | lem -I cor | gue -p W -S lem
1967@end example
1968@item spelling correction
1969
1970
1971
1972@example
1973cat text | tok | lem --only-fail | cor -1 > output3
1974@end example
1975
1976@item Expression extraction
1977
1978Extraction of all occurrences of a verb followed by a form of the noun 'rozmowa'.
1979
1980@example
1981cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' -m | kot > output4
1982@end example
1983
1984@item A word in context
1985
1986Extraction of text fragments containing a form of the lexeme 'rozmowa' in
1987the context of 5 preceeding and 5 succeeding corpus segments.
1988
1989@example
1990cat text | tok | lem -1 | ser -e 'seg@{5@} lexeme(rozmowa) seg@{5@}' -m | kot > output
1991@end example
1992
1993@item generation of concordance table (1)
1994
1995@example
1996cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' | con
1997@end example
1998
199910"
2000
2001@item generation of concordance table (2)
2002
2003The same as above but much faster
2004
2005@example
2006cat text | tok | lem -1 | \
2007grp -e 'cat(<V>) space lexeme(rozmowa)' | \
2008ser -e 'cat(<V>) space lexeme(rozmowa)' | \
2009con
2010@end example
2011
20122"
2013
2014@item generation of concordance table (3)
2015
2016Usually, one performs repetitively search over the same corpus. In
2017such case it is advisable to transform the corpus data into the format
2018required by @command{grp} first, and then use the preprocessed data.
2019
2020As @command{grp} (@command{grep}) processes data faster then it is
2021read from the disk drive, the search time may be still shortened by
2022using file compression techniques.  We suggest usin @command{lzop}.
2023
2024@item the fastest way to search a large corpus
2025
2026step 1: preprocessing
2027
2028@example
2029cat corpus | tok | sen | lem -1 \
2030| grp -a p | lzop -7 > corpus.grp.lzo
2031@end example
2032
2033step 2: search
2034
2035@example
2036lzop -cd corpus.grp.lzo | grp -a gP -e 'cat(<V>) space
2037lexeme(rozmowa)' | ser -e 'cat(<V>) space lexeme(rozmowa)' | con
2038@end example
2039
2040@end enumerate
2041
2042@subsubheading More complicated configurations
2043
2044
2045@example
2046mknod fifo1 p
2047mknod fifo2 p
2048mknod fifo3 p
2049mknod fifo4 p
2050mknod fifo5 p
2051
2052tok | lem -p W -e fifo1 > fifo2 &
2053cor -e fifo3 < fifo1 | lem > fifo4 &
2054gue < fifo3 > fifo5 &
2055sort -m fifo2 fifo4 fifo5
2056
2057rm fifo?
2058@end example
2059
2060
2061@c ---------------------------------------------------------------------
2062@c ---------------------------------------------------------------------
2063
2064@c ---------------------------------------------------------------------
2065@c PMDBF DICTIONARY
2066@c ---------------------------------------------------------------------
2067
2068@node PMDBF dictionary
2069@chapter PMDBF dictionary
2070
2071UTT components come with lexical data derived from Polish
2072Morphological Database (PMDB).
2073
2074@menu
2075* PMDBF files::   
2076* PMDBF tag structure::                 
2077* PMDBF parts of speech::           
2078* PMDBF morphosyntactic attributes::           
2079@end menu
2080
2081@node PMDBF files
2082@section Files
2083
2084@node PMDBF tag structure
2085@section Tag structure
2086
2087pos = [[:upper:]]+
2088
2089attr = [[:upper:]]+
2090
2091val = [[:lower:][:digit:]?!*+-] | <[^>\n]+>
2092
2093descr = pos ( / ( attr val + ) + ) ?
2094
2095@node PMDBF parts of speech
2096@section Parts of speech
2097
2098@multitable {ADJPRP} { adjectival-passive-participle }
2099@item @code{N} @tab noun
2100@item @code{NPRO} @tab nominal-pronoun
2101@item @code{NV} @tab deverbal-noun
2102@item @code{V} @tab verb
2103@item @code{BYC} @tab byc
2104@item @code{VNI} @tab non-inflected-verb
2105@item @code{ADJ} @tab adjective
2106@item @code{ADJPAP} @tab adjectival-passive-participle
2107@item @code{ADJPRP} @tab adjectival-present-participle
2108@item @code{ADJPP} @tab adjectival-past-participle
2109@item @code{ADJPRO} @tab adjectival-pronoun
2110@item @code{ADJNUM} @tab adjectival-numeral
2111@item @code{ADV} @tab adverb
2112@item @code{ADVANP} @tab adverbial-anterior-participle
2113@item @code{ADVPRP} @tab adverbial-present-participle
2114@item @code{ADVPRO} @tab adverbial-pronoun
2115@item @code{ADVNUM} @tab  adverbial-numeral
2116@item @code{P} @tab preposition
2117@item @code{PPRO} @tab prep-noun-pronoun
2118@item @code{CONJ} @tab conjunction
2119@item @code{EXCL} @tab exclamation
2120@item @code{APP} @tab call
2121@item @code{ONO} @tab onomatopoeia
2122@item @code{PART} @tab particle
2123@item @code{NUMCRD} @tab cardinal-numeral
2124@item @code{NUMCOL} @tab collective-numeral
2125@item @code{NUMPAR} @tab partitive-numeral
2126@item @code{NUMORD} @tab ordinal-numeral
2127@end multitable
2128
2129@node PMDBF morphosyntactic attributes
2130@section Morphosyntactic attributes
2131
2132@multitable {Attr} {Val} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
2133@c @headitem Attr @tab Val @tab Description
2134@item
2135@code{A} @tab @tab Aspect
2136@item
2137@tab @code{p} @tab perfect
2138@item
2139@tab @code{i} @tab imperfect.
2140@item
2141@item
2142@code{V} @tab @tab Verb-Form
2143@item
2144@tab @code{b} @tab infinitive,
2145@item
2146@tab @code{p} @tab personal,
2147@item
2148@tab @code{i} @tab impersonal.
2149@item
2150@item
2151@code{M} @tab @tab Mood
2152@item
2153@tab @code{d} @tab declarative,
2154@item
2155@tab @code{c} @tab conditional,
2156@item
2157@tab @code{i} @tab imperative.
2158@item
2159@item
2160@code{T} @tab @tab Tense
2161@item
2162@tab @code{a} @tab past,
2163@item
2164@tab @code{r} @tab present,
2165@item
2166@tab @code{f} @tab future.
2167@item
2168@item
2169@code{P} @tab @tab Person
2170@item
2171@tab @code{1} @tab 1,
2172@item
2173@tab @code{2} @tab 2,
2174@item
2175@tab @code{3} @tab 3.
2176@item
2177@item
2178@code{D} @tab @tab Degree
2179@item
2180@tab @code{p} @tab positive,
2181@item
2182@tab @code{c} @tab comparative,
2183@item
2184@tab @code{s} @tab superlative.
2185@item
2186@item
2187@code{N} @tab @tab Number
2188@item
2189@tab @code{s} @tab singular,
2190@item
2191@tab @code{p} @tab plural.
2192@item
2193@item
2194@code{C} @tab @tab Case
2195@item
2196@tab @code{n} @tab nominative,
2197@item
2198@tab @code{g} @tab genitive,
2199@item
2200@tab @code{d} @tab dative,
2201@item
2202@tab @code{a} @tab accusative,
2203@item
2204@tab @code{i} @tab instrumantal,
2205@item
2206@tab @code{l} @tab locative,
2207@item
2208@tab @code{v} @tab vocative.
2209@item
2210@item
2211@code{G} @tab @tab Gender
2212@item
2213@tab @code{p} @tab masculine-personal,
2214@item
2215@tab @code{a} @tab masculine-animal,
2216@item
2217@tab @code{i} @tab masculine-inanimate,
2218@item
2219@tab @code{f} @tab feminine,
2220@item
2221@tab @code{n} @tab neuter.
2222@end multitable
2223
2224
2225@c ---------------------------------------------------------------------
2226@c ---------------------------------------------------------------------
2227@c
2228@c @node Examples
2229@c @chapter Examples
2230
2231@c ----------------------------------------------------------------------
2232@c ----------------------------------------------------------------------
2233
2234@node    GNU Free Documentation License
2235@chapter GNU Free Documentation License
2236
2237@c The GNU Free Documentation License.
2238@center Version 1.2, November 2002
2239
2240@c This file is intended to be included within another document,
2241@c hence no sectioning command or @node.
2242
2243@display
2244Copyright @copyright{} 2000,2001,2002 Free Software Foundation, Inc.
224551 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA
2246
2247Everyone is permitted to copy and distribute verbatim copies
2248of this license document, but changing it is not allowed.
2249@end display
2250
2251@enumerate 0
2252@item
2253PREAMBLE
2254
2255The purpose of this License is to make a manual, textbook, or other
2256functional and useful document @dfn{free} in the sense of freedom: to
2257assure everyone the effective freedom to copy and redistribute it,
2258with or without modifying it, either commercially or noncommercially.
2259Secondarily, this License preserves for the author and publisher a way
2260to get credit for their work, while not being considered responsible
2261for modifications made by others.
2262
2263This License is a kind of ``copyleft'', which means that derivative
2264works of the document must themselves be free in the same sense.  It
2265complements the GNU General Public License, which is a copyleft
2266license designed for free software.
2267
2268We have designed this License in order to use it for manuals for free
2269software, because free software needs free documentation: a free
2270program should come with manuals providing the same freedoms that the
2271software does.  But this License is not limited to software manuals;
2272it can be used for any textual work, regardless of subject matter or
2273whether it is published as a printed book.  We recommend this License
2274principally for works whose purpose is instruction or reference.
2275
2276@item
2277APPLICABILITY AND DEFINITIONS
2278
2279This License applies to any manual or other work, in any medium, that
2280contains a notice placed by the copyright holder saying it can be
2281distributed under the terms of this License.  Such a notice grants a
2282world-wide, royalty-free license, unlimited in duration, to use that
2283work under the conditions stated herein.  The ``Document'', below,
2284refers to any such manual or work.  Any member of the public is a
2285licensee, and is addressed as ``you''.  You accept the license if you
2286copy, modify or distribute the work in a way requiring permission
2287under copyright law.
2288
2289A ``Modified Version'' of the Document means any work containing the
2290Document or a portion of it, either copied verbatim, or with
2291modifications and/or translated into another language.
2292
2293A ``Secondary Section'' is a named appendix or a front-matter section
2294of the Document that deals exclusively with the relationship of the
2295publishers or authors of the Document to the Document's overall
2296subject (or to related matters) and contains nothing that could fall
2297directly within that overall subject.  (Thus, if the Document is in
2298part a textbook of mathematics, a Secondary Section may not explain
2299any mathematics.)  The relationship could be a matter of historical
2300connection with the subject or with related matters, or of legal,
2301commercial, philosophical, ethical or political position regarding
2302them.
2303
2304The ``Invariant Sections'' are certain Secondary Sections whose titles
2305are designated, as being those of Invariant Sections, in the notice
2306that says that the Document is released under this License.  If a
2307section does not fit the above definition of Secondary then it is not
2308allowed to be designated as Invariant.  The Document may contain zero
2309Invariant Sections.  If the Document does not identify any Invariant
2310Sections then there are none.
2311
2312The ``Cover Texts'' are certain short passages of text that are listed,
2313as Front-Cover Texts or Back-Cover Texts, in the notice that says that
2314the Document is released under this License.  A Front-Cover Text may
2315be at most 5 words, and a Back-Cover Text may be at most 25 words.
2316
2317A ``Transparent'' copy of the Document means a machine-readable copy,
2318represented in a format whose specification is available to the
2319general public, that is suitable for revising the document
2320straightforwardly with generic text editors or (for images composed of
2321pixels) generic paint programs or (for drawings) some widely available
2322drawing editor, and that is suitable for input to text formatters or
2323for automatic translation to a variety of formats suitable for input
2324to text formatters.  A copy made in an otherwise Transparent file
2325format whose markup, or absence of markup, has been arranged to thwart
2326or discourage subsequent modification by readers is not Transparent.
2327An image format is not Transparent if used for any substantial amount
2328of text.  A copy that is not ``Transparent'' is called ``Opaque''.
2329
2330Examples of suitable formats for Transparent copies include plain
2331@sc{ascii} without markup, Texinfo input format, La@TeX{} input
2332format, @acronym{SGML} or @acronym{XML} using a publicly available
2333@acronym{DTD}, and standard-conforming simple @acronym{HTML},
2334PostScript or @acronym{PDF} designed for human modification.  Examples
2335of transparent image formats include @acronym{PNG}, @acronym{XCF} and
2336@acronym{JPG}.  Opaque formats include proprietary formats that can be
2337read and edited only by proprietary word processors, @acronym{SGML} or
2338@acronym{XML} for which the @acronym{DTD} and/or processing tools are
2339not generally available, and the machine-generated @acronym{HTML},
2340PostScript or @acronym{PDF} produced by some word processors for
2341output purposes only.
2342
2343The ``Title Page'' means, for a printed book, the title page itself,
2344plus such following pages as are needed to hold, legibly, the material
2345this License requires to appear in the title page.  For works in
2346formats which do not have any title page as such, ``Title Page'' means
2347the text near the most prominent appearance of the work's title,
2348preceding the beginning of the body of the text.
2349
2350A section ``Entitled XYZ'' means a named subunit of the Document whose
2351title either is precisely XYZ or contains XYZ in parentheses following
2352text that translates XYZ in another language.  (Here XYZ stands for a
2353specific section name mentioned below, such as ``Acknowledgements'',
2354``Dedications'', ``Endorsements'', or ``History''.)  To ``Preserve the Title''
2355of such a section when you modify the Document means that it remains a
2356section ``Entitled XYZ'' according to this definition.
2357
2358The Document may include Warranty Disclaimers next to the notice which
2359states that this License applies to the Document.  These Warranty
2360Disclaimers are considered to be included by reference in this
2361License, but only as regards disclaiming warranties: any other
2362implication that these Warranty Disclaimers may have is void and has
2363no effect on the meaning of this License.
2364
2365@item
2366VERBATIM COPYING
2367
2368You may copy and distribute the Document in any medium, either
2369commercially or noncommercially, provided that this License, the
2370copyright notices, and the license notice saying this License applies
2371to the Document are reproduced in all copies, and that you add no other
2372conditions whatsoever to those of this License.  You may not use
2373technical measures to obstruct or control the reading or further
2374copying of the copies you make or distribute.  However, you may accept
2375compensation in exchange for copies.  If you distribute a large enough
2376number of copies you must also follow the conditions in section 3.
2377
2378You may also lend copies, under the same conditions stated above, and
2379you may publicly display copies.
2380
2381@item
2382COPYING IN QUANTITY
2383
2384If you publish printed copies (or copies in media that commonly have
2385printed covers) of the Document, numbering more than 100, and the
2386Document's license notice requires Cover Texts, you must enclose the
2387copies in covers that carry, clearly and legibly, all these Cover
2388Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on
2389the back cover.  Both covers must also clearly and legibly identify
2390you as the publisher of these copies.  The front cover must present
2391the full title with all words of the title equally prominent and
2392visible.  You may add other material on the covers in addition.
2393Copying with changes limited to the covers, as long as they preserve
2394the title of the Document and satisfy these conditions, can be treated
2395as verbatim copying in other respects.
2396
2397If the required texts for either cover are too voluminous to fit
2398legibly, you should put the first ones listed (as many as fit
2399reasonably) on the actual cover, and continue the rest onto adjacent
2400pages.
2401
2402If you publish or distribute Opaque copies of the Document numbering
2403more than 100, you must either include a machine-readable Transparent
2404copy along with each Opaque copy, or state in or with each Opaque copy
2405a computer-network location from which the general network-using
2406public has access to download using public-standard network protocols
2407a complete Transparent copy of the Document, free of added material.
2408If you use the latter option, you must take reasonably prudent steps,
2409when you begin distribution of Opaque copies in quantity, to ensure
2410that this Transparent copy will remain thus accessible at the stated
2411location until at least one year after the last time you distribute an
2412Opaque copy (directly or through your agents or retailers) of that
2413edition to the public.
2414
2415It is requested, but not required, that you contact the authors of the
2416Document well before redistributing any large number of copies, to give
2417them a chance to provide you with an updated version of the Document.
2418
2419@item
2420MODIFICATIONS
2421
2422You may copy and distribute a Modified Version of the Document under
2423the conditions of sections 2 and 3 above, provided that you release
2424the Modified Version under precisely this License, with the Modified
2425Version filling the role of the Document, thus licensing distribution
2426and modification of the Modified Version to whoever possesses a copy
2427of it.  In addition, you must do these things in the Modified Version:
2428
2429@enumerate A
2430@item
2431Use in the Title Page (and on the covers, if any) a title distinct
2432from that of the Document, and from those of previous versions
2433(which should, if there were any, be listed in the History section
2434of the Document).  You may use the same title as a previous version
2435if the original publisher of that version gives permission.
2436
2437@item
2438List on the Title Page, as authors, one or more persons or entities
2439responsible for authorship of the modifications in the Modified
2440Version, together with at least five of the principal authors of the
2441Document (all of its principal authors, if it has fewer than five),
2442unless they release you from this requirement.
2443
2444@item
2445State on the Title page the name of the publisher of the
2446Modified Version, as the publisher.
2447
2448@item
2449Preserve all the copyright notices of the Document.
2450
2451@item
2452Add an appropriate copyright notice for your modifications
2453adjacent to the other copyright notices.
2454
2455@item
2456Include, immediately after the copyright notices, a license notice
2457giving the public permission to use the Modified Version under the
2458terms of this License, in the form shown in the Addendum below.
2459
2460@item
2461Preserve in that license notice the full lists of Invariant Sections
2462and required Cover Texts given in the Document's license notice.
2463
2464@item
2465Include an unaltered copy of this License.
2466
2467@item
2468Preserve the section Entitled ``History'', Preserve its Title, and add
2469to it an item stating at least the title, year, new authors, and
2470publisher of the Modified Version as given on the Title Page.  If
2471there is no section Entitled ``History'' in the Document, create one
2472stating the title, year, authors, and publisher of the Document as
2473given on its Title Page, then add an item describing the Modified
2474Version as stated in the previous sentence.
2475
2476@item
2477Preserve the network location, if any, given in the Document for
2478public access to a Transparent copy of the Document, and likewise
2479the network locations given in the Document for previous versions
2480it was based on.  These may be placed in the ``History'' section.
2481You may omit a network location for a work that was published at
2482least four years before the Document itself, or if the original
2483publisher of the version it refers to gives permission.
2484
2485@item
2486For any section Entitled ``Acknowledgements'' or ``Dedications'', Preserve
2487the Title of the section, and preserve in the section all the
2488substance and tone of each of the contributor acknowledgements and/or
2489dedications given therein.
2490
2491@item
2492Preserve all the Invariant Sections of the Document,
2493unaltered in their text and in their titles.  Section numbers
2494or the equivalent are not considered part of the section titles.
2495
2496@item
2497Delete any section Entitled ``Endorsements''.  Such a section
2498may not be included in the Modified Version.
2499
2500@item
2501Do not retitle any existing section to be Entitled ``Endorsements'' or
2502to conflict in title with any Invariant Section.
2503
2504@item
2505Preserve any Warranty Disclaimers.
2506@end enumerate
2507
2508If the Modified Version includes new front-matter sections or
2509appendices that qualify as Secondary Sections and contain no material
2510copied from the Document, you may at your option designate some or all
2511of these sections as invariant.  To do this, add their titles to the
2512list of Invariant Sections in the Modified Version's license notice.
2513These titles must be distinct from any other section titles.
2514
2515You may add a section Entitled ``Endorsements'', provided it contains
2516nothing but endorsements of your Modified Version by various
2517parties---for example, statements of peer review or that the text has
2518been approved by an organization as the authoritative definition of a
2519standard.
2520
2521You may add a passage of up to five words as a Front-Cover Text, and a
2522passage of up to 25 words as a Back-Cover Text, to the end of the list
2523of Cover Texts in the Modified Version.  Only one passage of
2524Front-Cover Text and one of Back-Cover Text may be added by (or
2525through arrangements made by) any one entity.  If the Document already
2526includes a cover text for the same cover, previously added by you or
2527by arrangement made by the same entity you are acting on behalf of,
2528you may not add another; but you may replace the old one, on explicit
2529permission from the previous publisher that added the old one.
2530
2531The author(s) and publisher(s) of the Document do not by this License
2532give permission to use their names for publicity for or to assert or
2533imply endorsement of any Modified Version.
2534
2535@item
2536COMBINING DOCUMENTS
2537
2538You may combine the Document with other documents released under this
2539License, under the terms defined in section 4 above for modified
2540versions, provided that you include in the combination all of the
2541Invariant Sections of all of the original documents, unmodified, and
2542list them all as Invariant Sections of your combined work in its
2543license notice, and that you preserve all their Warranty Disclaimers.
2544
2545The combined work need only contain one copy of this License, and
2546multiple identical Invariant Sections may be replaced with a single
2547copy.  If there are multiple Invariant Sections with the same name but
2548different contents, make the title of each such section unique by
2549adding at the end of it, in parentheses, the name of the original
2550author or publisher of that section if known, or else a unique number.
2551Make the same adjustment to the section titles in the list of
2552Invariant Sections in the license notice of the combined work.
2553
2554In the combination, you must combine any sections Entitled ``History''
2555in the various original documents, forming one section Entitled
2556``History''; likewise combine any sections Entitled ``Acknowledgements'',
2557and any sections Entitled ``Dedications''.  You must delete all
2558sections Entitled ``Endorsements.''
2559
2560@item
2561COLLECTIONS OF DOCUMENTS
2562
2563You may make a collection consisting of the Document and other documents
2564released under this License, and replace the individual copies of this
2565License in the various documents with a single copy that is included in
2566the collection, provided that you follow the rules of this License for
2567verbatim copying of each of the documents in all other respects.
2568
2569You may extract a single document from such a collection, and distribute
2570it individually under this License, provided you insert a copy of this
2571License into the extracted document, and follow this License in all
2572other respects regarding verbatim copying of that document.
2573
2574@item
2575AGGREGATION WITH INDEPENDENT WORKS
2576
2577A compilation of the Document or its derivatives with other separate
2578and independent documents or works, in or on a volume of a storage or
2579distribution medium, is called an ``aggregate'' if the copyright
2580resulting from the compilation is not used to limit the legal rights
2581of the compilation's users beyond what the individual works permit.
2582When the Document is included in an aggregate, this License does not
2583apply to the other works in the aggregate which are not themselves
2584derivative works of the Document.
2585
2586If the Cover Text requirement of section 3 is applicable to these
2587copies of the Document, then if the Document is less than one half of
2588the entire aggregate, the Document's Cover Texts may be placed on
2589covers that bracket the Document within the aggregate, or the
2590electronic equivalent of covers if the Document is in electronic form.
2591Otherwise they must appear on printed covers that bracket the whole
2592aggregate.
2593
2594@item
2595TRANSLATION
2596
2597Translation is considered a kind of modification, so you may
2598distribute translations of the Document under the terms of section 4.
2599Replacing Invariant Sections with translations requires special
2600permission from their copyright holders, but you may include
2601translations of some or all Invariant Sections in addition to the
2602original versions of these Invariant Sections.  You may include a
2603translation of this License, and all the license notices in the
2604Document, and any Warranty Disclaimers, provided that you also include
2605the original English version of this License and the original versions
2606of those notices and disclaimers.  In case of a disagreement between
2607the translation and the original version of this License or a notice
2608or disclaimer, the original version will prevail.
2609
2610If a section in the Document is Entitled ``Acknowledgements'',
2611``Dedications'', or ``History'', the requirement (section 4) to Preserve
2612its Title (section 1) will typically require changing the actual
2613title.
2614
2615@item
2616TERMINATION
2617
2618You may not copy, modify, sublicense, or distribute the Document except
2619as expressly provided for under this License.  Any other attempt to
2620copy, modify, sublicense or distribute the Document is void, and will
2621automatically terminate your rights under this License.  However,
2622parties who have received copies, or rights, from you under this
2623License will not have their licenses terminated so long as such
2624parties remain in full compliance.
2625
2626@item
2627FUTURE REVISIONS OF THIS LICENSE
2628
2629The Free Software Foundation may publish new, revised versions
2630of the GNU Free Documentation License from time to time.  Such new
2631versions will be similar in spirit to the present version, but may
2632differ in detail to address new problems or concerns.  See
2633@uref{http://www.gnu.org/copyleft/}.
2634
2635Each version of the License is given a distinguishing version number.
2636If the Document specifies that a particular numbered version of this
2637License ``or any later version'' applies to it, you have the option of
2638following the terms and conditions either of that specified version or
2639of any later version that has been published (not as a draft) by the
2640Free Software Foundation.  If the Document does not specify a version
2641number of this License, you may choose any version ever published (not
2642as a draft) by the Free Software Foundation.
2643@end enumerate
2644
2645@page
2646@heading ADDENDUM: How to use this License for your documents
2647
2648To use this License in a document you have written, include a copy of
2649the License in the document and put the following copyright and
2650license notices just after the title page:
2651
2652@smallexample
2653@group
2654  Copyright (C)  @var{year}  @var{your name}.
2655  Permission is granted to copy, distribute and/or modify this document
2656  under the terms of the GNU Free Documentation License, Version 1.2
2657  or any later version published by the Free Software Foundation;
2658  with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
2659  Texts.  A copy of the license is included in the section entitled ``GNU
2660  Free Documentation License''.
2661@end group
2662@end smallexample
2663
2664If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts,
2665replace the ``with@dots{}Texts.'' line with this:
2666
2667@smallexample
2668@group
2669    with the Invariant Sections being @var{list their titles}, with
2670    the Front-Cover Texts being @var{list}, and with the Back-Cover Texts
2671    being @var{list}.
2672@end group
2673@end smallexample
2674
2675If you have Invariant Sections without Cover Texts, or some other
2676combination of the three, merge those two alternatives to suit the
2677situation.
2678
2679If your document contains nontrivial examples of program code, we
2680recommend releasing these examples in parallel under your choice of
2681free software license, such as the GNU General Public License,
2682to permit their use in free software.
2683
2684@c Local Variables:
2685@c ispell-local-pdict: "ispell-dict"
2686@c End:
2687
2688
2689@c ---------------------------------------------------------------------
2690@c ---------------------------------------------------------------------
2691
2692@node    Reporting bugs
2693@chapter Reporting bugs
2694
2695Report bugs to <obrebski@@amu.edu.pl>.
2696
2697@c ---------------------------------------------------------------------
2698@c ---------------------------------------------------------------------
2699
2700@c @node    Copyright
2701@c @chapter Copyright
2702@c
2703@c Copyright 2004 by Tomasz Obrebski
2704@c This software is free for research and educational use.
2705
2706@c ---------------------------------------------------------------------
2707@c ---------------------------------------------------------------------
2708
2709@node    Author
2710@chapter Author
2711
2712
2713@bye
Note: See TracBrowser for help on using the repository browser.