source: app/doc/utt.texinfo @ 04ae414

help
Last change on this file since 04ae414 was 04ae414, checked in by obrebski <obrebski@…>, 16 years ago

w utt.texinfo usuniete polskie znaki z jednego miejsca (nie kompilowalo
sie). NADAL POZOSTAJE PROBLEM Z POLSKIMI ZNAKAMI W TEXINFO!

git-svn-id: svn://atos.wmid.amu.edu.pl/utt@58 e293616e-ec6a-49c2-aa92-f4a8b91c5d16

  • Property mode set to 100644
File size: 79.3 KB
Line 
1\input texinfo   @c -*-texinfo-*-
2@documentencoding ISO-8859-2
3@c @documentlanguage pl
4
5@c %**start of header
6@setfilename utt.info
7@settitle UAM Text Tools v0.90
8@c %**end of header
9
10@copying
11This manual is for UAM Text Tools (version 0.90, November, 2007)
12
13Copyright @copyright{}  2005, 2007  Tomasz Obrêbski, Micha³ Stolarski, Justyna Walkowska, Pawe³ Konieczka.
14
15Permission is granted to copy, distribute and/or modify this document
16under the terms of the GNU Free Documentation License, Version 1.2
17or any later version published by the Free Software Foundation;
18with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
19Texts.  A copy of the license is included in the section entitled GNU Free Documentation License,,GNU Free Documentation License.
20
21@c @quotation
22@c Permission is granted to ...
23@c No permission is granted until the document is completed.
24@c @end quotation
25@end copying
26
27
28@titlepage
29@title UAM Text Tools 0.90 - User Manual
30@subtitle edition 0.01, @today
31@subtitle status: prescript
32@author by Justyna Walkowska, Tomasz Obr@,{}ebski and Micha@l{} Stolarski
33@page
34@vskip 0pt plus 1filll
35@insertcopying
36@end titlepage
37
38@contents
39
40@c @paragraphindent none
41
42@iftex
43@parskip = 0.5@normalbaselineskip plus 3pt minus 1pt
44@end iftex
45
46@c @headings off
47@c @everyheading LEM(1) @| @| LEM(1)
48@everyfooting @today @c @| @thispage @|
49
50@ifnottex
51
52@node Top
53@top UTT - UAM Text Tools
54
55@insertcopying
56
57@menu
58* General information::                       
59* UTT file format::             
60* Configuration files::         
61* UTT components::
62* Auxiliary tools::
63* Usage examples::             
64* PMDBF dictionary::           
65@c * Examples::                   
66@c * Copyright::
67* GNU Free Documentation License::
68* Reporting bugs::                                   
69* Author::                     
70@end menu
71@end ifnottex
72
73
74@c ----------------------------------------------------------------------
75
76@node General information
77@chapter General information
78
79UAM Text Tools (UTT) is a package of language processing tools
80developed at Adam Mickiewicz University. Its functionality includes:
81
82@itemize @bullet
83
84@item
85tokenization
86@item
87dictionary-based morphological analysis
88@item
89heuristic morphological analysis of unknown words
90@item
91spelling correction
92@item
93pattern search
94@item
95sentence splitting
96@item
97generation of concordance tables
98@end itemize
99
100The toolkit is destined for processing of raw (not annotated)
101unrestricted text for any conceivable purpose.
102
103The system is organized as a collection of command-line programs, each
104performing one operation, e.g. tokenization, lemmatization, spelling
105correction. The components are independent one from another, the
106unifying element being the uniform i/o file format.
107
108The components may be combined in various ways to provide various text
109processing services. Also new components supplied by the used may be
110easily incorporated into the system provided that they respect the i/o
111file format conventions.
112
113UTT component programs does not depend on any specific tagset or
114morphological description format.
115
116UTT is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by
117the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
118
119The Polex/PMDBF dictionary is licensed under the Creative Commons by-nc-sa License which prohibits commercial use. 
120
121
122List of contributors:
123
124@itemize
125@item Pawel Konieczka
126@item Tomasz Obrebski
127@item Michal Stolarski
128@item Marcin Walas
129@item Justyna Walkowska
130@item Pawel Werenski
131@end itemize
132
133@c ----------------------------------------------------------------------
134@c ---------------------------------------------------------------------
135
136@node    UTT file format
137@chapter UTT file format
138
139A UTT file contains annotation of a text. It consists of a sequence of
140segments. Each segment explicitly refers to a continuous piece of the
141text and provides some information on it.
142
143@section Segment format
144
145A segment occupies one line of a UTT file and consists of
146space-separated fields:
147
148
149@quotation
150@sp 1
151[@var{start} [@var{length}]] @var{type} @var{form} [@var{annotation1} [@var{annotation2} ...]]
152@sp 1
153@end quotation
154
155@table @var
156
157@item @var{start}
158Non-negative integer value indicating the position in the source text where the
159segment starts.
160
161@item @var{length}
162Non-negative integer value indicating the length of the segment.
163
164@item @var{type}
165A sequence of non-ASCII characters (without spaces or letters, which could lead to @var{type} being misinterpreted as a @var{start} or @var{length} field).
166@var{type} reflects the main classification of segments -
167into words, numbers, punctuation marks, meta-text markers.
168@xref{tok output,,tok output}, for description of automatically recognized type markers.
169
170@item @var{form}
171This field contains the textual form of the segment or the special
172symbol @code{*} indicating that the form is not given (e.g. when the segment has been created artificially to mark something and is of lentgh 0).
173
174The characters or character sequences that have special meaning in the
175@var{form} field are enumerated below.
176
177Characters with special meaning:
178
179@itemize
180@item @code{_} - space character
181@item @code{*} - undefined contents
182@end itemize
183
184Escape sequences:
185
186@itemize
187@item @code{\n} - new line
188@item @code{\t} - tabulation
189@item @code{\r} - carriage return 
190
191@item @code{\_} - the @code{_} character
192@item @code{\*} - the @code{*} character
193@item @code{\\} - the @code{\} character
194
195@c @item @code{\hh} - a character with hexadecimal code @code{hh} (used for non-printable characters)
196@end itemize
197
198@item @var{annotation1}
199@item @var{annotation2}
200@item ...
201Annotation fields have the following format:
202
203@var{longname} @code{:} @var{value}
204
205or
206
207@var{shortname} @var{value}
208
209where @var{longname} is a string of alphanumeric characters
210(isalnum() test), @var{shortname} - a single non-alphanumeric character
211(ispunct() test), and @var{value} is an arbitrary string of non-blank characters.
212
213@end table
214
215
216Only two fields are mandatory: @var{type} and @var{form}. All other fields
217may be absent. In the case when only one number precedes the
218@var{type} field, it is interpreted as the @var{START} position.
219
220If the @var{length} field is ommited, the length of the segment is the
221length of the @var{form} field, except when the value of the
222@var{form} field is @code{*} -- in this case, the length is assumed to
223be 0.
224
225If the @var{start} field is also absent, the segment is assumed to directly
226follow the preceding one.
227
228@c Conventions:
229
230@c Annotation fields with predefined meaning:
231
232@c @itemize
233@c @item @code{!} - UTT components are allowed to modify the contents of
234@c the @var{form} field (e.g. spelling correction does this). If this happens the
235@c original form of the segment have to be placed in the @code{!}-field.
236@c @item @code{@@} - morphological description
237@c @item @code{=} - node identifier assignment (used in graph encoding)
238@c @item @code{<} - preceding/dominating node(s) (used in graph encoding)
239@c @item @code{>} - succeeding/subordinate node(s) (used in graph encoding)
240@c @end itemize
241
242Segments of length 0 may be used to mark file positions with some
243information. See e.g. BOS and EOS (beginning/end of sentence) markers
244in the example below.
245
246Example:
247
248sentence: @samp{Piszemy dobre progrumy.}
249
250@example
2510000 00 BOS *
2520000 07 W Piszemy lem:pisaÊ,V
2530007 01 S _
2540008 05 W dobre lem:dobry,ADJ
2550013 01 S _
2560014 08 W progrumy cor:programy lem:program,N
2570022 01 P .
2580023 00 EOS *
2590023 01 S _
2600024 00 BOS *
2610024 11 W Warszawiacy lem:Warszawiak,N
2620035 01 S _
2630036 03 W te¿
2640039 01 P .
2650040 00 EOS *
266
267@end example
268
269@example
2700000 BOS *
2710000 W Piszemy lem:pisaÊ,V
2720007 S _
2730008 W dobre lem:dobry,ADJ
2740013 S _
2750014 W progrumy cor:programy lem:program,N
2760022 P .
2770023 EOS *
278@end example
279
280Posion information may be provided only for some types of segments:
281
282@example
2830000 BOS *
284W Piszemy lem:pisaÊ,V
285S _
286W dobre lem:dobry,ADJ
287S _
288W progrumy cor:programy lem:program,N
289P .
290EOS *
291S _
2920024 BOS *
293W Warszawiacy lem:Warszawiak,N
294S _
295W te¿
296P .
297EOS *
298@end example
299
300Position/length information may be provided only when necessary:
301
302@example
3030000 04 N *
3040000 N 12
305P .
306N 5
307S _
308W km
309@end example
310
311@section UTT File
312
313A UTT file consists of a sequence of segments.  The same text position
314may be covered by multiple segments. In cosequence, ambiguous text
315segmentation and ambiguous annotation may be represented.
316
317There are two structural requirements a valid UTT-formatted file
318has to meet:
319
320@itemize @bullet
321
322@item
323segments have to be sorted with respect to the @var{position} field,
324
325@item
326for each
327segment ending at position @var{n}, either there must be a segment starting at
328position @var{n+1}, or position @var{n+1} is not covered by any segment; similarly
329for each segment starting at position @var{n}, either there must be a segment
330ending at position @var{n-1}, or the position @var{n-1} must not be covered
331by any segment.
332
333@end itemize
334
335A valid annotation for the text fragment
336@example
33712.5 km
338@end example
339
340may be
341
342@example
3430000 02 N 12
3440000 04 N 12.5
3450002 01 P .
3460003 01 N 5
3470004 01 S _
3480005 02 W km
349@end example
350
351but not
352
353@example
3540000 02 N 12
3550000 04 N 12.5
3560004 01 S _
3570005 02 W km
358@end example
359
360because in the latter example the first segment (starting at position 0000, 2 characters long) ends at position @var{n}=0001 which is covered by the second segment and no segment starts at position @var{n+2}=0002.
361
362@section Character encoding
363
364The UTT component programs accept only 1-byte character encoding, such
365as ISO, ANSI, DOS, UTF-8 (probably: not tested yet).
366
367
368@c @section Formats
369
370@c @unnumberedsubsubsec Basic format
371
372@c While processing large amounts of the overhead related with explicit
373@c ... of the start position and segment length becomes ... . Therefore,
374@c for efficiency reasons certain shortcuts are possible:
375
376@c @unnumberedsubsubsec Relative start position
377
378@c Start position may be given as relative distance from the last
379@c absolut position.
380
381@c @unnumberedsubsubsec Absent length
382
383@c Segment length may by omitted. Normally it can be restored by counting
384@c the length of the @emph{form field}. For segments with the special value
385@c @code{*} in the @emph{form field} length 0 is assumed.
386
387@c @unnumberedsubsubsec Absent length and start position
388
389@c Both start position and segment length may be omitted. In this format
390@c each segment is assumed to follow the previous one. This format is,
391@c therefore, suitable only for unambiguously tagged text
392@c (0-length markers can be still used.)
393
394
395@c @table @code
396@c @item AL
397@c @code{1234 03 W kot}
398@c @item RL
399@c @code{+56 03 W kot}
400@c @item A
401@c @code{1234 W kot}
402@c @item R
403@c @code{+56 W kot}
404@c @item 0
405@c @code{W kot}
406@c @end table
407
408
409@c [JAK UZYSKAÆ POLSKIE CZCIONKI W DVI???]
410
411@macro parhelp
412@item @b{@minus{}@minus{}help}, @b{@minus{}h}
413Print help.
414@end macro
415
416
417@macro parversion
418@item @b{@minus{}@minus{}version}, @b{@minus{}V}
419Print version information.
420@end macro
421
422@macro parinteractive
423@item @b{@minus{}@minus{}interactive, @minus{}i}
424This option toggles interactive mode, which is by default off. In the
425interactive mode the program does not buffer the output.
426@end macro
427
428
429@c @macro parfile
430@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
431@c Input file name.
432@c If this option is absent or equal to '@minus{}', the program
433@c reads from the standard input.
434@c @end macro
435
436
437@c @macro paroutput
438@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
439@c Regular output file name. To regular output the program sends segments
440@c which it successfully processed and copies those which were not
441@c subject to processing. If this option is absent or equal to
442@c '@minus{}', standard output is used.
443@c @end macro
444
445@c @macro parfail
446@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}}
447@c Fail output file name. To fail output the program copies the segments
448@c it failed to process.  If this option is absent or equal to
449@c '@minus{}', standard output is used.
450@c @end macro
451
452
453@c @macro parcopy
454@c @item @b{@minus{}@minus{}copy, @minus{}c}
455@c Copy succesfully processed segments to regular output also in their
456@c original input form.
457@c @end macro
458
459
460@macro parinputfield
461@item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}
462The field containing the input to the program. The default is the
463@var{form} field. The fields @var{position}, @var{length}, @var{type},
464and @var{form} are referred to as @code{1}, @code{2}, @code{3},
465@code{4}, respectively.
466@end macro
467
468
469@macro paroutputfield
470@item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}
471The name of the field added by the program. The default is the name of the program.
472@end macro
473
474
475@macro pardictionary
476@item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}}
477Dictionary file name.
478@end macro
479
480
481@macro parprocess
482@item @b{@minus{}@minus{}process=@var{type}, @minus{}p @var{type}}
483Process segments with the specified value in the @var{type} field.
484Multiple occurences of this option are allowed and are interpreted as
485disjunction. If this option is absent, all segments are processed.
486@end macro
487
488
489@macro parselect
490@item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}}
491Select for processing only segments in which the field named
492@var{fieldname} is present. Multiple occurences of this option are
493allowed and are interpreted as conjunction of conditions. If this
494option is absent, all segments are processed.
495@end macro
496
497
498@macro parunselect
499@item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}}
500Select for processing only segments in which the field @var{fieldname}
501is absent.  Multiple occurences of this option are allowed and are
502interpreted as conjunction of conditions. If this option is absent,
503all segments are processed.
504@end macro
505
506
507@macro paroneline
508@item @b{@minus{}@minus{}one-line}
509This option makes the program print ambiguous annotation in one output
510line by generating multiple annotation fields. By default when
511ambiguous annotation may be produced for a segment, the segment is
512multiplicated and each of the annotations is added to separate copy of
513the segment.
514@end macro
515
516
517@macro paronefield
518@item @b{@minus{}@minus{}one-field, @minus{}1}
519This option makes the program print ambiguous annotation in one
520annotation field. By default when ambiguous annotation may be produced
521for a segment, the segment is multiplicated and each of the
522annotations is added to separate copy of the segment.
523
524This option is useful when working with @command{kot} or @command{con}.
525@end macro
526
527
528@c ---------------------------------------------------------------------
529@c ---------------------------------------------------------------------
530
531@c @node Common command line options
532@c @chapter Common command line options
533
534@c @table @code
535
536@c @parhelp
537
538@c @item @b{@minus{}@minus{}help}, @b{@minus{}h}
539@c Print help.
540
541@c @item @b{@minus{}@minus{}version}, @b{@minus{}v}
542@c Print version information.
543
544@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
545@c Input file name.
546@c If this option is absent or equal to '@minus{}', the program
547@c reads from the standard input.
548
549@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
550@c Regular output file name. To regular output the program sends segments
551@c which it successfully processed and copies those which were not
552@c subject to processing. If this option is absent or equal to
553@c '@minus{}', standard output is used.
554
555@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}}
556@c Fail output file name. To fail output the program copies the segments
557@c it failed to process.  If this option is absent or equal to
558@c '@minus{}', standard output is used.
559
560@c @item @b{@minus{}@minus{}only-fail}
561@c Discard segments which would normally be sent to regular
562@c output. Print only segments the program failed to process.
563
564@c @item @b{@minus{}@minus{}no-fail}
565@c Discard segments the program failed to process.
566@c (This and the previous option are functionally equivalent to,
567@c respectively, @option{-o /dev/null} and @option{-e /dev/null}, but
568@c make the programs run faster.)
569
570@c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}
571@c The field containing the input to the program. The default is usually
572@c the @var{form} field (unless otherwise stated in the program
573@c description). The fields @var{position}, @var{length}, @var{tag}, and
574@c @var{form} are referred to as @code{1}, @code{2}, @code{3}, @code{4},
575@c respectively.
576
577@c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}
578@c The name of the field added by the program. The default is the name of
579@c the program.
580
581@c @c @item @b{@minus{}@minus{}copy, @minus{}c}
582@c @c Copy processed segments to regular output.
583
584@c @item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}}
585@c Dictionary file name.
586@c (This option is used by programs which use dictionary data.)
587
588@c @item @b{@minus{}@minus{}process=@var{tag}, @minus{}p @var{tag}}
589@c Process segments with the specified value in the @var{tag} field.
590@c Multiple occurences of this option are allowed and are interpreted as
591@c disjunction. If this option is absent, all segments are processed.
592
593@c @item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}}
594@c Select for processing only segments in which the field named
595@c @var{fieldname} is present. Multiple occurences of this option are
596@c allowed and are interpreted as conjunction of conditions. If this
597@c option is absent, all segments are processed.
598
599@c @item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}}
600@c Select for processing only segments in which the field @var{fieldname}
601@c is absent.  Multiple occurences of this option are allowed and are
602@c interpreted as conjunction of conditions. If this option is absent,
603@c all segments are processed.
604
605@c @item @b{@minus{}@minus{}interactive @minus{}i}
606@c This option toggles interactive mode, which is by default off. In the
607@c interactive mode the program does not buffer the output.
608
609@c @item @b{@minus{}@minus{}config=@var{filename}}
610@c Read configuration from file @file{@var{filename}}.
611
612@c @item @b{@minus{}@minus{}one @minus{}1}
613@c This option makes the program print ambiguous annotation in one output
614@c segment. By default when
615@c ambiguous new annotation is being produced for a segment, the segment
616@c is multiplicated and each of the annotations is added to separate copy
617@c of the segment.
618
619@c @end table
620
621@c ---------------------------------------------------------------------
622@c CONFIGURATION FILES
623@c ---------------------------------------------------------------------
624
625@node    Configuration files
626@chapter Configuration files
627
628Values for all command line options accepted by a component
629may be set in configuration files. The default location of the
630configuration files for a component named @command{@var{program}} are
631
632@example
633        @file{/usr/local/etc/utt/@var{program}.conf}
634@end example
635
636for system-wide configuration file and
637
638@example
639        @file{~/.utt/@var{program}.conf}
640@end example
641
642for user configuration file.
643
644@c The configuration file to load may be also specified with the
645@c @option{--config} option. Configuration file need not be provided.
646
647For each option, the value is set according to the following priority:
648
649@itemize
650@item command line
651@c @item configuration file indicated with @option{--config} option
652@item user configuration file (or configuration file indicated with the @option{--config} option)
653@item system-wide configuration file
654@end itemize
655
656Parameter values are specified in the following format:
657
658@var{parametername}=@var{value}
659
660where @var{parametername} is the short or long name of an option accepted by
661the program, or
662
663@var{parametername}
664
665if the option does not need arguments.
666
667You can introduce comments to configuration files using the # sign.
668
669If a program accepts multiple occurences of an option (e.g. @var{lem}'s select option) you can specify them in two distinct lines of the program's configuration file.
670
671@c The equal sign may be omitted.
672
673
674@quotation Tip
675If you have two (or more) frequently used sets of options for the same
676program (eg. lem with PMDBF dictionary and lem with a user dictionary)
677a good solution is to create two soft links to lem, called
678eg. lemg and lemu and specify their configuration in files lemg.conf
679and lemu.conf respectively.
680@end quotation
681
682@c ---------------------------------------------------------------------
683@c COMPONENTS
684@c ---------------------------------------------------------------------
685
686@node UTT components
687@chapter UTT components
688
689UTT components are of three types:
690
691@menu
692Sources: programs which read non-UTT data (e.g. raw text) and produce output
693in UTT format
694* tok::         a tokenizer
695
696Filters: programs which read and produce UTT-formatted data
697@c * sen - the sentencizer::
698* lem::         a morphological analyzer
699* gue::         a morphological guesser
700* cor::         a spelling corrector
701* sen::         a sentensizer
702@c * gph - the graphizer::
703* ser::         a pattern search tool (marks matches)
704* grp::         a pattern search tool (selects sentences containing a match)
705
706Sinks: programs which read UTT data and produce output in another format
707* kot::         an untokenizer
708* con::         a concordance table generator
709@end menu
710
711@c ---------------------------------------------------------------------
712@c TOK
713@c ---------------------------------------------------------------------
714
715@page
716@node tok
717@section tok - a tokenizer
718
719@c ----------------------------------------
720
721@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
722@item @strong{Authors:}                 @tab Tomasz Obrêbski
723@item @strong{Component category:}      @tab source
724@end multitable
725
726
727@menu
728* tok description::
729* tok input::
730* tok output::
731* tok command line options::
732* tok example::
733@end menu
734
735@node tok description
736@subsection Description
737
738@code{tok} is a simple program which reads a text file and identifies
739tokens on the basis of their orthographic form.  The type of the token
740is printed as the @var{type} field.
741
742@node tok input
743@subsection Input
744
745Raw text.
746
747@node tok output
748@subsection Output
749
750UTT-file with four fields: @var{start}, @var{length}, @var{type}, and @var{form}. In the @var{type} field five types of tokens are distinguished:
751
752@itemize
753
754@item @code{W}
755(word)
756- continuous sequence of letters
757
758@item @code{N}
759(number)
760- continuous sequence of digits
761
762@item @code{S}
763(space)
764- continuous sequence of space characters
765
766@item @code{P}
767(punctuation mark)
768- single printable characters not belonging to any of the other classes
769
770@item @code{B}
771(unprintable character)
772- single unprintable character
773
774@end itemize
775
776
777
778@node tok command line options
779@subsection Command line options
780
781@table @code
782
783@item @b{@minus{}@minus{}help}, @b{@minus{}h}
784Print help.
785
786@item @b{@minus{}@minus{}version}, @b{@minus{}V}
787Print version information.
788
789@item @b{@minus{}@minus{}interactive, @minus{}i}
790This option toggles interactive mode, which is by default off. In the
791interactive mode the program does not buffer the output.
792
793@end table
794
795@node tok example
796@subsection Example
797
798Input:
799
800@example
801Piszemy dobre programy.
802@end example
803
804Output:
805
806@example
8070000 07 W Piszemy
8080007 01 S _
8090008 05 W dobre
8100013 01 S _
8110014 08 W programy
8120022 01 P .
8130023 01 S \n
814@end example
815
816
817@c ---------------------------------------------------------------------
818@c SEN
819@c ---------------------------------------------------------------------
820
821@c @node sen - sentencizer
822@c @chapter sen - sentencizer
823
824@c Authors: Tomasz Obrêbski
825
826@c ---------------------------------------------------------------------
827@c LEM
828@c ---------------------------------------------------------------------
829
830@page
831@node lem
832@section lem - morphological analyzer
833
834@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
835@item @strong{Authors:}                 @tab Tomasz Obrêbski, Micha³ Stolarski
836@item @strong{Component category:}      @tab filter
837@end multitable
838
839@menu
840* lem description::             
841* lem command line options::   
842* lem input::
843* lem output::
844* lem example::                 
845* lem dictionaries::           
846* lem hints::           
847@end menu
848
849@node lem description
850@subsection Description
851
852@command{lem} performs morphological analysis of a simple orthographic
853word, returning all its possible morphological annotations,
854disregarding the context.
855
856@c ----------------------------------------
857
858@node lem command line options
859@subsection Command line options
860
861@table @code
862@parhelp
863@parversion
864@parinteractive
865@c @parfile
866@c @paroutput
867@c @parfail
868@c @parcopy
869@parinputfield
870@paroutputfield
871@pardictionary
872@parprocess
873@parselect
874@parunselect
875@paroneline
876@paronefield
877@end table
878
879@c ----------------------------------------
880
881@node lem input
882@subsection Input
883
884Lem reads a UTT file and processes the value of the @var{form} field
885(the input field may be changed with @option{--input-field} option).
886
887@node lem output
888@subsection Output
889
890@command{lem} adds a new annotation field, whose default name is @code{lem}.  In
891case of ambiguity either the segment is multiplicated (default),
892multiple @code{lem} fields are added (@option{--one-line}) or ambiguous
893annotation is produced as the value of single @code{lem} field (option
894@option{--one-field,-1}):
895
896@itemize @bullet
897
898@item
899unambiguous value format:
900
901@example
902   <lemma>,<descr>
903@end example
904
905@item
906ambiguous value format (@option{--one-field} option)
907
908
909@example
910   <lemma>,<descr>[,<descr>][;<lemma>,<descr>[,<descr>]]
911@end example
912
913(alternative descriptions for the same lemma are separated by commas,
914alternative lemmata are separated by semicolons.)
915
916@end itemize
917
918@node lem example
919@subsection Example
920
921Input:
922
923@example
9240000 07 W Piszemy
9250007 01 S _
9260008 05 W dobre
9270013 01 S _
9280014 08 W programy
9290022 01 P .
9300023 01 B \n
931@end example
932
933Output (default):
934
935@example
9360000 07 W Piszemy lem:pisaÊ,V/AiVpMdTrfNpP1
9370007 01 B _
9380008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn
9390008 05 W dobre lem:dobry,ADJ/DpNsCnavGn
9400013 01 B _
9410014 08 W programy lem:program,N/GiNpCa
9420014 08 W programy lem:program,N/GiNpCn
9430014 08 W programy lem:program,N/GiNpCv
9440022 01 P .
9450023 01 B \n
946@end example
947
948Output (@option{--one-line} option):
949
950@example
9510000 07 W Piszemy lem:pisaÊ,V/AiVpMdTrfNpP1
9520007 01 S _
9530008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn lem:dobry,ADJ/DpNsCnavGn
9540013 01 S _
9550014 08 W programy lem:program,N/GiNpCa lem:program,N/GiNpCn lem:program,N/GiNpCv
9560022 01 P .
9570023 01 S \n
958@end example
959
960Output (@option{--one-field} option):
961
962@example
9630000 07 W Piszemy lem:pisaÊ,V/AiVpMdTrfNpP1
9640007 01 S _
9650008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn,ADJ/DpNsCnavGn
9660013 01 S _
9670014 08 W programy lem:program,N/GiNpCa,N/GiNpCn,N/GiNpCv
9680022 01 P .
9690023 01 S \n
970@end example
971
972@c ----------------------------------------
973
974@node lem dictionaries
975@subsection Dictionaries
976
977@command{lem} requires a dictionary. The dictionary may be provided in
978one of two formats: in text (source) format or in binary (fsa) format.
979
980@subsubheading Text format
981
982Dictionary entries have the following structure:
983
984@example
985<form>;<lemma>,<descr>[;<lemma>,<descr>]
986@end example
987
988@var{lemma} may be given explicitly or in the cut-add format:
989
990@example
991@code{[<cut1><add1>-]<cut2><add2>}
992@end example
993
994meaning: replace prefix of length @code{<cut1>} with
995string @code{<add1>}, replace suffix of length @code{<cut2>} with string
996@code{<add2>}. For example @code{3t} transforms @samp{kocie} into
997@samp{kot}, @code{3-4a³y} transforms @samp{najbielsi} into @samp{bia³y}
998
999Each dictionary entry must be written in one line and must not contain blank characters.
1000
1001Examples:
1002@example
1003kot;0,N/GaNsCn
1004kota;1,N/GaNsCg;1,N/GaNsCa
1005kotu;1,N/GaNsCd
1006kotem;2,N/GaNsCi
1007kocie;3t,N/GaNsCl;3t,N/GaNsCv
1008najbielsi;3-4a³y,ADJ/DsNpCnGp
1009najbielsze;3-5a³y,ADJ/DsNpCnGaifn
1010najlepsi;dobry,ADJ/DsNpCnGp
1011najlepsze;dobry,ADJ/DsNpCnGaifn
1012@end example
1013
1014
1015The mandatory file name extension for a text dictionary is @code{dic}. For large
1016dictionaries it is preferable, however, to compile them into binary
1017(fsa) format.
1018
1019@subsubheading Binary format
1020
1021The mandatory file name extension for a binary dictionary is @code{bin}. To
1022compile a text dictionary into binary format, write:
1023
1024@example
1025compiledic <dictionaryname>.dic
1026@end example
1027
1028@subsubheading Polex/PMDBF dictionary
1029
1030A large-coverage morphological dictionary for Polish language, Polex/PMDBF, is included in
1031the distribution as the default @emph{lem}'s dictionary. It's
1032located by default in:
1033
1034@file{$HOME/.utt/pl/lem.bin}
1035
1036@node lem hints
1037@subsection Hints
1038
1039@c @subsubheading Combining data from multiple dictionaries
1040
1041@c @itemize
1042
1043@c @item Apply <dict1>, then apply <dict2> to words which were not annotatated.
1044
1045@c @example
1046@c lem -d <dict1> | lem -S lem -d <dict2>
1047@c @end example
1048
1049@c @item Add annotations from two dictionaries <dict1> and <dict2>.
1050
1051@c @example
1052@c lem -c -d <dict1> | lem -S lem -d <dict2>
1053@c @end example
1054
1055@c @end itemize
1056
1057
1058@c ---------------------------------------------------------------------
1059@c GUE
1060@c ---------------------------------------------------------------------
1061
1062@page
1063@node gue
1064@section gue - morphological guesser
1065
1066@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1067
1068@item @strong{Authors:}                 @tab Micha³ Stolarski, Tomasz Obrêbski
1069@item @strong{Component category:}      @tab filter
1070
1071@end multitable
1072
1073@command{gue} guesess morphological descriptions of the form contained
1074in the @var{form} field.
1075
1076@menu
1077* gue command line options::   
1078* gue example::                 
1079* gue dictionaries::           
1080@end menu
1081
1082@node gue command line options
1083@subsection Command line options
1084
1085@table @code
1086
1087@parhelp
1088@parversion
1089@parinteractive
1090@c @parfile
1091@c @paroutput
1092@c @parfail
1093@c @parcopy
1094@parinputfield
1095@paroutputfield
1096@pardictionary
1097@parprocess
1098@parselect
1099@parunselect
1100@paroneline
1101@paronefield
1102
1103@item @b{@minus{}@minus{}delta=@var{n}}
1104Stop displaying answers after fall of weight, that is, when weight difference between 2 subsequent results is more than delta value (default=`0.2').
1105
1106
1107@item @b{@minus{}@minus{}cut-off=@var{n}}
1108Do not display answers with less weight than cut-off value (default=`200').
1109
1110
1111@item @b{@minus{}@minus{}guess_count=@var{n}, @minus{}n @var{n}}
1112Guess up to n descriptions  (default=`0', which means 'display all results').
1113
1114
1115
1116@end table
1117
1118@node gue example
1119@subsection Example
1120
1121@example
1122command: gue -n 2
1123
1124input:
11250000 07 W smerfny
1126
1127output:
11280000 07 W smerfny gue:,ADJ/CaDpGiNs
11290000 07 W smerfny gue:,ADJ/CnvDpGaipNs
1130@end example
1131                                 
1132
1133@node gue dictionaries
1134@subsection Dictionaries
1135
1136@command{gue} requires a dictionary. For now, the dictionary must be provided in binary (fsa) format.
1137The fsa format is created by compiling text-format dictionaries.
1138
1139
1140
1141@subsubheading Text format
1142
1143Dictionary entries have the following structure:
1144
1145@example
1146@var{prefix}@code{*}@var{suffix}@code{;}@var{lemma}@code{,}@var{description}@code{:}@var{weight}
1147@end example
1148
1149@var{lemma} must be given in the cut-add format:
1150
1151@example
1152@code{[<cut1><add1>-]<cut2><add2>}
1153@end example
1154(no spaces in between): replace prefix of length @var{cut1} with
1155string @var{add1}, replace suffix of length @var{cat2} with string
1156@var{add2}.
1157
1158
1159Example: @code{3-4a³y} transforms @i{najbielsi} into @i{bia³y}
1160
1161
1162@var{description} contains the part of speech and morphosyntactic information (@xref{PMDBF dictionary}.).
1163
1164@var{weight} is an integer value between 1 and 999 indicating the
1165likelihood of the guess.
1166
1167@example
1168*³kê;1a,N/GfNsCa
1169naj*elszy;3-4a³y,ADJ/...:...
1170@end example
1171
1172
1173@c ---------------------------------------------------------------------
1174@c COR
1175@c ---------------------------------------------------------------------
1176
1177@page
1178@node cor
1179@section cor - spelling corrector
1180
1181@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1182@item @strong{Authors:}                 @tab Tomasz Obrêbski, Micha³ Stolarski
1183@item @strong{Component category:}      @tab filter
1184@end multitable
1185
1186The spelling corrector applies Kemal Oflazer's dynamic programming
1187algorithm @cite{oflazer96} to the FSA representation of the set of
1188word forms of the Polex/PMDBF dictionary. Given an incorrect
1189word form it returns all word forms present in the dictionary whose
1190edit distance is smaller than the threshold given as the parameter.
1191
1192By default @code{cor} replaces the contents of the @var{form} field
1193with new corrected value, placing the old contents in the @code{cor}
1194field.
1195
1196
1197@menu
1198* cor command line options::   
1199* cor dictionaries::           
1200@end menu
1201
1202
1203@node cor command line options
1204@subsection Command line options
1205
1206@table @code
1207
1208@parhelp
1209@parversion
1210@parinteractive
1211@c @parfile
1212@c @paroutput
1213@c @parfail
1214@c @parcopy
1215@parinputfield
1216@paroutputfield
1217@pardictionary
1218@parprocess
1219@parselect
1220@parunselect
1221@paroneline
1222@paronefield
1223
1224@item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}}
1225Maximum edit distance (default='1').
1226
1227
1228@end table
1229
1230@node cor dictionaries
1231@subsection Dictionaries
1232
1233@command{cor} requires a dictionary. The dictionary has to be provided in binary (fsa) format.
1234The fsa format is created by compiling text-format dictionaries.
1235
1236@subsubheading Text format
1237
1238The @command{cor} dictionary is a list of words:
1239@example
1240odlot
1241odlotowy
1242odludek
1243@end example
1244
1245@page
1246@node sen
1247@section sen - a sentensizer
1248
1249@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1250
1251@item @strong{Authors:}                 @tab Tomasz Obrêbski
1252@item @strong{Component category:}      @tab filter
1253
1254@end multitable
1255
1256@command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation.
1257
1258@menu
1259@c * sen input::
1260@c * sen output::
1261* sen example::                 
1262@end menu
1263
1264@node sen example
1265@subsection Example
1266
1267@example
1268command: sen
1269
1270input:
12710000 05 W Cze¶Ê
12720005 01 P !
12730006 01 S _
12740007 02 W To
12750009 01 S _
12760010 02 W ja
12770012 01 P .
12780013 01 S \n
1279
1280output:
12810000 00 BOS *
12820000 05 W Cze¶Ê
12830005 01 P !
12840006 00 EOS *
12850006 00 BOS *
12860006 01 S _
12870007 02 W To
12880009 01 S _
12890010 02 W ja
12900012 01 P .
12910013 01 S \n
12920014 00 EOS *
1293@end example
1294
1295
1296@c ---------------------------------------------------------------------
1297@c GPH
1298@c ---------------------------------------------------------------------
1299
1300@c @node gph - graphizer
1301@c @chapter gph - graphizer
1302
1303@c Authors: Tomasz Obrêbski
1304
1305
1306
1307@c SER
1308@c ---------------------------------------------------------------------
1309@c ---------------------------------------------------------------------
1310
1311@page
1312@node ser
1313@section ser - pattern search tool
1314
1315@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1316@item @strong{Authors:}                 @tab Tomasz Obrêbski
1317@item @strong{Component category:}      @tab filter
1318@end multitable
1319
1320@command{ser} looks for patterns in UTT-formatted texts.
1321
1322@menu
1323* ser command line options::   
1324* ser pattern::                 
1325* ser how ser works::           
1326* ser customization::           
1327* ser limitations::             
1328* ser requirements::           
1329@end menu
1330
1331
1332@c ---------------------------------------------------------------------
1333@node ser command line options
1334@subsection Command line options
1335
1336@table @code
1337
1338@parhelp
1339@parversion
1340@c @parfile
1341@c @paroutput
1342@c @parinputfield
1343@c @paroutputfield
1344@parprocess
1345@parinteractive
1346
1347@item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
1348The search pattern.
1349
1350@item @b{@minus{}@minus{}morph=@var{field}}
1351The name of the annotation field containing the morphological
1352description (default @code{lem}).
1353
1354@item @b{@minus{}@minus{}flex}
1355Only print the generated flex source code.
1356
1357@item @b{@minus{}@minus{}macro=@var{filename}}
1358Read macrodefinitions from file @var{filename} rather than from
1359default location. This option allows to redefine the set of terms.
1360
1361@item @b{@minus{}@minus{}define=@var{filename}}
1362Append macrodefinitions from file @var{filename}. This option
1363allows to extend the set of terms.
1364
1365@end table
1366
1367
1368@c ---------------------------------------------------------------------
1369@node ser pattern
1370@subsection Pattern
1371
1372The @command{ser} pattern is a regular expression over terms corresponding
1373to text segments or segment sequences. Predefined terms are:
1374
1375@table @code
1376
1377@item seg(@var{t},@var{f},@var{a})
1378a segment of type @var{t}, containing form @var{f} and annotation
1379@var{a}
1380
1381@item form(@var{f})
1382a segment containing form @var{f}
1383
1384@item field(@var{f})
1385a segment containing annotation field @var{f}
1386
1387@item space(@var{f})
1388a space segment of form @var{f}
1389
1390@item word(@var{f})
1391a word segment of form @var{f}
1392
1393@item punct(@var{f})
1394a punct segment of form @var{f}
1395
1396@item number(@var{f})
1397a number segment of form @var{f}
1398
1399@item lexeme(@var{f})
1400a word segment with lemma @var{f}
1401
1402@item cat(@var{c})
1403a word segment of category @var{c}
1404
1405@end table
1406
1407All arguments are optional. If an argument is omitted, an arbitrary
1408string of non-blank characters is assumed as the argument value. Term
1409arguments may be arbitrary character-level regular expressions. The
1410following special symbols can by used:
1411
1412@multitable {aaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1413@item @code{[@dots{}]}            @tab a character class
1414@item @code{[^@dots{}]}           @tab a negated character class
1415@item @code{|}                    @tab alternative
1416@item @code{*}                    @tab repetition, including zero times
1417@item @code{+}                    @tab repetition, at least one time
1418@item @code{?}                    @tab optionality
1419@item @code{@{@var{m},@var{n}@}}  @tab repetition from @var{m} to @var{n} times
1420@item @code{@{@var{m},@}}         @tab repetition @var{m} or more times
1421@item @code{@{@var{m}@}}          @tab repetition @var{m} times
1422@item @code{@var{\ddd}}           @tab the character with octal value @var{ddd}
1423@item @code{\x@var{hh}}           @tab the character with hexadecimal value @var{hh}
1424@item @code{( )}                  @tab parentheses, used to override precedence
1425@c @end multitable
1426
1427@c @multitable {aaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1428@item @code{.}    @tab a non-blank character
1429@item @code{\w}   @tab a letter
1430@item @code{\W}   @tab a non-blank character other than a letter
1431@item @code{\d}   @tab a digit
1432@item @code{\D}   @tab a non-blank character other than a digit
1433@item @code{\s}   @tab a space or tab character
1434@item @code{\S}   @tab a non-blank character (the same as @code{.})
1435@item @code{\l}   @tab a lowercase letter
1436@item @code{\L}   @tab an uppercase letter
1437@end multitable
1438
1439
1440@noindent The following characters:
1441@example
1442@verb{%  [   ]   ^   |   *   +   ?   {   }   ,   .   <   >   \ %}
1443@end example
1444must be escaped with a backslash, i.e. written as:
1445@example
1446@verb{% \[  \]  \^  \|  \*  \+  \?  \{  \}  \,  \.  \<  \>  \\ %}
1447@end example
1448
1449@quotation Note
1450The special symbols are ... borrowed from Perl with minor
1451modifications ... for convenience
1452The meaning of certain special characters/sequences slightly differs
1453from their common ???. This is motivated by convenience reasons.
1454The meaning of the @code{.} special character is modified due to
1455the special function of spaces in utt files (they are field
1456separators). Use @code{\s} to explicitly
1457@end quotation
1458
1459In the argument of the @code{cat} term a special operator <...> may be
1460used. A category specification enclosed in angle brackets matches all
1461category descriptions which are consistent (non-contradictory) with the
1462specification. For example @code{<N>} matches all noun descriptions,
1463@code{<ADJ/Can>} matches all adjectives in accusative or nominal case.
1464
1465
1466@*
1467@noindent @b{Examples of one-segment patterns:}
1468
1469@multitable {aaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1470@item @code{seg}            @tab any segment
1471@item @code{word}           @tab any word-form
1472@item @code{word(pomocy)}   @tab the word-form @samp{pomocy}
1473@item @code{word(naj.+)}    @tab a word-form beginning with @samp{naj}
1474@item @code{word(\L\l+)}    @tab a capitalized word-form
1475@item @code{punct}          @tab a punctuation character
1476@item @code{space(.*\\n.*)} @tab a space segment containing a newline character
1477@item @code{lexeme(pomoc)}  @tab any form of the lexeme 'pomoc'
1478@item @code{cat(N/.*)}      @tab a word which category starts with @code{N/}
1479@item @code{cat(<N/Ca>)}    @tab a word which category matches @code{N/Ca}
1480@end multitable
1481
1482@*
1483@noindent @b{Examples of multi-segment patterns:}
1484
1485@table @code
1486
1487@item (word(\L) punct(\.) space?)+ word(\L\l+)
1488a sequence of initials followed by a surname
1489
1490@item punct seg(W|S|N)* cat(<NPRO/Sr>) seg(W|S|N)* punct
1491a text fragment between two punctuation characters, containing an
1492ocurrence of a relative pronoun
1493
1494@end table
1495
1496
1497@node ser how ser works
1498@subsection How ser works
1499
1500@node ser customization
1501@subsection Customization
1502
1503@c All predefined terms correspond to single segments,
1504
1505@example
1506define(`verbseq', `(cat(V) (space cat(V)))')
1507@end example
1508
1509
1510the term @code{cat()} may not be used as a ... of
1511
1512@c See @command{m4} manual for further details on macro definition format.
1513
1514@node ser limitations
1515@subsection Limitations
1516
1517more than 3 attributes in <>.
1518
1519@node ser requirements
1520@subsection Requirements
1521
1522In order to run @command{ser}, the following programs must be
1523installed in the system:
1524
1525@itemize
1526
1527@item @command{m4}
1528@item @command{grep}
1529@item @command{flex}
1530@item @command{gcc}
1531
1532@end itemize
1533
1534
1535@c GRP
1536@c ---------------------------------------------------------------------
1537@c ---------------------------------------------------------------------
1538
1539@page
1540@node grp
1541@section grp - pattern search tool
1542
1543@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1544@item @strong{Authors:}                 @tab Tomasz Obrêbski
1545@item @strong{Component category:}      @tab filter
1546@end multitable
1547
1548
1549@code{gre} selects sentences containing an expression matching a
1550pattern. The pattern format is exactly the same as that accepted by
1551@code{ser}.
1552
1553@code{gre} is intended mainly for speeding up corpus search process.
1554It is extremely fast (processing speed is usually higher then the speed
1555of reading the corpus file from disk).
1556
1557
1558
1559@c @menu
1560@c * ser command line options::   
1561@c * ser pattern::                 
1562@c * ser how ser works::           
1563@c * ser customization::           
1564@c * ser limitations::             
1565@c * ser requirements::           
1566@c @end menu
1567@menu
1568* grp command line options::   
1569* grp pattern::                 
1570* grp hints::   
1571@end menu
1572
1573@node grp command line options
1574@subsection Command line options
1575
1576@table @code
1577
1578@parhelp
1579@parversion
1580@c @parfile
1581@c @paroutput
1582@c @parinputfield
1583@c @paroutputfield
1584@parprocess
1585@parinteractive
1586
1587@item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
1588The search pattern.
1589
1590@item @b{@minus{}@minus{}morph=@var{field}}
1591The name of the annotation field containing the morphological
1592description (default @code{lem}).
1593
1594@item @b{@minus{}@minus{}command}
1595Only print the generated flex source code.
1596
1597@item @b{@minus{}@minus{}macro=@var{filename}}
1598Read macrodefinitions from file @var{filename} rather than from
1599default location. This option allows to redefine the set of terms.
1600
1601@item @b{@minus{}@minus{}define=@var{filename}}
1602Append macrodefinitions from file @var{filename}. This option
1603allows to extend the set of terms.
1604
1605@end table
1606
1607
1608@node grp pattern
1609@subsection Pattern
1610
1611(see @code{ser})
1612
1613@node grp hints
1614@subsection Hints
1615
1616The corpus search speed may be increased by combining grp with lzop
1617compression tool (grp usually processes data faster than it is read from a
1618disk, especially for slow laptop drives).
1619
1620@example
1621cat corpus | tok | sen | lem | grp -a p | lzop -7 > corpus.grp.lzo
1622@end example
1623
1624@example
1625lzop -cd corpus.grp.lzo | grp -a gP -e @var{EXPR} | ser -e @var{EXPR}
1626@end example
1627
1628
1629@c ---------------------------------------------------------------------
1630@c kot
1631@c ---------------------------------------------------------------------
1632@c ---------------------------------------------------------------------
1633
1634@page
1635@node kot
1636@section kot - untokenizer
1637
1638Authors: Tomasz Obrêbski
1639
1640@command{kot} is the opposite of @command{tok}. It changes UTT-formatted text into plain text.
1641
1642@menu
1643* kot command line options::   
1644* kot usage examples::   
1645@end menu
1646
1647@node kot command line options
1648@subsection Command line options
1649
1650@table @code
1651
1652@parhelp
1653
1654@c @item @b{@minus{}@minus{}version}, @b{@minus{}v}
1655
1656@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
1657
1658@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
1659
1660@c @item @b{@minus{}@minus{}interactive @minus{}i}
1661
1662@c @item @b{@minus{}@minus{}config=@var{filename}}
1663
1664@item
1665
1666@item @b{@minus{}@minus{}gap-fill=@var{string}, @minus{}g @var{string}}
1667print @var{string} between nonadjacent segments of the input file
1668
1669@item @b{@minus{}@minus{}spaces, @minus{}r}
1670retain the special characters @code{_}, @code{\t},
1671@code{\n}, @code{\r}, @code{\f} unexpanded in the output
1672
1673@end table
1674
1675@node kot usage examples
1676@subsection Usage examples
1677
1678@example
1679cat legia.txt | tok | kot       
1680@end example
1681
1682@example
1683cat legia.txt | tok | lem -1 | kot
1684@end example
1685
1686@c CON............................................................
1687@c ...............................................................
1688@c ...............................................................
1689
1690@page
1691@node con
1692@section con - concordance table generator
1693
1694@command{con} generates a concordance table based on a pattern given to @command{ser}.
1695
1696@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1697@item @strong{Authors:}                 @tab Justyna Walkowska
1698@item @strong{Component category:}      @tab sink
1699@end multitable
1700@c
1701
1702@menu
1703* con command line options::
1704* con usage example::
1705* con hints::   
1706@end menu
1707
1708@node con command line options
1709@subsection Command line options
1710
1711@table @code
1712
1713@parhelp
1714
1715@c @item @b{@minus{}@minus{}help}, @b{@minus{}h}
1716@c @item @b{@minus{}@minus{}version}, @b{@minus{}v}
1717@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
1718@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
1719@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}} [???]
1720@c @item @b{@minus{}@minus{}copy, @minus{}c} [???]
1721@c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}
1722@c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}
1723@c @item @b{@minus{}@minus{}process=@var{class}, @minus{}p @var{class}}
1724@c @item @b{@minus{}@minus{}interactive @minus{}i}
1725@c @item @b{@minus{}@minus{}config=@var{filename}}
1726@c @item
1727@c @item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
1728@c search pattern
1729@c
1730@c @item @b{@minus{}@minus{}flex}
1731@c only print the generated flex source code
1732@c
1733@c @item @b{@minus{}@minus{}macro=@var{filename}}
1734@c read macrodefinitions from file @var{filename} rather than from
1735@c default location. This option allows to redefine the set of terms.
1736@c
1737@c @item @b{@minus{}@minus{}define=@var{filename}}
1738@c append macrodefinitions from file @var{filename}. This option
1739@c allows to extend the set of terms.
1740
1741@item @b{@minus{}@minus{}left @minus{}l}           
1742        Left context info (default='30c'). Example:
1743@example                         
1744                                 -l=5c: left context is 5 characters
1745                                 -l=5w: left context is 5 words
1746                                 -l=5s: left context is 5 non-empty input lines
1747                                 -l='\s*\S+\sr\S+BOS': left context starts with the given regex
1748@end example
1749
1750@item @b{@minus{}@minus{}right @minus{}r}           
1751        Right context info (default='30c').
1752@item @b{@minus{}@minus{}trim @minus{}t}           
1753        Clear incomplete words from output.
1754@item @b{@minus{}@minus{}white @minus{}w}           
1755        DO NOT change all white characters into spaces.
1756@item @b{@minus{}@minus{}column @minus{}c}           
1757        Left column minimal width in characters (default = 0).
1758@item @b{@minus{}@minus{}ignore @minus{}i}           
1759        Ignore segment inconsistency in the input.
1760@item @b{@minus{}@minus{}bon}           
1761        Beginning of selected segment (regex, default='[0-9]+ [0-9]+ BOM .*').
1762@item @b{@minus{}@minus{}eob}           
1763        End of selected segment (regex, default='[0-9]+ [0-9]+ EOM .*').
1764@item @b{@minus{}@minus{}bod}           
1765        Selected segment beginning display string (default='[').
1766@item @b{@minus{}@minus{}eod}           
1767        Selected segment end display string (default=']').
1768
1769
1770
1771@end table
1772
1773@node con usage example
1774@subsection Usage example
1775@example
1776cat file.txt | tok | lem -1 | ser -e 'lexeme(dom) | con' 
1777@end example
1778
1779
1780@node con hints
1781@subsection Hints
1782
1783@command{con} is a rather slow program. Do not pass large amounts of
1784redundant text through this program. @command{con} works fine in the following
1785sequence:
1786
1787@example
1788... | grp -e EXPR | ser -e EXPR | con
1789@end example
1790
1791
1792
1793@c ---------------------------------------------------------------------
1794@c ---------------------------------------------------------------------
1795
1796@page
1797@node Auxiliary tools
1798@chapter Auxiliary tools
1799
1800@menu
1801* compiledic::         dictionary compiler
1802* fla::                UTT file flattener
1803* unfla::              UTT file unflattener
1804@end menu
1805
1806
1807@page
1808@node compiledic
1809@section compiledic - the dictionary compiler
1810
1811@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1812@item @strong{Authors:}                 @tab Michal Stolarski, Tomasz Obrebski
1813@item @strong{Component category:}      @tab additional tool
1814@end multitable
1815@c
1816
1817@command{compiledic} compiles dictionaries in text format (@code{.dic} extension) into binary
1818(FSA) format (@code{.bin} extension).
1819
1820Automaton representation of a dictionary is built using the AT&T tools:
1821@itemize
1822@item AT&T FSM Library,
1823@item AT&T Lextools.
1824@end itemize
1825
1826In order for the compiledic program to work you have to install the
1827above mentioned packages into your system.  They are freely available
1828for non-commercial use.
1829
1830Usage:
1831@example
1832        compiledic <dictionaryname>.dic
1833@end example
1834
1835The file <dictionaryname>.bin will be generated.
1836
1837Remarque: The program produces a lot of temporary files which are
1838stored in the current directory. They are deleted after successfull
1839termination of the program.
1840
1841@c @menu
1842@c * con command line options::
1843@c * con usage example::
1844@c * con hints::   
1845@c @end menu
1846
1847
1848@page
1849@node fla
1850@section fla - the UTT file flattener
1851
1852@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1853@item @strong{Authors:}                 @tab Tomasz Obrêbski
1854@item @strong{Component category:}      @tab filter
1855@end multitable
1856@c
1857
1858@command{fla} ``flattens'' a utt file by merging segments belonging
1859to one sentence in one line. Technically, end-of-line characters
1860('\n', ASCII code 10) are replaced with line-feed characters ('\f',
1861ASCII code 12).  The flattening makes it possible to process UTT files
1862with such tools as @command{grep} or @command{sed} sentence by
1863sentence (used in @command{grp} and @command{mar}).
1864
1865Flattened files should have the suffix @code{.fla}, eg. @file{thetext.utt.fla}.
1866
1867Flattened files are still human-readible.
1868
1869Usage:
1870
1871@example
1872        fla [<bosregex>]
1873@end example
1874
1875The facultative argument is a regular expression describing segments
1876which should be treated as sentence beginnings (the test is: the
1877segment contains a fragment matching the @code{<bosregex>}). By
1878default, segments containing a field @code{BOS} are seeked.
1879@c @menu
1880@c * con command line options::
1881@c * con usage example::
1882@c * con hints::   
1883@c @end menu
1884
1885
1886
1887@page
1888@node unfla
1889@section unfla - the UTT file unflattener
1890
1891@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1892@item @strong{Authors:}                 @tab Tomasz Obrêbski
1893@item @strong{Component category:}      @tab filter
1894@end multitable
1895
1896@command{unfla} transforms a flattened UTT file, produced by
1897@command{fla}, into the regular format by restoring end-of-line
1898characters.
1899
1900
1901
1902
1903@c ---------------------------------------------------------------------
1904@c USAGE EXAMPLES
1905@c ---------------------------------------------------------------------
1906
1907@node Usage examples
1908@chapter Usage examples
1909
1910@subsubheading Simple pipelines
1911
1912@enumerate
1913
1914@item tokenization
1915
1916cat text | tok > output1
1917
1918@item morphological annotation (1)
1919
1920simple dictionary based lemmatization
1921
1922cat text | tok | lem > output1
1923
1924@item morphological annotation (2)
1925
19261) perform dictionary-based lemmatization
19274) guess descriptions for words which have no annotation
1928
1929@example
1930cat text | tok | lem | gue -S lem > output2
1931@end example
1932
1933@item morphological annotation (3)
1934
19351) perform dictionary-based lemmatization
19362) try to correct words with no annotation
19373) perform dictionary-based lemmatization of corrected words
19384) guess descriptions for words which still have no annotation
1939
1940@example
1941cat text | tok | lem | cor -p W -S lem | lem -I cor | gue -p W -S lem
1942@end example
1943@item spelling correction
1944
1945
1946
1947@example
1948cat text | tok | lem --only-fail | cor -1 > output3
1949@end example
1950
1951@item Expression extraction
1952
1953Extraction of all occurrences of a verb followed by a form of the noun 'rozmowa'.
1954
1955@example
1956cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' -m | kot > output4
1957@end example
1958
1959@item A word in context
1960
1961Extraction of text fragments containing a form of the lexeme 'rozmowa' in
1962the context of 5 preceeding and 5 succeeding corpus segments.
1963
1964@example
1965cat text | tok | lem -1 | ser -e 'seg@{5@} lexeme(rozmowa) seg@{5@}' -m | kot > output
1966@end example
1967
1968@item generation of concordance table (1)
1969
1970@example
1971cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' | con
1972@end example
1973
197410"
1975
1976@item generation of concordance table (2)
1977
1978The same as above but much faster
1979
1980@example
1981cat text | tok | lem -1 | \
1982grp -e 'cat(<V>) space lexeme(rozmowa)' | \
1983ser -e 'cat(<V>) space lexeme(rozmowa)' | \
1984con
1985@end example
1986
19872"
1988
1989@item generation of concordance table (3)
1990
1991Usually, one performs repetitively search over the same corpus. In
1992such case it is advisable to transform the corpus data into the format
1993required by @command{grp} first, and then use the preprocessed data.
1994
1995As @command{grp} (@command{grep}) processes data faster then it is
1996read from the disk drive, the search time may be still shortened by
1997using file compression techniques.  We suggest usin @command{lzop}.
1998
1999@item the fastest way to search a large corpus
2000
2001step 1: preprocessing
2002
2003@example
2004cat corpus | tok | sen | lem -1 \
2005| grp -a p | lzop -7 > corpus.grp.lzo
2006@end example
2007
2008step 2: search
2009
2010@example
2011lzop -cd corpus.grp.lzo | grp -a gP -e 'cat(<V>) space
2012lexeme(rozmowa)' | ser -e 'cat(<V>) space lexeme(rozmowa)' | con
2013@end example
2014
2015@end enumerate
2016
2017@subsubheading More complicated configurations
2018
2019
2020@example
2021mknod fifo1 p
2022mknod fifo2 p
2023mknod fifo3 p
2024mknod fifo4 p
2025mknod fifo5 p
2026
2027tok | lem -p W -e fifo1 > fifo2 &
2028cor -e fifo3 < fifo1 | lem > fifo4 &
2029gue < fifo3 > fifo5 &
2030sort -m fifo2 fifo4 fifo5
2031
2032rm fifo?
2033@end example
2034
2035
2036@c ---------------------------------------------------------------------
2037@c ---------------------------------------------------------------------
2038
2039@c ---------------------------------------------------------------------
2040@c PMDBF DICTIONARY
2041@c ---------------------------------------------------------------------
2042
2043@node PMDBF dictionary
2044@chapter PMDBF dictionary
2045
2046UTT components come with lexical data derived from Polish
2047Morphological Database (PMDB).
2048
2049@menu
2050* PMDBF files::   
2051* PMDBF tag structure::                 
2052* PMDBF parts of speech::           
2053* PMDBF morphosyntactic attributes::           
2054@end menu
2055
2056@node PMDBF files
2057@section Files
2058
2059@node PMDBF tag structure
2060@section Tag structure
2061
2062pos = [[:upper:]]+
2063
2064attr = [[:upper:]]+
2065
2066val = [[:lower:][:digit:]?!*+-] | <[^>\n]+>
2067
2068descr = pos ( / ( attr val + ) + ) ?
2069
2070@node PMDBF parts of speech
2071@section Parts of speech
2072
2073@multitable {ADJPRP} { adjectival-passive-participle }
2074@item @code{N} @tab noun
2075@item @code{NPRO} @tab nominal-pronoun
2076@item @code{NV} @tab deverbal-noun
2077@item @code{V} @tab verb
2078@item @code{BYC} @tab byc
2079@item @code{VNI} @tab non-inflected-verb
2080@item @code{ADJ} @tab adjective
2081@item @code{ADJPAP} @tab adjectival-passive-participle
2082@item @code{ADJPRP} @tab adjectival-present-participle
2083@item @code{ADJPP} @tab adjectival-past-participle
2084@item @code{ADJPRO} @tab adjectival-pronoun
2085@item @code{ADJNUM} @tab adjectival-numeral
2086@item @code{ADV} @tab adverb
2087@item @code{ADVANP} @tab adverbial-anterior-participle
2088@item @code{ADVPRP} @tab adverbial-present-participle
2089@item @code{ADVPRO} @tab adverbial-pronoun
2090@item @code{ADVNUM} @tab  adverbial-numeral
2091@item @code{P} @tab preposition
2092@item @code{PPRO} @tab prep-noun-pronoun
2093@item @code{CONJ} @tab conjunction
2094@item @code{EXCL} @tab exclamation
2095@item @code{APP} @tab call
2096@item @code{ONO} @tab onomatopoeia
2097@item @code{PART} @tab particle
2098@item @code{NUMCRD} @tab cardinal-numeral
2099@item @code{NUMCOL} @tab collective-numeral
2100@item @code{NUMPAR} @tab partitive-numeral
2101@item @code{NUMORD} @tab ordinal-numeral
2102@end multitable
2103
2104@node PMDBF morphosyntactic attributes
2105@section Morphosyntactic attributes
2106
2107@multitable {Attr} {Val} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
2108@c @headitem Attr @tab Val @tab Description
2109@item
2110@code{A} @tab @tab Aspect
2111@item
2112@tab @code{p} @tab perfect
2113@item
2114@tab @code{i} @tab imperfect.
2115@item
2116@item
2117@code{V} @tab @tab Verb-Form
2118@item
2119@tab @code{b} @tab infinitive,
2120@item
2121@tab @code{p} @tab personal,
2122@item
2123@tab @code{i} @tab impersonal.
2124@item
2125@item
2126@code{M} @tab @tab Mood
2127@item
2128@tab @code{d} @tab declarative,
2129@item
2130@tab @code{c} @tab conditional,
2131@item
2132@tab @code{i} @tab imperative.
2133@item
2134@item
2135@code{T} @tab @tab Tense
2136@item
2137@tab @code{a} @tab past,
2138@item
2139@tab @code{r} @tab present,
2140@item
2141@tab @code{f} @tab future.
2142@item
2143@item
2144@code{P} @tab @tab Person
2145@item
2146@tab @code{1} @tab 1,
2147@item
2148@tab @code{2} @tab 2,
2149@item
2150@tab @code{3} @tab 3.
2151@item
2152@item
2153@code{D} @tab @tab Degree
2154@item
2155@tab @code{p} @tab positive,
2156@item
2157@tab @code{c} @tab comparative,
2158@item
2159@tab @code{s} @tab superlative.
2160@item
2161@item
2162@code{N} @tab @tab Number
2163@item
2164@tab @code{s} @tab singular,
2165@item
2166@tab @code{p} @tab plural.
2167@item
2168@item
2169@code{C} @tab @tab Case
2170@item
2171@tab @code{n} @tab nominative,
2172@item
2173@tab @code{g} @tab genitive,
2174@item
2175@tab @code{d} @tab dative,
2176@item
2177@tab @code{a} @tab accusative,
2178@item
2179@tab @code{i} @tab instrumantal,
2180@item
2181@tab @code{l} @tab locative,
2182@item
2183@tab @code{v} @tab vocative.
2184@item
2185@item
2186@code{G} @tab @tab Gender
2187@item
2188@tab @code{p} @tab masculine-personal,
2189@item
2190@tab @code{a} @tab masculine-animal,
2191@item
2192@tab @code{i} @tab masculine-inanimate,
2193@item
2194@tab @code{f} @tab feminine,
2195@item
2196@tab @code{n} @tab neuter.
2197@end multitable
2198
2199
2200@c ---------------------------------------------------------------------
2201@c ---------------------------------------------------------------------
2202@c
2203@c @node Examples
2204@c @chapter Examples
2205
2206@c ----------------------------------------------------------------------
2207@c ----------------------------------------------------------------------
2208
2209@node    GNU Free Documentation License
2210@chapter GNU Free Documentation License
2211
2212@c The GNU Free Documentation License.
2213@center Version 1.2, November 2002
2214
2215@c This file is intended to be included within another document,
2216@c hence no sectioning command or @node.
2217
2218@display
2219Copyright @copyright{} 2000,2001,2002 Free Software Foundation, Inc.
222051 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA
2221
2222Everyone is permitted to copy and distribute verbatim copies
2223of this license document, but changing it is not allowed.
2224@end display
2225
2226@enumerate 0
2227@item
2228PREAMBLE
2229
2230The purpose of this License is to make a manual, textbook, or other
2231functional and useful document @dfn{free} in the sense of freedom: to
2232assure everyone the effective freedom to copy and redistribute it,
2233with or without modifying it, either commercially or noncommercially.
2234Secondarily, this License preserves for the author and publisher a way
2235to get credit for their work, while not being considered responsible
2236for modifications made by others.
2237
2238This License is a kind of ``copyleft'', which means that derivative
2239works of the document must themselves be free in the same sense.  It
2240complements the GNU General Public License, which is a copyleft
2241license designed for free software.
2242
2243We have designed this License in order to use it for manuals for free
2244software, because free software needs free documentation: a free
2245program should come with manuals providing the same freedoms that the
2246software does.  But this License is not limited to software manuals;
2247it can be used for any textual work, regardless of subject matter or
2248whether it is published as a printed book.  We recommend this License
2249principally for works whose purpose is instruction or reference.
2250
2251@item
2252APPLICABILITY AND DEFINITIONS
2253
2254This License applies to any manual or other work, in any medium, that
2255contains a notice placed by the copyright holder saying it can be
2256distributed under the terms of this License.  Such a notice grants a
2257world-wide, royalty-free license, unlimited in duration, to use that
2258work under the conditions stated herein.  The ``Document'', below,
2259refers to any such manual or work.  Any member of the public is a
2260licensee, and is addressed as ``you''.  You accept the license if you
2261copy, modify or distribute the work in a way requiring permission
2262under copyright law.
2263
2264A ``Modified Version'' of the Document means any work containing the
2265Document or a portion of it, either copied verbatim, or with
2266modifications and/or translated into another language.
2267
2268A ``Secondary Section'' is a named appendix or a front-matter section
2269of the Document that deals exclusively with the relationship of the
2270publishers or authors of the Document to the Document's overall
2271subject (or to related matters) and contains nothing that could fall
2272directly within that overall subject.  (Thus, if the Document is in
2273part a textbook of mathematics, a Secondary Section may not explain
2274any mathematics.)  The relationship could be a matter of historical
2275connection with the subject or with related matters, or of legal,
2276commercial, philosophical, ethical or political position regarding
2277them.
2278
2279The ``Invariant Sections'' are certain Secondary Sections whose titles
2280are designated, as being those of Invariant Sections, in the notice
2281that says that the Document is released under this License.  If a
2282section does not fit the above definition of Secondary then it is not
2283allowed to be designated as Invariant.  The Document may contain zero
2284Invariant Sections.  If the Document does not identify any Invariant
2285Sections then there are none.
2286
2287The ``Cover Texts'' are certain short passages of text that are listed,
2288as Front-Cover Texts or Back-Cover Texts, in the notice that says that
2289the Document is released under this License.  A Front-Cover Text may
2290be at most 5 words, and a Back-Cover Text may be at most 25 words.
2291
2292A ``Transparent'' copy of the Document means a machine-readable copy,
2293represented in a format whose specification is available to the
2294general public, that is suitable for revising the document
2295straightforwardly with generic text editors or (for images composed of
2296pixels) generic paint programs or (for drawings) some widely available
2297drawing editor, and that is suitable for input to text formatters or
2298for automatic translation to a variety of formats suitable for input
2299to text formatters.  A copy made in an otherwise Transparent file
2300format whose markup, or absence of markup, has been arranged to thwart
2301or discourage subsequent modification by readers is not Transparent.
2302An image format is not Transparent if used for any substantial amount
2303of text.  A copy that is not ``Transparent'' is called ``Opaque''.
2304
2305Examples of suitable formats for Transparent copies include plain
2306@sc{ascii} without markup, Texinfo input format, La@TeX{} input
2307format, @acronym{SGML} or @acronym{XML} using a publicly available
2308@acronym{DTD}, and standard-conforming simple @acronym{HTML},
2309PostScript or @acronym{PDF} designed for human modification.  Examples
2310of transparent image formats include @acronym{PNG}, @acronym{XCF} and
2311@acronym{JPG}.  Opaque formats include proprietary formats that can be
2312read and edited only by proprietary word processors, @acronym{SGML} or
2313@acronym{XML} for which the @acronym{DTD} and/or processing tools are
2314not generally available, and the machine-generated @acronym{HTML},
2315PostScript or @acronym{PDF} produced by some word processors for
2316output purposes only.
2317
2318The ``Title Page'' means, for a printed book, the title page itself,
2319plus such following pages as are needed to hold, legibly, the material
2320this License requires to appear in the title page.  For works in
2321formats which do not have any title page as such, ``Title Page'' means
2322the text near the most prominent appearance of the work's title,
2323preceding the beginning of the body of the text.
2324
2325A section ``Entitled XYZ'' means a named subunit of the Document whose
2326title either is precisely XYZ or contains XYZ in parentheses following
2327text that translates XYZ in another language.  (Here XYZ stands for a
2328specific section name mentioned below, such as ``Acknowledgements'',
2329``Dedications'', ``Endorsements'', or ``History''.)  To ``Preserve the Title''
2330of such a section when you modify the Document means that it remains a
2331section ``Entitled XYZ'' according to this definition.
2332
2333The Document may include Warranty Disclaimers next to the notice which
2334states that this License applies to the Document.  These Warranty
2335Disclaimers are considered to be included by reference in this
2336License, but only as regards disclaiming warranties: any other
2337implication that these Warranty Disclaimers may have is void and has
2338no effect on the meaning of this License.
2339
2340@item
2341VERBATIM COPYING
2342
2343You may copy and distribute the Document in any medium, either
2344commercially or noncommercially, provided that this License, the
2345copyright notices, and the license notice saying this License applies
2346to the Document are reproduced in all copies, and that you add no other
2347conditions whatsoever to those of this License.  You may not use
2348technical measures to obstruct or control the reading or further
2349copying of the copies you make or distribute.  However, you may accept
2350compensation in exchange for copies.  If you distribute a large enough
2351number of copies you must also follow the conditions in section 3.
2352
2353You may also lend copies, under the same conditions stated above, and
2354you may publicly display copies.
2355
2356@item
2357COPYING IN QUANTITY
2358
2359If you publish printed copies (or copies in media that commonly have
2360printed covers) of the Document, numbering more than 100, and the
2361Document's license notice requires Cover Texts, you must enclose the
2362copies in covers that carry, clearly and legibly, all these Cover
2363Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on
2364the back cover.  Both covers must also clearly and legibly identify
2365you as the publisher of these copies.  The front cover must present
2366the full title with all words of the title equally prominent and
2367visible.  You may add other material on the covers in addition.
2368Copying with changes limited to the covers, as long as they preserve
2369the title of the Document and satisfy these conditions, can be treated
2370as verbatim copying in other respects.
2371
2372If the required texts for either cover are too voluminous to fit
2373legibly, you should put the first ones listed (as many as fit
2374reasonably) on the actual cover, and continue the rest onto adjacent
2375pages.
2376
2377If you publish or distribute Opaque copies of the Document numbering
2378more than 100, you must either include a machine-readable Transparent
2379copy along with each Opaque copy, or state in or with each Opaque copy
2380a computer-network location from which the general network-using
2381public has access to download using public-standard network protocols
2382a complete Transparent copy of the Document, free of added material.
2383If you use the latter option, you must take reasonably prudent steps,
2384when you begin distribution of Opaque copies in quantity, to ensure
2385that this Transparent copy will remain thus accessible at the stated
2386location until at least one year after the last time you distribute an
2387Opaque copy (directly or through your agents or retailers) of that
2388edition to the public.
2389
2390It is requested, but not required, that you contact the authors of the
2391Document well before redistributing any large number of copies, to give
2392them a chance to provide you with an updated version of the Document.
2393
2394@item
2395MODIFICATIONS
2396
2397You may copy and distribute a Modified Version of the Document under
2398the conditions of sections 2 and 3 above, provided that you release
2399the Modified Version under precisely this License, with the Modified
2400Version filling the role of the Document, thus licensing distribution
2401and modification of the Modified Version to whoever possesses a copy
2402of it.  In addition, you must do these things in the Modified Version:
2403
2404@enumerate A
2405@item
2406Use in the Title Page (and on the covers, if any) a title distinct
2407from that of the Document, and from those of previous versions
2408(which should, if there were any, be listed in the History section
2409of the Document).  You may use the same title as a previous version
2410if the original publisher of that version gives permission.
2411
2412@item
2413List on the Title Page, as authors, one or more persons or entities
2414responsible for authorship of the modifications in the Modified
2415Version, together with at least five of the principal authors of the
2416Document (all of its principal authors, if it has fewer than five),
2417unless they release you from this requirement.
2418
2419@item
2420State on the Title page the name of the publisher of the
2421Modified Version, as the publisher.
2422
2423@item
2424Preserve all the copyright notices of the Document.
2425
2426@item
2427Add an appropriate copyright notice for your modifications
2428adjacent to the other copyright notices.
2429
2430@item
2431Include, immediately after the copyright notices, a license notice
2432giving the public permission to use the Modified Version under the
2433terms of this License, in the form shown in the Addendum below.
2434
2435@item
2436Preserve in that license notice the full lists of Invariant Sections
2437and required Cover Texts given in the Document's license notice.
2438
2439@item
2440Include an unaltered copy of this License.
2441
2442@item
2443Preserve the section Entitled ``History'', Preserve its Title, and add
2444to it an item stating at least the title, year, new authors, and
2445publisher of the Modified Version as given on the Title Page.  If
2446there is no section Entitled ``History'' in the Document, create one
2447stating the title, year, authors, and publisher of the Document as
2448given on its Title Page, then add an item describing the Modified
2449Version as stated in the previous sentence.
2450
2451@item
2452Preserve the network location, if any, given in the Document for
2453public access to a Transparent copy of the Document, and likewise
2454the network locations given in the Document for previous versions
2455it was based on.  These may be placed in the ``History'' section.
2456You may omit a network location for a work that was published at
2457least four years before the Document itself, or if the original
2458publisher of the version it refers to gives permission.
2459
2460@item
2461For any section Entitled ``Acknowledgements'' or ``Dedications'', Preserve
2462the Title of the section, and preserve in the section all the
2463substance and tone of each of the contributor acknowledgements and/or
2464dedications given therein.
2465
2466@item
2467Preserve all the Invariant Sections of the Document,
2468unaltered in their text and in their titles.  Section numbers
2469or the equivalent are not considered part of the section titles.
2470
2471@item
2472Delete any section Entitled ``Endorsements''.  Such a section
2473may not be included in the Modified Version.
2474
2475@item
2476Do not retitle any existing section to be Entitled ``Endorsements'' or
2477to conflict in title with any Invariant Section.
2478
2479@item
2480Preserve any Warranty Disclaimers.
2481@end enumerate
2482
2483If the Modified Version includes new front-matter sections or
2484appendices that qualify as Secondary Sections and contain no material
2485copied from the Document, you may at your option designate some or all
2486of these sections as invariant.  To do this, add their titles to the
2487list of Invariant Sections in the Modified Version's license notice.
2488These titles must be distinct from any other section titles.
2489
2490You may add a section Entitled ``Endorsements'', provided it contains
2491nothing but endorsements of your Modified Version by various
2492parties---for example, statements of peer review or that the text has
2493been approved by an organization as the authoritative definition of a
2494standard.
2495
2496You may add a passage of up to five words as a Front-Cover Text, and a
2497passage of up to 25 words as a Back-Cover Text, to the end of the list
2498of Cover Texts in the Modified Version.  Only one passage of
2499Front-Cover Text and one of Back-Cover Text may be added by (or
2500through arrangements made by) any one entity.  If the Document already
2501includes a cover text for the same cover, previously added by you or
2502by arrangement made by the same entity you are acting on behalf of,
2503you may not add another; but you may replace the old one, on explicit
2504permission from the previous publisher that added the old one.
2505
2506The author(s) and publisher(s) of the Document do not by this License
2507give permission to use their names for publicity for or to assert or
2508imply endorsement of any Modified Version.
2509
2510@item
2511COMBINING DOCUMENTS
2512
2513You may combine the Document with other documents released under this
2514License, under the terms defined in section 4 above for modified
2515versions, provided that you include in the combination all of the
2516Invariant Sections of all of the original documents, unmodified, and
2517list them all as Invariant Sections of your combined work in its
2518license notice, and that you preserve all their Warranty Disclaimers.
2519
2520The combined work need only contain one copy of this License, and
2521multiple identical Invariant Sections may be replaced with a single
2522copy.  If there are multiple Invariant Sections with the same name but
2523different contents, make the title of each such section unique by
2524adding at the end of it, in parentheses, the name of the original
2525author or publisher of that section if known, or else a unique number.
2526Make the same adjustment to the section titles in the list of
2527Invariant Sections in the license notice of the combined work.
2528
2529In the combination, you must combine any sections Entitled ``History''
2530in the various original documents, forming one section Entitled
2531``History''; likewise combine any sections Entitled ``Acknowledgements'',
2532and any sections Entitled ``Dedications''.  You must delete all
2533sections Entitled ``Endorsements.''
2534
2535@item
2536COLLECTIONS OF DOCUMENTS
2537
2538You may make a collection consisting of the Document and other documents
2539released under this License, and replace the individual copies of this
2540License in the various documents with a single copy that is included in
2541the collection, provided that you follow the rules of this License for
2542verbatim copying of each of the documents in all other respects.
2543
2544You may extract a single document from such a collection, and distribute
2545it individually under this License, provided you insert a copy of this
2546License into the extracted document, and follow this License in all
2547other respects regarding verbatim copying of that document.
2548
2549@item
2550AGGREGATION WITH INDEPENDENT WORKS
2551
2552A compilation of the Document or its derivatives with other separate
2553and independent documents or works, in or on a volume of a storage or
2554distribution medium, is called an ``aggregate'' if the copyright
2555resulting from the compilation is not used to limit the legal rights
2556of the compilation's users beyond what the individual works permit.
2557When the Document is included in an aggregate, this License does not
2558apply to the other works in the aggregate which are not themselves
2559derivative works of the Document.
2560
2561If the Cover Text requirement of section 3 is applicable to these
2562copies of the Document, then if the Document is less than one half of
2563the entire aggregate, the Document's Cover Texts may be placed on
2564covers that bracket the Document within the aggregate, or the
2565electronic equivalent of covers if the Document is in electronic form.
2566Otherwise they must appear on printed covers that bracket the whole
2567aggregate.
2568
2569@item
2570TRANSLATION
2571
2572Translation is considered a kind of modification, so you may
2573distribute translations of the Document under the terms of section 4.
2574Replacing Invariant Sections with translations requires special
2575permission from their copyright holders, but you may include
2576translations of some or all Invariant Sections in addition to the
2577original versions of these Invariant Sections.  You may include a
2578translation of this License, and all the license notices in the
2579Document, and any Warranty Disclaimers, provided that you also include
2580the original English version of this License and the original versions
2581of those notices and disclaimers.  In case of a disagreement between
2582the translation and the original version of this License or a notice
2583or disclaimer, the original version will prevail.
2584
2585If a section in the Document is Entitled ``Acknowledgements'',
2586``Dedications'', or ``History'', the requirement (section 4) to Preserve
2587its Title (section 1) will typically require changing the actual
2588title.
2589
2590@item
2591TERMINATION
2592
2593You may not copy, modify, sublicense, or distribute the Document except
2594as expressly provided for under this License.  Any other attempt to
2595copy, modify, sublicense or distribute the Document is void, and will
2596automatically terminate your rights under this License.  However,
2597parties who have received copies, or rights, from you under this
2598License will not have their licenses terminated so long as such
2599parties remain in full compliance.
2600
2601@item
2602FUTURE REVISIONS OF THIS LICENSE
2603
2604The Free Software Foundation may publish new, revised versions
2605of the GNU Free Documentation License from time to time.  Such new
2606versions will be similar in spirit to the present version, but may
2607differ in detail to address new problems or concerns.  See
2608@uref{http://www.gnu.org/copyleft/}.
2609
2610Each version of the License is given a distinguishing version number.
2611If the Document specifies that a particular numbered version of this
2612License ``or any later version'' applies to it, you have the option of
2613following the terms and conditions either of that specified version or
2614of any later version that has been published (not as a draft) by the
2615Free Software Foundation.  If the Document does not specify a version
2616number of this License, you may choose any version ever published (not
2617as a draft) by the Free Software Foundation.
2618@end enumerate
2619
2620@page
2621@heading ADDENDUM: How to use this License for your documents
2622
2623To use this License in a document you have written, include a copy of
2624the License in the document and put the following copyright and
2625license notices just after the title page:
2626
2627@smallexample
2628@group
2629  Copyright (C)  @var{year}  @var{your name}.
2630  Permission is granted to copy, distribute and/or modify this document
2631  under the terms of the GNU Free Documentation License, Version 1.2
2632  or any later version published by the Free Software Foundation;
2633  with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
2634  Texts.  A copy of the license is included in the section entitled ``GNU
2635  Free Documentation License''.
2636@end group
2637@end smallexample
2638
2639If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts,
2640replace the ``with@dots{}Texts.'' line with this:
2641
2642@smallexample
2643@group
2644    with the Invariant Sections being @var{list their titles}, with
2645    the Front-Cover Texts being @var{list}, and with the Back-Cover Texts
2646    being @var{list}.
2647@end group
2648@end smallexample
2649
2650If you have Invariant Sections without Cover Texts, or some other
2651combination of the three, merge those two alternatives to suit the
2652situation.
2653
2654If your document contains nontrivial examples of program code, we
2655recommend releasing these examples in parallel under your choice of
2656free software license, such as the GNU General Public License,
2657to permit their use in free software.
2658
2659@c Local Variables:
2660@c ispell-local-pdict: "ispell-dict"
2661@c End:
2662
2663
2664@c ---------------------------------------------------------------------
2665@c ---------------------------------------------------------------------
2666
2667@node    Reporting bugs
2668@chapter Reporting bugs
2669
2670Report bugs to <obrebski@@amu.edu.pl>.
2671
2672@c ---------------------------------------------------------------------
2673@c ---------------------------------------------------------------------
2674
2675@c @node    Copyright
2676@c @chapter Copyright
2677@c
2678@c Copyright 2004 by Tomasz Obrebski
2679@c This software is free for research and educational use.
2680
2681@c ---------------------------------------------------------------------
2682@c ---------------------------------------------------------------------
2683
2684@node    Author
2685@chapter Author
2686
2687
2688@bye
Note: See TracBrowser for help on using the repository browser.