source: doc/utt.texinfo @ 0a58b3f

Last change on this file since 0a58b3f was d6a59ca, checked in by Tomasz Obrebski <obrebski@…>, 11 years ago

Poprawki w dokumentacji (utf8 dzia�a), poprawka w tre

  • Property mode set to 100644
File size: 85.2 KB
RevLine 
[9ace5d2]1
[25ae32e]2\input texinfo   @c -*-texinfo-*-
[9ace5d2]3@c @documentencoding ISO-8859-2
[25ae32e]4@c @documentlanguage pl
5
6@c %**start of header
7@setfilename utt.info
8@settitle UAM Text Tools v0.90
[d6a59ca]9@documentencoding utf-8
[25ae32e]10@c %**end of header
11
12@copying
[261bf62]13This manual is for UAM Text Tools (version 0.90, October, 2008)
[25ae32e]14
[9ace5d2]15Copyright @copyright{}  2005, 2007  Tomasz Obrębski, Michał Stolarski, Justyna Walkowska, Paweł Konieczka.
[25ae32e]16
17Permission is granted to copy, distribute and/or modify this document
[261bf62]18under the terms of the GNU Free Documentation License, Version 1.2 or
19any later version published by the Free Software Foundation; with no
20Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.  A
21copy of the license is included in the section entitled GNU Free
22Documentation License,,GNU Free Documentation License.
[25ae32e]23
24@c @quotation
25@c Permission is granted to ...
26@c No permission is granted until the document is completed.
27@c @end quotation
28@end copying
29
30@titlepage
31@title UAM Text Tools 0.90 - User Manual
32@subtitle edition 0.01, @today
33@subtitle status: prescript
[9ace5d2]34@author by Justyna Walkowska, Tomasz Obrębski and Michał Stolarski
[25ae32e]35@page
36@vskip 0pt plus 1filll
37@insertcopying
38@end titlepage
39
40@contents
41
42@c @paragraphindent none
43
44@iftex
[9ace5d2]45@tex
46% \usepackage[T1]{fontenc}
47% \usepackage[utf8]{inputenc}
48% \usepackage{times}
49@end tex
50
[25ae32e]51@parskip = 0.5@normalbaselineskip plus 3pt minus 1pt
52@end iftex
53@c @headings off
54@c @everyheading LEM(1) @| @| LEM(1)
55@everyfooting @today @c @| @thispage @|
56
57@ifnottex
58
59@node Top
60@top UTT - UAM Text Tools
61
62@insertcopying
63
64@menu
65* General information::                       
66* UTT file format::             
67* Configuration files::         
68* UTT components::
69* Auxiliary tools::
70* Usage examples::             
71* PMDBF dictionary::           
72@c * Examples::                   
73@c * Copyright::
74* GNU Free Documentation License::
75* Reporting bugs::                                   
76* Author::                     
77@end menu
78@end ifnottex
79
80
81@c ----------------------------------------------------------------------
82
83@node General information
84@chapter General information
85
86UAM Text Tools (UTT) is a package of language processing tools
87developed at Adam Mickiewicz University. Its functionality includes:
88
89@itemize @bullet
90
91@item
[9ace5d2]92tokenization ółąŌ
[25ae32e]93@item
94dictionary-based morphological analysis
95@item
96heuristic morphological analysis of unknown words
97@item
[9ace5d2]98spelling correction ółąśćŌ
[25ae32e]99@item
100pattern search
101@item
102sentence splitting
103@item
104generation of concordance tables
105@end itemize
106
107The toolkit is destined for processing of raw (not annotated)
108unrestricted text for any conceivable purpose.
109
110The system is organized as a collection of command-line programs, each
111performing one operation, e.g. tokenization, lemmatization, spelling
112correction. The components are independent one from another, the
113unifying element being the uniform i/o file format.
114
115The components may be combined in various ways to provide various text
116processing services. Also new components supplied by the used may be
117easily incorporated into the system provided that they respect the i/o
118file format conventions.
119
120UTT component programs does not depend on any specific tagset or
121morphological description format.
122
123UTT is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by
124the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
125
126The Polex/PMDBF dictionary is licensed under the Creative Commons by-nc-sa License which prohibits commercial use. 
127
128
129List of contributors:
130
131@itemize
132@item Pawel Konieczka
[9ace5d2]133@item Tomasz Obrębski
134@item Michał Stolarski
[25ae32e]135@item Marcin Walas
136@item Justyna Walkowska
[9ace5d2]137@item Paweł Wereński
[25ae32e]138@end itemize
139
140@c ----------------------------------------------------------------------
141@c ---------------------------------------------------------------------
142
143@node    UTT file format
144@chapter UTT file format
145
146A UTT file contains annotation of a text. It consists of a sequence of
147segments. Each segment explicitly refers to a continuous piece of the
148text and provides some information on it.
149
150@section Segment format
151
152A segment occupies one line of a UTT file and consists of
153space-separated fields:
154
155
156@quotation
157@sp 1
158[@var{start} [@var{length}]] @var{type} @var{form} [@var{annotation1} [@var{annotation2} ...]]
159@sp 1
160@end quotation
161
162@table @var
163
164@item @var{start}
165Non-negative integer value indicating the position in the source text where the
166segment starts.
167
168@item @var{length}
169Non-negative integer value indicating the length of the segment.
170
171@item @var{type}
172A sequence of non-ASCII characters (without spaces or letters, which could lead to @var{type} being misinterpreted as a @var{start} or @var{length} field).
173@var{type} reflects the main classification of segments -
174into words, numbers, punctuation marks, meta-text markers.
175@xref{tok output,,tok output}, for description of automatically recognized type markers.
176
177@item @var{form}
178This field contains the textual form of the segment or the special
179symbol @code{*} indicating that the form is not given (e.g. when the segment has been created artificially to mark something and is of lentgh 0).
180
181The characters or character sequences that have special meaning in the
182@var{form} field are enumerated below.
183
184Characters with special meaning:
185
186@itemize
187@item @code{_} - space character
188@item @code{*} - undefined contents
189@end itemize
190
191Escape sequences:
192
193@itemize
194@item @code{\n} - new line
195@item @code{\t} - tabulation
196@item @code{\r} - carriage return 
197
198@item @code{\_} - the @code{_} character
199@item @code{\*} - the @code{*} character
200@item @code{\\} - the @code{\} character
201
202@c @item @code{\hh} - a character with hexadecimal code @code{hh} (used for non-printable characters)
203@end itemize
204
205@item @var{annotation1}
206@item @var{annotation2}
207@item ...
208Annotation fields have the following format:
209
210@var{longname} @code{:} @var{value}
211
212or
213
214@var{shortname} @var{value}
215
216where @var{longname} is a string of alphanumeric characters
217(isalnum() test), @var{shortname} - a single non-alphanumeric character
218(ispunct() test), and @var{value} is an arbitrary string of non-blank characters.
219
220@end table
221
222
223Only two fields are mandatory: @var{type} and @var{form}. All other fields
224may be absent. In the case when only one number precedes the
225@var{type} field, it is interpreted as the @var{START} position.
226
227If the @var{length} field is ommited, the length of the segment is the
228length of the @var{form} field, except when the value of the
229@var{form} field is @code{*} -- in this case, the length is assumed to
230be 0.
231
232If the @var{start} field is also absent, the segment is assumed to directly
233follow the preceding one.
234
235@c Conventions:
236
237@c Annotation fields with predefined meaning:
238
239@c @itemize
240@c @item @code{!} - UTT components are allowed to modify the contents of
241@c the @var{form} field (e.g. spelling correction does this). If this happens the
242@c original form of the segment have to be placed in the @code{!}-field.
243@c @item @code{@@} - morphological description
244@c @item @code{=} - node identifier assignment (used in graph encoding)
245@c @item @code{<} - preceding/dominating node(s) (used in graph encoding)
246@c @item @code{>} - succeeding/subordinate node(s) (used in graph encoding)
247@c @end itemize
248
249Segments of length 0 may be used to mark file positions with some
250information. See e.g. BOS and EOS (beginning/end of sentence) markers
251in the example below.
252
253Example:
254
255sentence: @samp{Piszemy dobre progrumy.}
256
257@example
2580000 00 BOS *
[9ace5d2]2590000 07 W Piszemy lem:pisać,V
[25ae32e]2600007 01 S _
2610008 05 W dobre lem:dobry,ADJ
2620013 01 S _
2630014 08 W progrumy cor:programy lem:program,N
2640022 01 P .
2650023 00 EOS *
2660023 01 S _
2670024 00 BOS *
2680024 11 W Warszawiacy lem:Warszawiak,N
2690035 01 S _
[9ace5d2]2700036 03 W teŌ
[25ae32e]2710039 01 P .
2720040 00 EOS *
273
274@end example
275
276@example
2770000 BOS *
[9ace5d2]2780000 W Piszemy lem:pisać,V
[25ae32e]2790007 S _
2800008 W dobre lem:dobry,ADJ
2810013 S _
2820014 W progrumy cor:programy lem:program,N
2830022 P .
2840023 EOS *
285@end example
286
287Posion information may be provided only for some types of segments:
288
289@example
2900000 BOS *
[9ace5d2]291W Piszemy lem:pisać‡,V
[25ae32e]292S _
293W dobre lem:dobry,ADJ
294S _
295W progrumy cor:programy lem:program,N
296P .
297EOS *
298S _
2990024 BOS *
300W Warszawiacy lem:Warszawiak,N
301S _
[9ace5d2]302W teŌ
[25ae32e]303P .
304EOS *
305@end example
306
307Position/length information may be provided only when necessary:
308
309@example
3100000 04 N *
3110000 N 12
312P .
313N 5
314S _
315W km
316@end example
317
318@section UTT File
319
320A UTT file consists of a sequence of segments.  The same text position
321may be covered by multiple segments. In cosequence, ambiguous text
322segmentation and ambiguous annotation may be represented.
323
324There are two structural requirements a valid UTT-formatted file
325has to meet:
326
327@itemize @bullet
328
329@item
330segments have to be sorted with respect to the @var{position} field,
331
332@item
333for each
334segment ending at position @var{n}, either there must be a segment starting at
335position @var{n+1}, or position @var{n+1} is not covered by any segment; similarly
336for each segment starting at position @var{n}, either there must be a segment
337ending at position @var{n-1}, or the position @var{n-1} must not be covered
338by any segment.
339
340@end itemize
341
342A valid annotation for the text fragment
343@example
34412.5 km
345@end example
346
347may be
348
349@example
3500000 02 N 12
3510000 04 N 12.5
3520002 01 P .
3530003 01 N 5
3540004 01 S _
3550005 02 W km
356@end example
357
358but not
359
360@example
3610000 02 N 12
3620000 04 N 12.5
3630004 01 S _
3640005 02 W km
365@end example
366
[261bf62]367because in the latter example the first segment (starting at position
3680000, 2 characters long) ends at position @var{n}=0001 which is
369covered by the second segment and no segment starts at position
370@var{n+2}=0002.
371
372
373@section Flattened UTT file
374
[e28a625]375A UTT file format has two variants: regular and flattened. The regular
[261bf62]376format was described above.  In the flattened format some of the
377end-of-line characters are replaced with line-feed characters.
378
379The flatten format is basically used to represent whole sentences as
380single lines of the input file (all intrasentential end-of-line
381characters are replaced with line-feed characters).
382
383This technical trick permits to perform certain text
384processing operations on entire sentences with the use of such tools as
385@command{grep} (see @command{grp} component) or @command{sed} (see  @command{mar} component).
386
387The conversion between the two formats is performed by the tools:
388@command{fla} and @command{unfla}.
[25ae32e]389
390@section Character encoding
391
392The UTT component programs accept only 1-byte character encoding, such
[261bf62]393as ISO, ANSI, DOS.
[25ae32e]394
395
396@c @section Formats
397
398@c @unnumberedsubsubsec Basic format
399
400@c While processing large amounts of the overhead related with explicit
401@c ... of the start position and segment length becomes ... . Therefore,
402@c for efficiency reasons certain shortcuts are possible:
403
404@c @unnumberedsubsubsec Relative start position
405
406@c Start position may be given as relative distance from the last
407@c absolut position.
408
409@c @unnumberedsubsubsec Absent length
410
411@c Segment length may by omitted. Normally it can be restored by counting
412@c the length of the @emph{form field}. For segments with the special value
413@c @code{*} in the @emph{form field} length 0 is assumed.
414
415@c @unnumberedsubsubsec Absent length and start position
416
417@c Both start position and segment length may be omitted. In this format
418@c each segment is assumed to follow the previous one. This format is,
419@c therefore, suitable only for unambiguously tagged text
420@c (0-length markers can be still used.)
421
422
423@c @table @code
424@c @item AL
425@c @code{1234 03 W kot}
426@c @item RL
427@c @code{+56 03 W kot}
428@c @item A
429@c @code{1234 W kot}
430@c @item R
431@c @code{+56 W kot}
432@c @item 0
433@c @code{W kot}
434@c @end table
435
436
[9ace5d2]437@c [JAK UZYSKAÆ POLSKIE CZCIONKI W DVI???]
[25ae32e]438
439@macro parhelp
440@item @b{@minus{}@minus{}help}, @b{@minus{}h}
441Print help.
442@end macro
443
444
445@macro parversion
446@item @b{@minus{}@minus{}version}, @b{@minus{}V}
447Print version information.
448@end macro
449
450@macro parinteractive
451@item @b{@minus{}@minus{}interactive, @minus{}i}
452This option toggles interactive mode, which is by default off. In the
453interactive mode the program does not buffer the output.
454@end macro
455
456
457@c @macro parfile
458@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
459@c Input file name.
460@c If this option is absent or equal to '@minus{}', the program
461@c reads from the standard input.
462@c @end macro
463
464
465@c @macro paroutput
466@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
467@c Regular output file name. To regular output the program sends segments
468@c which it successfully processed and copies those which were not
469@c subject to processing. If this option is absent or equal to
470@c '@minus{}', standard output is used.
471@c @end macro
472
473@c @macro parfail
474@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}}
475@c Fail output file name. To fail output the program copies the segments
476@c it failed to process.  If this option is absent or equal to
477@c '@minus{}', standard output is used.
478@c @end macro
479
480
481@c @macro parcopy
482@c @item @b{@minus{}@minus{}copy, @minus{}c}
483@c Copy succesfully processed segments to regular output also in their
484@c original input form.
485@c @end macro
486
487
488@macro parinputfield
489@item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}
490The field containing the input to the program. The default is the
491@var{form} field. The fields @var{position}, @var{length}, @var{type},
492and @var{form} are referred to as @code{1}, @code{2}, @code{3},
493@code{4}, respectively.
494@end macro
495
496
497@macro paroutputfield
498@item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}
499The name of the field added by the program. The default is the name of the program.
500@end macro
501
502
503@macro pardictionary
504@item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}}
505Dictionary file name.
506@end macro
507
508
509@macro parprocess
510@item @b{@minus{}@minus{}process=@var{type}, @minus{}p @var{type}}
511Process segments with the specified value in the @var{type} field.
512Multiple occurences of this option are allowed and are interpreted as
513disjunction. If this option is absent, all segments are processed.
514@end macro
515
516
517@macro parselect
518@item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}}
519Select for processing only segments in which the field named
520@var{fieldname} is present. Multiple occurences of this option are
521allowed and are interpreted as conjunction of conditions. If this
522option is absent, all segments are processed.
523@end macro
524
525
526@macro parunselect
527@item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}}
528Select for processing only segments in which the field @var{fieldname}
529is absent.  Multiple occurences of this option are allowed and are
530interpreted as conjunction of conditions. If this option is absent,
531all segments are processed.
532@end macro
533
534
535@macro paroneline
536@item @b{@minus{}@minus{}one-line}
537This option makes the program print ambiguous annotation in one output
538line by generating multiple annotation fields. By default when
539ambiguous annotation may be produced for a segment, the segment is
540multiplicated and each of the annotations is added to separate copy of
541the segment.
542@end macro
543
544
545@macro paronefield
546@item @b{@minus{}@minus{}one-field, @minus{}1}
547This option makes the program print ambiguous annotation in one
548annotation field. By default when ambiguous annotation may be produced
549for a segment, the segment is multiplicated and each of the
550annotations is added to separate copy of the segment.
551
552This option is useful when working with @command{kot} or @command{con}.
553@end macro
554
555
556@c ---------------------------------------------------------------------
557@c CONFIGURATION FILES
558@c ---------------------------------------------------------------------
559
560@node    Configuration files
561@chapter Configuration files
562
563Values for all command line options accepted by a component
564may be set in configuration files. The default location of the
565configuration files for a component named @command{@var{program}} are
566
567@example
[246900a]568        @file{/usr/local/etc/utt/@var{program}.conf}
[25ae32e]569@end example
570
571for system-wide configuration file and
572
573@example
[246900a]574        @file{~/.utt/@var{program}.conf}
[25ae32e]575@end example
576
577for user configuration file.
578
579@c The configuration file to load may be also specified with the
580@c @option{--config} option. Configuration file need not be provided.
581
582For each option, the value is set according to the following priority:
583
584@itemize
585@item command line
586@c @item configuration file indicated with @option{--config} option
587@item user configuration file (or configuration file indicated with the @option{--config} option)
588@item system-wide configuration file
589@end itemize
590
591Parameter values are specified in the following format:
592
593@var{parametername}=@var{value}
594
595where @var{parametername} is the short or long name of an option accepted by
596the program, or
597
598@var{parametername}
599
600if the option does not need arguments.
601
602You can introduce comments to configuration files using the # sign.
603
604If a program accepts multiple occurences of an option (e.g. @var{lem}'s select option) you can specify them in two distinct lines of the program's configuration file.
605
606@c The equal sign may be omitted.
607
608
609@quotation Tip
610If you have two (or more) frequently used sets of options for the same
611program (eg. lem with PMDBF dictionary and lem with a user dictionary)
612a good solution is to create two soft links to lem, called
613eg. lemg and lemu and specify their configuration in files lemg.conf
614and lemu.conf respectively.
615@end quotation
616
617@c ---------------------------------------------------------------------
618@c COMPONENTS
619@c ---------------------------------------------------------------------
620
621@node UTT components
622@chapter UTT components
623
624UTT components are of three types:
625
626@menu
627Sources: programs which read non-UTT data (e.g. raw text) and produce output
628in UTT format
629* tok::         a tokenizer
630
631Filters: programs which read and produce UTT-formatted data
632* lem::         a morphological analyzer
633* gue::         a morphological guesser
[261bf62]634* cor::         a simple spelling corrector
635* kor::         a more elaborated spelling corrector
[25ae32e]636* sen::         a sentensizer
637* ser::         a pattern search tool (marks matches)
[261bf62]638* mar::         a pattern search tool (introduces arbitrary markers into the text)
[25ae32e]639* grp::         a pattern search tool (selects sentences containing a match)
[261bf62]640@c * gph::         a word-graph annotation tool::
641@c * dgp::         a dependency parser
[25ae32e]642
643Sinks: programs which read UTT data and produce output in another format
644* kot::         an untokenizer
645* con::         a concordance table generator
646@end menu
647
648@c ---------------------------------------------------------------------
649@c TOK
650@c ---------------------------------------------------------------------
651
652@page
653@node tok
654@section tok - a tokenizer
655
656@c ----------------------------------------
657
658@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]659@item @strong{Authors:}                 @tab Tomasz Obrębski
[25ae32e]660@item @strong{Component category:}      @tab source
[261bf62]661@item @strong{Input format:}            @tab raw text file
662@item @strong{Output format:}           @tab UTT regular
663@item @strong{Required annotation:}     @tab -
[25ae32e]664@end multitable
665
666
667@menu
668* tok description::
669* tok input::
670* tok output::
671* tok command line options::
672* tok example::
673@end menu
674
675@node tok description
676@subsection Description
677
678@code{tok} is a simple program which reads a text file and identifies
679tokens on the basis of their orthographic form.  The type of the token
680is printed as the @var{type} field.
681
682@node tok input
683@subsection Input
684
685Raw text.
686
687@node tok output
688@subsection Output
689
690UTT-file with four fields: @var{start}, @var{length}, @var{type}, and @var{form}. In the @var{type} field five types of tokens are distinguished:
691
692@itemize
693
694@item @code{W}
695(word)
696- continuous sequence of letters
697
698@item @code{N}
699(number)
700- continuous sequence of digits
701
702@item @code{S}
703(space)
704- continuous sequence of space characters
705
706@item @code{P}
707(punctuation mark)
708- single printable characters not belonging to any of the other classes
709
710@item @code{B}
711(unprintable character)
712- single unprintable character
713
714@end itemize
715
716
717
718@node tok command line options
719@subsection Command line options
720
721@table @code
722
723@item @b{@minus{}@minus{}help}, @b{@minus{}h}
724Print help.
725
726@item @b{@minus{}@minus{}version}, @b{@minus{}V}
727Print version information.
728
729@item @b{@minus{}@minus{}interactive, @minus{}i}
730This option toggles interactive mode, which is by default off. In the
731interactive mode the program does not buffer the output.
732
733@end table
734
735@node tok example
736@subsection Example
737
738Input:
739
740@example
741Piszemy dobre programy.
742@end example
743
744Output:
745
746@example
7470000 07 W Piszemy
7480007 01 S _
7490008 05 W dobre
7500013 01 S _
7510014 08 W programy
7520022 01 P .
7530023 01 S \n
754@end example
755
756
757@c ---------------------------------------------------------------------
758@c SEN
759@c ---------------------------------------------------------------------
760
761@c @node sen - sentencizer
762@c @chapter sen - sentencizer
763
[9ace5d2]764@c Authors: Tomasz Obrębski
[25ae32e]765
766@c ---------------------------------------------------------------------
767@c LEM
768@c ---------------------------------------------------------------------
769
770@page
771@node lem
772@section lem - morphological analyzer
773
774@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]775@item @strong{Authors:}                 @tab Tomasz Obrębski, Michał Stolarski
[25ae32e]776@item @strong{Component category:}      @tab filter
[261bf62]777@item @strong{Input format:}            @tab UTT regular
778@item @strong{Output format:}           @tab UTT regular
779@item @strong{Required annotation:}     @tab tok
[25ae32e]780@end multitable
781
782@menu
783* lem description::             
784* lem command line options::   
785* lem input::
786* lem output::
787* lem example::                 
788* lem dictionaries::           
789* lem hints::           
790@end menu
791
792@node lem description
793@subsection Description
794
795@command{lem} performs morphological analysis of a simple orthographic
796word, returning all its possible morphological annotations,
797disregarding the context.
798
799@c ----------------------------------------
800
801@node lem command line options
802@subsection Command line options
803
804@table @code
805@parhelp
806@parversion
807@parinteractive
808@c @parfile
809@c @paroutput
810@c @parfail
811@c @parcopy
812@parinputfield
813@paroutputfield
814@pardictionary
815@parprocess
816@parselect
817@parunselect
818@paroneline
819@paronefield
820@end table
821
822@c ----------------------------------------
823
824@node lem input
825@subsection Input
826
827Lem reads a UTT file and processes the value of the @var{form} field
828(the input field may be changed with @option{--input-field} option).
829
830@node lem output
831@subsection Output
832
833@command{lem} adds a new annotation field, whose default name is @code{lem}.  In
834case of ambiguity either the segment is multiplicated (default),
835multiple @code{lem} fields are added (@option{--one-line}) or ambiguous
836annotation is produced as the value of single @code{lem} field (option
837@option{--one-field,-1}):
838
839@itemize @bullet
840
841@item
842unambiguous value format:
843
844@example
845   <lemma>,<descr>
846@end example
847
848@item
849ambiguous value format (@option{--one-field} option)
850
851
852@example
853   <lemma>,<descr>[,<descr>][;<lemma>,<descr>[,<descr>]]
854@end example
855
856(alternative descriptions for the same lemma are separated by commas,
857alternative lemmata are separated by semicolons.)
858
859@end itemize
860
861@node lem example
862@subsection Example
863
864Input:
865
866@example
8670000 07 W Piszemy
8680007 01 S _
8690008 05 W dobre
8700013 01 S _
8710014 08 W programy
8720022 01 P .
8730023 01 B \n
874@end example
875
876Output (default):
877
878@example
[9ace5d2]8790000 07 W Piszemy lem:pisać,V/AiVpMdTrfNpP1
[25ae32e]8800007 01 B _
8810008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn
8820008 05 W dobre lem:dobry,ADJ/DpNsCnavGn
8830013 01 B _
8840014 08 W programy lem:program,N/GiNpCa
8850014 08 W programy lem:program,N/GiNpCn
8860014 08 W programy lem:program,N/GiNpCv
8870022 01 P .
8880023 01 B \n
889@end example
890
891Output (@option{--one-line} option):
892
893@example
[9ace5d2]8940000 07 W Piszemy lem:pisać,V/AiVpMdTrfNpP1
[25ae32e]8950007 01 S _
8960008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn lem:dobry,ADJ/DpNsCnavGn
8970013 01 S _
8980014 08 W programy lem:program,N/GiNpCa lem:program,N/GiNpCn lem:program,N/GiNpCv
8990022 01 P .
9000023 01 S \n
901@end example
902
903Output (@option{--one-field} option):
904
905@example
[9ace5d2]9060000 07 W Piszemy lem:pisać,V/AiVpMdTrfNpP1
[25ae32e]9070007 01 S _
9080008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn,ADJ/DpNsCnavGn
9090013 01 S _
9100014 08 W programy lem:program,N/GiNpCa,N/GiNpCn,N/GiNpCv
9110022 01 P .
9120023 01 S \n
913@end example
914
915@c ----------------------------------------
916
917@node lem dictionaries
918@subsection Dictionaries
919
920@command{lem} requires a dictionary. The dictionary may be provided in
921one of two formats: in text (source) format or in binary (fsa) format.
922
923@subsubheading Text format
924
925Dictionary entries have the following structure:
926
927@example
928<form>;<lemma>,<descr>[;<lemma>,<descr>]
929@end example
930
931@var{lemma} may be given explicitly or in the cut-add format:
932
933@example
934@code{[<cut1><add1>-]<cut2><add2>}
935@end example
936
937meaning: replace prefix of length @code{<cut1>} with
938string @code{<add1>}, replace suffix of length @code{<cut2>} with string
939@code{<add2>}. For example @code{3t} transforms @samp{kocie} into
[9ace5d2]940@samp{kot}, @code{3-4aÂły} transforms @samp{najbielsi} into @samp{biaÂły}
[25ae32e]941
942Each dictionary entry must be written in one line and must not contain blank characters.
943
944Examples:
945@example
946kot;0,N/GaNsCn
947kota;1,N/GaNsCg;1,N/GaNsCa
948kotu;1,N/GaNsCd
949kotem;2,N/GaNsCi
950kocie;3t,N/GaNsCl;3t,N/GaNsCv
[9ace5d2]951najbielsi;3-4ały,ADJ/DsNpCnGp
952najbielsze;3-5ały,ADJ/DsNpCnGaifn
[25ae32e]953najlepsi;dobry,ADJ/DsNpCnGp
954najlepsze;dobry,ADJ/DsNpCnGaifn
955@end example
956
957
958The mandatory file name extension for a text dictionary is @code{dic}. For large
959dictionaries it is preferable, however, to compile them into binary
960(fsa) format.
961
962@subsubheading Binary format
963
964The mandatory file name extension for a binary dictionary is @code{bin}. To
965compile a text dictionary into binary format, write:
966
967@example
[d6a59ca]968compdic <dictionaryname>.dic <dictionaryname>.bin
[25ae32e]969@end example
970
971@subsubheading Polex/PMDBF dictionary
972
973A large-coverage morphological dictionary for Polish language, Polex/PMDBF, is included in
974the distribution as the default @emph{lem}'s dictionary. It's
975located by default in:
976
[261bf62]977@file{$HOME/.local/share/utt/pl_PL.ISO-8859-2/lem.bin}
978
979in local installation or in
980
981@file{/usr/local/share/utt/pl_PL.ISO-8859-2/lem.bin}
982
983in system installation.
[25ae32e]984
985@node lem hints
986@subsection Hints
987
[261bf62]988@subsubheading Combining data from multiple dictionaries
[25ae32e]989
[261bf62]990@itemize
[25ae32e]991
[261bf62]992@item Apply <dict1>, then apply <dict2> to words which were not annotatated.
[25ae32e]993
[261bf62]994@example
995lem -d <dict1> | lem -S lem -d <dict2>
996@end example
[25ae32e]997
[261bf62]998@item Add annotations from two dictionaries <dict1> and <dict2>.
[25ae32e]999
[261bf62]1000@example
1001lem -c -d <dict1> | lem -S lem -d <dict2>
1002@end example
[25ae32e]1003
[261bf62]1004@end itemize
[25ae32e]1005
1006
1007@c ---------------------------------------------------------------------
1008@c GUE
1009@c ---------------------------------------------------------------------
1010
1011@page
1012@node gue
1013@section gue - morphological guesser
1014
1015@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1016
[9ace5d2]1017@item @strong{Authors:}                 @tab Michał Stolarski, Tomasz Obrębski
[25ae32e]1018@item @strong{Component category:}      @tab filter
1019
1020@end multitable
1021
1022@menu
[261bf62]1023* gue description::   
[25ae32e]1024* gue command line options::   
1025* gue example::                 
1026* gue dictionaries::           
1027@end menu
1028
[261bf62]1029
1030@node gue description
1031@subsection Description
1032
1033@command{gue} guesess morphological descriptions of the form contained
1034in the @var{form} field.
1035
1036
[25ae32e]1037@node gue command line options
1038@subsection Command line options
1039
1040@table @code
1041
1042@parhelp
1043@parversion
1044@parinteractive
1045@c @parfile
1046@c @paroutput
1047@c @parfail
1048@c @parcopy
1049@parinputfield
1050@paroutputfield
1051@pardictionary
1052@parprocess
1053@parselect
1054@parunselect
1055@paroneline
1056@paronefield
1057
1058@item @b{@minus{}@minus{}delta=@var{n}}
1059Stop displaying answers after fall of weight, that is, when weight difference between 2 subsequent results is more than delta value (default=`0.2').
1060
1061
1062@item @b{@minus{}@minus{}cut-off=@var{n}}
1063Do not display answers with less weight than cut-off value (default=`200').
1064
1065
1066@item @b{@minus{}@minus{}guess_count=@var{n}, @minus{}n @var{n}}
1067Guess up to n descriptions  (default=`0', which means 'display all results').
1068
1069
1070
1071@end table
1072
1073@node gue example
1074@subsection Example
1075
1076@example
1077command: gue -n 2
1078
1079input:
10800000 07 W smerfny
1081
1082output:
10830000 07 W smerfny gue:,ADJ/CaDpGiNs
10840000 07 W smerfny gue:,ADJ/CnvDpGaipNs
1085@end example
1086                                 
1087
1088@node gue dictionaries
1089@subsection Dictionaries
1090
1091@command{gue} requires a dictionary. For now, the dictionary must be provided in binary (fsa) format.
1092The fsa format is created by compiling text-format dictionaries.
1093
1094
1095
1096@subsubheading Text format
1097
1098Dictionary entries have the following structure:
1099
1100@example
1101@var{prefix}@code{*}@var{suffix}@code{;}@var{lemma}@code{,}@var{description}@code{:}@var{weight}
1102@end example
1103
1104@var{lemma} must be given in the cut-add format:
1105
1106@example
1107@code{[<cut1><add1>-]<cut2><add2>}
1108@end example
1109(no spaces in between): replace prefix of length @var{cut1} with
1110string @var{add1}, replace suffix of length @var{cat2} with string
1111@var{add2}.
1112
1113
[9ace5d2]1114Example: @code{3-4ały} transforms @i{najbielsi} into @i{biały}
[25ae32e]1115
1116
1117@var{description} contains the part of speech and morphosyntactic information (@xref{PMDBF dictionary}.).
1118
1119@var{weight} is an integer value between 1 and 999 indicating the
1120likelihood of the guess.
1121
[9ace5d2]1122@c @example
1123@c *łkę;1a,N/GfNsCa
1124@c naj*elszy;3-4ały,ADJ/...:...
1125@c @end example
[25ae32e]1126
1127
1128@c ---------------------------------------------------------------------
1129@c COR
1130@c ---------------------------------------------------------------------
1131
1132@page
1133@node cor
1134@section cor - spelling corrector
1135
1136@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]1137@item @strong{Authors:}                 @tab Tomasz Obrębski, Michał Stolarski
[25ae32e]1138@item @strong{Component category:}      @tab filter
[261bf62]1139@item @strong{Input format:}            @tab UTT regular
1140@item @strong{Output format:}           @tab UTT regular
1141@item @strong{Required annotation:}     @tab tok
[25ae32e]1142@end multitable
1143
[261bf62]1144@menu
1145* cor description::
1146* cor command line options::   
1147* cor dictionaries::           
1148@end menu
1149
1150
1151@node cor description
1152@subsection Description
1153
[25ae32e]1154The spelling corrector applies Kemal Oflazer's dynamic programming
1155algorithm @cite{oflazer96} to the FSA representation of the set of
1156word forms of the Polex/PMDBF dictionary. Given an incorrect
1157word form it returns all word forms present in the dictionary whose
1158edit distance is smaller than the threshold given as the parameter.
1159
1160
1161@node cor command line options
1162@subsection Command line options
1163
1164@table @code
1165
1166@parhelp
1167@parversion
1168@parinteractive
1169@c @parfile
1170@c @paroutput
1171@c @parfail
1172@c @parcopy
1173@parinputfield
1174@paroutputfield
1175@pardictionary
1176@parprocess
1177@parselect
1178@parunselect
1179@paroneline
1180@paronefield
1181
1182@item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}}
1183Maximum edit distance (default='1').
1184
[261bf62]1185@c @item @b{@minus{}@minus{}replace, @minus{}r}
1186@c Replace original form with corrected form, place original form in the
1187@c cor field. This option has no effect in @option{--one-*} modes (default=off)
1188
[25ae32e]1189
1190@end table
1191
1192@node cor dictionaries
1193@subsection Dictionaries
1194
1195@command{cor} requires a dictionary. The dictionary has to be provided in binary (fsa) format.
1196The fsa format is created by compiling text-format dictionaries.
1197
1198@subsubheading Text format
1199
1200The @command{cor} dictionary is a list of words:
1201@example
1202odlot
1203odlotowy
1204odludek
1205@end example
1206
[261bf62]1207@subsubheading Binary format
1208
1209The mandatory file name extension for a binary dictionary is @code{bin}. To
1210compile a text dictionary into binary format, write:
1211
1212@example
[d6a59ca]1213compdic <dictionaryname>.dic <dictionaryname>.bin
[261bf62]1214@end example
1215
1216@c ---------------------------------------------------------------------
1217@c KOR
1218@c ---------------------------------------------------------------------
1219
1220@page
1221@node kor
1222@section kor - configurable spelling corrector
1223
[9ace5d2]1224@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1225@item @strong{Authors:}                 @tab Paweł Werenski, Tomasz Obrębski, Michał Stolarski
1226@item @strong{Component category:}      @tab filter
1227@item @strong{Input format:}            @tab UTT regular
1228@item @strong{Output format:}           @tab UTT regular
1229@item @strong{Required annotation:}     @tab tok
1230@end multitable
1231
1232@menu
1233* kor description::
1234* kor command line options::
1235* kor weights definition file::   
1236* kor dictionaries::           
1237@end menu
1238
1239
1240@node kor description
1241@subsection Description
1242
1243The spelling corrector applies a Pawel Werenski's dynamic programming
1244algorithm to the FSA representation of the set of word forms of the
1245Polex/PMDBF dictionary. The algorithm is an extension of K. Oflazer
1246algorithm used by @command{cor}. In the extended version it is
1247possible to assign weights to individual edit operations.
1248
1249Given an incorrect word form it returns all word forms
1250present in the dictionary whose edit distance is smaller than the
1251threshold given as the parameter.
1252
1253
1254@node kor command line options
1255@subsection Command line options
1256
1257@table @code
1258
1259@parhelp
1260@parversion
1261@parinteractive
1262@c @parfile
1263@c @paroutput
1264@c @parfail
1265@c @parcopy
1266@parinputfield
1267@paroutputfield
1268@pardictionary
1269@parprocess
1270@parselect
1271@parunselect
1272@paroneline
1273@paronefield
1274
1275@item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}}
1276Maximum edit distance (default='1').
1277
1278@item @b{@minus{}@minus{}weights=@var{filename}, @minus{}w @var{filename}}
1279Edit operations' weights file.
1280
1281@c @item @b{@minus{}@minus{}replace, @minus{}r}
1282@c Replace original form with corrected form, place original form in the
1283@c cor field. This option has no effect in @option{--one-*} modes (default=off)
1284
1285
1286@end table
1287
1288
1289@node kor weights definition file
1290@subsection Weights definition file
1291
1292Example:
1293
1294@example
1295
1296%stdcor 1
1297%xchg   1
1298ÅŒ  rz 0.5
1299ch h  0.5
1300u  ó  0.5
1301
1302@end example
1303
1304
1305Default weight is set to 1 (@code{%stdcor 1}), the weight of exchange
1306operation is set to 1 (@code{%xchg 1}), the three principal orthographic
1307errors are assigned the weight 0.5.
1308
1309The edit operation weight declaration, such as
1310
1311@example
1312ÅŒ  rz 0.5
1313@end example
1314
1315works in both ways, i.e. Ō->rz, rz->Ō.
1316
1317The default weights definition file for @code{kor} is:
1318
1319@example
1320$HOME/.local/share/utt/weights.kor
1321@end example
1322
1323or, if the above mentioned file is absent:
1324
1325@example
1326/usr/local/share/utt/weights.kor
1327@end example
1328
1329
1330@node kor dictionaries
1331@subsection Dictionaries
1332
1333see @command{cor}
[261bf62]1334
1335@c ---------------------------------------------------------------------
1336@c SEN
1337@c ---------------------------------------------------------------------
1338
[25ae32e]1339@page
1340@node sen
1341@section sen - a sentensizer
1342
1343@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1344
[9ace5d2]1345@item @strong{Authors:}                 @tab Tomasz Obrębski
[25ae32e]1346@item @strong{Component category:}      @tab filter
[261bf62]1347@item @strong{Input format:}            @tab UTT regular
1348@item @strong{Output format:}           @tab UTT regular
1349@item @strong{Required annotation:}     @tab tok
[25ae32e]1350
1351@end multitable
1352
1353
1354@menu
[261bf62]1355* sen description::
[25ae32e]1356@c * sen input::
1357@c * sen output::
1358* sen example::                 
1359@end menu
1360
[261bf62]1361@node sen description
1362@subsection Description
1363
1364@command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation.
1365
[25ae32e]1366@node sen example
1367@subsection Example
1368
1369@example
1370command: sen
1371
1372input:
[9ace5d2]13730000 05 W Cześć
[25ae32e]13740005 01 P !
13750006 01 S _
13760007 02 W To
13770009 01 S _
13780010 02 W ja
13790012 01 P .
13800013 01 S \n
1381
1382output:
13830000 00 BOS *
[9ace5d2]13840000 05 W Cześć
[25ae32e]13850005 01 P !
13860006 00 EOS *
13870006 00 BOS *
13880006 01 S _
13890007 02 W To
13900009 01 S _
13910010 02 W ja
13920012 01 P .
13930013 01 S \n
13940014 00 EOS *
1395@end example
1396
1397
1398@c ---------------------------------------------------------------------
1399@c GPH
1400@c ---------------------------------------------------------------------
1401
1402@c @node gph - graphizer
1403@c @chapter gph - graphizer
1404
[9ace5d2]1405@c Authors: Tomasz Obrębski
[25ae32e]1406
1407
1408
1409@c ---------------------------------------------------------------------
[261bf62]1410@c SER
[25ae32e]1411@c ---------------------------------------------------------------------
1412
1413@page
1414@node ser
1415@section ser - pattern search tool
1416
1417@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]1418@item @strong{Authors:}                 @tab Tomasz Obrębski
[25ae32e]1419@item @strong{Component category:}      @tab filter
[261bf62]1420@item @strong{Input format:}            @tab UTT regular
1421@item @strong{Output format:}           @tab UTT regular
1422@item @strong{Required annotation:}     @tab tok,  lem --one-field
[25ae32e]1423@end multitable
1424
1425@menu
[261bf62]1426* ser description::
[25ae32e]1427* ser command line options::   
1428* ser pattern::                 
1429* ser how ser works::           
1430* ser customization::           
1431* ser limitations::             
1432* ser requirements::           
1433@end menu
1434
1435
[261bf62]1436@node ser description
1437@subsection Description
1438
1439@command{ser} looks for patterns in UTT-formatted texts.
1440
1441
[25ae32e]1442@c ---------------------------------------------------------------------
1443@node ser command line options
1444@subsection Command line options
1445
1446@table @code
1447
1448@parhelp
1449@parversion
1450@c @parfile
1451@c @paroutput
1452@c @parinputfield
1453@c @paroutputfield
1454@parprocess
1455@parinteractive
1456
1457@item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
1458The search pattern.
1459
1460@item @b{@minus{}@minus{}morph=@var{field}}
1461The name of the annotation field containing the morphological
1462description (default @code{lem}).
1463
1464@item @b{@minus{}@minus{}flex}
1465Only print the generated flex source code.
1466
1467@item @b{@minus{}@minus{}macro=@var{filename}}
1468Read macrodefinitions from file @var{filename} rather than from
1469default location. This option allows to redefine the set of terms.
1470
1471@item @b{@minus{}@minus{}define=@var{filename}}
1472Append macrodefinitions from file @var{filename}. This option
1473allows to extend the set of terms.
1474
1475@end table
1476
1477
1478@c ---------------------------------------------------------------------
1479@node ser pattern
1480@subsection Pattern
1481
1482The @command{ser} pattern is a regular expression over terms corresponding
1483to text segments or segment sequences. Predefined terms are:
1484
1485@table @code
1486
1487@item seg(@var{t},@var{f},@var{a})
1488a segment of type @var{t}, containing form @var{f} and annotation
1489@var{a}
1490
1491@item form(@var{f})
1492a segment containing form @var{f}
1493
1494@item field(@var{f})
1495a segment containing annotation field @var{f}
1496
1497@item space(@var{f})
1498a space segment of form @var{f}
1499
1500@item word(@var{f})
1501a word segment of form @var{f}
1502
1503@item punct(@var{f})
1504a punct segment of form @var{f}
1505
1506@item number(@var{f})
1507a number segment of form @var{f}
1508
1509@item lexeme(@var{f})
1510a word segment with lemma @var{f}
1511
1512@item cat(@var{c})
1513a word segment of category @var{c}
1514
1515@end table
1516
1517All arguments are optional. If an argument is omitted, an arbitrary
1518string of non-blank characters is assumed as the argument value. Term
1519arguments may be arbitrary character-level regular expressions. The
1520following special symbols can by used:
1521
1522@multitable {aaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1523@item @code{[@dots{}]}            @tab a character class
1524@item @code{[^@dots{}]}           @tab a negated character class
1525@item @code{|}                    @tab alternative
1526@item @code{*}                    @tab repetition, including zero times
1527@item @code{+}                    @tab repetition, at least one time
1528@item @code{?}                    @tab optionality
1529@item @code{@{@var{m},@var{n}@}}  @tab repetition from @var{m} to @var{n} times
1530@item @code{@{@var{m},@}}         @tab repetition @var{m} or more times
1531@item @code{@{@var{m}@}}          @tab repetition @var{m} times
1532@item @code{@var{\ddd}}           @tab the character with octal value @var{ddd}
1533@item @code{\x@var{hh}}           @tab the character with hexadecimal value @var{hh}
1534@item @code{( )}                  @tab parentheses, used to override precedence
1535@c @end multitable
1536
1537@c @multitable {aaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1538@item @code{.}    @tab a non-blank character
1539@item @code{\w}   @tab a letter
1540@item @code{\W}   @tab a non-blank character other than a letter
1541@item @code{\d}   @tab a digit
1542@item @code{\D}   @tab a non-blank character other than a digit
1543@item @code{\s}   @tab a space or tab character
1544@item @code{\S}   @tab a non-blank character (the same as @code{.})
1545@item @code{\l}   @tab a lowercase letter
1546@item @code{\L}   @tab an uppercase letter
1547@end multitable
1548
1549
1550@noindent The following characters:
1551@example
1552@verb{%  [   ]   ^   |   *   +   ?   {   }   ,   .   <   >   \ %}
1553@end example
1554must be escaped with a backslash, i.e. written as:
1555@example
1556@verb{% \[  \]  \^  \|  \*  \+  \?  \{  \}  \,  \.  \<  \>  \\ %}
1557@end example
1558
1559@quotation Note
1560The special symbols are ... borrowed from Perl with minor
1561modifications ... for convenience
1562The meaning of certain special characters/sequences slightly differs
1563from their common ???. This is motivated by convenience reasons.
1564The meaning of the @code{.} special character is modified due to
1565the special function of spaces in utt files (they are field
1566separators). Use @code{\s} to explicitly
1567@end quotation
1568
1569In the argument of the @code{cat} term a special operator <...> may be
1570used. A category specification enclosed in angle brackets matches all
1571category descriptions which are consistent (non-contradictory) with the
1572specification. For example @code{<N>} matches all noun descriptions,
1573@code{<ADJ/Can>} matches all adjectives in accusative or nominal case.
1574
1575
1576@*
1577@noindent @b{Examples of one-segment patterns:}
1578
1579@multitable {aaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1580@item @code{seg}            @tab any segment
1581@item @code{word}           @tab any word-form
1582@item @code{word(pomocy)}   @tab the word-form @samp{pomocy}
1583@item @code{word(naj.+)}    @tab a word-form beginning with @samp{naj}
1584@item @code{word(\L\l+)}    @tab a capitalized word-form
1585@item @code{punct}          @tab a punctuation character
1586@item @code{space(.*\\n.*)} @tab a space segment containing a newline character
1587@item @code{lexeme(pomoc)}  @tab any form of the lexeme 'pomoc'
1588@item @code{cat(N/.*)}      @tab a word which category starts with @code{N/}
1589@item @code{cat(<N/Ca>)}    @tab a word which category matches @code{N/Ca}
1590@end multitable
1591
1592@*
1593@noindent @b{Examples of multi-segment patterns:}
1594
1595@table @code
1596
1597@item (word(\L) punct(\.) space?)+ word(\L\l+)
1598a sequence of initials followed by a surname
1599
1600@item punct seg(W|S|N)* cat(<NPRO/Sr>) seg(W|S|N)* punct
1601a text fragment between two punctuation characters, containing an
1602ocurrence of a relative pronoun
1603
1604@end table
1605
1606
1607@node ser how ser works
1608@subsection How ser works
1609
1610@node ser customization
1611@subsection Customization
1612
1613@c All predefined terms correspond to single segments,
1614
1615@example
[261bf62]1616define(`verbseq', `(cat(<V>) (space cat(<V>)))')
[25ae32e]1617@end example
1618
1619
1620the term @code{cat()} may not be used as a ... of
1621
1622@c See @command{m4} manual for further details on macro definition format.
1623
1624@node ser limitations
1625@subsection Limitations
1626
[261bf62]1627Do not use more than 3 attributes in <>.
[25ae32e]1628
1629@node ser requirements
1630@subsection Requirements
1631
1632In order to run @command{ser}, the following programs must be
1633installed in the system:
1634
1635@itemize
1636
1637@item @command{m4}
1638@item @command{grep}
1639@item @command{flex}
1640@item @command{gcc}
1641
1642@end itemize
1643
1644
1645@c ---------------------------------------------------------------------
[261bf62]1646@c GRP
[25ae32e]1647@c ---------------------------------------------------------------------
1648
1649@page
1650@node grp
1651@section grp - pattern search tool
1652
1653@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]1654@item @strong{Authors:}                 @tab Tomasz Obrębski
[25ae32e]1655@item @strong{Component category:}      @tab filter
[261bf62]1656@item @strong{Input format:}            @tab UTT flattened
1657@item @strong{Output format:}           @tab UTT flattened
1658@item @strong{Required annotation:}     @tab tok, sen, lem --one-field
[25ae32e]1659@end multitable
1660
1661
[261bf62]1662@menu
1663* grp description::
1664* grp command line options::   
1665* grp pattern::                 
1666* grp hints::   
1667@end menu
1668
1669
1670@node grp description
1671@subsection Description
1672
[25ae32e]1673@code{gre} selects sentences containing an expression matching a
1674pattern. The pattern format is exactly the same as that accepted by
1675@code{ser}.
1676
1677@code{gre} is intended mainly for speeding up corpus search process.
1678It is extremely fast (processing speed is usually higher then the speed
1679of reading the corpus file from disk).
1680
1681@node grp command line options
1682@subsection Command line options
1683
1684@table @code
1685
1686@parhelp
1687@parversion
1688@parprocess
1689@parinteractive
1690
1691@item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
1692The search pattern.
1693
1694@item @b{@minus{}@minus{}morph=@var{field}}
1695The name of the annotation field containing the morphological
1696description (default @code{lem}).
1697
1698@item @b{@minus{}@minus{}command}
1699Only print the generated flex source code.
1700
1701@item @b{@minus{}@minus{}macro=@var{filename}}
1702Read macrodefinitions from file @var{filename} rather than from
1703default location. This option allows to redefine the set of terms.
1704
1705@item @b{@minus{}@minus{}define=@var{filename}}
1706Append macrodefinitions from file @var{filename}. This option
1707allows to extend the set of terms.
1708
1709@end table
1710
1711
1712@node grp pattern
1713@subsection Pattern
1714
1715(see @code{ser})
1716
1717@node grp hints
1718@subsection Hints
1719
1720The corpus search speed may be increased by combining grp with lzop
1721compression tool (grp usually processes data faster than it is read from a
1722disk, especially for slow laptop drives).
1723
1724@example
[e28a625]1725cat corpus | tok | sen | lem -1 | fla | lzop -7 > corpus.grp.lzo
[25ae32e]1726@end example
1727
1728@example
[e28a625]1729lzop -cd corpus.grp.lzo | grp -e @var{EXPR} | unfla | ser -e @var{EXPR}
[25ae32e]1730@end example
1731
1732
[261bf62]1733
[25ae32e]1734@c ---------------------------------------------------------------------
[261bf62]1735@c MAR
[25ae32e]1736@c ---------------------------------------------------------------------
[261bf62]1737
1738@page
1739@node mar
1740@section mar
1741
1742@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]1743@item @strong{Authors:}                 @tab Marcin Walas, Tomasz Obrębski
[e28a625]1744@item @strong{Input format:}            @tab UTT flattened
1745@item @strong{Output format:}           @tab UTT flattened
1746@item @strong{Required annotation:}     @tab tok, sen, lem -1
[261bf62]1747@end multitable
1748
[2d89d4b]1749@subsection Description
1750@code{mar} is a perl script, which matches given pattern on the utt-formated text
1751and tags matching parts with any number of user-defined tags.
1752
1753@subsection Command line options
1754@table @code
1755@parhelp
1756@parversion
1757
1758@item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
1759The search pattern.
1760@item @b{@minus{}@minus{}action=@var{action}, @minus{}a @var{action} [p] [s] [P]}
1761Perform only indicated actions. Where:
1762@multitable {aaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1763@item @code{p}   @tab preprocess
1764@item @code{s}   @tab search
1765@item @code{P}   @tab postprocess
1766@end multitable
1767default: psP
1768
1769@item @b{@minus{}@minus{}command}
1770print generated sed command, then exit
1771
1772@item @b{@minus{}@minus{}help, @minus{}h}
1773print help, then exit
1774
1775@item @b{@minus{}@minus{}version, @minus{}v}
1776print version, then exit
1777@end table
1778@subsection Tokens in pattern
1779@code{mar} pattern is based on @code{ser} patterns(see @pxref{ser pattern}). @code{mar} pattern is a @code{ser} pattern,
1780in which you can add any number of matching tags, which will be printed in exacly the place, where
1781they were placed in the pattern. A valid token starts with @@ which follows any number of alphanumeric
1782characters. For example valid match tokens are: @@STARTMATCH @@ENDMATCH
1783
1784Matching tokens can be placed between, before or after any of @code{ser} pattern terms. They don't have
1785to be paritied. There can be any number of them in the pattern (zero or more). They don't have to be unique.
1786They can be placed one after another. For example:
1787
1788@multitable {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaa}
1789@item @code{@@BOM lexeme(pomoc)}  @tab place tag @b{BOM} before any form of the lexeme 'pomoc'
1790@item @code{@@MATCH lexeme(pomoc) @@MATCH}      @tab place tag @b{MATCH} before and after any form of the lexeme 'pomoc'
1791@item @code{cat(<ADJ>) @@MATCH lexeme(pomoc) @@MATCH}      @tab place tag @b{MATCH} before and after any form of the lexeme 'pomoc' which is  followef by adjective
1792@item @code{cat(<ADJ>) @@TAG @@BOM lexeme(pomoc) @@EOM}      @tab place tags @b{TAG} and @b{BOM}  before any form of the lexeme 'pomoc' which is  followed by adjective and tag @b{EOM} after it
1793@end multitable
1794
1795(see mar's help 'mar -h' for some more information)
1796
1797@subsection How mar works
1798@code{mar} translates given @code{ser} pattern with @code{m4} macroprocessor to regular expression. Then it changes it into @code{sed} command script, which is then executed.
1799
1800You can see translated sed script by using the @code{@minus{}@minus{}command} option.
1801@subsection Limitations
1802The complexity of computations performed by @code{mar} increases linearly with the number of placed tokens. So it is highly recommended not to place too much tokens.
1803@subsection Requirements
1804In order to run @code{mar}, the following programs must be installed in the system:
1805
1806@itemize
1807
1808@item @command{m4}
1809@item @command{grep}
1810@item @command{sed}
1811
1812@end itemize
1813
[261bf62]1814
[e28a625]1815
[261bf62]1816@c ---------------------------------------------------------------------
1817@c KOT
[25ae32e]1818@c ---------------------------------------------------------------------
1819
1820@page
1821@node kot
1822@section kot - untokenizer
1823
[261bf62]1824@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]1825@item @strong{Authors:}                 @tab Tomasz Obrębski
[261bf62]1826@item @strong{Component category:}      @tab filter
1827@item @strong{Input format:}            @tab UTT regular
1828@item @strong{Output format:}           @tab text
1829@item @strong{Required annotation:}     @tab tok
1830@end multitable
[25ae32e]1831
1832
1833@menu
[261bf62]1834* kot description::
[25ae32e]1835* kot command line options::   
1836* kot usage examples::   
1837@end menu
1838
[261bf62]1839@node kot description
1840@subsection Description
1841
1842@command{kot} transforms a UTT formatted file back into raw text format.
1843
[25ae32e]1844@node kot command line options
1845@subsection Command line options
1846
1847@table @code
1848
1849@parhelp
1850
1851@c @item @b{@minus{}@minus{}version}, @b{@minus{}v}
1852
1853@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
1854
1855@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
1856
1857@c @item @b{@minus{}@minus{}interactive @minus{}i}
1858
1859@c @item @b{@minus{}@minus{}config=@var{filename}}
1860
1861@item
1862
1863@item @b{@minus{}@minus{}gap-fill=@var{string}, @minus{}g @var{string}}
1864print @var{string} between nonadjacent segments of the input file
1865
1866@item @b{@minus{}@minus{}spaces, @minus{}r}
1867retain the special characters @code{_}, @code{\t},
1868@code{\n}, @code{\r}, @code{\f} unexpanded in the output
1869
1870@end table
1871
1872@node kot usage examples
1873@subsection Usage examples
1874
1875@example
1876cat legia.txt | tok | kot       
1877@end example
1878
1879@example
1880cat legia.txt | tok | lem -1 | kot
1881@end example
1882
[261bf62]1883@c ---------------------------------------------------------------
1884@c CON
1885@c ---------------------------------------------------------------
1886
[25ae32e]1887
1888@page
1889@node con
1890@section con - concordance table generator
1891
1892@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1893@item @strong{Authors:}                 @tab Justyna Walkowska
1894@item @strong{Component category:}      @tab sink
[261bf62]1895@item @strong{Input format:}            @tab UTT regular
1896@item @strong{Output format:}           @tab text
1897@item @strong{Required annotation:}     @tab ser or mar
[25ae32e]1898@end multitable
1899@c
1900
1901@menu
[261bf62]1902* con description::
[25ae32e]1903* con command line options::
1904* con usage example::
1905* con hints::   
1906@end menu
1907
[261bf62]1908
1909@node con description
1910@subsection Description
1911
1912@command{con} generates a concordance table based on a pattern given to @command{ser}.
1913
1914
[25ae32e]1915@node con command line options
1916@subsection Command line options
1917
1918@table @code
1919
1920@parhelp
1921
1922@c @item @b{@minus{}@minus{}help}, @b{@minus{}h}
1923@c @item @b{@minus{}@minus{}version}, @b{@minus{}v}
1924@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
1925@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
1926@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}} [???]
1927@c @item @b{@minus{}@minus{}copy, @minus{}c} [???]
1928@c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}
1929@c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}
1930@c @item @b{@minus{}@minus{}process=@var{class}, @minus{}p @var{class}}
1931@c @item @b{@minus{}@minus{}interactive @minus{}i}
1932@c @item @b{@minus{}@minus{}config=@var{filename}}
1933@c @item
1934@c @item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
1935@c search pattern
1936@c
1937@c @item @b{@minus{}@minus{}flex}
1938@c only print the generated flex source code
1939@c
1940@c @item @b{@minus{}@minus{}macro=@var{filename}}
1941@c read macrodefinitions from file @var{filename} rather than from
1942@c default location. This option allows to redefine the set of terms.
1943@c
1944@c @item @b{@minus{}@minus{}define=@var{filename}}
1945@c append macrodefinitions from file @var{filename}. This option
1946@c allows to extend the set of terms.
1947
1948@item @b{@minus{}@minus{}left @minus{}l}           
1949        Left context info (default='30c'). Example:
1950@example                         
1951                                 -l=5c: left context is 5 characters
1952                                 -l=5w: left context is 5 words
1953                                 -l=5s: left context is 5 non-empty input lines
1954                                 -l='\s*\S+\sr\S+BOS': left context starts with the given regex
1955@end example
1956
1957@item @b{@minus{}@minus{}right @minus{}r}           
1958        Right context info (default='30c').
1959@item @b{@minus{}@minus{}trim @minus{}t}           
1960        Clear incomplete words from output.
1961@item @b{@minus{}@minus{}white @minus{}w}           
1962        DO NOT change all white characters into spaces.
1963@item @b{@minus{}@minus{}column @minus{}c}           
1964        Left column minimal width in characters (default = 0).
1965@item @b{@minus{}@minus{}ignore @minus{}i}           
1966        Ignore segment inconsistency in the input.
[261bf62]1967@item @b{@minus{}@minus{}bom}           
[25ae32e]1968        Beginning of selected segment (regex, default='[0-9]+ [0-9]+ BOM .*').
[261bf62]1969@item @b{@minus{}@minus{}eom}           
[25ae32e]1970        End of selected segment (regex, default='[0-9]+ [0-9]+ EOM .*').
1971@item @b{@minus{}@minus{}bod}           
1972        Selected segment beginning display string (default='[').
1973@item @b{@minus{}@minus{}eod}           
1974        Selected segment end display string (default=']').
1975
1976
1977
1978@end table
1979
1980@node con usage example
1981@subsection Usage example
1982@example
[261bf62]1983cat file.txt | tok | lem -1 | ser -e 'lexeme(dom)' | con 
[25ae32e]1984@end example
1985
1986
1987@node con hints
1988@subsection Hints
1989
1990@command{con} is a rather slow program. Do not pass large amounts of
1991redundant text through this program. @command{con} works fine in the following
1992sequence:
1993
1994@example
1995... | grp -e EXPR | ser -e EXPR | con
1996@end example
1997
1998
1999@c ---------------------------------------------------------------------
2000@c ---------------------------------------------------------------------
2001
2002@page
2003@node Auxiliary tools
2004@chapter Auxiliary tools
2005
2006@menu
[d6a59ca]2007* compdic::            dictionary compiler
[25ae32e]2008* fla::                UTT file flattener
2009* unfla::              UTT file unflattener
2010@end menu
2011
2012
2013@page
[d6a59ca]2014@node compdic
2015@section compdic - the dictionary compiler
[25ae32e]2016
2017@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]2018@item @strong{Authors:}                 @tab Michał Stolarski, Tomasz Obrębski
[25ae32e]2019@item @strong{Component category:}      @tab additional tool
2020@end multitable
2021@c
2022
[d6a59ca]2023@command{compdic} compiles dictionaries in text format (@code{.dic} extension) into binary
2024(FST) format (@code{.bin} extension).
[25ae32e]2025
[d6a59ca]2026Automaton representation of a dictionary is built using the OpenFst toolkit.
[25ae32e]2027
[d6a59ca]2028In order for the compdic program to work you have to install the OpenFst toolkit in your system.
[25ae32e]2029
2030Usage:
2031@example
[d6a59ca]2032        compdic <dictionaryname>.dic <dictionaryname>.bin
[25ae32e]2033@end example
2034
2035The file <dictionaryname>.bin will be generated.
2036
2037@c @menu
2038@c * con command line options::
2039@c * con usage example::
2040@c * con hints::   
2041@c @end menu
2042
2043
[e28a625]2044@c -------------------------------------------------------------------------------
2045@c FLA
2046@c -------------------------------------------------------------------------------
2047
[25ae32e]2048@page
2049@node fla
2050@section fla - the UTT file flattener
2051
2052@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]2053@item @strong{Authors:}                 @tab Tomasz Obrębski
[e28a625]2054@item @strong{Input format:}            @tab UTT regular
2055@item @strong{Output format:}           @tab UTT flattened
2056@item @strong{Required annotation:}     @tab sen
[25ae32e]2057@end multitable
2058@c
2059
[e28a625]2060@menu
2061* fla description::
2062@c * fla command line options::
2063@c * fla usage example::
2064@end menu
2065
2066
2067@node fla description
2068@subsection Description
2069
[25ae32e]2070@command{fla} ``flattens'' a utt file by merging segments belonging
2071to one sentence in one line. Technically, end-of-line characters
2072('\n', ASCII code 10) are replaced with line-feed characters ('\f',
2073ASCII code 12).  The flattening makes it possible to process UTT files
2074with such tools as @command{grep} or @command{sed} sentence by
2075sentence (used in @command{grp} and @command{mar}).
2076
2077Flattened files should have the suffix @code{.fla}, eg. @file{thetext.utt.fla}.
2078
2079Flattened files are still human-readible.
2080
2081Usage:
2082
2083@example
2084        fla [<bosregex>]
2085@end example
2086
2087The facultative argument is a regular expression describing segments
2088which should be treated as sentence beginnings (the test is: the
2089segment contains a fragment matching the @code{<bosregex>}). By
2090default, segments containing a field @code{BOS} are seeked.
2091
[e28a625]2092@c -------------------------------------------------------------------------------
2093@c UNFLA
2094@c -------------------------------------------------------------------------------
[25ae32e]2095
2096@page
2097@node unfla
2098@section unfla - the UTT file unflattener
2099
2100@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]2101@item @strong{Authors:}                 @tab Tomasz Obrębski
[e28a625]2102@item @strong{Input format:}            @tab UTT flattened
2103@item @strong{Output format:}           @tab UTT regular
2104@item @strong{Required annotation:}     @tab -
[25ae32e]2105@end multitable
2106
[e28a625]2107@menu
2108* unfla description::
2109@c * fla command line options::
2110@c * fla usage example::
2111@end menu
2112
2113@node unfla description
2114@subsection Description
[25ae32e]2115@command{unfla} transforms a flattened UTT file, produced by
2116@command{fla}, into the regular format by restoring end-of-line
2117characters.
2118
2119
2120
2121
2122@c ---------------------------------------------------------------------
2123@c USAGE EXAMPLES
2124@c ---------------------------------------------------------------------
2125
2126@node Usage examples
2127@chapter Usage examples
2128
2129@subsubheading Simple pipelines
2130
2131@enumerate
2132
2133@item tokenization
2134
2135cat text | tok > output1
2136
2137@item morphological annotation (1)
2138
2139simple dictionary based lemmatization
2140
2141cat text | tok | lem > output1
2142
2143@item morphological annotation (2)
2144
21451) perform dictionary-based lemmatization
21464) guess descriptions for words which have no annotation
2147
2148@example
2149cat text | tok | lem | gue -S lem > output2
2150@end example
2151
2152@item morphological annotation (3)
2153
21541) perform dictionary-based lemmatization
21552) try to correct words with no annotation
21563) perform dictionary-based lemmatization of corrected words
21574) guess descriptions for words which still have no annotation
2158
2159@example
2160cat text | tok | lem | cor -p W -S lem | lem -I cor | gue -p W -S lem
2161@end example
2162@item spelling correction
2163
2164
2165
2166@example
[e28a625]2167cat text | tok | egrep ' W ' | lem | egrep -v 'lem:' | cor -1
[25ae32e]2168@end example
2169
2170@item Expression extraction
2171
2172Extraction of all occurrences of a verb followed by a form of the noun 'rozmowa'.
2173
2174@example
2175cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' -m | kot > output4
2176@end example
2177
2178@item A word in context
2179
2180Extraction of text fragments containing a form of the lexeme 'rozmowa' in
2181the context of 5 preceeding and 5 succeeding corpus segments.
2182
2183@example
2184cat text | tok | lem -1 | ser -e 'seg@{5@} lexeme(rozmowa) seg@{5@}' -m | kot > output
2185@end example
2186
2187@item generation of concordance table (1)
2188
2189@example
2190cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' | con
2191@end example
2192
219310"
2194
2195@item generation of concordance table (2)
2196
2197The same as above but much faster
2198
2199@example
2200cat text | tok | lem -1 | \
2201grp -e 'cat(<V>) space lexeme(rozmowa)' | \
2202ser -e 'cat(<V>) space lexeme(rozmowa)' | \
2203con
2204@end example
2205
22062"
2207
2208@item generation of concordance table (3)
2209
2210Usually, one performs repetitively search over the same corpus. In
2211such case it is advisable to transform the corpus data into the format
2212required by @command{grp} first, and then use the preprocessed data.
2213
2214As @command{grp} (@command{grep}) processes data faster then it is
2215read from the disk drive, the search time may be still shortened by
[e28a625]2216using file compression techniques.  We suggest using the
2217@command{lzop} compressor/decompressor.
[25ae32e]2218
2219@item the fastest way to search a large corpus
2220
[e28a625]2221step 1: corpus preprocessing
[25ae32e]2222
2223@example
2224cat corpus | tok | sen | lem -1 \
[e28a625]2225| fla | lzop -7 > corpus.grp.lzo
[25ae32e]2226@end example
2227
2228step 2: search
2229
2230@example
[e28a625]2231lzop -cd corpus.grp.lzo | unfla | grp -e 'cat(<V>) space
[25ae32e]2232lexeme(rozmowa)' | ser -e 'cat(<V>) space lexeme(rozmowa)' | con
2233@end example
2234
2235@end enumerate
2236
[e28a625]2237@c @subsubheading More complicated configurations
[25ae32e]2238
2239
[e28a625]2240@c @example
2241@c mknod fifo1 p
2242@c mknod fifo2 p
2243@c mknod fifo3 p
2244@c mknod fifo4 p
2245@c mknod fifo5 p
2246
2247@c tok | lem -p W -e fifo1 > fifo2 &
2248@c cor -e fifo3 < fifo1 | lem > fifo4 &
2249@c gue < fifo3 > fifo5 &
2250@c sort -m fifo2 fifo4 fifo5
2251
2252@c rm fifo?
2253@c @end example
[25ae32e]2254
2255
2256@c ---------------------------------------------------------------------
2257@c ---------------------------------------------------------------------
2258
2259@c ---------------------------------------------------------------------
2260@c PMDBF DICTIONARY
2261@c ---------------------------------------------------------------------
2262
2263@node PMDBF dictionary
2264@chapter PMDBF dictionary
2265
2266UTT components come with lexical data derived from Polish
2267Morphological Database (PMDB).
2268
2269@menu
2270* PMDBF files::   
2271* PMDBF tag structure::                 
2272* PMDBF parts of speech::           
2273* PMDBF morphosyntactic attributes::           
2274@end menu
2275
2276@node PMDBF files
2277@section Files
2278
2279@node PMDBF tag structure
2280@section Tag structure
2281
2282pos = [[:upper:]]+
2283
2284attr = [[:upper:]]+
2285
2286val = [[:lower:][:digit:]?!*+-] | <[^>\n]+>
2287
2288descr = pos ( / ( attr val + ) + ) ?
2289
2290@node PMDBF parts of speech
2291@section Parts of speech
2292
2293@multitable {ADJPRP} { adjectival-passive-participle }
2294@item @code{N} @tab noun
2295@item @code{NPRO} @tab nominal-pronoun
2296@item @code{NV} @tab deverbal-noun
2297@item @code{V} @tab verb
2298@item @code{BYC} @tab byc
2299@item @code{VNI} @tab non-inflected-verb
2300@item @code{ADJ} @tab adjective
2301@item @code{ADJPAP} @tab adjectival-passive-participle
2302@item @code{ADJPRP} @tab adjectival-present-participle
2303@item @code{ADJPP} @tab adjectival-past-participle
2304@item @code{ADJPRO} @tab adjectival-pronoun
2305@item @code{ADJNUM} @tab adjectival-numeral
2306@item @code{ADV} @tab adverb
2307@item @code{ADVANP} @tab adverbial-anterior-participle
2308@item @code{ADVPRP} @tab adverbial-present-participle
2309@item @code{ADVPRO} @tab adverbial-pronoun
2310@item @code{ADVNUM} @tab  adverbial-numeral
2311@item @code{P} @tab preposition
2312@item @code{PPRO} @tab prep-noun-pronoun
2313@item @code{CONJ} @tab conjunction
2314@item @code{EXCL} @tab exclamation
2315@item @code{APP} @tab call
2316@item @code{ONO} @tab onomatopoeia
2317@item @code{PART} @tab particle
2318@item @code{NUMCRD} @tab cardinal-numeral
2319@item @code{NUMCOL} @tab collective-numeral
2320@item @code{NUMPAR} @tab partitive-numeral
2321@item @code{NUMORD} @tab ordinal-numeral
2322@end multitable
2323
2324@node PMDBF morphosyntactic attributes
2325@section Morphosyntactic attributes
2326
2327@multitable {Attr} {Val} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
2328@c @headitem Attr @tab Val @tab Description
2329@item
2330@code{A} @tab @tab Aspect
2331@item
2332@tab @code{p} @tab perfect
2333@item
2334@tab @code{i} @tab imperfect.
2335@item
2336@item
2337@code{V} @tab @tab Verb-Form
2338@item
2339@tab @code{b} @tab infinitive,
2340@item
2341@tab @code{p} @tab personal,
2342@item
2343@tab @code{i} @tab impersonal.
2344@item
2345@item
2346@code{M} @tab @tab Mood
2347@item
2348@tab @code{d} @tab declarative,
2349@item
2350@tab @code{c} @tab conditional,
2351@item
2352@tab @code{i} @tab imperative.
2353@item
2354@item
2355@code{T} @tab @tab Tense
2356@item
2357@tab @code{a} @tab past,
2358@item
2359@tab @code{r} @tab present,
2360@item
2361@tab @code{f} @tab future.
2362@item
2363@item
2364@code{P} @tab @tab Person
2365@item
2366@tab @code{1} @tab 1,
2367@item
2368@tab @code{2} @tab 2,
2369@item
2370@tab @code{3} @tab 3.
2371@item
2372@item
2373@code{D} @tab @tab Degree
2374@item
2375@tab @code{p} @tab positive,
2376@item
2377@tab @code{c} @tab comparative,
2378@item
2379@tab @code{s} @tab superlative.
2380@item
2381@item
2382@code{N} @tab @tab Number
2383@item
2384@tab @code{s} @tab singular,
2385@item
2386@tab @code{p} @tab plural.
2387@item
2388@item
2389@code{C} @tab @tab Case
2390@item
2391@tab @code{n} @tab nominative,
2392@item
2393@tab @code{g} @tab genitive,
2394@item
2395@tab @code{d} @tab dative,
2396@item
2397@tab @code{a} @tab accusative,
2398@item
2399@tab @code{i} @tab instrumantal,
2400@item
2401@tab @code{l} @tab locative,
2402@item
2403@tab @code{v} @tab vocative.
2404@item
2405@code{G} @tab @tab Gender
2406@item
2407@tab @code{p} @tab masculine-personal,
2408@item
2409@tab @code{a} @tab masculine-animal,
2410@item
2411@tab @code{i} @tab masculine-inanimate,
2412@item
2413@tab @code{f} @tab feminine,
2414@item
2415@tab @code{n} @tab neuter.
2416@end multitable
2417
2418
2419@c ---------------------------------------------------------------------
2420@c ---------------------------------------------------------------------
2421@c
2422@c @node Examples
2423@c @chapter Examples
2424
2425@c ----------------------------------------------------------------------
2426@c ----------------------------------------------------------------------
2427
2428@node    GNU Free Documentation License
2429@chapter GNU Free Documentation License
2430
2431@c The GNU Free Documentation License.
2432@center Version 1.2, November 2002
2433
2434@c This file is intended to be included within another document,
2435@c hence no sectioning command or @node.
2436
2437@display
2438Copyright @copyright{} 2000,2001,2002 Free Software Foundation, Inc.
243951 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA
2440
2441Everyone is permitted to copy and distribute verbatim copies
2442of this license document, but changing it is not allowed.
2443@end display
2444
2445@enumerate 0
2446@item
2447PREAMBLE
2448
2449The purpose of this License is to make a manual, textbook, or other
2450functional and useful document @dfn{free} in the sense of freedom: to
2451assure everyone the effective freedom to copy and redistribute it,
2452with or without modifying it, either commercially or noncommercially.
2453Secondarily, this License preserves for the author and publisher a way
2454to get credit for their work, while not being considered responsible
2455for modifications made by others.
2456
2457This License is a kind of ``copyleft'', which means that derivative
2458works of the document must themselves be free in the same sense.  It
2459complements the GNU General Public License, which is a copyleft
2460license designed for free software.
2461
2462We have designed this License in order to use it for manuals for free
2463software, because free software needs free documentation: a free
2464program should come with manuals providing the same freedoms that the
2465software does.  But this License is not limited to software manuals;
2466it can be used for any textual work, regardless of subject matter or
2467whether it is published as a printed book.  We recommend this License
2468principally for works whose purpose is instruction or reference.
2469
2470@item
2471APPLICABILITY AND DEFINITIONS
2472
2473This License applies to any manual or other work, in any medium, that
2474contains a notice placed by the copyright holder saying it can be
2475distributed under the terms of this License.  Such a notice grants a
2476world-wide, royalty-free license, unlimited in duration, to use that
2477work under the conditions stated herein.  The ``Document'', below,
2478refers to any such manual or work.  Any member of the public is a
2479licensee, and is addressed as ``you''.  You accept the license if you
2480copy, modify or distribute the work in a way requiring permission
2481under copyright law.
2482
2483A ``Modified Version'' of the Document means any work containing the
2484Document or a portion of it, either copied verbatim, or with
2485modifications and/or translated into another language.
2486
2487A ``Secondary Section'' is a named appendix or a front-matter section
2488of the Document that deals exclusively with the relationship of the
2489publishers or authors of the Document to the Document's overall
2490subject (or to related matters) and contains nothing that could fall
2491directly within that overall subject.  (Thus, if the Document is in
2492part a textbook of mathematics, a Secondary Section may not explain
2493any mathematics.)  The relationship could be a matter of historical
2494connection with the subject or with related matters, or of legal,
2495commercial, philosophical, ethical or political position regarding
2496them.
2497
2498The ``Invariant Sections'' are certain Secondary Sections whose titles
2499are designated, as being those of Invariant Sections, in the notice
2500that says that the Document is released under this License.  If a
2501section does not fit the above definition of Secondary then it is not
2502allowed to be designated as Invariant.  The Document may contain zero
2503Invariant Sections.  If the Document does not identify any Invariant
2504Sections then there are none.
2505
2506The ``Cover Texts'' are certain short passages of text that are listed,
2507as Front-Cover Texts or Back-Cover Texts, in the notice that says that
2508the Document is released under this License.  A Front-Cover Text may
2509be at most 5 words, and a Back-Cover Text may be at most 25 words.
2510
2511A ``Transparent'' copy of the Document means a machine-readable copy,
2512represented in a format whose specification is available to the
2513general public, that is suitable for revising the document
2514straightforwardly with generic text editors or (for images composed of
2515pixels) generic paint programs or (for drawings) some widely available
2516drawing editor, and that is suitable for input to text formatters or
2517for automatic translation to a variety of formats suitable for input
2518to text formatters.  A copy made in an otherwise Transparent file
2519format whose markup, or absence of markup, has been arranged to thwart
2520or discourage subsequent modification by readers is not Transparent.
2521An image format is not Transparent if used for any substantial amount
2522of text.  A copy that is not ``Transparent'' is called ``Opaque''.
2523
2524Examples of suitable formats for Transparent copies include plain
2525@sc{ascii} without markup, Texinfo input format, La@TeX{} input
2526format, @acronym{SGML} or @acronym{XML} using a publicly available
2527@acronym{DTD}, and standard-conforming simple @acronym{HTML},
2528PostScript or @acronym{PDF} designed for human modification.  Examples
2529of transparent image formats include @acronym{PNG}, @acronym{XCF} and
2530@acronym{JPG}.  Opaque formats include proprietary formats that can be
2531read and edited only by proprietary word processors, @acronym{SGML} or
2532@acronym{XML} for which the @acronym{DTD} and/or processing tools are
2533not generally available, and the machine-generated @acronym{HTML},
2534PostScript or @acronym{PDF} produced by some word processors for
2535output purposes only.
2536
2537The ``Title Page'' means, for a printed book, the title page itself,
2538plus such following pages as are needed to hold, legibly, the material
2539this License requires to appear in the title page.  For works in
2540formats which do not have any title page as such, ``Title Page'' means
2541the text near the most prominent appearance of the work's title,
2542preceding the beginning of the body of the text.
2543
2544A section ``Entitled XYZ'' means a named subunit of the Document whose
2545title either is precisely XYZ or contains XYZ in parentheses following
2546text that translates XYZ in another language.  (Here XYZ stands for a
2547specific section name mentioned below, such as ``Acknowledgements'',
2548``Dedications'', ``Endorsements'', or ``History''.)  To ``Preserve the Title''
2549of such a section when you modify the Document means that it remains a
2550section ``Entitled XYZ'' according to this definition.
2551
2552The Document may include Warranty Disclaimers next to the notice which
2553states that this License applies to the Document.  These Warranty
2554Disclaimers are considered to be included by reference in this
2555License, but only as regards disclaiming warranties: any other
2556implication that these Warranty Disclaimers may have is void and has
2557no effect on the meaning of this License.
2558
2559@item
2560VERBATIM COPYING
2561
2562You may copy and distribute the Document in any medium, either
2563commercially or noncommercially, provided that this License, the
2564copyright notices, and the license notice saying this License applies
2565to the Document are reproduced in all copies, and that you add no other
2566conditions whatsoever to those of this License.  You may not use
2567technical measures to obstruct or control the reading or further
2568copying of the copies you make or distribute.  However, you may accept
2569compensation in exchange for copies.  If you distribute a large enough
2570number of copies you must also follow the conditions in section 3.
2571
2572You may also lend copies, under the same conditions stated above, and
2573you may publicly display copies.
2574
2575@item
2576COPYING IN QUANTITY
2577
2578If you publish printed copies (or copies in media that commonly have
2579printed covers) of the Document, numbering more than 100, and the
2580Document's license notice requires Cover Texts, you must enclose the
2581copies in covers that carry, clearly and legibly, all these Cover
2582Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on
2583the back cover.  Both covers must also clearly and legibly identify
2584you as the publisher of these copies.  The front cover must present
2585the full title with all words of the title equally prominent and
2586visible.  You may add other material on the covers in addition.
2587Copying with changes limited to the covers, as long as they preserve
2588the title of the Document and satisfy these conditions, can be treated
2589as verbatim copying in other respects.
2590
2591If the required texts for either cover are too voluminous to fit
2592legibly, you should put the first ones listed (as many as fit
2593reasonably) on the actual cover, and continue the rest onto adjacent
2594pages.
2595
2596If you publish or distribute Opaque copies of the Document numbering
2597more than 100, you must either include a machine-readable Transparent
2598copy along with each Opaque copy, or state in or with each Opaque copy
2599a computer-network location from which the general network-using
2600public has access to download using public-standard network protocols
2601a complete Transparent copy of the Document, free of added material.
2602If you use the latter option, you must take reasonably prudent steps,
2603when you begin distribution of Opaque copies in quantity, to ensure
2604that this Transparent copy will remain thus accessible at the stated
2605location until at least one year after the last time you distribute an
2606Opaque copy (directly or through your agents or retailers) of that
2607edition to the public.
2608
2609It is requested, but not required, that you contact the authors of the
2610Document well before redistributing any large number of copies, to give
2611them a chance to provide you with an updated version of the Document.
2612
2613@item
2614MODIFICATIONS
2615
2616You may copy and distribute a Modified Version of the Document under
2617the conditions of sections 2 and 3 above, provided that you release
2618the Modified Version under precisely this License, with the Modified
2619Version filling the role of the Document, thus licensing distribution
2620and modification of the Modified Version to whoever possesses a copy
2621of it.  In addition, you must do these things in the Modified Version:
2622
2623@enumerate A
2624@item
2625Use in the Title Page (and on the covers, if any) a title distinct
2626from that of the Document, and from those of previous versions
2627(which should, if there were any, be listed in the History section
2628of the Document).  You may use the same title as a previous version
2629if the original publisher of that version gives permission.
2630
2631@item
2632List on the Title Page, as authors, one or more persons or entities
2633responsible for authorship of the modifications in the Modified
2634Version, together with at least five of the principal authors of the
2635Document (all of its principal authors, if it has fewer than five),
2636unless they release you from this requirement.
2637
2638@item
2639State on the Title page the name of the publisher of the
2640Modified Version, as the publisher.
2641
2642@item
2643Preserve all the copyright notices of the Document.
2644
2645@item
2646Add an appropriate copyright notice for your modifications
2647adjacent to the other copyright notices.
2648
2649@item
2650Include, immediately after the copyright notices, a license notice
2651giving the public permission to use the Modified Version under the
2652terms of this License, in the form shown in the Addendum below.
2653
2654@item
2655Preserve in that license notice the full lists of Invariant Sections
2656and required Cover Texts given in the Document's license notice.
2657
2658@item
2659Include an unaltered copy of this License.
2660
2661@item
2662Preserve the section Entitled ``History'', Preserve its Title, and add
2663to it an item stating at least the title, year, new authors, and
2664publisher of the Modified Version as given on the Title Page.  If
2665there is no section Entitled ``History'' in the Document, create one
2666stating the title, year, authors, and publisher of the Document as
2667given on its Title Page, then add an item describing the Modified
2668Version as stated in the previous sentence.
2669
2670@item
2671Preserve the network location, if any, given in the Document for
2672public access to a Transparent copy of the Document, and likewise
2673the network locations given in the Document for previous versions
2674it was based on.  These may be placed in the ``History'' section.
2675You may omit a network location for a work that was published at
2676least four years before the Document itself, or if the original
2677publisher of the version it refers to gives permission.
2678
2679@item
2680For any section Entitled ``Acknowledgements'' or ``Dedications'', Preserve
2681the Title of the section, and preserve in the section all the
2682substance and tone of each of the contributor acknowledgements and/or
2683dedications given therein.
2684
2685@item
2686Preserve all the Invariant Sections of the Document,
2687unaltered in their text and in their titles.  Section numbers
2688or the equivalent are not considered part of the section titles.
2689
2690@item
2691Delete any section Entitled ``Endorsements''.  Such a section
2692may not be included in the Modified Version.
2693
2694@item
2695Do not retitle any existing section to be Entitled ``Endorsements'' or
2696to conflict in title with any Invariant Section.
2697
2698@item
2699Preserve any Warranty Disclaimers.
2700@end enumerate
2701
2702If the Modified Version includes new front-matter sections or
2703appendices that qualify as Secondary Sections and contain no material
2704copied from the Document, you may at your option designate some or all
2705of these sections as invariant.  To do this, add their titles to the
2706list of Invariant Sections in the Modified Version's license notice.
2707These titles must be distinct from any other section titles.
2708
2709You may add a section Entitled ``Endorsements'', provided it contains
2710nothing but endorsements of your Modified Version by various
2711parties---for example, statements of peer review or that the text has
2712been approved by an organization as the authoritative definition of a
2713standard.
2714
2715You may add a passage of up to five words as a Front-Cover Text, and a
2716passage of up to 25 words as a Back-Cover Text, to the end of the list
2717of Cover Texts in the Modified Version.  Only one passage of
2718Front-Cover Text and one of Back-Cover Text may be added by (or
2719through arrangements made by) any one entity.  If the Document already
2720includes a cover text for the same cover, previously added by you or
2721by arrangement made by the same entity you are acting on behalf of,
2722you may not add another; but you may replace the old one, on explicit
2723permission from the previous publisher that added the old one.
2724
2725The author(s) and publisher(s) of the Document do not by this License
2726give permission to use their names for publicity for or to assert or
2727imply endorsement of any Modified Version.
2728
2729@item
2730COMBINING DOCUMENTS
2731
2732You may combine the Document with other documents released under this
2733License, under the terms defined in section 4 above for modified
2734versions, provided that you include in the combination all of the
2735Invariant Sections of all of the original documents, unmodified, and
2736list them all as Invariant Sections of your combined work in its
2737license notice, and that you preserve all their Warranty Disclaimers.
2738
2739The combined work need only contain one copy of this License, and
2740multiple identical Invariant Sections may be replaced with a single
2741copy.  If there are multiple Invariant Sections with the same name but
2742different contents, make the title of each such section unique by
2743adding at the end of it, in parentheses, the name of the original
2744author or publisher of that section if known, or else a unique number.
2745Make the same adjustment to the section titles in the list of
2746Invariant Sections in the license notice of the combined work.
2747
2748In the combination, you must combine any sections Entitled ``History''
2749in the various original documents, forming one section Entitled
2750``History''; likewise combine any sections Entitled ``Acknowledgements'',
2751and any sections Entitled ``Dedications''.  You must delete all
2752sections Entitled ``Endorsements.''
2753
2754@item
2755COLLECTIONS OF DOCUMENTS
2756
2757You may make a collection consisting of the Document and other documents
2758released under this License, and replace the individual copies of this
2759License in the various documents with a single copy that is included in
2760the collection, provided that you follow the rules of this License for
2761verbatim copying of each of the documents in all other respects.
2762
2763You may extract a single document from such a collection, and distribute
2764it individually under this License, provided you insert a copy of this
2765License into the extracted document, and follow this License in all
2766other respects regarding verbatim copying of that document.
2767
2768@item
2769AGGREGATION WITH INDEPENDENT WORKS
2770
2771A compilation of the Document or its derivatives with other separate
2772and independent documents or works, in or on a volume of a storage or
2773distribution medium, is called an ``aggregate'' if the copyright
2774resulting from the compilation is not used to limit the legal rights
2775of the compilation's users beyond what the individual works permit.
2776When the Document is included in an aggregate, this License does not
2777apply to the other works in the aggregate which are not themselves
2778derivative works of the Document.
2779
2780If the Cover Text requirement of section 3 is applicable to these
2781copies of the Document, then if the Document is less than one half of
2782the entire aggregate, the Document's Cover Texts may be placed on
2783covers that bracket the Document within the aggregate, or the
2784electronic equivalent of covers if the Document is in electronic form.
2785Otherwise they must appear on printed covers that bracket the whole
2786aggregate.
2787
2788@item
2789TRANSLATION
2790
2791Translation is considered a kind of modification, so you may
2792distribute translations of the Document under the terms of section 4.
2793Replacing Invariant Sections with translations requires special
2794permission from their copyright holders, but you may include
2795translations of some or all Invariant Sections in addition to the
2796original versions of these Invariant Sections.  You may include a
2797translation of this License, and all the license notices in the
2798Document, and any Warranty Disclaimers, provided that you also include
2799the original English version of this License and the original versions
2800of those notices and disclaimers.  In case of a disagreement between
2801the translation and the original version of this License or a notice
2802or disclaimer, the original version will prevail.
2803
2804If a section in the Document is Entitled ``Acknowledgements'',
2805``Dedications'', or ``History'', the requirement (section 4) to Preserve
2806its Title (section 1) will typically require changing the actual
2807title.
2808
2809@item
2810TERMINATION
2811
2812You may not copy, modify, sublicense, or distribute the Document except
2813as expressly provided for under this License.  Any other attempt to
2814copy, modify, sublicense or distribute the Document is void, and will
2815automatically terminate your rights under this License.  However,
2816parties who have received copies, or rights, from you under this
2817License will not have their licenses terminated so long as such
2818parties remain in full compliance.
2819
2820@item
2821FUTURE REVISIONS OF THIS LICENSE
2822
2823The Free Software Foundation may publish new, revised versions
2824of the GNU Free Documentation License from time to time.  Such new
2825versions will be similar in spirit to the present version, but may
2826differ in detail to address new problems or concerns.  See
2827@uref{http://www.gnu.org/copyleft/}.
2828
2829Each version of the License is given a distinguishing version number.
2830If the Document specifies that a particular numbered version of this
2831License ``or any later version'' applies to it, you have the option of
2832following the terms and conditions either of that specified version or
2833of any later version that has been published (not as a draft) by the
2834Free Software Foundation.  If the Document does not specify a version
2835number of this License, you may choose any version ever published (not
2836as a draft) by the Free Software Foundation.
2837@end enumerate
2838
2839@page
2840@heading ADDENDUM: How to use this License for your documents
2841
2842To use this License in a document you have written, include a copy of
2843the License in the document and put the following copyright and
2844license notices just after the title page:
2845
2846@smallexample
2847@group
2848  Copyright (C)  @var{year}  @var{your name}.
2849  Permission is granted to copy, distribute and/or modify this document
2850  under the terms of the GNU Free Documentation License, Version 1.2
2851  or any later version published by the Free Software Foundation;
2852  with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
2853  Texts.  A copy of the license is included in the section entitled ``GNU
2854  Free Documentation License''.
2855@end group
2856@end smallexample
2857
2858If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts,
2859replace the ``with@dots{}Texts.'' line with this:
2860
2861@smallexample
2862@group
2863    with the Invariant Sections being @var{list their titles}, with
2864    the Front-Cover Texts being @var{list}, and with the Back-Cover Texts
2865    being @var{list}.
2866@end group
2867@end smallexample
2868
2869If you have Invariant Sections without Cover Texts, or some other
2870combination of the three, merge those two alternatives to suit the
2871situation.
2872
2873If your document contains nontrivial examples of program code, we
2874recommend releasing these examples in parallel under your choice of
2875free software license, such as the GNU General Public License,
2876to permit their use in free software.
2877
2878@c Local Variables:
2879@c ispell-local-pdict: "ispell-dict"
2880@c End:
2881
2882
2883@c ---------------------------------------------------------------------
2884@c ---------------------------------------------------------------------
2885
2886@node    Reporting bugs
2887@chapter Reporting bugs
2888
2889Report bugs to <obrebski@@amu.edu.pl>.
2890
2891@c ---------------------------------------------------------------------
2892@c ---------------------------------------------------------------------
2893
2894@c @node    Copyright
2895@c @chapter Copyright
2896@c
[9ace5d2]2897@c Copyright 2004 by Tomasz Obrębski
[25ae32e]2898@c This software is free for research and educational use.
2899
2900@c ---------------------------------------------------------------------
2901@c ---------------------------------------------------------------------
2902
2903@node    Author
2904@chapter Author
2905
2906
2907@bye
Note: See TracBrowser for help on using the repository browser.