source: doc/utt.texinfo @ c21bdd6

Last change on this file since c21bdd6 was 9a36761, checked in by Mateusz Hromada <ruanda@…>, 16 years ago

Migration to new build system.

  • documentation moved and checked
  • Property mode set to 100644
File size: 85.5 KB
RevLine 
[9ace5d2]1
[25ae32e]2\input texinfo   @c -*-texinfo-*-
[9ace5d2]3@c @documentencoding ISO-8859-2
4@documentencoding UTF-8
[25ae32e]5@c @documentlanguage pl
6
7@c %**start of header
8@setfilename utt.info
9@settitle UAM Text Tools v0.90
10@c %**end of header
11
12@copying
[261bf62]13This manual is for UAM Text Tools (version 0.90, October, 2008)
[25ae32e]14
[9ace5d2]15Copyright @copyright{}  2005, 2007  Tomasz Obrębski, Michał Stolarski, Justyna Walkowska, Paweł Konieczka.
[25ae32e]16
17Permission is granted to copy, distribute and/or modify this document
[261bf62]18under the terms of the GNU Free Documentation License, Version 1.2 or
19any later version published by the Free Software Foundation; with no
20Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.  A
21copy of the license is included in the section entitled GNU Free
22Documentation License,,GNU Free Documentation License.
[25ae32e]23
24@c @quotation
25@c Permission is granted to ...
26@c No permission is granted until the document is completed.
27@c @end quotation
28@end copying
29
30
31@titlepage
32@title UAM Text Tools 0.90 - User Manual
33@subtitle edition 0.01, @today
34@subtitle status: prescript
[9ace5d2]35@author by Justyna Walkowska, Tomasz Obrębski and Michał Stolarski
[25ae32e]36@page
37@vskip 0pt plus 1filll
38@insertcopying
39@end titlepage
40
41@contents
42
43@c @paragraphindent none
44
45@iftex
[9ace5d2]46@tex
47% \usepackage[T1]{fontenc}
48% \usepackage[utf8]{inputenc}
49% \usepackage{times}
50@end tex
51
[25ae32e]52@parskip = 0.5@normalbaselineskip plus 3pt minus 1pt
53@end iftex
54@c @headings off
55@c @everyheading LEM(1) @| @| LEM(1)
56@everyfooting @today @c @| @thispage @|
57
58@ifnottex
59
60@node Top
61@top UTT - UAM Text Tools
62
63@insertcopying
64
65@menu
66* General information::                       
67* UTT file format::             
68* Configuration files::         
69* UTT components::
70* Auxiliary tools::
71* Usage examples::             
72* PMDBF dictionary::           
73@c * Examples::                   
74@c * Copyright::
75* GNU Free Documentation License::
76* Reporting bugs::                                   
77* Author::                     
78@end menu
79@end ifnottex
80
81
82@c ----------------------------------------------------------------------
83
84@node General information
85@chapter General information
86
87UAM Text Tools (UTT) is a package of language processing tools
88developed at Adam Mickiewicz University. Its functionality includes:
89
90@itemize @bullet
91
92@item
[9ace5d2]93tokenization ółąŌ
[25ae32e]94@item
95dictionary-based morphological analysis
96@item
97heuristic morphological analysis of unknown words
98@item
[9ace5d2]99spelling correction ółąśćŌ
[25ae32e]100@item
101pattern search
102@item
103sentence splitting
104@item
105generation of concordance tables
106@end itemize
107
108The toolkit is destined for processing of raw (not annotated)
109unrestricted text for any conceivable purpose.
110
111The system is organized as a collection of command-line programs, each
112performing one operation, e.g. tokenization, lemmatization, spelling
113correction. The components are independent one from another, the
114unifying element being the uniform i/o file format.
115
116The components may be combined in various ways to provide various text
117processing services. Also new components supplied by the used may be
118easily incorporated into the system provided that they respect the i/o
119file format conventions.
120
121UTT component programs does not depend on any specific tagset or
122morphological description format.
123
124UTT is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by
125the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
126
127The Polex/PMDBF dictionary is licensed under the Creative Commons by-nc-sa License which prohibits commercial use. 
128
129
130List of contributors:
131
132@itemize
133@item Pawel Konieczka
[9ace5d2]134@item Tomasz Obrębski
135@item Michał Stolarski
[25ae32e]136@item Marcin Walas
137@item Justyna Walkowska
[9ace5d2]138@item Paweł Wereński
[25ae32e]139@end itemize
140
141@c ----------------------------------------------------------------------
142@c ---------------------------------------------------------------------
143
144@node    UTT file format
145@chapter UTT file format
146
147A UTT file contains annotation of a text. It consists of a sequence of
148segments. Each segment explicitly refers to a continuous piece of the
149text and provides some information on it.
150
151@section Segment format
152
153A segment occupies one line of a UTT file and consists of
154space-separated fields:
155
156
157@quotation
158@sp 1
159[@var{start} [@var{length}]] @var{type} @var{form} [@var{annotation1} [@var{annotation2} ...]]
160@sp 1
161@end quotation
162
163@table @var
164
165@item @var{start}
166Non-negative integer value indicating the position in the source text where the
167segment starts.
168
169@item @var{length}
170Non-negative integer value indicating the length of the segment.
171
172@item @var{type}
173A sequence of non-ASCII characters (without spaces or letters, which could lead to @var{type} being misinterpreted as a @var{start} or @var{length} field).
174@var{type} reflects the main classification of segments -
175into words, numbers, punctuation marks, meta-text markers.
176@xref{tok output,,tok output}, for description of automatically recognized type markers.
177
178@item @var{form}
179This field contains the textual form of the segment or the special
180symbol @code{*} indicating that the form is not given (e.g. when the segment has been created artificially to mark something and is of lentgh 0).
181
182The characters or character sequences that have special meaning in the
183@var{form} field are enumerated below.
184
185Characters with special meaning:
186
187@itemize
188@item @code{_} - space character
189@item @code{*} - undefined contents
190@end itemize
191
192Escape sequences:
193
194@itemize
195@item @code{\n} - new line
196@item @code{\t} - tabulation
197@item @code{\r} - carriage return 
198
199@item @code{\_} - the @code{_} character
200@item @code{\*} - the @code{*} character
201@item @code{\\} - the @code{\} character
202
203@c @item @code{\hh} - a character with hexadecimal code @code{hh} (used for non-printable characters)
204@end itemize
205
206@item @var{annotation1}
207@item @var{annotation2}
208@item ...
209Annotation fields have the following format:
210
211@var{longname} @code{:} @var{value}
212
213or
214
215@var{shortname} @var{value}
216
217where @var{longname} is a string of alphanumeric characters
218(isalnum() test), @var{shortname} - a single non-alphanumeric character
219(ispunct() test), and @var{value} is an arbitrary string of non-blank characters.
220
221@end table
222
223
224Only two fields are mandatory: @var{type} and @var{form}. All other fields
225may be absent. In the case when only one number precedes the
226@var{type} field, it is interpreted as the @var{START} position.
227
228If the @var{length} field is ommited, the length of the segment is the
229length of the @var{form} field, except when the value of the
230@var{form} field is @code{*} -- in this case, the length is assumed to
231be 0.
232
233If the @var{start} field is also absent, the segment is assumed to directly
234follow the preceding one.
235
236@c Conventions:
237
238@c Annotation fields with predefined meaning:
239
240@c @itemize
241@c @item @code{!} - UTT components are allowed to modify the contents of
242@c the @var{form} field (e.g. spelling correction does this). If this happens the
243@c original form of the segment have to be placed in the @code{!}-field.
244@c @item @code{@@} - morphological description
245@c @item @code{=} - node identifier assignment (used in graph encoding)
246@c @item @code{<} - preceding/dominating node(s) (used in graph encoding)
247@c @item @code{>} - succeeding/subordinate node(s) (used in graph encoding)
248@c @end itemize
249
250Segments of length 0 may be used to mark file positions with some
251information. See e.g. BOS and EOS (beginning/end of sentence) markers
252in the example below.
253
254Example:
255
256sentence: @samp{Piszemy dobre progrumy.}
257
258@example
2590000 00 BOS *
[9ace5d2]2600000 07 W Piszemy lem:pisać,V
[25ae32e]2610007 01 S _
2620008 05 W dobre lem:dobry,ADJ
2630013 01 S _
2640014 08 W progrumy cor:programy lem:program,N
2650022 01 P .
2660023 00 EOS *
2670023 01 S _
2680024 00 BOS *
2690024 11 W Warszawiacy lem:Warszawiak,N
2700035 01 S _
[9ace5d2]2710036 03 W teŌ
[25ae32e]2720039 01 P .
2730040 00 EOS *
274
275@end example
276
277@example
2780000 BOS *
[9ace5d2]2790000 W Piszemy lem:pisać,V
[25ae32e]2800007 S _
2810008 W dobre lem:dobry,ADJ
2820013 S _
2830014 W progrumy cor:programy lem:program,N
2840022 P .
2850023 EOS *
286@end example
287
288Posion information may be provided only for some types of segments:
289
290@example
2910000 BOS *
[9ace5d2]292W Piszemy lem:pisać‡,V
[25ae32e]293S _
294W dobre lem:dobry,ADJ
295S _
296W progrumy cor:programy lem:program,N
297P .
298EOS *
299S _
3000024 BOS *
301W Warszawiacy lem:Warszawiak,N
302S _
[9ace5d2]303W teŌ
[25ae32e]304P .
305EOS *
306@end example
307
308Position/length information may be provided only when necessary:
309
310@example
3110000 04 N *
3120000 N 12
313P .
314N 5
315S _
316W km
317@end example
318
319@section UTT File
320
321A UTT file consists of a sequence of segments.  The same text position
322may be covered by multiple segments. In cosequence, ambiguous text
323segmentation and ambiguous annotation may be represented.
324
325There are two structural requirements a valid UTT-formatted file
326has to meet:
327
328@itemize @bullet
329
330@item
331segments have to be sorted with respect to the @var{position} field,
332
333@item
334for each
335segment ending at position @var{n}, either there must be a segment starting at
336position @var{n+1}, or position @var{n+1} is not covered by any segment; similarly
337for each segment starting at position @var{n}, either there must be a segment
338ending at position @var{n-1}, or the position @var{n-1} must not be covered
339by any segment.
340
341@end itemize
342
343A valid annotation for the text fragment
344@example
34512.5 km
346@end example
347
348may be
349
350@example
3510000 02 N 12
3520000 04 N 12.5
3530002 01 P .
3540003 01 N 5
3550004 01 S _
3560005 02 W km
357@end example
358
359but not
360
361@example
3620000 02 N 12
3630000 04 N 12.5
3640004 01 S _
3650005 02 W km
366@end example
367
[261bf62]368because in the latter example the first segment (starting at position
3690000, 2 characters long) ends at position @var{n}=0001 which is
370covered by the second segment and no segment starts at position
371@var{n+2}=0002.
372
373
374@section Flattened UTT file
375
[e28a625]376A UTT file format has two variants: regular and flattened. The regular
[261bf62]377format was described above.  In the flattened format some of the
378end-of-line characters are replaced with line-feed characters.
379
380The flatten format is basically used to represent whole sentences as
381single lines of the input file (all intrasentential end-of-line
382characters are replaced with line-feed characters).
383
384This technical trick permits to perform certain text
385processing operations on entire sentences with the use of such tools as
386@command{grep} (see @command{grp} component) or @command{sed} (see  @command{mar} component).
387
388The conversion between the two formats is performed by the tools:
389@command{fla} and @command{unfla}.
[25ae32e]390
391@section Character encoding
392
393The UTT component programs accept only 1-byte character encoding, such
[261bf62]394as ISO, ANSI, DOS.
[25ae32e]395
396
397@c @section Formats
398
399@c @unnumberedsubsubsec Basic format
400
401@c While processing large amounts of the overhead related with explicit
402@c ... of the start position and segment length becomes ... . Therefore,
403@c for efficiency reasons certain shortcuts are possible:
404
405@c @unnumberedsubsubsec Relative start position
406
407@c Start position may be given as relative distance from the last
408@c absolut position.
409
410@c @unnumberedsubsubsec Absent length
411
412@c Segment length may by omitted. Normally it can be restored by counting
413@c the length of the @emph{form field}. For segments with the special value
414@c @code{*} in the @emph{form field} length 0 is assumed.
415
416@c @unnumberedsubsubsec Absent length and start position
417
418@c Both start position and segment length may be omitted. In this format
419@c each segment is assumed to follow the previous one. This format is,
420@c therefore, suitable only for unambiguously tagged text
421@c (0-length markers can be still used.)
422
423
424@c @table @code
425@c @item AL
426@c @code{1234 03 W kot}
427@c @item RL
428@c @code{+56 03 W kot}
429@c @item A
430@c @code{1234 W kot}
431@c @item R
432@c @code{+56 W kot}
433@c @item 0
434@c @code{W kot}
435@c @end table
436
437
[9ace5d2]438@c [JAK UZYSKAÆ POLSKIE CZCIONKI W DVI???]
[25ae32e]439
440@macro parhelp
441@item @b{@minus{}@minus{}help}, @b{@minus{}h}
442Print help.
443@end macro
444
445
446@macro parversion
447@item @b{@minus{}@minus{}version}, @b{@minus{}V}
448Print version information.
449@end macro
450
451@macro parinteractive
452@item @b{@minus{}@minus{}interactive, @minus{}i}
453This option toggles interactive mode, which is by default off. In the
454interactive mode the program does not buffer the output.
455@end macro
456
457
458@c @macro parfile
459@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
460@c Input file name.
461@c If this option is absent or equal to '@minus{}', the program
462@c reads from the standard input.
463@c @end macro
464
465
466@c @macro paroutput
467@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
468@c Regular output file name. To regular output the program sends segments
469@c which it successfully processed and copies those which were not
470@c subject to processing. If this option is absent or equal to
471@c '@minus{}', standard output is used.
472@c @end macro
473
474@c @macro parfail
475@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}}
476@c Fail output file name. To fail output the program copies the segments
477@c it failed to process.  If this option is absent or equal to
478@c '@minus{}', standard output is used.
479@c @end macro
480
481
482@c @macro parcopy
483@c @item @b{@minus{}@minus{}copy, @minus{}c}
484@c Copy succesfully processed segments to regular output also in their
485@c original input form.
486@c @end macro
487
488
489@macro parinputfield
490@item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}
491The field containing the input to the program. The default is the
492@var{form} field. The fields @var{position}, @var{length}, @var{type},
493and @var{form} are referred to as @code{1}, @code{2}, @code{3},
494@code{4}, respectively.
495@end macro
496
497
498@macro paroutputfield
499@item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}
500The name of the field added by the program. The default is the name of the program.
501@end macro
502
503
504@macro pardictionary
505@item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}}
506Dictionary file name.
507@end macro
508
509
510@macro parprocess
511@item @b{@minus{}@minus{}process=@var{type}, @minus{}p @var{type}}
512Process segments with the specified value in the @var{type} field.
513Multiple occurences of this option are allowed and are interpreted as
514disjunction. If this option is absent, all segments are processed.
515@end macro
516
517
518@macro parselect
519@item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}}
520Select for processing only segments in which the field named
521@var{fieldname} is present. Multiple occurences of this option are
522allowed and are interpreted as conjunction of conditions. If this
523option is absent, all segments are processed.
524@end macro
525
526
527@macro parunselect
528@item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}}
529Select for processing only segments in which the field @var{fieldname}
530is absent.  Multiple occurences of this option are allowed and are
531interpreted as conjunction of conditions. If this option is absent,
532all segments are processed.
533@end macro
534
535
536@macro paroneline
537@item @b{@minus{}@minus{}one-line}
538This option makes the program print ambiguous annotation in one output
539line by generating multiple annotation fields. By default when
540ambiguous annotation may be produced for a segment, the segment is
541multiplicated and each of the annotations is added to separate copy of
542the segment.
543@end macro
544
545
546@macro paronefield
547@item @b{@minus{}@minus{}one-field, @minus{}1}
548This option makes the program print ambiguous annotation in one
549annotation field. By default when ambiguous annotation may be produced
550for a segment, the segment is multiplicated and each of the
551annotations is added to separate copy of the segment.
552
553This option is useful when working with @command{kot} or @command{con}.
554@end macro
555
556
557@c ---------------------------------------------------------------------
558@c CONFIGURATION FILES
559@c ---------------------------------------------------------------------
560
561@node    Configuration files
562@chapter Configuration files
563
564Values for all command line options accepted by a component
565may be set in configuration files. The default location of the
566configuration files for a component named @command{@var{program}} are
567
568@example
[246900a]569        @file{/usr/local/etc/utt/@var{program}.conf}
[25ae32e]570@end example
571
572for system-wide configuration file and
573
574@example
[246900a]575        @file{~/.utt/@var{program}.conf}
[25ae32e]576@end example
577
578for user configuration file.
579
580@c The configuration file to load may be also specified with the
581@c @option{--config} option. Configuration file need not be provided.
582
583For each option, the value is set according to the following priority:
584
585@itemize
586@item command line
587@c @item configuration file indicated with @option{--config} option
588@item user configuration file (or configuration file indicated with the @option{--config} option)
589@item system-wide configuration file
590@end itemize
591
592Parameter values are specified in the following format:
593
594@var{parametername}=@var{value}
595
596where @var{parametername} is the short or long name of an option accepted by
597the program, or
598
599@var{parametername}
600
601if the option does not need arguments.
602
603You can introduce comments to configuration files using the # sign.
604
605If a program accepts multiple occurences of an option (e.g. @var{lem}'s select option) you can specify them in two distinct lines of the program's configuration file.
606
607@c The equal sign may be omitted.
608
609
610@quotation Tip
611If you have two (or more) frequently used sets of options for the same
612program (eg. lem with PMDBF dictionary and lem with a user dictionary)
613a good solution is to create two soft links to lem, called
614eg. lemg and lemu and specify their configuration in files lemg.conf
615and lemu.conf respectively.
616@end quotation
617
618@c ---------------------------------------------------------------------
619@c COMPONENTS
620@c ---------------------------------------------------------------------
621
622@node UTT components
623@chapter UTT components
624
625UTT components are of three types:
626
627@menu
628Sources: programs which read non-UTT data (e.g. raw text) and produce output
629in UTT format
630* tok::         a tokenizer
631
632Filters: programs which read and produce UTT-formatted data
633* lem::         a morphological analyzer
634* gue::         a morphological guesser
[261bf62]635* cor::         a simple spelling corrector
636* kor::         a more elaborated spelling corrector
[25ae32e]637* sen::         a sentensizer
638* ser::         a pattern search tool (marks matches)
[261bf62]639* mar::         a pattern search tool (introduces arbitrary markers into the text)
[25ae32e]640* grp::         a pattern search tool (selects sentences containing a match)
[261bf62]641@c * gph::         a word-graph annotation tool::
642@c * dgp::         a dependency parser
[25ae32e]643
644Sinks: programs which read UTT data and produce output in another format
645* kot::         an untokenizer
646* con::         a concordance table generator
647@end menu
648
649@c ---------------------------------------------------------------------
650@c TOK
651@c ---------------------------------------------------------------------
652
653@page
654@node tok
655@section tok - a tokenizer
656
657@c ----------------------------------------
658
659@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]660@item @strong{Authors:}                 @tab Tomasz Obrębski
[25ae32e]661@item @strong{Component category:}      @tab source
[261bf62]662@item @strong{Input format:}            @tab raw text file
663@item @strong{Output format:}           @tab UTT regular
664@item @strong{Required annotation:}     @tab -
[25ae32e]665@end multitable
666
667
668@menu
669* tok description::
670* tok input::
671* tok output::
672* tok command line options::
673* tok example::
674@end menu
675
676@node tok description
677@subsection Description
678
679@code{tok} is a simple program which reads a text file and identifies
680tokens on the basis of their orthographic form.  The type of the token
681is printed as the @var{type} field.
682
683@node tok input
684@subsection Input
685
686Raw text.
687
688@node tok output
689@subsection Output
690
691UTT-file with four fields: @var{start}, @var{length}, @var{type}, and @var{form}. In the @var{type} field five types of tokens are distinguished:
692
693@itemize
694
695@item @code{W}
696(word)
697- continuous sequence of letters
698
699@item @code{N}
700(number)
701- continuous sequence of digits
702
703@item @code{S}
704(space)
705- continuous sequence of space characters
706
707@item @code{P}
708(punctuation mark)
709- single printable characters not belonging to any of the other classes
710
711@item @code{B}
712(unprintable character)
713- single unprintable character
714
715@end itemize
716
717
718
719@node tok command line options
720@subsection Command line options
721
722@table @code
723
724@item @b{@minus{}@minus{}help}, @b{@minus{}h}
725Print help.
726
727@item @b{@minus{}@minus{}version}, @b{@minus{}V}
728Print version information.
729
730@item @b{@minus{}@minus{}interactive, @minus{}i}
731This option toggles interactive mode, which is by default off. In the
732interactive mode the program does not buffer the output.
733
734@end table
735
736@node tok example
737@subsection Example
738
739Input:
740
741@example
742Piszemy dobre programy.
743@end example
744
745Output:
746
747@example
7480000 07 W Piszemy
7490007 01 S _
7500008 05 W dobre
7510013 01 S _
7520014 08 W programy
7530022 01 P .
7540023 01 S \n
755@end example
756
757
758@c ---------------------------------------------------------------------
759@c SEN
760@c ---------------------------------------------------------------------
761
762@c @node sen - sentencizer
763@c @chapter sen - sentencizer
764
[9ace5d2]765@c Authors: Tomasz Obrębski
[25ae32e]766
767@c ---------------------------------------------------------------------
768@c LEM
769@c ---------------------------------------------------------------------
770
771@page
772@node lem
773@section lem - morphological analyzer
774
775@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]776@item @strong{Authors:}                 @tab Tomasz Obrębski, Michał Stolarski
[25ae32e]777@item @strong{Component category:}      @tab filter
[261bf62]778@item @strong{Input format:}            @tab UTT regular
779@item @strong{Output format:}           @tab UTT regular
780@item @strong{Required annotation:}     @tab tok
[25ae32e]781@end multitable
782
783@menu
784* lem description::             
785* lem command line options::   
786* lem input::
787* lem output::
788* lem example::                 
789* lem dictionaries::           
790* lem hints::           
791@end menu
792
793@node lem description
794@subsection Description
795
796@command{lem} performs morphological analysis of a simple orthographic
797word, returning all its possible morphological annotations,
798disregarding the context.
799
800@c ----------------------------------------
801
802@node lem command line options
803@subsection Command line options
804
805@table @code
806@parhelp
807@parversion
808@parinteractive
809@c @parfile
810@c @paroutput
811@c @parfail
812@c @parcopy
813@parinputfield
814@paroutputfield
815@pardictionary
816@parprocess
817@parselect
818@parunselect
819@paroneline
820@paronefield
821@end table
822
823@c ----------------------------------------
824
825@node lem input
826@subsection Input
827
828Lem reads a UTT file and processes the value of the @var{form} field
829(the input field may be changed with @option{--input-field} option).
830
831@node lem output
832@subsection Output
833
834@command{lem} adds a new annotation field, whose default name is @code{lem}.  In
835case of ambiguity either the segment is multiplicated (default),
836multiple @code{lem} fields are added (@option{--one-line}) or ambiguous
837annotation is produced as the value of single @code{lem} field (option
838@option{--one-field,-1}):
839
840@itemize @bullet
841
842@item
843unambiguous value format:
844
845@example
846   <lemma>,<descr>
847@end example
848
849@item
850ambiguous value format (@option{--one-field} option)
851
852
853@example
854   <lemma>,<descr>[,<descr>][;<lemma>,<descr>[,<descr>]]
855@end example
856
857(alternative descriptions for the same lemma are separated by commas,
858alternative lemmata are separated by semicolons.)
859
860@end itemize
861
862@node lem example
863@subsection Example
864
865Input:
866
867@example
8680000 07 W Piszemy
8690007 01 S _
8700008 05 W dobre
8710013 01 S _
8720014 08 W programy
8730022 01 P .
8740023 01 B \n
875@end example
876
877Output (default):
878
879@example
[9ace5d2]8800000 07 W Piszemy lem:pisać,V/AiVpMdTrfNpP1
[25ae32e]8810007 01 B _
8820008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn
8830008 05 W dobre lem:dobry,ADJ/DpNsCnavGn
8840013 01 B _
8850014 08 W programy lem:program,N/GiNpCa
8860014 08 W programy lem:program,N/GiNpCn
8870014 08 W programy lem:program,N/GiNpCv
8880022 01 P .
8890023 01 B \n
890@end example
891
892Output (@option{--one-line} option):
893
894@example
[9ace5d2]8950000 07 W Piszemy lem:pisać,V/AiVpMdTrfNpP1
[25ae32e]8960007 01 S _
8970008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn lem:dobry,ADJ/DpNsCnavGn
8980013 01 S _
8990014 08 W programy lem:program,N/GiNpCa lem:program,N/GiNpCn lem:program,N/GiNpCv
9000022 01 P .
9010023 01 S \n
902@end example
903
904Output (@option{--one-field} option):
905
906@example
[9ace5d2]9070000 07 W Piszemy lem:pisać,V/AiVpMdTrfNpP1
[25ae32e]9080007 01 S _
9090008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn,ADJ/DpNsCnavGn
9100013 01 S _
9110014 08 W programy lem:program,N/GiNpCa,N/GiNpCn,N/GiNpCv
9120022 01 P .
9130023 01 S \n
914@end example
915
916@c ----------------------------------------
917
918@node lem dictionaries
919@subsection Dictionaries
920
921@command{lem} requires a dictionary. The dictionary may be provided in
922one of two formats: in text (source) format or in binary (fsa) format.
923
924@subsubheading Text format
925
926Dictionary entries have the following structure:
927
928@example
929<form>;<lemma>,<descr>[;<lemma>,<descr>]
930@end example
931
932@var{lemma} may be given explicitly or in the cut-add format:
933
934@example
935@code{[<cut1><add1>-]<cut2><add2>}
936@end example
937
938meaning: replace prefix of length @code{<cut1>} with
939string @code{<add1>}, replace suffix of length @code{<cut2>} with string
940@code{<add2>}. For example @code{3t} transforms @samp{kocie} into
[9ace5d2]941@samp{kot}, @code{3-4aÂły} transforms @samp{najbielsi} into @samp{biaÂły}
[25ae32e]942
943Each dictionary entry must be written in one line and must not contain blank characters.
944
945Examples:
946@example
947kot;0,N/GaNsCn
948kota;1,N/GaNsCg;1,N/GaNsCa
949kotu;1,N/GaNsCd
950kotem;2,N/GaNsCi
951kocie;3t,N/GaNsCl;3t,N/GaNsCv
[9ace5d2]952najbielsi;3-4ały,ADJ/DsNpCnGp
953najbielsze;3-5ały,ADJ/DsNpCnGaifn
[25ae32e]954najlepsi;dobry,ADJ/DsNpCnGp
955najlepsze;dobry,ADJ/DsNpCnGaifn
956@end example
957
958
959The mandatory file name extension for a text dictionary is @code{dic}. For large
960dictionaries it is preferable, however, to compile them into binary
961(fsa) format.
962
963@subsubheading Binary format
964
965The mandatory file name extension for a binary dictionary is @code{bin}. To
966compile a text dictionary into binary format, write:
967
968@example
969compiledic <dictionaryname>.dic
970@end example
971
972@subsubheading Polex/PMDBF dictionary
973
974A large-coverage morphological dictionary for Polish language, Polex/PMDBF, is included in
975the distribution as the default @emph{lem}'s dictionary. It's
976located by default in:
977
[261bf62]978@file{$HOME/.local/share/utt/pl_PL.ISO-8859-2/lem.bin}
979
980in local installation or in
981
982@file{/usr/local/share/utt/pl_PL.ISO-8859-2/lem.bin}
983
984in system installation.
[25ae32e]985
986@node lem hints
987@subsection Hints
988
[261bf62]989@subsubheading Combining data from multiple dictionaries
[25ae32e]990
[261bf62]991@itemize
[25ae32e]992
[261bf62]993@item Apply <dict1>, then apply <dict2> to words which were not annotatated.
[25ae32e]994
[261bf62]995@example
996lem -d <dict1> | lem -S lem -d <dict2>
997@end example
[25ae32e]998
[261bf62]999@item Add annotations from two dictionaries <dict1> and <dict2>.
[25ae32e]1000
[261bf62]1001@example
1002lem -c -d <dict1> | lem -S lem -d <dict2>
1003@end example
[25ae32e]1004
[261bf62]1005@end itemize
[25ae32e]1006
1007
1008@c ---------------------------------------------------------------------
1009@c GUE
1010@c ---------------------------------------------------------------------
1011
1012@page
1013@node gue
1014@section gue - morphological guesser
1015
1016@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1017
[9ace5d2]1018@item @strong{Authors:}                 @tab Michał Stolarski, Tomasz Obrębski
[25ae32e]1019@item @strong{Component category:}      @tab filter
1020
1021@end multitable
1022
1023@menu
[261bf62]1024* gue description::   
[25ae32e]1025* gue command line options::   
1026* gue example::                 
1027* gue dictionaries::           
1028@end menu
1029
[261bf62]1030
1031@node gue description
1032@subsection Description
1033
1034@command{gue} guesess morphological descriptions of the form contained
1035in the @var{form} field.
1036
1037
[25ae32e]1038@node gue command line options
1039@subsection Command line options
1040
1041@table @code
1042
1043@parhelp
1044@parversion
1045@parinteractive
1046@c @parfile
1047@c @paroutput
1048@c @parfail
1049@c @parcopy
1050@parinputfield
1051@paroutputfield
1052@pardictionary
1053@parprocess
1054@parselect
1055@parunselect
1056@paroneline
1057@paronefield
1058
1059@item @b{@minus{}@minus{}delta=@var{n}}
1060Stop displaying answers after fall of weight, that is, when weight difference between 2 subsequent results is more than delta value (default=`0.2').
1061
1062
1063@item @b{@minus{}@minus{}cut-off=@var{n}}
1064Do not display answers with less weight than cut-off value (default=`200').
1065
1066
1067@item @b{@minus{}@minus{}guess_count=@var{n}, @minus{}n @var{n}}
1068Guess up to n descriptions  (default=`0', which means 'display all results').
1069
1070
1071
1072@end table
1073
1074@node gue example
1075@subsection Example
1076
1077@example
1078command: gue -n 2
1079
1080input:
10810000 07 W smerfny
1082
1083output:
10840000 07 W smerfny gue:,ADJ/CaDpGiNs
10850000 07 W smerfny gue:,ADJ/CnvDpGaipNs
1086@end example
1087                                 
1088
1089@node gue dictionaries
1090@subsection Dictionaries
1091
1092@command{gue} requires a dictionary. For now, the dictionary must be provided in binary (fsa) format.
1093The fsa format is created by compiling text-format dictionaries.
1094
1095
1096
1097@subsubheading Text format
1098
1099Dictionary entries have the following structure:
1100
1101@example
1102@var{prefix}@code{*}@var{suffix}@code{;}@var{lemma}@code{,}@var{description}@code{:}@var{weight}
1103@end example
1104
1105@var{lemma} must be given in the cut-add format:
1106
1107@example
1108@code{[<cut1><add1>-]<cut2><add2>}
1109@end example
1110(no spaces in between): replace prefix of length @var{cut1} with
1111string @var{add1}, replace suffix of length @var{cat2} with string
1112@var{add2}.
1113
1114
[9ace5d2]1115Example: @code{3-4ały} transforms @i{najbielsi} into @i{biały}
[25ae32e]1116
1117
1118@var{description} contains the part of speech and morphosyntactic information (@xref{PMDBF dictionary}.).
1119
1120@var{weight} is an integer value between 1 and 999 indicating the
1121likelihood of the guess.
1122
[9ace5d2]1123@c @example
1124@c *łkę;1a,N/GfNsCa
1125@c naj*elszy;3-4ały,ADJ/...:...
1126@c @end example
[25ae32e]1127
1128
1129@c ---------------------------------------------------------------------
1130@c COR
1131@c ---------------------------------------------------------------------
1132
1133@page
1134@node cor
1135@section cor - spelling corrector
1136
1137@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]1138@item @strong{Authors:}                 @tab Tomasz Obrębski, Michał Stolarski
[25ae32e]1139@item @strong{Component category:}      @tab filter
[261bf62]1140@item @strong{Input format:}            @tab UTT regular
1141@item @strong{Output format:}           @tab UTT regular
1142@item @strong{Required annotation:}     @tab tok
[25ae32e]1143@end multitable
1144
[261bf62]1145@menu
1146* cor description::
1147* cor command line options::   
1148* cor dictionaries::           
1149@end menu
1150
1151
1152@node cor description
1153@subsection Description
1154
[25ae32e]1155The spelling corrector applies Kemal Oflazer's dynamic programming
1156algorithm @cite{oflazer96} to the FSA representation of the set of
1157word forms of the Polex/PMDBF dictionary. Given an incorrect
1158word form it returns all word forms present in the dictionary whose
1159edit distance is smaller than the threshold given as the parameter.
1160
1161
1162@node cor command line options
1163@subsection Command line options
1164
1165@table @code
1166
1167@parhelp
1168@parversion
1169@parinteractive
1170@c @parfile
1171@c @paroutput
1172@c @parfail
1173@c @parcopy
1174@parinputfield
1175@paroutputfield
1176@pardictionary
1177@parprocess
1178@parselect
1179@parunselect
1180@paroneline
1181@paronefield
1182
1183@item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}}
1184Maximum edit distance (default='1').
1185
[261bf62]1186@c @item @b{@minus{}@minus{}replace, @minus{}r}
1187@c Replace original form with corrected form, place original form in the
1188@c cor field. This option has no effect in @option{--one-*} modes (default=off)
1189
[25ae32e]1190
1191@end table
1192
1193@node cor dictionaries
1194@subsection Dictionaries
1195
1196@command{cor} requires a dictionary. The dictionary has to be provided in binary (fsa) format.
1197The fsa format is created by compiling text-format dictionaries.
1198
1199@subsubheading Text format
1200
1201The @command{cor} dictionary is a list of words:
1202@example
1203odlot
1204odlotowy
1205odludek
1206@end example
1207
[261bf62]1208@subsubheading Binary format
1209
1210The mandatory file name extension for a binary dictionary is @code{bin}. To
1211compile a text dictionary into binary format, write:
1212
1213@example
1214compiledic <dictionaryname>.dic
1215@end example
1216
1217@c ---------------------------------------------------------------------
1218@c KOR
1219@c ---------------------------------------------------------------------
1220
1221@page
1222@node kor
1223@section kor - configurable spelling corrector
1224
[9ace5d2]1225@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1226@item @strong{Authors:}                 @tab Paweł Werenski, Tomasz Obrębski, Michał Stolarski
1227@item @strong{Component category:}      @tab filter
1228@item @strong{Input format:}            @tab UTT regular
1229@item @strong{Output format:}           @tab UTT regular
1230@item @strong{Required annotation:}     @tab tok
1231@end multitable
1232
1233@menu
1234* kor description::
1235* kor command line options::
1236* kor weights definition file::   
1237* kor dictionaries::           
1238@end menu
1239
1240
1241@node kor description
1242@subsection Description
1243
1244The spelling corrector applies a Pawel Werenski's dynamic programming
1245algorithm to the FSA representation of the set of word forms of the
1246Polex/PMDBF dictionary. The algorithm is an extension of K. Oflazer
1247algorithm used by @command{cor}. In the extended version it is
1248possible to assign weights to individual edit operations.
1249
1250Given an incorrect word form it returns all word forms
1251present in the dictionary whose edit distance is smaller than the
1252threshold given as the parameter.
1253
1254
1255@node kor command line options
1256@subsection Command line options
1257
1258@table @code
1259
1260@parhelp
1261@parversion
1262@parinteractive
1263@c @parfile
1264@c @paroutput
1265@c @parfail
1266@c @parcopy
1267@parinputfield
1268@paroutputfield
1269@pardictionary
1270@parprocess
1271@parselect
1272@parunselect
1273@paroneline
1274@paronefield
1275
1276@item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}}
1277Maximum edit distance (default='1').
1278
1279@item @b{@minus{}@minus{}weights=@var{filename}, @minus{}w @var{filename}}
1280Edit operations' weights file.
1281
1282@c @item @b{@minus{}@minus{}replace, @minus{}r}
1283@c Replace original form with corrected form, place original form in the
1284@c cor field. This option has no effect in @option{--one-*} modes (default=off)
1285
1286
1287@end table
1288
1289
1290@node kor weights definition file
1291@subsection Weights definition file
1292
1293Example:
1294
1295@example
1296
1297%stdcor 1
1298%xchg   1
1299ÅŒ  rz 0.5
1300ch h  0.5
1301u  ó  0.5
1302
1303@end example
1304
1305
1306Default weight is set to 1 (@code{%stdcor 1}), the weight of exchange
1307operation is set to 1 (@code{%xchg 1}), the three principal orthographic
1308errors are assigned the weight 0.5.
1309
1310The edit operation weight declaration, such as
1311
1312@example
1313ÅŒ  rz 0.5
1314@end example
1315
1316works in both ways, i.e. Ō->rz, rz->Ō.
1317
1318The default weights definition file for @code{kor} is:
1319
1320@example
1321$HOME/.local/share/utt/weights.kor
1322@end example
1323
1324or, if the above mentioned file is absent:
1325
1326@example
1327/usr/local/share/utt/weights.kor
1328@end example
1329
1330
1331@node kor dictionaries
1332@subsection Dictionaries
1333
1334see @command{cor}
[261bf62]1335
1336@c ---------------------------------------------------------------------
1337@c SEN
1338@c ---------------------------------------------------------------------
1339
[25ae32e]1340@page
1341@node sen
1342@section sen - a sentensizer
1343
1344@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1345
[9ace5d2]1346@item @strong{Authors:}                 @tab Tomasz Obrębski
[25ae32e]1347@item @strong{Component category:}      @tab filter
[261bf62]1348@item @strong{Input format:}            @tab UTT regular
1349@item @strong{Output format:}           @tab UTT regular
1350@item @strong{Required annotation:}     @tab tok
[25ae32e]1351
1352@end multitable
1353
1354
1355@menu
[261bf62]1356* sen description::
[25ae32e]1357@c * sen input::
1358@c * sen output::
1359* sen example::                 
1360@end menu
1361
[261bf62]1362@node sen description
1363@subsection Description
1364
1365@command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation.
1366
[25ae32e]1367@node sen example
1368@subsection Example
1369
1370@example
1371command: sen
1372
1373input:
[9ace5d2]13740000 05 W Cześć
[25ae32e]13750005 01 P !
13760006 01 S _
13770007 02 W To
13780009 01 S _
13790010 02 W ja
13800012 01 P .
13810013 01 S \n
1382
1383output:
13840000 00 BOS *
[9ace5d2]13850000 05 W Cześć
[25ae32e]13860005 01 P !
13870006 00 EOS *
13880006 00 BOS *
13890006 01 S _
13900007 02 W To
13910009 01 S _
13920010 02 W ja
13930012 01 P .
13940013 01 S \n
13950014 00 EOS *
1396@end example
1397
1398
1399@c ---------------------------------------------------------------------
1400@c GPH
1401@c ---------------------------------------------------------------------
1402
1403@c @node gph - graphizer
1404@c @chapter gph - graphizer
1405
[9ace5d2]1406@c Authors: Tomasz Obrębski
[25ae32e]1407
1408
1409
1410@c ---------------------------------------------------------------------
[261bf62]1411@c SER
[25ae32e]1412@c ---------------------------------------------------------------------
1413
1414@page
1415@node ser
1416@section ser - pattern search tool
1417
1418@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]1419@item @strong{Authors:}                 @tab Tomasz Obrębski
[25ae32e]1420@item @strong{Component category:}      @tab filter
[261bf62]1421@item @strong{Input format:}            @tab UTT regular
1422@item @strong{Output format:}           @tab UTT regular
1423@item @strong{Required annotation:}     @tab tok,  lem --one-field
[25ae32e]1424@end multitable
1425
1426@menu
[261bf62]1427* ser description::
[25ae32e]1428* ser command line options::   
1429* ser pattern::                 
1430* ser how ser works::           
1431* ser customization::           
1432* ser limitations::             
1433* ser requirements::           
1434@end menu
1435
1436
[261bf62]1437@node ser description
1438@subsection Description
1439
1440@command{ser} looks for patterns in UTT-formatted texts.
1441
1442
[25ae32e]1443@c ---------------------------------------------------------------------
1444@node ser command line options
1445@subsection Command line options
1446
1447@table @code
1448
1449@parhelp
1450@parversion
1451@c @parfile
1452@c @paroutput
1453@c @parinputfield
1454@c @paroutputfield
1455@parprocess
1456@parinteractive
1457
1458@item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
1459The search pattern.
1460
1461@item @b{@minus{}@minus{}morph=@var{field}}
1462The name of the annotation field containing the morphological
1463description (default @code{lem}).
1464
1465@item @b{@minus{}@minus{}flex}
1466Only print the generated flex source code.
1467
1468@item @b{@minus{}@minus{}macro=@var{filename}}
1469Read macrodefinitions from file @var{filename} rather than from
1470default location. This option allows to redefine the set of terms.
1471
1472@item @b{@minus{}@minus{}define=@var{filename}}
1473Append macrodefinitions from file @var{filename}. This option
1474allows to extend the set of terms.
1475
1476@end table
1477
1478
1479@c ---------------------------------------------------------------------
1480@node ser pattern
1481@subsection Pattern
1482
1483The @command{ser} pattern is a regular expression over terms corresponding
1484to text segments or segment sequences. Predefined terms are:
1485
1486@table @code
1487
1488@item seg(@var{t},@var{f},@var{a})
1489a segment of type @var{t}, containing form @var{f} and annotation
1490@var{a}
1491
1492@item form(@var{f})
1493a segment containing form @var{f}
1494
1495@item field(@var{f})
1496a segment containing annotation field @var{f}
1497
1498@item space(@var{f})
1499a space segment of form @var{f}
1500
1501@item word(@var{f})
1502a word segment of form @var{f}
1503
1504@item punct(@var{f})
1505a punct segment of form @var{f}
1506
1507@item number(@var{f})
1508a number segment of form @var{f}
1509
1510@item lexeme(@var{f})
1511a word segment with lemma @var{f}
1512
1513@item cat(@var{c})
1514a word segment of category @var{c}
1515
1516@end table
1517
1518All arguments are optional. If an argument is omitted, an arbitrary
1519string of non-blank characters is assumed as the argument value. Term
1520arguments may be arbitrary character-level regular expressions. The
1521following special symbols can by used:
1522
1523@multitable {aaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1524@item @code{[@dots{}]}            @tab a character class
1525@item @code{[^@dots{}]}           @tab a negated character class
1526@item @code{|}                    @tab alternative
1527@item @code{*}                    @tab repetition, including zero times
1528@item @code{+}                    @tab repetition, at least one time
1529@item @code{?}                    @tab optionality
1530@item @code{@{@var{m},@var{n}@}}  @tab repetition from @var{m} to @var{n} times
1531@item @code{@{@var{m},@}}         @tab repetition @var{m} or more times
1532@item @code{@{@var{m}@}}          @tab repetition @var{m} times
1533@item @code{@var{\ddd}}           @tab the character with octal value @var{ddd}
1534@item @code{\x@var{hh}}           @tab the character with hexadecimal value @var{hh}
1535@item @code{( )}                  @tab parentheses, used to override precedence
1536@c @end multitable
1537
1538@c @multitable {aaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1539@item @code{.}    @tab a non-blank character
1540@item @code{\w}   @tab a letter
1541@item @code{\W}   @tab a non-blank character other than a letter
1542@item @code{\d}   @tab a digit
1543@item @code{\D}   @tab a non-blank character other than a digit
1544@item @code{\s}   @tab a space or tab character
1545@item @code{\S}   @tab a non-blank character (the same as @code{.})
1546@item @code{\l}   @tab a lowercase letter
1547@item @code{\L}   @tab an uppercase letter
1548@end multitable
1549
1550
1551@noindent The following characters:
1552@example
1553@verb{%  [   ]   ^   |   *   +   ?   {   }   ,   .   <   >   \ %}
1554@end example
1555must be escaped with a backslash, i.e. written as:
1556@example
1557@verb{% \[  \]  \^  \|  \*  \+  \?  \{  \}  \,  \.  \<  \>  \\ %}
1558@end example
1559
1560@quotation Note
1561The special symbols are ... borrowed from Perl with minor
1562modifications ... for convenience
1563The meaning of certain special characters/sequences slightly differs
1564from their common ???. This is motivated by convenience reasons.
1565The meaning of the @code{.} special character is modified due to
1566the special function of spaces in utt files (they are field
1567separators). Use @code{\s} to explicitly
1568@end quotation
1569
1570In the argument of the @code{cat} term a special operator <...> may be
1571used. A category specification enclosed in angle brackets matches all
1572category descriptions which are consistent (non-contradictory) with the
1573specification. For example @code{<N>} matches all noun descriptions,
1574@code{<ADJ/Can>} matches all adjectives in accusative or nominal case.
1575
1576
1577@*
1578@noindent @b{Examples of one-segment patterns:}
1579
1580@multitable {aaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1581@item @code{seg}            @tab any segment
1582@item @code{word}           @tab any word-form
1583@item @code{word(pomocy)}   @tab the word-form @samp{pomocy}
1584@item @code{word(naj.+)}    @tab a word-form beginning with @samp{naj}
1585@item @code{word(\L\l+)}    @tab a capitalized word-form
1586@item @code{punct}          @tab a punctuation character
1587@item @code{space(.*\\n.*)} @tab a space segment containing a newline character
1588@item @code{lexeme(pomoc)}  @tab any form of the lexeme 'pomoc'
1589@item @code{cat(N/.*)}      @tab a word which category starts with @code{N/}
1590@item @code{cat(<N/Ca>)}    @tab a word which category matches @code{N/Ca}
1591@end multitable
1592
1593@*
1594@noindent @b{Examples of multi-segment patterns:}
1595
1596@table @code
1597
1598@item (word(\L) punct(\.) space?)+ word(\L\l+)
1599a sequence of initials followed by a surname
1600
1601@item punct seg(W|S|N)* cat(<NPRO/Sr>) seg(W|S|N)* punct
1602a text fragment between two punctuation characters, containing an
1603ocurrence of a relative pronoun
1604
1605@end table
1606
1607
1608@node ser how ser works
1609@subsection How ser works
1610
1611@node ser customization
1612@subsection Customization
1613
1614@c All predefined terms correspond to single segments,
1615
1616@example
[261bf62]1617define(`verbseq', `(cat(<V>) (space cat(<V>)))')
[25ae32e]1618@end example
1619
1620
1621the term @code{cat()} may not be used as a ... of
1622
1623@c See @command{m4} manual for further details on macro definition format.
1624
1625@node ser limitations
1626@subsection Limitations
1627
[261bf62]1628Do not use more than 3 attributes in <>.
[25ae32e]1629
1630@node ser requirements
1631@subsection Requirements
1632
1633In order to run @command{ser}, the following programs must be
1634installed in the system:
1635
1636@itemize
1637
1638@item @command{m4}
1639@item @command{grep}
1640@item @command{flex}
1641@item @command{gcc}
1642
1643@end itemize
1644
1645
1646@c ---------------------------------------------------------------------
[261bf62]1647@c GRP
[25ae32e]1648@c ---------------------------------------------------------------------
1649
1650@page
1651@node grp
1652@section grp - pattern search tool
1653
1654@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]1655@item @strong{Authors:}                 @tab Tomasz Obrębski
[25ae32e]1656@item @strong{Component category:}      @tab filter
[261bf62]1657@item @strong{Input format:}            @tab UTT flattened
1658@item @strong{Output format:}           @tab UTT flattened
1659@item @strong{Required annotation:}     @tab tok, sen, lem --one-field
[25ae32e]1660@end multitable
1661
1662
[261bf62]1663@menu
1664* grp description::
1665* grp command line options::   
1666* grp pattern::                 
1667* grp hints::   
1668@end menu
1669
1670
1671@node grp description
1672@subsection Description
1673
[25ae32e]1674@code{gre} selects sentences containing an expression matching a
1675pattern. The pattern format is exactly the same as that accepted by
1676@code{ser}.
1677
1678@code{gre} is intended mainly for speeding up corpus search process.
1679It is extremely fast (processing speed is usually higher then the speed
1680of reading the corpus file from disk).
1681
1682@node grp command line options
1683@subsection Command line options
1684
1685@table @code
1686
1687@parhelp
1688@parversion
1689@parprocess
1690@parinteractive
1691
1692@item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
1693The search pattern.
1694
1695@item @b{@minus{}@minus{}morph=@var{field}}
1696The name of the annotation field containing the morphological
1697description (default @code{lem}).
1698
1699@item @b{@minus{}@minus{}command}
1700Only print the generated flex source code.
1701
1702@item @b{@minus{}@minus{}macro=@var{filename}}
1703Read macrodefinitions from file @var{filename} rather than from
1704default location. This option allows to redefine the set of terms.
1705
1706@item @b{@minus{}@minus{}define=@var{filename}}
1707Append macrodefinitions from file @var{filename}. This option
1708allows to extend the set of terms.
1709
1710@end table
1711
1712
1713@node grp pattern
1714@subsection Pattern
1715
1716(see @code{ser})
1717
1718@node grp hints
1719@subsection Hints
1720
1721The corpus search speed may be increased by combining grp with lzop
1722compression tool (grp usually processes data faster than it is read from a
1723disk, especially for slow laptop drives).
1724
1725@example
[e28a625]1726cat corpus | tok | sen | lem -1 | fla | lzop -7 > corpus.grp.lzo
[25ae32e]1727@end example
1728
1729@example
[e28a625]1730lzop -cd corpus.grp.lzo | grp -e @var{EXPR} | unfla | ser -e @var{EXPR}
[25ae32e]1731@end example
1732
1733
[261bf62]1734
[25ae32e]1735@c ---------------------------------------------------------------------
[261bf62]1736@c MAR
[25ae32e]1737@c ---------------------------------------------------------------------
[261bf62]1738
1739@page
1740@node mar
1741@section mar
1742
1743@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]1744@item @strong{Authors:}                 @tab Marcin Walas, Tomasz Obrębski
[e28a625]1745@item @strong{Input format:}            @tab UTT flattened
1746@item @strong{Output format:}           @tab UTT flattened
1747@item @strong{Required annotation:}     @tab tok, sen, lem -1
[261bf62]1748@end multitable
1749
[2d89d4b]1750@subsection Description
1751@code{mar} is a perl script, which matches given pattern on the utt-formated text
1752and tags matching parts with any number of user-defined tags.
1753
1754@subsection Command line options
1755@table @code
1756@parhelp
1757@parversion
1758
1759@item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
1760The search pattern.
1761@item @b{@minus{}@minus{}action=@var{action}, @minus{}a @var{action} [p] [s] [P]}
1762Perform only indicated actions. Where:
1763@multitable {aaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1764@item @code{p}   @tab preprocess
1765@item @code{s}   @tab search
1766@item @code{P}   @tab postprocess
1767@end multitable
1768default: psP
1769
1770@item @b{@minus{}@minus{}command}
1771print generated sed command, then exit
1772
1773@item @b{@minus{}@minus{}help, @minus{}h}
1774print help, then exit
1775
1776@item @b{@minus{}@minus{}version, @minus{}v}
1777print version, then exit
1778@end table
1779@subsection Tokens in pattern
1780@code{mar} pattern is based on @code{ser} patterns(see @pxref{ser pattern}). @code{mar} pattern is a @code{ser} pattern,
1781in which you can add any number of matching tags, which will be printed in exacly the place, where
1782they were placed in the pattern. A valid token starts with @@ which follows any number of alphanumeric
1783characters. For example valid match tokens are: @@STARTMATCH @@ENDMATCH
1784
1785Matching tokens can be placed between, before or after any of @code{ser} pattern terms. They don't have
1786to be paritied. There can be any number of them in the pattern (zero or more). They don't have to be unique.
1787They can be placed one after another. For example:
1788
1789@multitable {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaa}
1790@item @code{@@BOM lexeme(pomoc)}  @tab place tag @b{BOM} before any form of the lexeme 'pomoc'
1791@item @code{@@MATCH lexeme(pomoc) @@MATCH}      @tab place tag @b{MATCH} before and after any form of the lexeme 'pomoc'
1792@item @code{cat(<ADJ>) @@MATCH lexeme(pomoc) @@MATCH}      @tab place tag @b{MATCH} before and after any form of the lexeme 'pomoc' which is  followef by adjective
1793@item @code{cat(<ADJ>) @@TAG @@BOM lexeme(pomoc) @@EOM}      @tab place tags @b{TAG} and @b{BOM}  before any form of the lexeme 'pomoc' which is  followed by adjective and tag @b{EOM} after it
1794@end multitable
1795
1796(see mar's help 'mar -h' for some more information)
1797
1798@subsection How mar works
1799@code{mar} translates given @code{ser} pattern with @code{m4} macroprocessor to regular expression. Then it changes it into @code{sed} command script, which is then executed.
1800
1801You can see translated sed script by using the @code{@minus{}@minus{}command} option.
1802@subsection Limitations
1803The complexity of computations performed by @code{mar} increases linearly with the number of placed tokens. So it is highly recommended not to place too much tokens.
1804@subsection Requirements
1805In order to run @code{mar}, the following programs must be installed in the system:
1806
1807@itemize
1808
1809@item @command{m4}
1810@item @command{grep}
1811@item @command{sed}
1812
1813@end itemize
1814
[261bf62]1815
[e28a625]1816
[261bf62]1817@c ---------------------------------------------------------------------
1818@c KOT
[25ae32e]1819@c ---------------------------------------------------------------------
1820
1821@page
1822@node kot
1823@section kot - untokenizer
1824
[261bf62]1825@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]1826@item @strong{Authors:}                 @tab Tomasz Obrębski
[261bf62]1827@item @strong{Component category:}      @tab filter
1828@item @strong{Input format:}            @tab UTT regular
1829@item @strong{Output format:}           @tab text
1830@item @strong{Required annotation:}     @tab tok
1831@end multitable
[25ae32e]1832
1833
1834@menu
[261bf62]1835* kot description::
[25ae32e]1836* kot command line options::   
1837* kot usage examples::   
1838@end menu
1839
[261bf62]1840@node kot description
1841@subsection Description
1842
1843@command{kot} transforms a UTT formatted file back into raw text format.
1844
[25ae32e]1845@node kot command line options
1846@subsection Command line options
1847
1848@table @code
1849
1850@parhelp
1851
1852@c @item @b{@minus{}@minus{}version}, @b{@minus{}v}
1853
1854@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
1855
1856@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
1857
1858@c @item @b{@minus{}@minus{}interactive @minus{}i}
1859
1860@c @item @b{@minus{}@minus{}config=@var{filename}}
1861
1862@item
1863
1864@item @b{@minus{}@minus{}gap-fill=@var{string}, @minus{}g @var{string}}
1865print @var{string} between nonadjacent segments of the input file
1866
1867@item @b{@minus{}@minus{}spaces, @minus{}r}
1868retain the special characters @code{_}, @code{\t},
1869@code{\n}, @code{\r}, @code{\f} unexpanded in the output
1870
1871@end table
1872
1873@node kot usage examples
1874@subsection Usage examples
1875
1876@example
1877cat legia.txt | tok | kot       
1878@end example
1879
1880@example
1881cat legia.txt | tok | lem -1 | kot
1882@end example
1883
[261bf62]1884@c ---------------------------------------------------------------
1885@c CON
1886@c ---------------------------------------------------------------
1887
[25ae32e]1888
1889@page
1890@node con
1891@section con - concordance table generator
1892
1893@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
1894@item @strong{Authors:}                 @tab Justyna Walkowska
1895@item @strong{Component category:}      @tab sink
[261bf62]1896@item @strong{Input format:}            @tab UTT regular
1897@item @strong{Output format:}           @tab text
1898@item @strong{Required annotation:}     @tab ser or mar
[25ae32e]1899@end multitable
1900@c
1901
1902@menu
[261bf62]1903* con description::
[25ae32e]1904* con command line options::
1905* con usage example::
1906* con hints::   
1907@end menu
1908
[261bf62]1909
1910@node con description
1911@subsection Description
1912
1913@command{con} generates a concordance table based on a pattern given to @command{ser}.
1914
1915
[25ae32e]1916@node con command line options
1917@subsection Command line options
1918
1919@table @code
1920
1921@parhelp
1922
1923@c @item @b{@minus{}@minus{}help}, @b{@minus{}h}
1924@c @item @b{@minus{}@minus{}version}, @b{@minus{}v}
1925@c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}
1926@c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}
1927@c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}} [???]
1928@c @item @b{@minus{}@minus{}copy, @minus{}c} [???]
1929@c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}
1930@c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}
1931@c @item @b{@minus{}@minus{}process=@var{class}, @minus{}p @var{class}}
1932@c @item @b{@minus{}@minus{}interactive @minus{}i}
1933@c @item @b{@minus{}@minus{}config=@var{filename}}
1934@c @item
1935@c @item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}}
1936@c search pattern
1937@c
1938@c @item @b{@minus{}@minus{}flex}
1939@c only print the generated flex source code
1940@c
1941@c @item @b{@minus{}@minus{}macro=@var{filename}}
1942@c read macrodefinitions from file @var{filename} rather than from
1943@c default location. This option allows to redefine the set of terms.
1944@c
1945@c @item @b{@minus{}@minus{}define=@var{filename}}
1946@c append macrodefinitions from file @var{filename}. This option
1947@c allows to extend the set of terms.
1948
1949@item @b{@minus{}@minus{}left @minus{}l}           
1950        Left context info (default='30c'). Example:
1951@example                         
1952                                 -l=5c: left context is 5 characters
1953                                 -l=5w: left context is 5 words
1954                                 -l=5s: left context is 5 non-empty input lines
1955                                 -l='\s*\S+\sr\S+BOS': left context starts with the given regex
1956@end example
1957
1958@item @b{@minus{}@minus{}right @minus{}r}           
1959        Right context info (default='30c').
1960@item @b{@minus{}@minus{}trim @minus{}t}           
1961        Clear incomplete words from output.
1962@item @b{@minus{}@minus{}white @minus{}w}           
1963        DO NOT change all white characters into spaces.
1964@item @b{@minus{}@minus{}column @minus{}c}           
1965        Left column minimal width in characters (default = 0).
1966@item @b{@minus{}@minus{}ignore @minus{}i}           
1967        Ignore segment inconsistency in the input.
[261bf62]1968@item @b{@minus{}@minus{}bom}           
[25ae32e]1969        Beginning of selected segment (regex, default='[0-9]+ [0-9]+ BOM .*').
[261bf62]1970@item @b{@minus{}@minus{}eom}           
[25ae32e]1971        End of selected segment (regex, default='[0-9]+ [0-9]+ EOM .*').
1972@item @b{@minus{}@minus{}bod}           
1973        Selected segment beginning display string (default='[').
1974@item @b{@minus{}@minus{}eod}           
1975        Selected segment end display string (default=']').
1976
1977
1978
1979@end table
1980
1981@node con usage example
1982@subsection Usage example
1983@example
[261bf62]1984cat file.txt | tok | lem -1 | ser -e 'lexeme(dom)' | con 
[25ae32e]1985@end example
1986
1987
1988@node con hints
1989@subsection Hints
1990
1991@command{con} is a rather slow program. Do not pass large amounts of
1992redundant text through this program. @command{con} works fine in the following
1993sequence:
1994
1995@example
1996... | grp -e EXPR | ser -e EXPR | con
1997@end example
1998
1999
2000@c ---------------------------------------------------------------------
2001@c ---------------------------------------------------------------------
2002
2003@page
2004@node Auxiliary tools
2005@chapter Auxiliary tools
2006
2007@menu
2008* compiledic::         dictionary compiler
2009* fla::                UTT file flattener
2010* unfla::              UTT file unflattener
2011@end menu
2012
2013
2014@page
2015@node compiledic
2016@section compiledic - the dictionary compiler
2017
2018@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]2019@item @strong{Authors:}                 @tab Michał Stolarski, Tomasz Obrębski
[25ae32e]2020@item @strong{Component category:}      @tab additional tool
2021@end multitable
2022@c
2023
2024@command{compiledic} compiles dictionaries in text format (@code{.dic} extension) into binary
2025(FSA) format (@code{.bin} extension).
2026
2027Automaton representation of a dictionary is built using the AT&T tools:
2028@itemize
2029@item AT&T FSM Library,
2030@item AT&T Lextools.
2031@end itemize
2032
2033In order for the compiledic program to work you have to install the
2034above mentioned packages into your system.  They are freely available
2035for non-commercial use.
2036
2037Usage:
2038@example
2039        compiledic <dictionaryname>.dic
2040@end example
2041
2042The file <dictionaryname>.bin will be generated.
2043
2044Remarque: The program produces a lot of temporary files which are
2045stored in the current directory. They are deleted after successfull
2046termination of the program.
2047
2048@c @menu
2049@c * con command line options::
2050@c * con usage example::
2051@c * con hints::   
2052@c @end menu
2053
2054
[e28a625]2055@c -------------------------------------------------------------------------------
2056@c FLA
2057@c -------------------------------------------------------------------------------
2058
[25ae32e]2059@page
2060@node fla
2061@section fla - the UTT file flattener
2062
2063@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]2064@item @strong{Authors:}                 @tab Tomasz Obrębski
[e28a625]2065@item @strong{Input format:}            @tab UTT regular
2066@item @strong{Output format:}           @tab UTT flattened
2067@item @strong{Required annotation:}     @tab sen
[25ae32e]2068@end multitable
2069@c
2070
[e28a625]2071@menu
2072* fla description::
2073@c * fla command line options::
2074@c * fla usage example::
2075@end menu
2076
2077
2078@node fla description
2079@subsection Description
2080
[25ae32e]2081@command{fla} ``flattens'' a utt file by merging segments belonging
2082to one sentence in one line. Technically, end-of-line characters
2083('\n', ASCII code 10) are replaced with line-feed characters ('\f',
2084ASCII code 12).  The flattening makes it possible to process UTT files
2085with such tools as @command{grep} or @command{sed} sentence by
2086sentence (used in @command{grp} and @command{mar}).
2087
2088Flattened files should have the suffix @code{.fla}, eg. @file{thetext.utt.fla}.
2089
2090Flattened files are still human-readible.
2091
2092Usage:
2093
2094@example
2095        fla [<bosregex>]
2096@end example
2097
2098The facultative argument is a regular expression describing segments
2099which should be treated as sentence beginnings (the test is: the
2100segment contains a fragment matching the @code{<bosregex>}). By
2101default, segments containing a field @code{BOS} are seeked.
2102
[e28a625]2103@c -------------------------------------------------------------------------------
2104@c UNFLA
2105@c -------------------------------------------------------------------------------
[25ae32e]2106
2107@page
2108@node unfla
2109@section unfla - the UTT file unflattener
2110
2111@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
[9ace5d2]2112@item @strong{Authors:}                 @tab Tomasz Obrębski
[e28a625]2113@item @strong{Input format:}            @tab UTT flattened
2114@item @strong{Output format:}           @tab UTT regular
2115@item @strong{Required annotation:}     @tab -
[25ae32e]2116@end multitable
2117
[e28a625]2118@menu
2119* unfla description::
2120@c * fla command line options::
2121@c * fla usage example::
2122@end menu
2123
2124@node unfla description
2125@subsection Description
[25ae32e]2126@command{unfla} transforms a flattened UTT file, produced by
2127@command{fla}, into the regular format by restoring end-of-line
2128characters.
2129
2130
2131
2132
2133@c ---------------------------------------------------------------------
2134@c USAGE EXAMPLES
2135@c ---------------------------------------------------------------------
2136
2137@node Usage examples
2138@chapter Usage examples
2139
2140@subsubheading Simple pipelines
2141
2142@enumerate
2143
2144@item tokenization
2145
2146cat text | tok > output1
2147
2148@item morphological annotation (1)
2149
2150simple dictionary based lemmatization
2151
2152cat text | tok | lem > output1
2153
2154@item morphological annotation (2)
2155
21561) perform dictionary-based lemmatization
21574) guess descriptions for words which have no annotation
2158
2159@example
2160cat text | tok | lem | gue -S lem > output2
2161@end example
2162
2163@item morphological annotation (3)
2164
21651) perform dictionary-based lemmatization
21662) try to correct words with no annotation
21673) perform dictionary-based lemmatization of corrected words
21684) guess descriptions for words which still have no annotation
2169
2170@example
2171cat text | tok | lem | cor -p W -S lem | lem -I cor | gue -p W -S lem
2172@end example
2173@item spelling correction
2174
2175
2176
2177@example
[e28a625]2178cat text | tok | egrep ' W ' | lem | egrep -v 'lem:' | cor -1
[25ae32e]2179@end example
2180
2181@item Expression extraction
2182
2183Extraction of all occurrences of a verb followed by a form of the noun 'rozmowa'.
2184
2185@example
2186cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' -m | kot > output4
2187@end example
2188
2189@item A word in context
2190
2191Extraction of text fragments containing a form of the lexeme 'rozmowa' in
2192the context of 5 preceeding and 5 succeeding corpus segments.
2193
2194@example
2195cat text | tok | lem -1 | ser -e 'seg@{5@} lexeme(rozmowa) seg@{5@}' -m | kot > output
2196@end example
2197
2198@item generation of concordance table (1)
2199
2200@example
2201cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' | con
2202@end example
2203
220410"
2205
2206@item generation of concordance table (2)
2207
2208The same as above but much faster
2209
2210@example
2211cat text | tok | lem -1 | \
2212grp -e 'cat(<V>) space lexeme(rozmowa)' | \
2213ser -e 'cat(<V>) space lexeme(rozmowa)' | \
2214con
2215@end example
2216
22172"
2218
2219@item generation of concordance table (3)
2220
2221Usually, one performs repetitively search over the same corpus. In
2222such case it is advisable to transform the corpus data into the format
2223required by @command{grp} first, and then use the preprocessed data.
2224
2225As @command{grp} (@command{grep}) processes data faster then it is
2226read from the disk drive, the search time may be still shortened by
[e28a625]2227using file compression techniques.  We suggest using the
2228@command{lzop} compressor/decompressor.
[25ae32e]2229
2230@item the fastest way to search a large corpus
2231
[e28a625]2232step 1: corpus preprocessing
[25ae32e]2233
2234@example
2235cat corpus | tok | sen | lem -1 \
[e28a625]2236| fla | lzop -7 > corpus.grp.lzo
[25ae32e]2237@end example
2238
2239step 2: search
2240
2241@example
[e28a625]2242lzop -cd corpus.grp.lzo | unfla | grp -e 'cat(<V>) space
[25ae32e]2243lexeme(rozmowa)' | ser -e 'cat(<V>) space lexeme(rozmowa)' | con
2244@end example
2245
2246@end enumerate
2247
[e28a625]2248@c @subsubheading More complicated configurations
[25ae32e]2249
2250
[e28a625]2251@c @example
2252@c mknod fifo1 p
2253@c mknod fifo2 p
2254@c mknod fifo3 p
2255@c mknod fifo4 p
2256@c mknod fifo5 p
2257
2258@c tok | lem -p W -e fifo1 > fifo2 &
2259@c cor -e fifo3 < fifo1 | lem > fifo4 &
2260@c gue < fifo3 > fifo5 &
2261@c sort -m fifo2 fifo4 fifo5
2262
2263@c rm fifo?
2264@c @end example
[25ae32e]2265
2266
2267@c ---------------------------------------------------------------------
2268@c ---------------------------------------------------------------------
2269
2270@c ---------------------------------------------------------------------
2271@c PMDBF DICTIONARY
2272@c ---------------------------------------------------------------------
2273
2274@node PMDBF dictionary
2275@chapter PMDBF dictionary
2276
2277UTT components come with lexical data derived from Polish
2278Morphological Database (PMDB).
2279
2280@menu
2281* PMDBF files::   
2282* PMDBF tag structure::                 
2283* PMDBF parts of speech::           
2284* PMDBF morphosyntactic attributes::           
2285@end menu
2286
2287@node PMDBF files
2288@section Files
2289
2290@node PMDBF tag structure
2291@section Tag structure
2292
2293pos = [[:upper:]]+
2294
2295attr = [[:upper:]]+
2296
2297val = [[:lower:][:digit:]?!*+-] | <[^>\n]+>
2298
2299descr = pos ( / ( attr val + ) + ) ?
2300
2301@node PMDBF parts of speech
2302@section Parts of speech
2303
2304@multitable {ADJPRP} { adjectival-passive-participle }
2305@item @code{N} @tab noun
2306@item @code{NPRO} @tab nominal-pronoun
2307@item @code{NV} @tab deverbal-noun
2308@item @code{V} @tab verb
2309@item @code{BYC} @tab byc
2310@item @code{VNI} @tab non-inflected-verb
2311@item @code{ADJ} @tab adjective
2312@item @code{ADJPAP} @tab adjectival-passive-participle
2313@item @code{ADJPRP} @tab adjectival-present-participle
2314@item @code{ADJPP} @tab adjectival-past-participle
2315@item @code{ADJPRO} @tab adjectival-pronoun
2316@item @code{ADJNUM} @tab adjectival-numeral
2317@item @code{ADV} @tab adverb
2318@item @code{ADVANP} @tab adverbial-anterior-participle
2319@item @code{ADVPRP} @tab adverbial-present-participle
2320@item @code{ADVPRO} @tab adverbial-pronoun
2321@item @code{ADVNUM} @tab  adverbial-numeral
2322@item @code{P} @tab preposition
2323@item @code{PPRO} @tab prep-noun-pronoun
2324@item @code{CONJ} @tab conjunction
2325@item @code{EXCL} @tab exclamation
2326@item @code{APP} @tab call
2327@item @code{ONO} @tab onomatopoeia
2328@item @code{PART} @tab particle
2329@item @code{NUMCRD} @tab cardinal-numeral
2330@item @code{NUMCOL} @tab collective-numeral
2331@item @code{NUMPAR} @tab partitive-numeral
2332@item @code{NUMORD} @tab ordinal-numeral
2333@end multitable
2334
2335@node PMDBF morphosyntactic attributes
2336@section Morphosyntactic attributes
2337
2338@multitable {Attr} {Val} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
2339@c @headitem Attr @tab Val @tab Description
2340@item
2341@code{A} @tab @tab Aspect
2342@item
2343@tab @code{p} @tab perfect
2344@item
2345@tab @code{i} @tab imperfect.
2346@item
2347@item
2348@code{V} @tab @tab Verb-Form
2349@item
2350@tab @code{b} @tab infinitive,
2351@item
2352@tab @code{p} @tab personal,
2353@item
2354@tab @code{i} @tab impersonal.
2355@item
2356@item
2357@code{M} @tab @tab Mood
2358@item
2359@tab @code{d} @tab declarative,
2360@item
2361@tab @code{c} @tab conditional,
2362@item
2363@tab @code{i} @tab imperative.
2364@item
2365@item
2366@code{T} @tab @tab Tense
2367@item
2368@tab @code{a} @tab past,
2369@item
2370@tab @code{r} @tab present,
2371@item
2372@tab @code{f} @tab future.
2373@item
2374@item
2375@code{P} @tab @tab Person
2376@item
2377@tab @code{1} @tab 1,
2378@item
2379@tab @code{2} @tab 2,
2380@item
2381@tab @code{3} @tab 3.
2382@item
2383@item
2384@code{D} @tab @tab Degree
2385@item
2386@tab @code{p} @tab positive,
2387@item
2388@tab @code{c} @tab comparative,
2389@item
2390@tab @code{s} @tab superlative.
2391@item
2392@item
2393@code{N} @tab @tab Number
2394@item
2395@tab @code{s} @tab singular,
2396@item
2397@tab @code{p} @tab plural.
2398@item
2399@item
2400@code{C} @tab @tab Case
2401@item
2402@tab @code{n} @tab nominative,
2403@item
2404@tab @code{g} @tab genitive,
2405@item
2406@tab @code{d} @tab dative,
2407@item
2408@tab @code{a} @tab accusative,
2409@item
2410@tab @code{i} @tab instrumantal,
2411@item
2412@tab @code{l} @tab locative,
2413@item
2414@tab @code{v} @tab vocative.
2415@item
2416@code{G} @tab @tab Gender
2417@item
2418@tab @code{p} @tab masculine-personal,
2419@item
2420@tab @code{a} @tab masculine-animal,
2421@item
2422@tab @code{i} @tab masculine-inanimate,
2423@item
2424@tab @code{f} @tab feminine,
2425@item
2426@tab @code{n} @tab neuter.
2427@end multitable
2428
2429
2430@c ---------------------------------------------------------------------
2431@c ---------------------------------------------------------------------
2432@c
2433@c @node Examples
2434@c @chapter Examples
2435
2436@c ----------------------------------------------------------------------
2437@c ----------------------------------------------------------------------
2438
2439@node    GNU Free Documentation License
2440@chapter GNU Free Documentation License
2441
2442@c The GNU Free Documentation License.
2443@center Version 1.2, November 2002
2444
2445@c This file is intended to be included within another document,
2446@c hence no sectioning command or @node.
2447
2448@display
2449Copyright @copyright{} 2000,2001,2002 Free Software Foundation, Inc.
245051 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA
2451
2452Everyone is permitted to copy and distribute verbatim copies
2453of this license document, but changing it is not allowed.
2454@end display
2455
2456@enumerate 0
2457@item
2458PREAMBLE
2459
2460The purpose of this License is to make a manual, textbook, or other
2461functional and useful document @dfn{free} in the sense of freedom: to
2462assure everyone the effective freedom to copy and redistribute it,
2463with or without modifying it, either commercially or noncommercially.
2464Secondarily, this License preserves for the author and publisher a way
2465to get credit for their work, while not being considered responsible
2466for modifications made by others.
2467
2468This License is a kind of ``copyleft'', which means that derivative
2469works of the document must themselves be free in the same sense.  It
2470complements the GNU General Public License, which is a copyleft
2471license designed for free software.
2472
2473We have designed this License in order to use it for manuals for free
2474software, because free software needs free documentation: a free
2475program should come with manuals providing the same freedoms that the
2476software does.  But this License is not limited to software manuals;
2477it can be used for any textual work, regardless of subject matter or
2478whether it is published as a printed book.  We recommend this License
2479principally for works whose purpose is instruction or reference.
2480
2481@item
2482APPLICABILITY AND DEFINITIONS
2483
2484This License applies to any manual or other work, in any medium, that
2485contains a notice placed by the copyright holder saying it can be
2486distributed under the terms of this License.  Such a notice grants a
2487world-wide, royalty-free license, unlimited in duration, to use that
2488work under the conditions stated herein.  The ``Document'', below,
2489refers to any such manual or work.  Any member of the public is a
2490licensee, and is addressed as ``you''.  You accept the license if you
2491copy, modify or distribute the work in a way requiring permission
2492under copyright law.
2493
2494A ``Modified Version'' of the Document means any work containing the
2495Document or a portion of it, either copied verbatim, or with
2496modifications and/or translated into another language.
2497
2498A ``Secondary Section'' is a named appendix or a front-matter section
2499of the Document that deals exclusively with the relationship of the
2500publishers or authors of the Document to the Document's overall
2501subject (or to related matters) and contains nothing that could fall
2502directly within that overall subject.  (Thus, if the Document is in
2503part a textbook of mathematics, a Secondary Section may not explain
2504any mathematics.)  The relationship could be a matter of historical
2505connection with the subject or with related matters, or of legal,
2506commercial, philosophical, ethical or political position regarding
2507them.
2508
2509The ``Invariant Sections'' are certain Secondary Sections whose titles
2510are designated, as being those of Invariant Sections, in the notice
2511that says that the Document is released under this License.  If a
2512section does not fit the above definition of Secondary then it is not
2513allowed to be designated as Invariant.  The Document may contain zero
2514Invariant Sections.  If the Document does not identify any Invariant
2515Sections then there are none.
2516
2517The ``Cover Texts'' are certain short passages of text that are listed,
2518as Front-Cover Texts or Back-Cover Texts, in the notice that says that
2519the Document is released under this License.  A Front-Cover Text may
2520be at most 5 words, and a Back-Cover Text may be at most 25 words.
2521
2522A ``Transparent'' copy of the Document means a machine-readable copy,
2523represented in a format whose specification is available to the
2524general public, that is suitable for revising the document
2525straightforwardly with generic text editors or (for images composed of
2526pixels) generic paint programs or (for drawings) some widely available
2527drawing editor, and that is suitable for input to text formatters or
2528for automatic translation to a variety of formats suitable for input
2529to text formatters.  A copy made in an otherwise Transparent file
2530format whose markup, or absence of markup, has been arranged to thwart
2531or discourage subsequent modification by readers is not Transparent.
2532An image format is not Transparent if used for any substantial amount
2533of text.  A copy that is not ``Transparent'' is called ``Opaque''.
2534
2535Examples of suitable formats for Transparent copies include plain
2536@sc{ascii} without markup, Texinfo input format, La@TeX{} input
2537format, @acronym{SGML} or @acronym{XML} using a publicly available
2538@acronym{DTD}, and standard-conforming simple @acronym{HTML},
2539PostScript or @acronym{PDF} designed for human modification.  Examples
2540of transparent image formats include @acronym{PNG}, @acronym{XCF} and
2541@acronym{JPG}.  Opaque formats include proprietary formats that can be
2542read and edited only by proprietary word processors, @acronym{SGML} or
2543@acronym{XML} for which the @acronym{DTD} and/or processing tools are
2544not generally available, and the machine-generated @acronym{HTML},
2545PostScript or @acronym{PDF} produced by some word processors for
2546output purposes only.
2547
2548The ``Title Page'' means, for a printed book, the title page itself,
2549plus such following pages as are needed to hold, legibly, the material
2550this License requires to appear in the title page.  For works in
2551formats which do not have any title page as such, ``Title Page'' means
2552the text near the most prominent appearance of the work's title,
2553preceding the beginning of the body of the text.
2554
2555A section ``Entitled XYZ'' means a named subunit of the Document whose
2556title either is precisely XYZ or contains XYZ in parentheses following
2557text that translates XYZ in another language.  (Here XYZ stands for a
2558specific section name mentioned below, such as ``Acknowledgements'',
2559``Dedications'', ``Endorsements'', or ``History''.)  To ``Preserve the Title''
2560of such a section when you modify the Document means that it remains a
2561section ``Entitled XYZ'' according to this definition.
2562
2563The Document may include Warranty Disclaimers next to the notice which
2564states that this License applies to the Document.  These Warranty
2565Disclaimers are considered to be included by reference in this
2566License, but only as regards disclaiming warranties: any other
2567implication that these Warranty Disclaimers may have is void and has
2568no effect on the meaning of this License.
2569
2570@item
2571VERBATIM COPYING
2572
2573You may copy and distribute the Document in any medium, either
2574commercially or noncommercially, provided that this License, the
2575copyright notices, and the license notice saying this License applies
2576to the Document are reproduced in all copies, and that you add no other
2577conditions whatsoever to those of this License.  You may not use
2578technical measures to obstruct or control the reading or further
2579copying of the copies you make or distribute.  However, you may accept
2580compensation in exchange for copies.  If you distribute a large enough
2581number of copies you must also follow the conditions in section 3.
2582
2583You may also lend copies, under the same conditions stated above, and
2584you may publicly display copies.
2585
2586@item
2587COPYING IN QUANTITY
2588
2589If you publish printed copies (or copies in media that commonly have
2590printed covers) of the Document, numbering more than 100, and the
2591Document's license notice requires Cover Texts, you must enclose the
2592copies in covers that carry, clearly and legibly, all these Cover
2593Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on
2594the back cover.  Both covers must also clearly and legibly identify
2595you as the publisher of these copies.  The front cover must present
2596the full title with all words of the title equally prominent and
2597visible.  You may add other material on the covers in addition.
2598Copying with changes limited to the covers, as long as they preserve
2599the title of the Document and satisfy these conditions, can be treated
2600as verbatim copying in other respects.
2601
2602If the required texts for either cover are too voluminous to fit
2603legibly, you should put the first ones listed (as many as fit
2604reasonably) on the actual cover, and continue the rest onto adjacent
2605pages.
2606
2607If you publish or distribute Opaque copies of the Document numbering
2608more than 100, you must either include a machine-readable Transparent
2609copy along with each Opaque copy, or state in or with each Opaque copy
2610a computer-network location from which the general network-using
2611public has access to download using public-standard network protocols
2612a complete Transparent copy of the Document, free of added material.
2613If you use the latter option, you must take reasonably prudent steps,
2614when you begin distribution of Opaque copies in quantity, to ensure
2615that this Transparent copy will remain thus accessible at the stated
2616location until at least one year after the last time you distribute an
2617Opaque copy (directly or through your agents or retailers) of that
2618edition to the public.
2619
2620It is requested, but not required, that you contact the authors of the
2621Document well before redistributing any large number of copies, to give
2622them a chance to provide you with an updated version of the Document.
2623
2624@item
2625MODIFICATIONS
2626
2627You may copy and distribute a Modified Version of the Document under
2628the conditions of sections 2 and 3 above, provided that you release
2629the Modified Version under precisely this License, with the Modified
2630Version filling the role of the Document, thus licensing distribution
2631and modification of the Modified Version to whoever possesses a copy
2632of it.  In addition, you must do these things in the Modified Version:
2633
2634@enumerate A
2635@item
2636Use in the Title Page (and on the covers, if any) a title distinct
2637from that of the Document, and from those of previous versions
2638(which should, if there were any, be listed in the History section
2639of the Document).  You may use the same title as a previous version
2640if the original publisher of that version gives permission.
2641
2642@item
2643List on the Title Page, as authors, one or more persons or entities
2644responsible for authorship of the modifications in the Modified
2645Version, together with at least five of the principal authors of the
2646Document (all of its principal authors, if it has fewer than five),
2647unless they release you from this requirement.
2648
2649@item
2650State on the Title page the name of the publisher of the
2651Modified Version, as the publisher.
2652
2653@item
2654Preserve all the copyright notices of the Document.
2655
2656@item
2657Add an appropriate copyright notice for your modifications
2658adjacent to the other copyright notices.
2659
2660@item
2661Include, immediately after the copyright notices, a license notice
2662giving the public permission to use the Modified Version under the
2663terms of this License, in the form shown in the Addendum below.
2664
2665@item
2666Preserve in that license notice the full lists of Invariant Sections
2667and required Cover Texts given in the Document's license notice.
2668
2669@item
2670Include an unaltered copy of this License.
2671
2672@item
2673Preserve the section Entitled ``History'', Preserve its Title, and add
2674to it an item stating at least the title, year, new authors, and
2675publisher of the Modified Version as given on the Title Page.  If
2676there is no section Entitled ``History'' in the Document, create one
2677stating the title, year, authors, and publisher of the Document as
2678given on its Title Page, then add an item describing the Modified
2679Version as stated in the previous sentence.
2680
2681@item
2682Preserve the network location, if any, given in the Document for
2683public access to a Transparent copy of the Document, and likewise
2684the network locations given in the Document for previous versions
2685it was based on.  These may be placed in the ``History'' section.
2686You may omit a network location for a work that was published at
2687least four years before the Document itself, or if the original
2688publisher of the version it refers to gives permission.
2689
2690@item
2691For any section Entitled ``Acknowledgements'' or ``Dedications'', Preserve
2692the Title of the section, and preserve in the section all the
2693substance and tone of each of the contributor acknowledgements and/or
2694dedications given therein.
2695
2696@item
2697Preserve all the Invariant Sections of the Document,
2698unaltered in their text and in their titles.  Section numbers
2699or the equivalent are not considered part of the section titles.
2700
2701@item
2702Delete any section Entitled ``Endorsements''.  Such a section
2703may not be included in the Modified Version.
2704
2705@item
2706Do not retitle any existing section to be Entitled ``Endorsements'' or
2707to conflict in title with any Invariant Section.
2708
2709@item
2710Preserve any Warranty Disclaimers.
2711@end enumerate
2712
2713If the Modified Version includes new front-matter sections or
2714appendices that qualify as Secondary Sections and contain no material
2715copied from the Document, you may at your option designate some or all
2716of these sections as invariant.  To do this, add their titles to the
2717list of Invariant Sections in the Modified Version's license notice.
2718These titles must be distinct from any other section titles.
2719
2720You may add a section Entitled ``Endorsements'', provided it contains
2721nothing but endorsements of your Modified Version by various
2722parties---for example, statements of peer review or that the text has
2723been approved by an organization as the authoritative definition of a
2724standard.
2725
2726You may add a passage of up to five words as a Front-Cover Text, and a
2727passage of up to 25 words as a Back-Cover Text, to the end of the list
2728of Cover Texts in the Modified Version.  Only one passage of
2729Front-Cover Text and one of Back-Cover Text may be added by (or
2730through arrangements made by) any one entity.  If the Document already
2731includes a cover text for the same cover, previously added by you or
2732by arrangement made by the same entity you are acting on behalf of,
2733you may not add another; but you may replace the old one, on explicit
2734permission from the previous publisher that added the old one.
2735
2736The author(s) and publisher(s) of the Document do not by this License
2737give permission to use their names for publicity for or to assert or
2738imply endorsement of any Modified Version.
2739
2740@item
2741COMBINING DOCUMENTS
2742
2743You may combine the Document with other documents released under this
2744License, under the terms defined in section 4 above for modified
2745versions, provided that you include in the combination all of the
2746Invariant Sections of all of the original documents, unmodified, and
2747list them all as Invariant Sections of your combined work in its
2748license notice, and that you preserve all their Warranty Disclaimers.
2749
2750The combined work need only contain one copy of this License, and
2751multiple identical Invariant Sections may be replaced with a single
2752copy.  If there are multiple Invariant Sections with the same name but
2753different contents, make the title of each such section unique by
2754adding at the end of it, in parentheses, the name of the original
2755author or publisher of that section if known, or else a unique number.
2756Make the same adjustment to the section titles in the list of
2757Invariant Sections in the license notice of the combined work.
2758
2759In the combination, you must combine any sections Entitled ``History''
2760in the various original documents, forming one section Entitled
2761``History''; likewise combine any sections Entitled ``Acknowledgements'',
2762and any sections Entitled ``Dedications''.  You must delete all
2763sections Entitled ``Endorsements.''
2764
2765@item
2766COLLECTIONS OF DOCUMENTS
2767
2768You may make a collection consisting of the Document and other documents
2769released under this License, and replace the individual copies of this
2770License in the various documents with a single copy that is included in
2771the collection, provided that you follow the rules of this License for
2772verbatim copying of each of the documents in all other respects.
2773
2774You may extract a single document from such a collection, and distribute
2775it individually under this License, provided you insert a copy of this
2776License into the extracted document, and follow this License in all
2777other respects regarding verbatim copying of that document.
2778
2779@item
2780AGGREGATION WITH INDEPENDENT WORKS
2781
2782A compilation of the Document or its derivatives with other separate
2783and independent documents or works, in or on a volume of a storage or
2784distribution medium, is called an ``aggregate'' if the copyright
2785resulting from the compilation is not used to limit the legal rights
2786of the compilation's users beyond what the individual works permit.
2787When the Document is included in an aggregate, this License does not
2788apply to the other works in the aggregate which are not themselves
2789derivative works of the Document.
2790
2791If the Cover Text requirement of section 3 is applicable to these
2792copies of the Document, then if the Document is less than one half of
2793the entire aggregate, the Document's Cover Texts may be placed on
2794covers that bracket the Document within the aggregate, or the
2795electronic equivalent of covers if the Document is in electronic form.
2796Otherwise they must appear on printed covers that bracket the whole
2797aggregate.
2798
2799@item
2800TRANSLATION
2801
2802Translation is considered a kind of modification, so you may
2803distribute translations of the Document under the terms of section 4.
2804Replacing Invariant Sections with translations requires special
2805permission from their copyright holders, but you may include
2806translations of some or all Invariant Sections in addition to the
2807original versions of these Invariant Sections.  You may include a
2808translation of this License, and all the license notices in the
2809Document, and any Warranty Disclaimers, provided that you also include
2810the original English version of this License and the original versions
2811of those notices and disclaimers.  In case of a disagreement between
2812the translation and the original version of this License or a notice
2813or disclaimer, the original version will prevail.
2814
2815If a section in the Document is Entitled ``Acknowledgements'',
2816``Dedications'', or ``History'', the requirement (section 4) to Preserve
2817its Title (section 1) will typically require changing the actual
2818title.
2819
2820@item
2821TERMINATION
2822
2823You may not copy, modify, sublicense, or distribute the Document except
2824as expressly provided for under this License.  Any other attempt to
2825copy, modify, sublicense or distribute the Document is void, and will
2826automatically terminate your rights under this License.  However,
2827parties who have received copies, or rights, from you under this
2828License will not have their licenses terminated so long as such
2829parties remain in full compliance.
2830
2831@item
2832FUTURE REVISIONS OF THIS LICENSE
2833
2834The Free Software Foundation may publish new, revised versions
2835of the GNU Free Documentation License from time to time.  Such new
2836versions will be similar in spirit to the present version, but may
2837differ in detail to address new problems or concerns.  See
2838@uref{http://www.gnu.org/copyleft/}.
2839
2840Each version of the License is given a distinguishing version number.
2841If the Document specifies that a particular numbered version of this
2842License ``or any later version'' applies to it, you have the option of
2843following the terms and conditions either of that specified version or
2844of any later version that has been published (not as a draft) by the
2845Free Software Foundation.  If the Document does not specify a version
2846number of this License, you may choose any version ever published (not
2847as a draft) by the Free Software Foundation.
2848@end enumerate
2849
2850@page
2851@heading ADDENDUM: How to use this License for your documents
2852
2853To use this License in a document you have written, include a copy of
2854the License in the document and put the following copyright and
2855license notices just after the title page:
2856
2857@smallexample
2858@group
2859  Copyright (C)  @var{year}  @var{your name}.
2860  Permission is granted to copy, distribute and/or modify this document
2861  under the terms of the GNU Free Documentation License, Version 1.2
2862  or any later version published by the Free Software Foundation;
2863  with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
2864  Texts.  A copy of the license is included in the section entitled ``GNU
2865  Free Documentation License''.
2866@end group
2867@end smallexample
2868
2869If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts,
2870replace the ``with@dots{}Texts.'' line with this:
2871
2872@smallexample
2873@group
2874    with the Invariant Sections being @var{list their titles}, with
2875    the Front-Cover Texts being @var{list}, and with the Back-Cover Texts
2876    being @var{list}.
2877@end group
2878@end smallexample
2879
2880If you have Invariant Sections without Cover Texts, or some other
2881combination of the three, merge those two alternatives to suit the
2882situation.
2883
2884If your document contains nontrivial examples of program code, we
2885recommend releasing these examples in parallel under your choice of
2886free software license, such as the GNU General Public License,
2887to permit their use in free software.
2888
2889@c Local Variables:
2890@c ispell-local-pdict: "ispell-dict"
2891@c End:
2892
2893
2894@c ---------------------------------------------------------------------
2895@c ---------------------------------------------------------------------
2896
2897@node    Reporting bugs
2898@chapter Reporting bugs
2899
2900Report bugs to <obrebski@@amu.edu.pl>.
2901
2902@c ---------------------------------------------------------------------
2903@c ---------------------------------------------------------------------
2904
2905@c @node    Copyright
2906@c @chapter Copyright
2907@c
[9ace5d2]2908@c Copyright 2004 by Tomasz Obrębski
[25ae32e]2909@c This software is free for research and educational use.
2910
2911@c ---------------------------------------------------------------------
2912@c ---------------------------------------------------------------------
2913
2914@node    Author
2915@chapter Author
2916
2917
2918@bye
Note: See TracBrowser for help on using the repository browser.