Ignore:
Timestamp:
10/22/08 11:53:31 (16 years ago)
Author:
obrebski <obrebski@…>
Branches:
master, help
Children:
e28a625
Parents:
839a0d5
git-author:
obrebski <obrebski@…> (10/22/08 11:53:31)
git-committer:
obrebski <obrebski@…> (10/22/08 11:53:31)
Message:

w utt.texinfo

git-svn-id: svn://atos.wmid.amu.edu.pl/utt@60 e293616e-ec6a-49c2-aa92-f4a8b91c5d16

File:
1 edited

Legend:

Unmodified
Added
Removed
  • app/doc/utt.texinfo

    r04ae414 r261bf62  
    99 
    1010@copying 
    11 This manual is for UAM Text Tools (version 0.90, November, 2007) 
     11This manual is for UAM Text Tools (version 0.90, October, 2008) 
    1212 
    1313Copyright @copyright{}  2005, 2007  Tomasz Obrêbski, Micha³ Stolarski, Justyna Walkowska, Pawe³ Konieczka. 
    1414 
    1515Permission is granted to copy, distribute and/or modify this document 
    16 under the terms of the GNU Free Documentation License, Version 1.2 
    17 or any later version published by the Free Software Foundation; 
    18 with no Invariant Sections, no Front-Cover Texts, and no Back-Cover 
    19 Texts.  A copy of the license is included in the section entitled GNU Free Documentation License,,GNU Free Documentation License. 
     16under the terms of the GNU Free Documentation License, Version 1.2 or 
     17any later version published by the Free Software Foundation; with no 
     18Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.  A 
     19copy of the license is included in the section entitled GNU Free 
     20Documentation License,,GNU Free Documentation License. 
    2021 
    2122@c @quotation 
     
    358359@end example 
    359360 
    360 because in the latter example the first segment (starting at position 0000, 2 characters long) ends at position @var{n}=0001 which is covered by the second segment and no segment starts at position @var{n+2}=0002. 
     361because in the latter example the first segment (starting at position 
     3620000, 2 characters long) ends at position @var{n}=0001 which is 
     363covered by the second segment and no segment starts at position 
     364@var{n+2}=0002. 
     365 
     366 
     367@section Flattened UTT file 
     368 
     369A UTT file format has two variants: regular and flattend. The regular 
     370format was described above.  In the flattened format some of the 
     371end-of-line characters are replaced with line-feed characters. 
     372 
     373The flatten format is basically used to represent whole sentences as 
     374single lines of the input file (all intrasentential end-of-line 
     375characters are replaced with line-feed characters). 
     376 
     377This technical trick permits to perform certain text 
     378processing operations on entire sentences with the use of such tools as 
     379@command{grep} (see @command{grp} component) or @command{sed} (see  @command{mar} component). 
     380 
     381The conversion between the two formats is performed by the tools: 
     382@command{fla} and @command{unfla}. 
    361383 
    362384@section Character encoding 
    363385 
    364386The UTT component programs accept only 1-byte character encoding, such 
    365 as ISO, ANSI, DOS, UTF-8 (probably: not tested yet). 
     387as ISO, ANSI, DOS. 
    366388 
    367389 
     
    527549 
    528550@c --------------------------------------------------------------------- 
    529 @c --------------------------------------------------------------------- 
    530  
    531 @c @node Common command line options 
    532 @c @chapter Common command line options 
    533  
    534 @c @table @code 
    535  
    536 @c @parhelp 
    537  
    538 @c @item @b{@minus{}@minus{}help}, @b{@minus{}h} 
    539 @c Print help. 
    540  
    541 @c @item @b{@minus{}@minus{}version}, @b{@minus{}v} 
    542 @c Print version information. 
    543  
    544 @c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}} 
    545 @c Input file name. 
    546 @c If this option is absent or equal to '@minus{}', the program 
    547 @c reads from the standard input. 
    548  
    549 @c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}} 
    550 @c Regular output file name. To regular output the program sends segments 
    551 @c which it successfully processed and copies those which were not 
    552 @c subject to processing. If this option is absent or equal to 
    553 @c '@minus{}', standard output is used. 
    554  
    555 @c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}} 
    556 @c Fail output file name. To fail output the program copies the segments 
    557 @c it failed to process.  If this option is absent or equal to 
    558 @c '@minus{}', standard output is used. 
    559  
    560 @c @item @b{@minus{}@minus{}only-fail} 
    561 @c Discard segments which would normally be sent to regular 
    562 @c output. Print only segments the program failed to process. 
    563  
    564 @c @item @b{@minus{}@minus{}no-fail} 
    565 @c Discard segments the program failed to process. 
    566 @c (This and the previous option are functionally equivalent to, 
    567 @c respectively, @option{-o /dev/null} and @option{-e /dev/null}, but 
    568 @c make the programs run faster.) 
    569  
    570 @c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}} 
    571 @c The field containing the input to the program. The default is usually 
    572 @c the @var{form} field (unless otherwise stated in the program 
    573 @c description). The fields @var{position}, @var{length}, @var{tag}, and 
    574 @c @var{form} are referred to as @code{1}, @code{2}, @code{3}, @code{4}, 
    575 @c respectively. 
    576  
    577 @c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}} 
    578 @c The name of the field added by the program. The default is the name of 
    579 @c the program. 
    580  
    581 @c @c @item @b{@minus{}@minus{}copy, @minus{}c} 
    582 @c @c Copy processed segments to regular output. 
    583  
    584 @c @item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}} 
    585 @c Dictionary file name. 
    586 @c (This option is used by programs which use dictionary data.) 
    587  
    588 @c @item @b{@minus{}@minus{}process=@var{tag}, @minus{}p @var{tag}} 
    589 @c Process segments with the specified value in the @var{tag} field. 
    590 @c Multiple occurences of this option are allowed and are interpreted as 
    591 @c disjunction. If this option is absent, all segments are processed. 
    592  
    593 @c @item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}} 
    594 @c Select for processing only segments in which the field named 
    595 @c @var{fieldname} is present. Multiple occurences of this option are 
    596 @c allowed and are interpreted as conjunction of conditions. If this 
    597 @c option is absent, all segments are processed. 
    598  
    599 @c @item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}} 
    600 @c Select for processing only segments in which the field @var{fieldname} 
    601 @c is absent.  Multiple occurences of this option are allowed and are 
    602 @c interpreted as conjunction of conditions. If this option is absent, 
    603 @c all segments are processed. 
    604  
    605 @c @item @b{@minus{}@minus{}interactive @minus{}i} 
    606 @c This option toggles interactive mode, which is by default off. In the 
    607 @c interactive mode the program does not buffer the output. 
    608  
    609 @c @item @b{@minus{}@minus{}config=@var{filename}} 
    610 @c Read configuration from file @file{@var{filename}}. 
    611  
    612 @c @item @b{@minus{}@minus{}one @minus{}1} 
    613 @c This option makes the program print ambiguous annotation in one output 
    614 @c segment. By default when 
    615 @c ambiguous new annotation is being produced for a segment, the segment 
    616 @c is multiplicated and each of the annotations is added to separate copy 
    617 @c of the segment. 
    618  
    619 @c @end table 
    620  
    621 @c --------------------------------------------------------------------- 
    622551@c CONFIGURATION FILES 
    623552@c --------------------------------------------------------------------- 
     
    695624 
    696625Filters: programs which read and produce UTT-formatted data 
    697 @c * sen - the sentencizer:: 
    698626* lem::         a morphological analyzer 
    699627* gue::         a morphological guesser 
    700 * cor::         a spelling corrector 
     628* cor::         a simple spelling corrector 
     629* kor::         a more elaborated spelling corrector 
    701630* sen::         a sentensizer 
    702 @c * gph - the graphizer:: 
    703631* ser::         a pattern search tool (marks matches) 
     632* mar::         a pattern search tool (introduces arbitrary markers into the text) 
    704633* grp::         a pattern search tool (selects sentences containing a match) 
     634@c * gph::         a word-graph annotation tool:: 
     635@c * dgp::         a dependency parser 
    705636 
    706637Sinks: programs which read UTT data and produce output in another format 
     
    722653@item @strong{Authors:}                 @tab Tomasz Obrêbski 
    723654@item @strong{Component category:}      @tab source 
     655@item @strong{Input format:}            @tab raw text file 
     656@item @strong{Output format:}           @tab UTT regular 
     657@item @strong{Required annotation:}     @tab - 
    724658@end multitable 
    725659 
     
    835769@item @strong{Authors:}                 @tab Tomasz Obrêbski, Micha³ Stolarski 
    836770@item @strong{Component category:}      @tab filter 
     771@item @strong{Input format:}            @tab UTT regular 
     772@item @strong{Output format:}           @tab UTT regular 
     773@item @strong{Required annotation:}     @tab tok 
    837774@end multitable 
    838775 
     
    1032969located by default in: 
    1033970 
    1034 @file{$HOME/.utt/pl/lem.bin} 
     971@file{$HOME/.local/share/utt/pl_PL.ISO-8859-2/lem.bin} 
     972 
     973in local installation or in 
     974 
     975@file{/usr/local/share/utt/pl_PL.ISO-8859-2/lem.bin} 
     976 
     977in system installation. 
    1035978 
    1036979@node lem hints 
    1037980@subsection Hints 
    1038981 
    1039 @c @subsubheading Combining data from multiple dictionaries 
    1040  
    1041 @c @itemize 
    1042  
    1043 @c @item Apply <dict1>, then apply <dict2> to words which were not annotatated. 
    1044  
    1045 @c @example 
    1046 @c lem -d <dict1> | lem -S lem -d <dict2> 
    1047 @c @end example 
    1048  
    1049 @c @item Add annotations from two dictionaries <dict1> and <dict2>. 
    1050  
    1051 @c @example 
    1052 @c lem -c -d <dict1> | lem -S lem -d <dict2> 
    1053 @c @end example 
    1054  
    1055 @c @end itemize 
     982@subsubheading Combining data from multiple dictionaries 
     983 
     984@itemize 
     985 
     986@item Apply <dict1>, then apply <dict2> to words which were not annotatated. 
     987 
     988@example 
     989lem -d <dict1> | lem -S lem -d <dict2> 
     990@end example 
     991 
     992@item Add annotations from two dictionaries <dict1> and <dict2>. 
     993 
     994@example 
     995lem -c -d <dict1> | lem -S lem -d <dict2> 
     996@end example 
     997 
     998@end itemize 
    1056999 
    10571000 
     
    10711014@end multitable 
    10721015 
    1073 @command{gue} guesess morphological descriptions of the form contained 
    1074 in the @var{form} field. 
    1075  
    10761016@menu 
     1017* gue description::     
    10771018* gue command line options::     
    10781019* gue example::                  
    10791020* gue dictionaries::             
    10801021@end menu 
     1022 
     1023 
     1024@node gue description 
     1025@subsection Description 
     1026 
     1027@command{gue} guesess morphological descriptions of the form contained 
     1028in the @var{form} field. 
     1029 
    10811030 
    10821031@node gue command line options 
     
    11821131@item @strong{Authors:}                 @tab Tomasz Obrêbski, Micha³ Stolarski 
    11831132@item @strong{Component category:}      @tab filter 
     1133@item @strong{Input format:}            @tab UTT regular 
     1134@item @strong{Output format:}           @tab UTT regular 
     1135@item @strong{Required annotation:}     @tab tok 
    11841136@end multitable 
     1137 
     1138@menu 
     1139* cor description:: 
     1140* cor command line options::     
     1141* cor dictionaries::             
     1142@end menu 
     1143 
     1144 
     1145@node cor description 
     1146@subsection Description 
    11851147 
    11861148The spelling corrector applies Kemal Oflazer's dynamic programming 
     
    11891151word form it returns all word forms present in the dictionary whose 
    11901152edit distance is smaller than the threshold given as the parameter. 
    1191  
    1192 By default @code{cor} replaces the contents of the @var{form} field 
    1193 with new corrected value, placing the old contents in the @code{cor} 
    1194 field. 
    1195  
    1196  
    1197 @menu 
    1198 * cor command line options::     
    1199 * cor dictionaries::             
    1200 @end menu 
    12011153 
    12021154 
     
    12251177Maximum edit distance (default='1'). 
    12261178 
     1179@c @item @b{@minus{}@minus{}replace, @minus{}r} 
     1180@c Replace original form with corrected form, place original form in the 
     1181@c cor field. This option has no effect in @option{--one-*} modes (default=off) 
     1182 
    12271183 
    12281184@end table 
     
    12431199@end example 
    12441200 
     1201@subsubheading Binary format 
     1202 
     1203The mandatory file name extension for a binary dictionary is @code{bin}. To 
     1204compile a text dictionary into binary format, write: 
     1205 
     1206@example 
     1207compiledic <dictionaryname>.dic 
     1208@end example 
     1209 
     1210@c --------------------------------------------------------------------- 
     1211@c KOR 
     1212@c --------------------------------------------------------------------- 
     1213 
     1214@page 
     1215@node kor 
     1216@section kor - configurable spelling corrector 
     1217 
     1218[TODO] 
     1219 
     1220@c --------------------------------------------------------------------- 
     1221@c SEN 
     1222@c --------------------------------------------------------------------- 
     1223 
    12451224@page 
    12461225@node sen 
     
    12511230@item @strong{Authors:}                 @tab Tomasz Obrêbski 
    12521231@item @strong{Component category:}      @tab filter 
     1232@item @strong{Input format:}            @tab UTT regular 
     1233@item @strong{Output format:}           @tab UTT regular 
     1234@item @strong{Required annotation:}     @tab tok 
    12531235 
    12541236@end multitable 
    12551237 
    1256 @command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation.  
    12571238 
    12581239@menu 
     1240* sen description:: 
    12591241@c * sen input:: 
    12601242@c * sen output:: 
    12611243* sen example::                  
    12621244@end menu 
     1245 
     1246@node sen description 
     1247@subsection Description 
     1248 
     1249@command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation.  
    12631250 
    12641251@node sen example 
     
    13051292 
    13061293 
     1294@c --------------------------------------------------------------------- 
    13071295@c SER 
    1308 @c --------------------------------------------------------------------- 
    13091296@c --------------------------------------------------------------------- 
    13101297 
     
    13161303@item @strong{Authors:}                 @tab Tomasz Obrêbski 
    13171304@item @strong{Component category:}      @tab filter 
     1305@item @strong{Input format:}            @tab UTT regular 
     1306@item @strong{Output format:}           @tab UTT regular 
     1307@item @strong{Required annotation:}     @tab tok,  lem --one-field 
    13181308@end multitable 
    13191309 
    1320 @command{ser} looks for patterns in UTT-formatted texts. 
    1321  
    13221310@menu 
     1311* ser description:: 
    13231312* ser command line options::     
    13241313* ser pattern::                  
     
    13281317* ser requirements::             
    13291318@end menu 
     1319 
     1320 
     1321@node ser description 
     1322@subsection Description 
     1323 
     1324@command{ser} looks for patterns in UTT-formatted texts. 
    13301325 
    13311326 
     
    15041499 
    15051500@example 
    1506 define(`verbseq', `(cat(V) (space cat(V)))') 
     1501define(`verbseq', `(cat(<V>) (space cat(<V>)))') 
    15071502@end example 
    15081503 
     
    15151510@subsection Limitations 
    15161511 
    1517 more than 3 attributes in <>. 
     1512Do not use more than 3 attributes in <>. 
    15181513 
    15191514@node ser requirements 
     
    15331528 
    15341529 
     1530@c --------------------------------------------------------------------- 
    15351531@c GRP 
    1536 @c --------------------------------------------------------------------- 
    15371532@c --------------------------------------------------------------------- 
    15381533 
     
    15441539@item @strong{Authors:}                 @tab Tomasz Obrêbski 
    15451540@item @strong{Component category:}      @tab filter 
     1541@item @strong{Input format:}            @tab UTT flattened 
     1542@item @strong{Output format:}           @tab UTT flattened 
     1543@item @strong{Required annotation:}     @tab tok, sen, lem --one-field 
    15461544@end multitable 
    15471545 
    15481546 
    1549 @code{gre} selects sentences containing an expression matching a 
    1550 pattern. The pattern format is exactly the same as that accepted by 
    1551 @code{ser}. 
    1552  
    1553 @code{gre} is intended mainly for speeding up corpus search process. 
    1554 It is extremely fast (processing speed is usually higher then the speed 
    1555 of reading the corpus file from disk).  
    1556  
    1557  
    1558  
    1559 @c @menu 
    1560 @c * ser command line options::     
    1561 @c * ser pattern::                  
    1562 @c * ser how ser works::            
    1563 @c * ser customization::            
    1564 @c * ser limitations::              
    1565 @c * ser requirements::             
    1566 @c @end menu 
    15671547@menu 
     1548* grp description:: 
    15681549* grp command line options::     
    15691550* grp pattern::                  
     
    15711552@end menu 
    15721553 
     1554 
     1555@node grp description 
     1556@subsection Description 
     1557 
     1558@code{gre} selects sentences containing an expression matching a 
     1559pattern. The pattern format is exactly the same as that accepted by 
     1560@code{ser}. 
     1561 
     1562@code{gre} is intended mainly for speeding up corpus search process. 
     1563It is extremely fast (processing speed is usually higher then the speed 
     1564of reading the corpus file from disk).  
     1565 
    15731566@node grp command line options 
    15741567@subsection Command line options 
     
    15781571@parhelp 
    15791572@parversion 
    1580 @c @parfile 
    1581 @c @paroutput 
    1582 @c @parinputfield 
    1583 @c @paroutputfield 
    15841573@parprocess 
    15851574@parinteractive 
     
    16271616 
    16281617 
    1629 @c --------------------------------------------------------------------- 
    1630 @c kot 
    1631 @c --------------------------------------------------------------------- 
    1632 @c --------------------------------------------------------------------- 
     1618 
     1619@c --------------------------------------------------------------------- 
     1620@c MAR 
     1621@c --------------------------------------------------------------------- 
     1622 
     1623@page 
     1624@node mar 
     1625@section mar 
     1626 
     1627@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} 
     1628@item @strong{Authors:}                 @tab Marcin Walas, Tomasz Obrêbski 
     1629@item @strong{Component category:}      @tab filter 
     1630@end multitable 
     1631 
     1632[TODO] 
     1633 
     1634@c --------------------------------------------------------------------- 
     1635@c KOT 
     1636@c --------------------------------------------------------------------- 
     1637 
    16331638 
    16341639@page 
     
    16361641@section kot - untokenizer 
    16371642 
    1638 Authors: Tomasz Obrêbski 
    1639  
    1640 @command{kot} is the opposite of @command{tok}. It changes UTT-formatted text into plain text. 
     1643@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} 
     1644@item @strong{Authors:}                 @tab Tomasz Obrêbski 
     1645@item @strong{Component category:}      @tab filter 
     1646@item @strong{Input format:}            @tab UTT regular 
     1647@item @strong{Output format:}           @tab text 
     1648@item @strong{Required annotation:}     @tab tok 
     1649@end multitable 
     1650 
    16411651 
    16421652@menu 
     1653* kot description:: 
    16431654* kot command line options::     
    16441655* kot usage examples::     
    16451656@end menu 
    16461657 
     1658@node kot description 
     1659@subsection Description 
     1660 
     1661@command{kot} transforms a UTT formatted file back into raw text format. 
     1662 
    16471663@node kot command line options 
    16481664@subsection Command line options 
     
    16841700@end example 
    16851701 
    1686 @c CON............................................................ 
    1687 @c ............................................................... 
    1688 @c ............................................................... 
     1702@c --------------------------------------------------------------- 
     1703@c CON 
     1704@c --------------------------------------------------------------- 
     1705 
    16891706 
    16901707@page 
     
    16921709@section con - concordance table generator 
    16931710 
    1694 @command{con} generates a concordance table based on a pattern given to @command{ser}. 
    1695  
    16961711@multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} 
    16971712@item @strong{Authors:}                 @tab Justyna Walkowska 
    16981713@item @strong{Component category:}      @tab sink 
     1714@item @strong{Input format:}            @tab UTT regular 
     1715@item @strong{Output format:}           @tab text 
     1716@item @strong{Required annotation:}     @tab ser or mar 
    16991717@end multitable 
    17001718@c 
    17011719 
    17021720@menu 
     1721* con description:: 
    17031722* con command line options:: 
    17041723* con usage example:: 
    17051724* con hints::     
    17061725@end menu 
     1726 
     1727 
     1728@node con description 
     1729@subsection Description 
     1730 
     1731@command{con} generates a concordance table based on a pattern given to @command{ser}. 
     1732 
    17071733 
    17081734@node con command line options 
     
    17581784@item @b{@minus{}@minus{}ignore @minus{}i}             
    17591785        Ignore segment inconsistency in the input. 
    1760 @item @b{@minus{}@minus{}bon}             
     1786@item @b{@minus{}@minus{}bom}             
    17611787        Beginning of selected segment (regex, default='[0-9]+ [0-9]+ BOM .*'). 
    1762 @item @b{@minus{}@minus{}eob}             
     1788@item @b{@minus{}@minus{}eom}             
    17631789        End of selected segment (regex, default='[0-9]+ [0-9]+ EOM .*'). 
    17641790@item @b{@minus{}@minus{}bod}             
     
    17741800@subsection Usage example 
    17751801@example 
    1776 cat file.txt | tok | lem -1 | ser -e 'lexeme(dom) | con'   
     1802cat file.txt | tok | lem -1 | ser -e 'lexeme(dom)' | con   
    17771803@end example 
    17781804 
     
    17881814... | grp -e EXPR | ser -e EXPR | con 
    17891815@end example 
    1790  
    17911816 
    17921817 
Note: See TracChangeset for help on using the changeset viewer.