- Timestamp:
- 10/22/08 11:53:31 (16 years ago)
- Branches:
- master, help
- Children:
- e28a625
- Parents:
- 839a0d5
- git-author:
- obrebski <obrebski@…> (10/22/08 11:53:31)
- git-committer:
- obrebski <obrebski@…> (10/22/08 11:53:31)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
app/doc/utt.texinfo
r04ae414 r261bf62 9 9 10 10 @copying 11 This manual is for UAM Text Tools (version 0.90, November, 2007)11 This manual is for UAM Text Tools (version 0.90, October, 2008) 12 12 13 13 Copyright @copyright{} 2005, 2007 Tomasz Obrêbski, Micha³ Stolarski, Justyna Walkowska, Pawe³ Konieczka. 14 14 15 15 Permission is granted to copy, distribute and/or modify this document 16 under the terms of the GNU Free Documentation License, Version 1.2 17 or any later version published by the Free Software Foundation; 18 with no Invariant Sections, no Front-Cover Texts, and no Back-Cover 19 Texts. A copy of the license is included in the section entitled GNU Free Documentation License,,GNU Free Documentation License. 16 under the terms of the GNU Free Documentation License, Version 1.2 or 17 any later version published by the Free Software Foundation; with no 18 Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A 19 copy of the license is included in the section entitled GNU Free 20 Documentation License,,GNU Free Documentation License. 20 21 21 22 @c @quotation … … 358 359 @end example 359 360 360 because in the latter example the first segment (starting at position 0000, 2 characters long) ends at position @var{n}=0001 which is covered by the second segment and no segment starts at position @var{n+2}=0002. 361 because in the latter example the first segment (starting at position 362 0000, 2 characters long) ends at position @var{n}=0001 which is 363 covered by the second segment and no segment starts at position 364 @var{n+2}=0002. 365 366 367 @section Flattened UTT file 368 369 A UTT file format has two variants: regular and flattend. The regular 370 format was described above. In the flattened format some of the 371 end-of-line characters are replaced with line-feed characters. 372 373 The flatten format is basically used to represent whole sentences as 374 single lines of the input file (all intrasentential end-of-line 375 characters are replaced with line-feed characters). 376 377 This technical trick permits to perform certain text 378 processing operations on entire sentences with the use of such tools as 379 @command{grep} (see @command{grp} component) or @command{sed} (see @command{mar} component). 380 381 The conversion between the two formats is performed by the tools: 382 @command{fla} and @command{unfla}. 361 383 362 384 @section Character encoding 363 385 364 386 The UTT component programs accept only 1-byte character encoding, such 365 as ISO, ANSI, DOS , UTF-8 (probably: not tested yet).387 as ISO, ANSI, DOS. 366 388 367 389 … … 527 549 528 550 @c --------------------------------------------------------------------- 529 @c ---------------------------------------------------------------------530 531 @c @node Common command line options532 @c @chapter Common command line options533 534 @c @table @code535 536 @c @parhelp537 538 @c @item @b{@minus{}@minus{}help}, @b{@minus{}h}539 @c Print help.540 541 @c @item @b{@minus{}@minus{}version}, @b{@minus{}v}542 @c Print version information.543 544 @c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}}545 @c Input file name.546 @c If this option is absent or equal to '@minus{}', the program547 @c reads from the standard input.548 549 @c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}}550 @c Regular output file name. To regular output the program sends segments551 @c which it successfully processed and copies those which were not552 @c subject to processing. If this option is absent or equal to553 @c '@minus{}', standard output is used.554 555 @c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}}556 @c Fail output file name. To fail output the program copies the segments557 @c it failed to process. If this option is absent or equal to558 @c '@minus{}', standard output is used.559 560 @c @item @b{@minus{}@minus{}only-fail}561 @c Discard segments which would normally be sent to regular562 @c output. Print only segments the program failed to process.563 564 @c @item @b{@minus{}@minus{}no-fail}565 @c Discard segments the program failed to process.566 @c (This and the previous option are functionally equivalent to,567 @c respectively, @option{-o /dev/null} and @option{-e /dev/null}, but568 @c make the programs run faster.)569 570 @c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}}571 @c The field containing the input to the program. The default is usually572 @c the @var{form} field (unless otherwise stated in the program573 @c description). The fields @var{position}, @var{length}, @var{tag}, and574 @c @var{form} are referred to as @code{1}, @code{2}, @code{3}, @code{4},575 @c respectively.576 577 @c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}}578 @c The name of the field added by the program. The default is the name of579 @c the program.580 581 @c @c @item @b{@minus{}@minus{}copy, @minus{}c}582 @c @c Copy processed segments to regular output.583 584 @c @item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}}585 @c Dictionary file name.586 @c (This option is used by programs which use dictionary data.)587 588 @c @item @b{@minus{}@minus{}process=@var{tag}, @minus{}p @var{tag}}589 @c Process segments with the specified value in the @var{tag} field.590 @c Multiple occurences of this option are allowed and are interpreted as591 @c disjunction. If this option is absent, all segments are processed.592 593 @c @item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}}594 @c Select for processing only segments in which the field named595 @c @var{fieldname} is present. Multiple occurences of this option are596 @c allowed and are interpreted as conjunction of conditions. If this597 @c option is absent, all segments are processed.598 599 @c @item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}}600 @c Select for processing only segments in which the field @var{fieldname}601 @c is absent. Multiple occurences of this option are allowed and are602 @c interpreted as conjunction of conditions. If this option is absent,603 @c all segments are processed.604 605 @c @item @b{@minus{}@minus{}interactive @minus{}i}606 @c This option toggles interactive mode, which is by default off. In the607 @c interactive mode the program does not buffer the output.608 609 @c @item @b{@minus{}@minus{}config=@var{filename}}610 @c Read configuration from file @file{@var{filename}}.611 612 @c @item @b{@minus{}@minus{}one @minus{}1}613 @c This option makes the program print ambiguous annotation in one output614 @c segment. By default when615 @c ambiguous new annotation is being produced for a segment, the segment616 @c is multiplicated and each of the annotations is added to separate copy617 @c of the segment.618 619 @c @end table620 621 @c ---------------------------------------------------------------------622 551 @c CONFIGURATION FILES 623 552 @c --------------------------------------------------------------------- … … 695 624 696 625 Filters: programs which read and produce UTT-formatted data 697 @c * sen - the sentencizer::698 626 * lem:: a morphological analyzer 699 627 * gue:: a morphological guesser 700 * cor:: a spelling corrector 628 * cor:: a simple spelling corrector 629 * kor:: a more elaborated spelling corrector 701 630 * sen:: a sentensizer 702 @c * gph - the graphizer::703 631 * ser:: a pattern search tool (marks matches) 632 * mar:: a pattern search tool (introduces arbitrary markers into the text) 704 633 * grp:: a pattern search tool (selects sentences containing a match) 634 @c * gph:: a word-graph annotation tool:: 635 @c * dgp:: a dependency parser 705 636 706 637 Sinks: programs which read UTT data and produce output in another format … … 722 653 @item @strong{Authors:} @tab Tomasz Obrêbski 723 654 @item @strong{Component category:} @tab source 655 @item @strong{Input format:} @tab raw text file 656 @item @strong{Output format:} @tab UTT regular 657 @item @strong{Required annotation:} @tab - 724 658 @end multitable 725 659 … … 835 769 @item @strong{Authors:} @tab Tomasz Obrêbski, Micha³ Stolarski 836 770 @item @strong{Component category:} @tab filter 771 @item @strong{Input format:} @tab UTT regular 772 @item @strong{Output format:} @tab UTT regular 773 @item @strong{Required annotation:} @tab tok 837 774 @end multitable 838 775 … … 1032 969 located by default in: 1033 970 1034 @file{$HOME/.utt/pl/lem.bin} 971 @file{$HOME/.local/share/utt/pl_PL.ISO-8859-2/lem.bin} 972 973 in local installation or in 974 975 @file{/usr/local/share/utt/pl_PL.ISO-8859-2/lem.bin} 976 977 in system installation. 1035 978 1036 979 @node lem hints 1037 980 @subsection Hints 1038 981 1039 @ c @subsubheading Combining data from multiple dictionaries1040 1041 @ c @itemize1042 1043 @ c @item Apply <dict1>, then apply <dict2> to words which were not annotatated.1044 1045 @ c @example1046 @clem -d <dict1> | lem -S lem -d <dict2>1047 @ c @end example1048 1049 @ c @item Add annotations from two dictionaries <dict1> and <dict2>.1050 1051 @ c @example1052 @clem -c -d <dict1> | lem -S lem -d <dict2>1053 @ c @end example1054 1055 @ c @end itemize982 @subsubheading Combining data from multiple dictionaries 983 984 @itemize 985 986 @item Apply <dict1>, then apply <dict2> to words which were not annotatated. 987 988 @example 989 lem -d <dict1> | lem -S lem -d <dict2> 990 @end example 991 992 @item Add annotations from two dictionaries <dict1> and <dict2>. 993 994 @example 995 lem -c -d <dict1> | lem -S lem -d <dict2> 996 @end example 997 998 @end itemize 1056 999 1057 1000 … … 1071 1014 @end multitable 1072 1015 1073 @command{gue} guesess morphological descriptions of the form contained1074 in the @var{form} field.1075 1076 1016 @menu 1017 * gue description:: 1077 1018 * gue command line options:: 1078 1019 * gue example:: 1079 1020 * gue dictionaries:: 1080 1021 @end menu 1022 1023 1024 @node gue description 1025 @subsection Description 1026 1027 @command{gue} guesess morphological descriptions of the form contained 1028 in the @var{form} field. 1029 1081 1030 1082 1031 @node gue command line options … … 1182 1131 @item @strong{Authors:} @tab Tomasz Obrêbski, Micha³ Stolarski 1183 1132 @item @strong{Component category:} @tab filter 1133 @item @strong{Input format:} @tab UTT regular 1134 @item @strong{Output format:} @tab UTT regular 1135 @item @strong{Required annotation:} @tab tok 1184 1136 @end multitable 1137 1138 @menu 1139 * cor description:: 1140 * cor command line options:: 1141 * cor dictionaries:: 1142 @end menu 1143 1144 1145 @node cor description 1146 @subsection Description 1185 1147 1186 1148 The spelling corrector applies Kemal Oflazer's dynamic programming … … 1189 1151 word form it returns all word forms present in the dictionary whose 1190 1152 edit distance is smaller than the threshold given as the parameter. 1191 1192 By default @code{cor} replaces the contents of the @var{form} field1193 with new corrected value, placing the old contents in the @code{cor}1194 field.1195 1196 1197 @menu1198 * cor command line options::1199 * cor dictionaries::1200 @end menu1201 1153 1202 1154 … … 1225 1177 Maximum edit distance (default='1'). 1226 1178 1179 @c @item @b{@minus{}@minus{}replace, @minus{}r} 1180 @c Replace original form with corrected form, place original form in the 1181 @c cor field. This option has no effect in @option{--one-*} modes (default=off) 1182 1227 1183 1228 1184 @end table … … 1243 1199 @end example 1244 1200 1201 @subsubheading Binary format 1202 1203 The mandatory file name extension for a binary dictionary is @code{bin}. To 1204 compile a text dictionary into binary format, write: 1205 1206 @example 1207 compiledic <dictionaryname>.dic 1208 @end example 1209 1210 @c --------------------------------------------------------------------- 1211 @c KOR 1212 @c --------------------------------------------------------------------- 1213 1214 @page 1215 @node kor 1216 @section kor - configurable spelling corrector 1217 1218 [TODO] 1219 1220 @c --------------------------------------------------------------------- 1221 @c SEN 1222 @c --------------------------------------------------------------------- 1223 1245 1224 @page 1246 1225 @node sen … … 1251 1230 @item @strong{Authors:} @tab Tomasz Obrêbski 1252 1231 @item @strong{Component category:} @tab filter 1232 @item @strong{Input format:} @tab UTT regular 1233 @item @strong{Output format:} @tab UTT regular 1234 @item @strong{Required annotation:} @tab tok 1253 1235 1254 1236 @end multitable 1255 1237 1256 @command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation.1257 1238 1258 1239 @menu 1240 * sen description:: 1259 1241 @c * sen input:: 1260 1242 @c * sen output:: 1261 1243 * sen example:: 1262 1244 @end menu 1245 1246 @node sen description 1247 @subsection Description 1248 1249 @command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation. 1263 1250 1264 1251 @node sen example … … 1305 1292 1306 1293 1294 @c --------------------------------------------------------------------- 1307 1295 @c SER 1308 @c ---------------------------------------------------------------------1309 1296 @c --------------------------------------------------------------------- 1310 1297 … … 1316 1303 @item @strong{Authors:} @tab Tomasz Obrêbski 1317 1304 @item @strong{Component category:} @tab filter 1305 @item @strong{Input format:} @tab UTT regular 1306 @item @strong{Output format:} @tab UTT regular 1307 @item @strong{Required annotation:} @tab tok, lem --one-field 1318 1308 @end multitable 1319 1309 1320 @command{ser} looks for patterns in UTT-formatted texts.1321 1322 1310 @menu 1311 * ser description:: 1323 1312 * ser command line options:: 1324 1313 * ser pattern:: … … 1328 1317 * ser requirements:: 1329 1318 @end menu 1319 1320 1321 @node ser description 1322 @subsection Description 1323 1324 @command{ser} looks for patterns in UTT-formatted texts. 1330 1325 1331 1326 … … 1504 1499 1505 1500 @example 1506 define(`verbseq', `(cat( V) (space cat(V)))')1501 define(`verbseq', `(cat(<V>) (space cat(<V>)))') 1507 1502 @end example 1508 1503 … … 1515 1510 @subsection Limitations 1516 1511 1517 more than 3 attributes in <>.1512 Do not use more than 3 attributes in <>. 1518 1513 1519 1514 @node ser requirements … … 1533 1528 1534 1529 1530 @c --------------------------------------------------------------------- 1535 1531 @c GRP 1536 @c ---------------------------------------------------------------------1537 1532 @c --------------------------------------------------------------------- 1538 1533 … … 1544 1539 @item @strong{Authors:} @tab Tomasz Obrêbski 1545 1540 @item @strong{Component category:} @tab filter 1541 @item @strong{Input format:} @tab UTT flattened 1542 @item @strong{Output format:} @tab UTT flattened 1543 @item @strong{Required annotation:} @tab tok, sen, lem --one-field 1546 1544 @end multitable 1547 1545 1548 1546 1549 @code{gre} selects sentences containing an expression matching a1550 pattern. The pattern format is exactly the same as that accepted by1551 @code{ser}.1552 1553 @code{gre} is intended mainly for speeding up corpus search process.1554 It is extremely fast (processing speed is usually higher then the speed1555 of reading the corpus file from disk).1556 1557 1558 1559 @c @menu1560 @c * ser command line options::1561 @c * ser pattern::1562 @c * ser how ser works::1563 @c * ser customization::1564 @c * ser limitations::1565 @c * ser requirements::1566 @c @end menu1567 1547 @menu 1548 * grp description:: 1568 1549 * grp command line options:: 1569 1550 * grp pattern:: … … 1571 1552 @end menu 1572 1553 1554 1555 @node grp description 1556 @subsection Description 1557 1558 @code{gre} selects sentences containing an expression matching a 1559 pattern. The pattern format is exactly the same as that accepted by 1560 @code{ser}. 1561 1562 @code{gre} is intended mainly for speeding up corpus search process. 1563 It is extremely fast (processing speed is usually higher then the speed 1564 of reading the corpus file from disk). 1565 1573 1566 @node grp command line options 1574 1567 @subsection Command line options … … 1578 1571 @parhelp 1579 1572 @parversion 1580 @c @parfile1581 @c @paroutput1582 @c @parinputfield1583 @c @paroutputfield1584 1573 @parprocess 1585 1574 @parinteractive … … 1627 1616 1628 1617 1629 @c --------------------------------------------------------------------- 1630 @c kot 1631 @c --------------------------------------------------------------------- 1632 @c --------------------------------------------------------------------- 1618 1619 @c --------------------------------------------------------------------- 1620 @c MAR 1621 @c --------------------------------------------------------------------- 1622 1623 @page 1624 @node mar 1625 @section mar 1626 1627 @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} 1628 @item @strong{Authors:} @tab Marcin Walas, Tomasz Obrêbski 1629 @item @strong{Component category:} @tab filter 1630 @end multitable 1631 1632 [TODO] 1633 1634 @c --------------------------------------------------------------------- 1635 @c KOT 1636 @c --------------------------------------------------------------------- 1637 1633 1638 1634 1639 @page … … 1636 1641 @section kot - untokenizer 1637 1642 1638 Authors: Tomasz Obrêbski 1639 1640 @command{kot} is the opposite of @command{tok}. It changes UTT-formatted text into plain text. 1643 @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} 1644 @item @strong{Authors:} @tab Tomasz Obrêbski 1645 @item @strong{Component category:} @tab filter 1646 @item @strong{Input format:} @tab UTT regular 1647 @item @strong{Output format:} @tab text 1648 @item @strong{Required annotation:} @tab tok 1649 @end multitable 1650 1641 1651 1642 1652 @menu 1653 * kot description:: 1643 1654 * kot command line options:: 1644 1655 * kot usage examples:: 1645 1656 @end menu 1646 1657 1658 @node kot description 1659 @subsection Description 1660 1661 @command{kot} transforms a UTT formatted file back into raw text format. 1662 1647 1663 @node kot command line options 1648 1664 @subsection Command line options … … 1684 1700 @end example 1685 1701 1686 @c CON............................................................ 1687 @c ............................................................... 1688 @c ............................................................... 1702 @c --------------------------------------------------------------- 1703 @c CON 1704 @c --------------------------------------------------------------- 1705 1689 1706 1690 1707 @page … … 1692 1709 @section con - concordance table generator 1693 1710 1694 @command{con} generates a concordance table based on a pattern given to @command{ser}.1695 1696 1711 @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} 1697 1712 @item @strong{Authors:} @tab Justyna Walkowska 1698 1713 @item @strong{Component category:} @tab sink 1714 @item @strong{Input format:} @tab UTT regular 1715 @item @strong{Output format:} @tab text 1716 @item @strong{Required annotation:} @tab ser or mar 1699 1717 @end multitable 1700 1718 @c 1701 1719 1702 1720 @menu 1721 * con description:: 1703 1722 * con command line options:: 1704 1723 * con usage example:: 1705 1724 * con hints:: 1706 1725 @end menu 1726 1727 1728 @node con description 1729 @subsection Description 1730 1731 @command{con} generates a concordance table based on a pattern given to @command{ser}. 1732 1707 1733 1708 1734 @node con command line options … … 1758 1784 @item @b{@minus{}@minus{}ignore @minus{}i} 1759 1785 Ignore segment inconsistency in the input. 1760 @item @b{@minus{}@minus{}bo n}1786 @item @b{@minus{}@minus{}bom} 1761 1787 Beginning of selected segment (regex, default='[0-9]+ [0-9]+ BOM .*'). 1762 @item @b{@minus{}@minus{}eo b}1788 @item @b{@minus{}@minus{}eom} 1763 1789 End of selected segment (regex, default='[0-9]+ [0-9]+ EOM .*'). 1764 1790 @item @b{@minus{}@minus{}bod} … … 1774 1800 @subsection Usage example 1775 1801 @example 1776 cat file.txt | tok | lem -1 | ser -e 'lexeme(dom) | con'1802 cat file.txt | tok | lem -1 | ser -e 'lexeme(dom)' | con 1777 1803 @end example 1778 1804 … … 1788 1814 ... | grp -e EXPR | ser -e EXPR | con 1789 1815 @end example 1790 1791 1816 1792 1817
Note: See TracChangeset
for help on using the changeset viewer.