[25ae32e] | 1 | \input texinfo @c -*-texinfo-*- |
---|
| 2 | @documentencoding ISO-8859-2 |
---|
| 3 | @c @documentlanguage pl |
---|
| 4 | |
---|
| 5 | @c %**start of header |
---|
| 6 | @setfilename utt.info |
---|
| 7 | @settitle UAM Text Tools v0.90 |
---|
| 8 | @c %**end of header |
---|
| 9 | |
---|
| 10 | @copying |
---|
[261bf62] | 11 | This manual is for UAM Text Tools (version 0.90, October, 2008) |
---|
[25ae32e] | 12 | |
---|
[19760ef] | 13 | Copyright @copyright{} 2005, 2007 Tomasz Obrêbski, Micha³ Stolarski, Justyna Walkowska, Pawe³ Konieczka. |
---|
[25ae32e] | 14 | |
---|
| 15 | Permission is granted to copy, distribute and/or modify this document |
---|
[261bf62] | 16 | under the terms of the GNU Free Documentation License, Version 1.2 or |
---|
| 17 | any later version published by the Free Software Foundation; with no |
---|
| 18 | Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A |
---|
| 19 | copy of the license is included in the section entitled GNU Free |
---|
| 20 | Documentation License,,GNU Free Documentation License. |
---|
[25ae32e] | 21 | |
---|
| 22 | @c @quotation |
---|
| 23 | @c Permission is granted to ... |
---|
| 24 | @c No permission is granted until the document is completed. |
---|
| 25 | @c @end quotation |
---|
| 26 | @end copying |
---|
| 27 | |
---|
| 28 | |
---|
| 29 | @titlepage |
---|
| 30 | @title UAM Text Tools 0.90 - User Manual |
---|
| 31 | @subtitle edition 0.01, @today |
---|
| 32 | @subtitle status: prescript |
---|
| 33 | @author by Justyna Walkowska, Tomasz Obr@,{}ebski and Micha@l{} Stolarski |
---|
| 34 | @page |
---|
| 35 | @vskip 0pt plus 1filll |
---|
| 36 | @insertcopying |
---|
| 37 | @end titlepage |
---|
| 38 | |
---|
| 39 | @contents |
---|
| 40 | |
---|
| 41 | @c @paragraphindent none |
---|
| 42 | |
---|
| 43 | @iftex |
---|
| 44 | @parskip = 0.5@normalbaselineskip plus 3pt minus 1pt |
---|
| 45 | @end iftex |
---|
| 46 | |
---|
| 47 | @c @headings off |
---|
| 48 | @c @everyheading LEM(1) @| @| LEM(1) |
---|
| 49 | @everyfooting @today @c @| @thispage @| |
---|
| 50 | |
---|
| 51 | @ifnottex |
---|
| 52 | |
---|
| 53 | @node Top |
---|
| 54 | @top UTT - UAM Text Tools |
---|
| 55 | |
---|
| 56 | @insertcopying |
---|
| 57 | |
---|
| 58 | @menu |
---|
| 59 | * General information:: |
---|
| 60 | * UTT file format:: |
---|
| 61 | * Configuration files:: |
---|
| 62 | * UTT components:: |
---|
| 63 | * Auxiliary tools:: |
---|
| 64 | * Usage examples:: |
---|
| 65 | * PMDBF dictionary:: |
---|
| 66 | @c * Examples:: |
---|
| 67 | @c * Copyright:: |
---|
| 68 | * GNU Free Documentation License:: |
---|
| 69 | * Reporting bugs:: |
---|
| 70 | * Author:: |
---|
| 71 | @end menu |
---|
| 72 | @end ifnottex |
---|
| 73 | |
---|
| 74 | |
---|
| 75 | @c ---------------------------------------------------------------------- |
---|
| 76 | |
---|
| 77 | @node General information |
---|
| 78 | @chapter General information |
---|
| 79 | |
---|
| 80 | UAM Text Tools (UTT) is a package of language processing tools |
---|
| 81 | developed at Adam Mickiewicz University. Its functionality includes: |
---|
| 82 | |
---|
| 83 | @itemize @bullet |
---|
| 84 | |
---|
| 85 | @item |
---|
| 86 | tokenization |
---|
| 87 | @item |
---|
| 88 | dictionary-based morphological analysis |
---|
| 89 | @item |
---|
| 90 | heuristic morphological analysis of unknown words |
---|
| 91 | @item |
---|
| 92 | spelling correction |
---|
| 93 | @item |
---|
| 94 | pattern search |
---|
| 95 | @item |
---|
| 96 | sentence splitting |
---|
| 97 | @item |
---|
| 98 | generation of concordance tables |
---|
| 99 | @end itemize |
---|
| 100 | |
---|
| 101 | The toolkit is destined for processing of raw (not annotated) |
---|
| 102 | unrestricted text for any conceivable purpose. |
---|
| 103 | |
---|
| 104 | The system is organized as a collection of command-line programs, each |
---|
| 105 | performing one operation, e.g. tokenization, lemmatization, spelling |
---|
| 106 | correction. The components are independent one from another, the |
---|
| 107 | unifying element being the uniform i/o file format. |
---|
| 108 | |
---|
| 109 | The components may be combined in various ways to provide various text |
---|
| 110 | processing services. Also new components supplied by the used may be |
---|
| 111 | easily incorporated into the system provided that they respect the i/o |
---|
| 112 | file format conventions. |
---|
| 113 | |
---|
| 114 | UTT component programs does not depend on any specific tagset or |
---|
| 115 | morphological description format. |
---|
| 116 | |
---|
| 117 | UTT is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by |
---|
| 118 | the Free Software Foundation, either version 3 of the License, or (at your option) any later version. |
---|
| 119 | |
---|
| 120 | The Polex/PMDBF dictionary is licensed under the Creative Commons by-nc-sa License which prohibits commercial use. |
---|
| 121 | |
---|
| 122 | |
---|
| 123 | List of contributors: |
---|
| 124 | |
---|
| 125 | @itemize |
---|
| 126 | @item Pawel Konieczka |
---|
| 127 | @item Tomasz Obrebski |
---|
| 128 | @item Michal Stolarski |
---|
| 129 | @item Marcin Walas |
---|
| 130 | @item Justyna Walkowska |
---|
[04ae414] | 131 | @item Pawel Werenski |
---|
[25ae32e] | 132 | @end itemize |
---|
| 133 | |
---|
| 134 | @c ---------------------------------------------------------------------- |
---|
| 135 | @c --------------------------------------------------------------------- |
---|
| 136 | |
---|
| 137 | @node UTT file format |
---|
| 138 | @chapter UTT file format |
---|
| 139 | |
---|
| 140 | A UTT file contains annotation of a text. It consists of a sequence of |
---|
| 141 | segments. Each segment explicitly refers to a continuous piece of the |
---|
| 142 | text and provides some information on it. |
---|
| 143 | |
---|
| 144 | @section Segment format |
---|
| 145 | |
---|
| 146 | A segment occupies one line of a UTT file and consists of |
---|
| 147 | space-separated fields: |
---|
| 148 | |
---|
| 149 | |
---|
| 150 | @quotation |
---|
| 151 | @sp 1 |
---|
| 152 | [@var{start} [@var{length}]] @var{type} @var{form} [@var{annotation1} [@var{annotation2} ...]] |
---|
| 153 | @sp 1 |
---|
| 154 | @end quotation |
---|
| 155 | |
---|
| 156 | @table @var |
---|
| 157 | |
---|
| 158 | @item @var{start} |
---|
| 159 | Non-negative integer value indicating the position in the source text where the |
---|
| 160 | segment starts. |
---|
| 161 | |
---|
| 162 | @item @var{length} |
---|
| 163 | Non-negative integer value indicating the length of the segment. |
---|
| 164 | |
---|
| 165 | @item @var{type} |
---|
| 166 | A sequence of non-ASCII characters (without spaces or letters, which could lead to @var{type} being misinterpreted as a @var{start} or @var{length} field). |
---|
| 167 | @var{type} reflects the main classification of segments - |
---|
| 168 | into words, numbers, punctuation marks, meta-text markers. |
---|
| 169 | @xref{tok output,,tok output}, for description of automatically recognized type markers. |
---|
| 170 | |
---|
| 171 | @item @var{form} |
---|
| 172 | This field contains the textual form of the segment or the special |
---|
| 173 | symbol @code{*} indicating that the form is not given (e.g. when the segment has been created artificially to mark something and is of lentgh 0). |
---|
| 174 | |
---|
| 175 | The characters or character sequences that have special meaning in the |
---|
| 176 | @var{form} field are enumerated below. |
---|
| 177 | |
---|
| 178 | Characters with special meaning: |
---|
| 179 | |
---|
| 180 | @itemize |
---|
| 181 | @item @code{_} - space character |
---|
| 182 | @item @code{*} - undefined contents |
---|
| 183 | @end itemize |
---|
| 184 | |
---|
| 185 | Escape sequences: |
---|
| 186 | |
---|
| 187 | @itemize |
---|
| 188 | @item @code{\n} - new line |
---|
| 189 | @item @code{\t} - tabulation |
---|
| 190 | @item @code{\r} - carriage return |
---|
| 191 | |
---|
| 192 | @item @code{\_} - the @code{_} character |
---|
| 193 | @item @code{\*} - the @code{*} character |
---|
| 194 | @item @code{\\} - the @code{\} character |
---|
| 195 | |
---|
| 196 | @c @item @code{\hh} - a character with hexadecimal code @code{hh} (used for non-printable characters) |
---|
| 197 | @end itemize |
---|
| 198 | |
---|
| 199 | @item @var{annotation1} |
---|
| 200 | @item @var{annotation2} |
---|
| 201 | @item ... |
---|
| 202 | Annotation fields have the following format: |
---|
| 203 | |
---|
| 204 | @var{longname} @code{:} @var{value} |
---|
| 205 | |
---|
| 206 | or |
---|
| 207 | |
---|
| 208 | @var{shortname} @var{value} |
---|
| 209 | |
---|
| 210 | where @var{longname} is a string of alphanumeric characters |
---|
| 211 | (isalnum() test), @var{shortname} - a single non-alphanumeric character |
---|
| 212 | (ispunct() test), and @var{value} is an arbitrary string of non-blank characters. |
---|
| 213 | |
---|
| 214 | @end table |
---|
| 215 | |
---|
| 216 | |
---|
| 217 | Only two fields are mandatory: @var{type} and @var{form}. All other fields |
---|
| 218 | may be absent. In the case when only one number precedes the |
---|
| 219 | @var{type} field, it is interpreted as the @var{START} position. |
---|
| 220 | |
---|
| 221 | If the @var{length} field is ommited, the length of the segment is the |
---|
| 222 | length of the @var{form} field, except when the value of the |
---|
| 223 | @var{form} field is @code{*} -- in this case, the length is assumed to |
---|
| 224 | be 0. |
---|
| 225 | |
---|
| 226 | If the @var{start} field is also absent, the segment is assumed to directly |
---|
| 227 | follow the preceding one. |
---|
| 228 | |
---|
| 229 | @c Conventions: |
---|
| 230 | |
---|
| 231 | @c Annotation fields with predefined meaning: |
---|
| 232 | |
---|
| 233 | @c @itemize |
---|
| 234 | @c @item @code{!} - UTT components are allowed to modify the contents of |
---|
| 235 | @c the @var{form} field (e.g. spelling correction does this). If this happens the |
---|
| 236 | @c original form of the segment have to be placed in the @code{!}-field. |
---|
| 237 | @c @item @code{@@} - morphological description |
---|
| 238 | @c @item @code{=} - node identifier assignment (used in graph encoding) |
---|
| 239 | @c @item @code{<} - preceding/dominating node(s) (used in graph encoding) |
---|
| 240 | @c @item @code{>} - succeeding/subordinate node(s) (used in graph encoding) |
---|
| 241 | @c @end itemize |
---|
| 242 | |
---|
| 243 | Segments of length 0 may be used to mark file positions with some |
---|
| 244 | information. See e.g. BOS and EOS (beginning/end of sentence) markers |
---|
| 245 | in the example below. |
---|
| 246 | |
---|
| 247 | Example: |
---|
| 248 | |
---|
| 249 | sentence: @samp{Piszemy dobre progrumy.} |
---|
| 250 | |
---|
| 251 | @example |
---|
| 252 | 0000 00 BOS * |
---|
[19760ef] | 253 | 0000 07 W Piszemy lem:pisaÊ,V |
---|
[25ae32e] | 254 | 0007 01 S _ |
---|
| 255 | 0008 05 W dobre lem:dobry,ADJ |
---|
| 256 | 0013 01 S _ |
---|
| 257 | 0014 08 W progrumy cor:programy lem:program,N |
---|
| 258 | 0022 01 P . |
---|
| 259 | 0023 00 EOS * |
---|
| 260 | 0023 01 S _ |
---|
| 261 | 0024 00 BOS * |
---|
| 262 | 0024 11 W Warszawiacy lem:Warszawiak,N |
---|
| 263 | 0035 01 S _ |
---|
[19760ef] | 264 | 0036 03 W te¿ |
---|
[25ae32e] | 265 | 0039 01 P . |
---|
| 266 | 0040 00 EOS * |
---|
| 267 | |
---|
| 268 | @end example |
---|
| 269 | |
---|
| 270 | @example |
---|
| 271 | 0000 BOS * |
---|
[19760ef] | 272 | 0000 W Piszemy lem:pisaÊ,V |
---|
[25ae32e] | 273 | 0007 S _ |
---|
| 274 | 0008 W dobre lem:dobry,ADJ |
---|
| 275 | 0013 S _ |
---|
| 276 | 0014 W progrumy cor:programy lem:program,N |
---|
| 277 | 0022 P . |
---|
| 278 | 0023 EOS * |
---|
| 279 | @end example |
---|
| 280 | |
---|
| 281 | Posion information may be provided only for some types of segments: |
---|
| 282 | |
---|
| 283 | @example |
---|
| 284 | 0000 BOS * |
---|
[19760ef] | 285 | W Piszemy lem:pisaÊ,V |
---|
[25ae32e] | 286 | S _ |
---|
| 287 | W dobre lem:dobry,ADJ |
---|
| 288 | S _ |
---|
| 289 | W progrumy cor:programy lem:program,N |
---|
| 290 | P . |
---|
| 291 | EOS * |
---|
| 292 | S _ |
---|
| 293 | 0024 BOS * |
---|
| 294 | W Warszawiacy lem:Warszawiak,N |
---|
| 295 | S _ |
---|
[19760ef] | 296 | W te¿ |
---|
[25ae32e] | 297 | P . |
---|
| 298 | EOS * |
---|
| 299 | @end example |
---|
| 300 | |
---|
| 301 | Position/length information may be provided only when necessary: |
---|
| 302 | |
---|
| 303 | @example |
---|
| 304 | 0000 04 N * |
---|
| 305 | 0000 N 12 |
---|
| 306 | P . |
---|
| 307 | N 5 |
---|
| 308 | S _ |
---|
| 309 | W km |
---|
| 310 | @end example |
---|
| 311 | |
---|
| 312 | @section UTT File |
---|
| 313 | |
---|
| 314 | A UTT file consists of a sequence of segments. The same text position |
---|
| 315 | may be covered by multiple segments. In cosequence, ambiguous text |
---|
| 316 | segmentation and ambiguous annotation may be represented. |
---|
| 317 | |
---|
| 318 | There are two structural requirements a valid UTT-formatted file |
---|
| 319 | has to meet: |
---|
| 320 | |
---|
| 321 | @itemize @bullet |
---|
| 322 | |
---|
| 323 | @item |
---|
| 324 | segments have to be sorted with respect to the @var{position} field, |
---|
| 325 | |
---|
| 326 | @item |
---|
| 327 | for each |
---|
| 328 | segment ending at position @var{n}, either there must be a segment starting at |
---|
| 329 | position @var{n+1}, or position @var{n+1} is not covered by any segment; similarly |
---|
| 330 | for each segment starting at position @var{n}, either there must be a segment |
---|
| 331 | ending at position @var{n-1}, or the position @var{n-1} must not be covered |
---|
| 332 | by any segment. |
---|
| 333 | |
---|
| 334 | @end itemize |
---|
| 335 | |
---|
| 336 | A valid annotation for the text fragment |
---|
| 337 | @example |
---|
| 338 | 12.5 km |
---|
| 339 | @end example |
---|
| 340 | |
---|
| 341 | may be |
---|
| 342 | |
---|
| 343 | @example |
---|
| 344 | 0000 02 N 12 |
---|
| 345 | 0000 04 N 12.5 |
---|
| 346 | 0002 01 P . |
---|
| 347 | 0003 01 N 5 |
---|
| 348 | 0004 01 S _ |
---|
| 349 | 0005 02 W km |
---|
| 350 | @end example |
---|
| 351 | |
---|
| 352 | but not |
---|
| 353 | |
---|
| 354 | @example |
---|
| 355 | 0000 02 N 12 |
---|
| 356 | 0000 04 N 12.5 |
---|
| 357 | 0004 01 S _ |
---|
| 358 | 0005 02 W km |
---|
| 359 | @end example |
---|
| 360 | |
---|
[261bf62] | 361 | because in the latter example the first segment (starting at position |
---|
| 362 | 0000, 2 characters long) ends at position @var{n}=0001 which is |
---|
| 363 | covered by the second segment and no segment starts at position |
---|
| 364 | @var{n+2}=0002. |
---|
| 365 | |
---|
| 366 | |
---|
| 367 | @section Flattened UTT file |
---|
| 368 | |
---|
| 369 | A UTT file format has two variants: regular and flattend. The regular |
---|
| 370 | format was described above. In the flattened format some of the |
---|
| 371 | end-of-line characters are replaced with line-feed characters. |
---|
| 372 | |
---|
| 373 | The flatten format is basically used to represent whole sentences as |
---|
| 374 | single lines of the input file (all intrasentential end-of-line |
---|
| 375 | characters are replaced with line-feed characters). |
---|
| 376 | |
---|
| 377 | This technical trick permits to perform certain text |
---|
| 378 | processing operations on entire sentences with the use of such tools as |
---|
| 379 | @command{grep} (see @command{grp} component) or @command{sed} (see @command{mar} component). |
---|
| 380 | |
---|
| 381 | The conversion between the two formats is performed by the tools: |
---|
| 382 | @command{fla} and @command{unfla}. |
---|
[25ae32e] | 383 | |
---|
| 384 | @section Character encoding |
---|
| 385 | |
---|
| 386 | The UTT component programs accept only 1-byte character encoding, such |
---|
[261bf62] | 387 | as ISO, ANSI, DOS. |
---|
[25ae32e] | 388 | |
---|
| 389 | |
---|
| 390 | @c @section Formats |
---|
| 391 | |
---|
| 392 | @c @unnumberedsubsubsec Basic format |
---|
| 393 | |
---|
| 394 | @c While processing large amounts of the overhead related with explicit |
---|
| 395 | @c ... of the start position and segment length becomes ... . Therefore, |
---|
| 396 | @c for efficiency reasons certain shortcuts are possible: |
---|
| 397 | |
---|
| 398 | @c @unnumberedsubsubsec Relative start position |
---|
| 399 | |
---|
| 400 | @c Start position may be given as relative distance from the last |
---|
| 401 | @c absolut position. |
---|
| 402 | |
---|
| 403 | @c @unnumberedsubsubsec Absent length |
---|
| 404 | |
---|
| 405 | @c Segment length may by omitted. Normally it can be restored by counting |
---|
| 406 | @c the length of the @emph{form field}. For segments with the special value |
---|
| 407 | @c @code{*} in the @emph{form field} length 0 is assumed. |
---|
| 408 | |
---|
| 409 | @c @unnumberedsubsubsec Absent length and start position |
---|
| 410 | |
---|
| 411 | @c Both start position and segment length may be omitted. In this format |
---|
| 412 | @c each segment is assumed to follow the previous one. This format is, |
---|
| 413 | @c therefore, suitable only for unambiguously tagged text |
---|
| 414 | @c (0-length markers can be still used.) |
---|
| 415 | |
---|
| 416 | |
---|
| 417 | @c @table @code |
---|
| 418 | @c @item AL |
---|
| 419 | @c @code{1234 03 W kot} |
---|
| 420 | @c @item RL |
---|
| 421 | @c @code{+56 03 W kot} |
---|
| 422 | @c @item A |
---|
| 423 | @c @code{1234 W kot} |
---|
| 424 | @c @item R |
---|
| 425 | @c @code{+56 W kot} |
---|
| 426 | @c @item 0 |
---|
| 427 | @c @code{W kot} |
---|
| 428 | @c @end table |
---|
| 429 | |
---|
| 430 | |
---|
[19760ef] | 431 | @c [JAK UZYSKAÃ POLSKIE CZCIONKI W DVI???] |
---|
[25ae32e] | 432 | |
---|
| 433 | @macro parhelp |
---|
| 434 | @item @b{@minus{}@minus{}help}, @b{@minus{}h} |
---|
| 435 | Print help. |
---|
| 436 | @end macro |
---|
| 437 | |
---|
| 438 | |
---|
| 439 | @macro parversion |
---|
| 440 | @item @b{@minus{}@minus{}version}, @b{@minus{}V} |
---|
| 441 | Print version information. |
---|
| 442 | @end macro |
---|
| 443 | |
---|
| 444 | @macro parinteractive |
---|
| 445 | @item @b{@minus{}@minus{}interactive, @minus{}i} |
---|
| 446 | This option toggles interactive mode, which is by default off. In the |
---|
| 447 | interactive mode the program does not buffer the output. |
---|
| 448 | @end macro |
---|
| 449 | |
---|
| 450 | |
---|
| 451 | @c @macro parfile |
---|
| 452 | @c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}} |
---|
| 453 | @c Input file name. |
---|
| 454 | @c If this option is absent or equal to '@minus{}', the program |
---|
| 455 | @c reads from the standard input. |
---|
| 456 | @c @end macro |
---|
| 457 | |
---|
| 458 | |
---|
| 459 | @c @macro paroutput |
---|
| 460 | @c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}} |
---|
| 461 | @c Regular output file name. To regular output the program sends segments |
---|
| 462 | @c which it successfully processed and copies those which were not |
---|
| 463 | @c subject to processing. If this option is absent or equal to |
---|
| 464 | @c '@minus{}', standard output is used. |
---|
| 465 | @c @end macro |
---|
| 466 | |
---|
| 467 | @c @macro parfail |
---|
| 468 | @c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}} |
---|
| 469 | @c Fail output file name. To fail output the program copies the segments |
---|
| 470 | @c it failed to process. If this option is absent or equal to |
---|
| 471 | @c '@minus{}', standard output is used. |
---|
| 472 | @c @end macro |
---|
| 473 | |
---|
| 474 | |
---|
| 475 | @c @macro parcopy |
---|
| 476 | @c @item @b{@minus{}@minus{}copy, @minus{}c} |
---|
| 477 | @c Copy succesfully processed segments to regular output also in their |
---|
| 478 | @c original input form. |
---|
| 479 | @c @end macro |
---|
| 480 | |
---|
| 481 | |
---|
| 482 | @macro parinputfield |
---|
| 483 | @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}} |
---|
| 484 | The field containing the input to the program. The default is the |
---|
| 485 | @var{form} field. The fields @var{position}, @var{length}, @var{type}, |
---|
| 486 | and @var{form} are referred to as @code{1}, @code{2}, @code{3}, |
---|
| 487 | @code{4}, respectively. |
---|
| 488 | @end macro |
---|
| 489 | |
---|
| 490 | |
---|
| 491 | @macro paroutputfield |
---|
| 492 | @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}} |
---|
| 493 | The name of the field added by the program. The default is the name of the program. |
---|
| 494 | @end macro |
---|
| 495 | |
---|
| 496 | |
---|
| 497 | @macro pardictionary |
---|
| 498 | @item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}} |
---|
| 499 | Dictionary file name. |
---|
| 500 | @end macro |
---|
| 501 | |
---|
| 502 | |
---|
| 503 | @macro parprocess |
---|
| 504 | @item @b{@minus{}@minus{}process=@var{type}, @minus{}p @var{type}} |
---|
| 505 | Process segments with the specified value in the @var{type} field. |
---|
| 506 | Multiple occurences of this option are allowed and are interpreted as |
---|
| 507 | disjunction. If this option is absent, all segments are processed. |
---|
| 508 | @end macro |
---|
| 509 | |
---|
| 510 | |
---|
| 511 | @macro parselect |
---|
| 512 | @item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}} |
---|
| 513 | Select for processing only segments in which the field named |
---|
| 514 | @var{fieldname} is present. Multiple occurences of this option are |
---|
| 515 | allowed and are interpreted as conjunction of conditions. If this |
---|
| 516 | option is absent, all segments are processed. |
---|
| 517 | @end macro |
---|
| 518 | |
---|
| 519 | |
---|
| 520 | @macro parunselect |
---|
| 521 | @item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}} |
---|
| 522 | Select for processing only segments in which the field @var{fieldname} |
---|
| 523 | is absent. Multiple occurences of this option are allowed and are |
---|
| 524 | interpreted as conjunction of conditions. If this option is absent, |
---|
| 525 | all segments are processed. |
---|
| 526 | @end macro |
---|
| 527 | |
---|
| 528 | |
---|
| 529 | @macro paroneline |
---|
| 530 | @item @b{@minus{}@minus{}one-line} |
---|
| 531 | This option makes the program print ambiguous annotation in one output |
---|
| 532 | line by generating multiple annotation fields. By default when |
---|
| 533 | ambiguous annotation may be produced for a segment, the segment is |
---|
| 534 | multiplicated and each of the annotations is added to separate copy of |
---|
| 535 | the segment. |
---|
| 536 | @end macro |
---|
| 537 | |
---|
| 538 | |
---|
| 539 | @macro paronefield |
---|
| 540 | @item @b{@minus{}@minus{}one-field, @minus{}1} |
---|
| 541 | This option makes the program print ambiguous annotation in one |
---|
| 542 | annotation field. By default when ambiguous annotation may be produced |
---|
| 543 | for a segment, the segment is multiplicated and each of the |
---|
| 544 | annotations is added to separate copy of the segment. |
---|
| 545 | |
---|
| 546 | This option is useful when working with @command{kot} or @command{con}. |
---|
| 547 | @end macro |
---|
| 548 | |
---|
| 549 | |
---|
| 550 | @c --------------------------------------------------------------------- |
---|
| 551 | @c CONFIGURATION FILES |
---|
| 552 | @c --------------------------------------------------------------------- |
---|
| 553 | |
---|
| 554 | @node Configuration files |
---|
| 555 | @chapter Configuration files |
---|
| 556 | |
---|
| 557 | Values for all command line options accepted by a component |
---|
| 558 | may be set in configuration files. The default location of the |
---|
| 559 | configuration files for a component named @command{@var{program}} are |
---|
| 560 | |
---|
| 561 | @example |
---|
[246900a] | 562 | @file{/usr/local/etc/utt/@var{program}.conf} |
---|
[25ae32e] | 563 | @end example |
---|
| 564 | |
---|
| 565 | for system-wide configuration file and |
---|
| 566 | |
---|
| 567 | @example |
---|
[246900a] | 568 | @file{~/.utt/@var{program}.conf} |
---|
[25ae32e] | 569 | @end example |
---|
| 570 | |
---|
| 571 | for user configuration file. |
---|
| 572 | |
---|
| 573 | @c The configuration file to load may be also specified with the |
---|
| 574 | @c @option{--config} option. Configuration file need not be provided. |
---|
| 575 | |
---|
| 576 | For each option, the value is set according to the following priority: |
---|
| 577 | |
---|
| 578 | @itemize |
---|
| 579 | @item command line |
---|
| 580 | @c @item configuration file indicated with @option{--config} option |
---|
| 581 | @item user configuration file (or configuration file indicated with the @option{--config} option) |
---|
| 582 | @item system-wide configuration file |
---|
| 583 | @end itemize |
---|
| 584 | |
---|
| 585 | Parameter values are specified in the following format: |
---|
| 586 | |
---|
| 587 | @var{parametername}=@var{value} |
---|
| 588 | |
---|
| 589 | where @var{parametername} is the short or long name of an option accepted by |
---|
| 590 | the program, or |
---|
| 591 | |
---|
| 592 | @var{parametername} |
---|
| 593 | |
---|
| 594 | if the option does not need arguments. |
---|
| 595 | |
---|
| 596 | You can introduce comments to configuration files using the # sign. |
---|
| 597 | |
---|
| 598 | If a program accepts multiple occurences of an option (e.g. @var{lem}'s select option) you can specify them in two distinct lines of the program's configuration file. |
---|
| 599 | |
---|
| 600 | @c The equal sign may be omitted. |
---|
| 601 | |
---|
| 602 | |
---|
| 603 | @quotation Tip |
---|
| 604 | If you have two (or more) frequently used sets of options for the same |
---|
| 605 | program (eg. lem with PMDBF dictionary and lem with a user dictionary) |
---|
| 606 | a good solution is to create two soft links to lem, called |
---|
| 607 | eg. lemg and lemu and specify their configuration in files lemg.conf |
---|
| 608 | and lemu.conf respectively. |
---|
| 609 | @end quotation |
---|
| 610 | |
---|
| 611 | @c --------------------------------------------------------------------- |
---|
| 612 | @c COMPONENTS |
---|
| 613 | @c --------------------------------------------------------------------- |
---|
| 614 | |
---|
| 615 | @node UTT components |
---|
| 616 | @chapter UTT components |
---|
| 617 | |
---|
| 618 | UTT components are of three types: |
---|
| 619 | |
---|
| 620 | @menu |
---|
| 621 | Sources: programs which read non-UTT data (e.g. raw text) and produce output |
---|
| 622 | in UTT format |
---|
| 623 | * tok:: a tokenizer |
---|
| 624 | |
---|
| 625 | Filters: programs which read and produce UTT-formatted data |
---|
| 626 | * lem:: a morphological analyzer |
---|
| 627 | * gue:: a morphological guesser |
---|
[261bf62] | 628 | * cor:: a simple spelling corrector |
---|
| 629 | * kor:: a more elaborated spelling corrector |
---|
[25ae32e] | 630 | * sen:: a sentensizer |
---|
| 631 | * ser:: a pattern search tool (marks matches) |
---|
[261bf62] | 632 | * mar:: a pattern search tool (introduces arbitrary markers into the text) |
---|
[25ae32e] | 633 | * grp:: a pattern search tool (selects sentences containing a match) |
---|
[261bf62] | 634 | @c * gph:: a word-graph annotation tool:: |
---|
| 635 | @c * dgp:: a dependency parser |
---|
[25ae32e] | 636 | |
---|
| 637 | Sinks: programs which read UTT data and produce output in another format |
---|
| 638 | * kot:: an untokenizer |
---|
| 639 | * con:: a concordance table generator |
---|
| 640 | @end menu |
---|
| 641 | |
---|
| 642 | @c --------------------------------------------------------------------- |
---|
| 643 | @c TOK |
---|
| 644 | @c --------------------------------------------------------------------- |
---|
| 645 | |
---|
| 646 | @page |
---|
| 647 | @node tok |
---|
| 648 | @section tok - a tokenizer |
---|
| 649 | |
---|
| 650 | @c ---------------------------------------- |
---|
| 651 | |
---|
| 652 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} |
---|
[19760ef] | 653 | @item @strong{Authors:} @tab Tomasz Obrêbski |
---|
[25ae32e] | 654 | @item @strong{Component category:} @tab source |
---|
[261bf62] | 655 | @item @strong{Input format:} @tab raw text file |
---|
| 656 | @item @strong{Output format:} @tab UTT regular |
---|
| 657 | @item @strong{Required annotation:} @tab - |
---|
[25ae32e] | 658 | @end multitable |
---|
| 659 | |
---|
| 660 | |
---|
| 661 | @menu |
---|
| 662 | * tok description:: |
---|
| 663 | * tok input:: |
---|
| 664 | * tok output:: |
---|
| 665 | * tok command line options:: |
---|
| 666 | * tok example:: |
---|
| 667 | @end menu |
---|
| 668 | |
---|
| 669 | @node tok description |
---|
| 670 | @subsection Description |
---|
| 671 | |
---|
| 672 | @code{tok} is a simple program which reads a text file and identifies |
---|
| 673 | tokens on the basis of their orthographic form. The type of the token |
---|
| 674 | is printed as the @var{type} field. |
---|
| 675 | |
---|
| 676 | @node tok input |
---|
| 677 | @subsection Input |
---|
| 678 | |
---|
| 679 | Raw text. |
---|
| 680 | |
---|
| 681 | @node tok output |
---|
| 682 | @subsection Output |
---|
| 683 | |
---|
| 684 | UTT-file with four fields: @var{start}, @var{length}, @var{type}, and @var{form}. In the @var{type} field five types of tokens are distinguished: |
---|
| 685 | |
---|
| 686 | @itemize |
---|
| 687 | |
---|
| 688 | @item @code{W} |
---|
| 689 | (word) |
---|
| 690 | - continuous sequence of letters |
---|
| 691 | |
---|
| 692 | @item @code{N} |
---|
| 693 | (number) |
---|
| 694 | - continuous sequence of digits |
---|
| 695 | |
---|
| 696 | @item @code{S} |
---|
| 697 | (space) |
---|
| 698 | - continuous sequence of space characters |
---|
| 699 | |
---|
| 700 | @item @code{P} |
---|
| 701 | (punctuation mark) |
---|
| 702 | - single printable characters not belonging to any of the other classes |
---|
| 703 | |
---|
| 704 | @item @code{B} |
---|
| 705 | (unprintable character) |
---|
| 706 | - single unprintable character |
---|
| 707 | |
---|
| 708 | @end itemize |
---|
| 709 | |
---|
| 710 | |
---|
| 711 | |
---|
| 712 | @node tok command line options |
---|
| 713 | @subsection Command line options |
---|
| 714 | |
---|
| 715 | @table @code |
---|
| 716 | |
---|
| 717 | @item @b{@minus{}@minus{}help}, @b{@minus{}h} |
---|
| 718 | Print help. |
---|
| 719 | |
---|
| 720 | @item @b{@minus{}@minus{}version}, @b{@minus{}V} |
---|
| 721 | Print version information. |
---|
| 722 | |
---|
| 723 | @item @b{@minus{}@minus{}interactive, @minus{}i} |
---|
| 724 | This option toggles interactive mode, which is by default off. In the |
---|
| 725 | interactive mode the program does not buffer the output. |
---|
| 726 | |
---|
| 727 | @end table |
---|
| 728 | |
---|
| 729 | @node tok example |
---|
| 730 | @subsection Example |
---|
| 731 | |
---|
| 732 | Input: |
---|
| 733 | |
---|
| 734 | @example |
---|
| 735 | Piszemy dobre programy. |
---|
| 736 | @end example |
---|
| 737 | |
---|
| 738 | Output: |
---|
| 739 | |
---|
| 740 | @example |
---|
| 741 | 0000 07 W Piszemy |
---|
| 742 | 0007 01 S _ |
---|
| 743 | 0008 05 W dobre |
---|
| 744 | 0013 01 S _ |
---|
| 745 | 0014 08 W programy |
---|
| 746 | 0022 01 P . |
---|
| 747 | 0023 01 S \n |
---|
| 748 | @end example |
---|
| 749 | |
---|
| 750 | |
---|
| 751 | @c --------------------------------------------------------------------- |
---|
| 752 | @c SEN |
---|
| 753 | @c --------------------------------------------------------------------- |
---|
| 754 | |
---|
| 755 | @c @node sen - sentencizer |
---|
| 756 | @c @chapter sen - sentencizer |
---|
| 757 | |
---|
[19760ef] | 758 | @c Authors: Tomasz Obrêbski |
---|
[25ae32e] | 759 | |
---|
| 760 | @c --------------------------------------------------------------------- |
---|
| 761 | @c LEM |
---|
| 762 | @c --------------------------------------------------------------------- |
---|
| 763 | |
---|
| 764 | @page |
---|
| 765 | @node lem |
---|
| 766 | @section lem - morphological analyzer |
---|
| 767 | |
---|
| 768 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} |
---|
[19760ef] | 769 | @item @strong{Authors:} @tab Tomasz Obrêbski, Micha³ Stolarski |
---|
[25ae32e] | 770 | @item @strong{Component category:} @tab filter |
---|
[261bf62] | 771 | @item @strong{Input format:} @tab UTT regular |
---|
| 772 | @item @strong{Output format:} @tab UTT regular |
---|
| 773 | @item @strong{Required annotation:} @tab tok |
---|
[25ae32e] | 774 | @end multitable |
---|
| 775 | |
---|
| 776 | @menu |
---|
| 777 | * lem description:: |
---|
| 778 | * lem command line options:: |
---|
| 779 | * lem input:: |
---|
| 780 | * lem output:: |
---|
| 781 | * lem example:: |
---|
| 782 | * lem dictionaries:: |
---|
| 783 | * lem hints:: |
---|
| 784 | @end menu |
---|
| 785 | |
---|
| 786 | @node lem description |
---|
| 787 | @subsection Description |
---|
| 788 | |
---|
| 789 | @command{lem} performs morphological analysis of a simple orthographic |
---|
| 790 | word, returning all its possible morphological annotations, |
---|
| 791 | disregarding the context. |
---|
| 792 | |
---|
| 793 | @c ---------------------------------------- |
---|
| 794 | |
---|
| 795 | @node lem command line options |
---|
| 796 | @subsection Command line options |
---|
| 797 | |
---|
| 798 | @table @code |
---|
| 799 | @parhelp |
---|
| 800 | @parversion |
---|
| 801 | @parinteractive |
---|
| 802 | @c @parfile |
---|
| 803 | @c @paroutput |
---|
| 804 | @c @parfail |
---|
| 805 | @c @parcopy |
---|
| 806 | @parinputfield |
---|
| 807 | @paroutputfield |
---|
| 808 | @pardictionary |
---|
| 809 | @parprocess |
---|
| 810 | @parselect |
---|
| 811 | @parunselect |
---|
| 812 | @paroneline |
---|
| 813 | @paronefield |
---|
| 814 | @end table |
---|
| 815 | |
---|
| 816 | @c ---------------------------------------- |
---|
| 817 | |
---|
| 818 | @node lem input |
---|
| 819 | @subsection Input |
---|
| 820 | |
---|
| 821 | Lem reads a UTT file and processes the value of the @var{form} field |
---|
| 822 | (the input field may be changed with @option{--input-field} option). |
---|
| 823 | |
---|
| 824 | @node lem output |
---|
| 825 | @subsection Output |
---|
| 826 | |
---|
| 827 | @command{lem} adds a new annotation field, whose default name is @code{lem}. In |
---|
| 828 | case of ambiguity either the segment is multiplicated (default), |
---|
| 829 | multiple @code{lem} fields are added (@option{--one-line}) or ambiguous |
---|
| 830 | annotation is produced as the value of single @code{lem} field (option |
---|
| 831 | @option{--one-field,-1}): |
---|
| 832 | |
---|
| 833 | @itemize @bullet |
---|
| 834 | |
---|
| 835 | @item |
---|
| 836 | unambiguous value format: |
---|
| 837 | |
---|
| 838 | @example |
---|
| 839 | <lemma>,<descr> |
---|
| 840 | @end example |
---|
| 841 | |
---|
| 842 | @item |
---|
| 843 | ambiguous value format (@option{--one-field} option) |
---|
| 844 | |
---|
| 845 | |
---|
| 846 | @example |
---|
| 847 | <lemma>,<descr>[,<descr>][;<lemma>,<descr>[,<descr>]] |
---|
| 848 | @end example |
---|
| 849 | |
---|
| 850 | (alternative descriptions for the same lemma are separated by commas, |
---|
| 851 | alternative lemmata are separated by semicolons.) |
---|
| 852 | |
---|
| 853 | @end itemize |
---|
| 854 | |
---|
| 855 | @node lem example |
---|
| 856 | @subsection Example |
---|
| 857 | |
---|
| 858 | Input: |
---|
| 859 | |
---|
| 860 | @example |
---|
| 861 | 0000 07 W Piszemy |
---|
| 862 | 0007 01 S _ |
---|
| 863 | 0008 05 W dobre |
---|
| 864 | 0013 01 S _ |
---|
| 865 | 0014 08 W programy |
---|
| 866 | 0022 01 P . |
---|
| 867 | 0023 01 B \n |
---|
| 868 | @end example |
---|
| 869 | |
---|
| 870 | Output (default): |
---|
| 871 | |
---|
| 872 | @example |
---|
[19760ef] | 873 | 0000 07 W Piszemy lem:pisaÊ,V/AiVpMdTrfNpP1 |
---|
[25ae32e] | 874 | 0007 01 B _ |
---|
| 875 | 0008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn |
---|
| 876 | 0008 05 W dobre lem:dobry,ADJ/DpNsCnavGn |
---|
| 877 | 0013 01 B _ |
---|
| 878 | 0014 08 W programy lem:program,N/GiNpCa |
---|
| 879 | 0014 08 W programy lem:program,N/GiNpCn |
---|
| 880 | 0014 08 W programy lem:program,N/GiNpCv |
---|
| 881 | 0022 01 P . |
---|
| 882 | 0023 01 B \n |
---|
| 883 | @end example |
---|
| 884 | |
---|
| 885 | Output (@option{--one-line} option): |
---|
| 886 | |
---|
| 887 | @example |
---|
[19760ef] | 888 | 0000 07 W Piszemy lem:pisaÊ,V/AiVpMdTrfNpP1 |
---|
[25ae32e] | 889 | 0007 01 S _ |
---|
| 890 | 0008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn lem:dobry,ADJ/DpNsCnavGn |
---|
| 891 | 0013 01 S _ |
---|
| 892 | 0014 08 W programy lem:program,N/GiNpCa lem:program,N/GiNpCn lem:program,N/GiNpCv |
---|
| 893 | 0022 01 P . |
---|
| 894 | 0023 01 S \n |
---|
| 895 | @end example |
---|
| 896 | |
---|
| 897 | Output (@option{--one-field} option): |
---|
| 898 | |
---|
| 899 | @example |
---|
[19760ef] | 900 | 0000 07 W Piszemy lem:pisaÊ,V/AiVpMdTrfNpP1 |
---|
[25ae32e] | 901 | 0007 01 S _ |
---|
| 902 | 0008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn,ADJ/DpNsCnavGn |
---|
| 903 | 0013 01 S _ |
---|
| 904 | 0014 08 W programy lem:program,N/GiNpCa,N/GiNpCn,N/GiNpCv |
---|
| 905 | 0022 01 P . |
---|
| 906 | 0023 01 S \n |
---|
| 907 | @end example |
---|
| 908 | |
---|
| 909 | @c ---------------------------------------- |
---|
| 910 | |
---|
| 911 | @node lem dictionaries |
---|
| 912 | @subsection Dictionaries |
---|
| 913 | |
---|
| 914 | @command{lem} requires a dictionary. The dictionary may be provided in |
---|
| 915 | one of two formats: in text (source) format or in binary (fsa) format. |
---|
| 916 | |
---|
| 917 | @subsubheading Text format |
---|
| 918 | |
---|
| 919 | Dictionary entries have the following structure: |
---|
| 920 | |
---|
| 921 | @example |
---|
| 922 | <form>;<lemma>,<descr>[;<lemma>,<descr>] |
---|
| 923 | @end example |
---|
| 924 | |
---|
| 925 | @var{lemma} may be given explicitly or in the cut-add format: |
---|
| 926 | |
---|
| 927 | @example |
---|
| 928 | @code{[<cut1><add1>-]<cut2><add2>} |
---|
| 929 | @end example |
---|
| 930 | |
---|
| 931 | meaning: replace prefix of length @code{<cut1>} with |
---|
| 932 | string @code{<add1>}, replace suffix of length @code{<cut2>} with string |
---|
| 933 | @code{<add2>}. For example @code{3t} transforms @samp{kocie} into |
---|
[19760ef] | 934 | @samp{kot}, @code{3-4a³y} transforms @samp{najbielsi} into @samp{bia³y} |
---|
[25ae32e] | 935 | |
---|
| 936 | Each dictionary entry must be written in one line and must not contain blank characters. |
---|
| 937 | |
---|
| 938 | Examples: |
---|
| 939 | @example |
---|
| 940 | kot;0,N/GaNsCn |
---|
| 941 | kota;1,N/GaNsCg;1,N/GaNsCa |
---|
| 942 | kotu;1,N/GaNsCd |
---|
| 943 | kotem;2,N/GaNsCi |
---|
| 944 | kocie;3t,N/GaNsCl;3t,N/GaNsCv |
---|
[19760ef] | 945 | najbielsi;3-4a³y,ADJ/DsNpCnGp |
---|
| 946 | najbielsze;3-5a³y,ADJ/DsNpCnGaifn |
---|
[25ae32e] | 947 | najlepsi;dobry,ADJ/DsNpCnGp |
---|
| 948 | najlepsze;dobry,ADJ/DsNpCnGaifn |
---|
| 949 | @end example |
---|
| 950 | |
---|
| 951 | |
---|
| 952 | The mandatory file name extension for a text dictionary is @code{dic}. For large |
---|
| 953 | dictionaries it is preferable, however, to compile them into binary |
---|
| 954 | (fsa) format. |
---|
| 955 | |
---|
| 956 | @subsubheading Binary format |
---|
| 957 | |
---|
| 958 | The mandatory file name extension for a binary dictionary is @code{bin}. To |
---|
| 959 | compile a text dictionary into binary format, write: |
---|
| 960 | |
---|
| 961 | @example |
---|
| 962 | compiledic <dictionaryname>.dic |
---|
| 963 | @end example |
---|
| 964 | |
---|
| 965 | @subsubheading Polex/PMDBF dictionary |
---|
| 966 | |
---|
| 967 | A large-coverage morphological dictionary for Polish language, Polex/PMDBF, is included in |
---|
| 968 | the distribution as the default @emph{lem}'s dictionary. It's |
---|
| 969 | located by default in: |
---|
| 970 | |
---|
[261bf62] | 971 | @file{$HOME/.local/share/utt/pl_PL.ISO-8859-2/lem.bin} |
---|
| 972 | |
---|
| 973 | in local installation or in |
---|
| 974 | |
---|
| 975 | @file{/usr/local/share/utt/pl_PL.ISO-8859-2/lem.bin} |
---|
| 976 | |
---|
| 977 | in system installation. |
---|
[25ae32e] | 978 | |
---|
| 979 | @node lem hints |
---|
| 980 | @subsection Hints |
---|
| 981 | |
---|
[261bf62] | 982 | @subsubheading Combining data from multiple dictionaries |
---|
[25ae32e] | 983 | |
---|
[261bf62] | 984 | @itemize |
---|
[25ae32e] | 985 | |
---|
[261bf62] | 986 | @item Apply <dict1>, then apply <dict2> to words which were not annotatated. |
---|
[25ae32e] | 987 | |
---|
[261bf62] | 988 | @example |
---|
| 989 | lem -d <dict1> | lem -S lem -d <dict2> |
---|
| 990 | @end example |
---|
[25ae32e] | 991 | |
---|
[261bf62] | 992 | @item Add annotations from two dictionaries <dict1> and <dict2>. |
---|
[25ae32e] | 993 | |
---|
[261bf62] | 994 | @example |
---|
| 995 | lem -c -d <dict1> | lem -S lem -d <dict2> |
---|
| 996 | @end example |
---|
[25ae32e] | 997 | |
---|
[261bf62] | 998 | @end itemize |
---|
[25ae32e] | 999 | |
---|
| 1000 | |
---|
| 1001 | @c --------------------------------------------------------------------- |
---|
| 1002 | @c GUE |
---|
| 1003 | @c --------------------------------------------------------------------- |
---|
| 1004 | |
---|
| 1005 | @page |
---|
| 1006 | @node gue |
---|
| 1007 | @section gue - morphological guesser |
---|
| 1008 | |
---|
| 1009 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} |
---|
| 1010 | |
---|
[19760ef] | 1011 | @item @strong{Authors:} @tab Micha³ Stolarski, Tomasz Obrêbski |
---|
[25ae32e] | 1012 | @item @strong{Component category:} @tab filter |
---|
| 1013 | |
---|
| 1014 | @end multitable |
---|
| 1015 | |
---|
| 1016 | @menu |
---|
[261bf62] | 1017 | * gue description:: |
---|
[25ae32e] | 1018 | * gue command line options:: |
---|
| 1019 | * gue example:: |
---|
| 1020 | * gue dictionaries:: |
---|
| 1021 | @end menu |
---|
| 1022 | |
---|
[261bf62] | 1023 | |
---|
| 1024 | @node gue description |
---|
| 1025 | @subsection Description |
---|
| 1026 | |
---|
| 1027 | @command{gue} guesess morphological descriptions of the form contained |
---|
| 1028 | in the @var{form} field. |
---|
| 1029 | |
---|
| 1030 | |
---|
[25ae32e] | 1031 | @node gue command line options |
---|
| 1032 | @subsection Command line options |
---|
| 1033 | |
---|
| 1034 | @table @code |
---|
| 1035 | |
---|
| 1036 | @parhelp |
---|
| 1037 | @parversion |
---|
| 1038 | @parinteractive |
---|
| 1039 | @c @parfile |
---|
| 1040 | @c @paroutput |
---|
| 1041 | @c @parfail |
---|
| 1042 | @c @parcopy |
---|
| 1043 | @parinputfield |
---|
| 1044 | @paroutputfield |
---|
| 1045 | @pardictionary |
---|
| 1046 | @parprocess |
---|
| 1047 | @parselect |
---|
| 1048 | @parunselect |
---|
| 1049 | @paroneline |
---|
| 1050 | @paronefield |
---|
| 1051 | |
---|
| 1052 | @item @b{@minus{}@minus{}delta=@var{n}} |
---|
| 1053 | Stop displaying answers after fall of weight, that is, when weight difference between 2 subsequent results is more than delta value (default=`0.2'). |
---|
| 1054 | |
---|
| 1055 | |
---|
| 1056 | @item @b{@minus{}@minus{}cut-off=@var{n}} |
---|
| 1057 | Do not display answers with less weight than cut-off value (default=`200'). |
---|
| 1058 | |
---|
| 1059 | |
---|
| 1060 | @item @b{@minus{}@minus{}guess_count=@var{n}, @minus{}n @var{n}} |
---|
| 1061 | Guess up to n descriptions (default=`0', which means 'display all results'). |
---|
| 1062 | |
---|
| 1063 | |
---|
| 1064 | |
---|
| 1065 | @end table |
---|
| 1066 | |
---|
| 1067 | @node gue example |
---|
| 1068 | @subsection Example |
---|
| 1069 | |
---|
| 1070 | @example |
---|
| 1071 | command: gue -n 2 |
---|
| 1072 | |
---|
| 1073 | input: |
---|
| 1074 | 0000 07 W smerfny |
---|
| 1075 | |
---|
| 1076 | output: |
---|
| 1077 | 0000 07 W smerfny gue:,ADJ/CaDpGiNs |
---|
| 1078 | 0000 07 W smerfny gue:,ADJ/CnvDpGaipNs |
---|
| 1079 | @end example |
---|
| 1080 | |
---|
| 1081 | |
---|
| 1082 | @node gue dictionaries |
---|
| 1083 | @subsection Dictionaries |
---|
| 1084 | |
---|
| 1085 | @command{gue} requires a dictionary. For now, the dictionary must be provided in binary (fsa) format. |
---|
| 1086 | The fsa format is created by compiling text-format dictionaries. |
---|
| 1087 | |
---|
| 1088 | |
---|
| 1089 | |
---|
| 1090 | @subsubheading Text format |
---|
| 1091 | |
---|
| 1092 | Dictionary entries have the following structure: |
---|
| 1093 | |
---|
| 1094 | @example |
---|
| 1095 | @var{prefix}@code{*}@var{suffix}@code{;}@var{lemma}@code{,}@var{description}@code{:}@var{weight} |
---|
| 1096 | @end example |
---|
| 1097 | |
---|
| 1098 | @var{lemma} must be given in the cut-add format: |
---|
| 1099 | |
---|
| 1100 | @example |
---|
| 1101 | @code{[<cut1><add1>-]<cut2><add2>} |
---|
| 1102 | @end example |
---|
| 1103 | (no spaces in between): replace prefix of length @var{cut1} with |
---|
| 1104 | string @var{add1}, replace suffix of length @var{cat2} with string |
---|
| 1105 | @var{add2}. |
---|
| 1106 | |
---|
| 1107 | |
---|
[19760ef] | 1108 | Example: @code{3-4a³y} transforms @i{najbielsi} into @i{bia³y} |
---|
[25ae32e] | 1109 | |
---|
| 1110 | |
---|
| 1111 | @var{description} contains the part of speech and morphosyntactic information (@xref{PMDBF dictionary}.). |
---|
| 1112 | |
---|
| 1113 | @var{weight} is an integer value between 1 and 999 indicating the |
---|
| 1114 | likelihood of the guess. |
---|
| 1115 | |
---|
| 1116 | @example |
---|
[19760ef] | 1117 | *³kê;1a,N/GfNsCa |
---|
| 1118 | naj*elszy;3-4a³y,ADJ/...:... |
---|
[25ae32e] | 1119 | @end example |
---|
| 1120 | |
---|
| 1121 | |
---|
| 1122 | @c --------------------------------------------------------------------- |
---|
| 1123 | @c COR |
---|
| 1124 | @c --------------------------------------------------------------------- |
---|
| 1125 | |
---|
| 1126 | @page |
---|
| 1127 | @node cor |
---|
| 1128 | @section cor - spelling corrector |
---|
| 1129 | |
---|
| 1130 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} |
---|
[19760ef] | 1131 | @item @strong{Authors:} @tab Tomasz Obrêbski, Micha³ Stolarski |
---|
[25ae32e] | 1132 | @item @strong{Component category:} @tab filter |
---|
[261bf62] | 1133 | @item @strong{Input format:} @tab UTT regular |
---|
| 1134 | @item @strong{Output format:} @tab UTT regular |
---|
| 1135 | @item @strong{Required annotation:} @tab tok |
---|
[25ae32e] | 1136 | @end multitable |
---|
| 1137 | |
---|
[261bf62] | 1138 | @menu |
---|
| 1139 | * cor description:: |
---|
| 1140 | * cor command line options:: |
---|
| 1141 | * cor dictionaries:: |
---|
| 1142 | @end menu |
---|
| 1143 | |
---|
| 1144 | |
---|
| 1145 | @node cor description |
---|
| 1146 | @subsection Description |
---|
| 1147 | |
---|
[25ae32e] | 1148 | The spelling corrector applies Kemal Oflazer's dynamic programming |
---|
| 1149 | algorithm @cite{oflazer96} to the FSA representation of the set of |
---|
| 1150 | word forms of the Polex/PMDBF dictionary. Given an incorrect |
---|
| 1151 | word form it returns all word forms present in the dictionary whose |
---|
| 1152 | edit distance is smaller than the threshold given as the parameter. |
---|
| 1153 | |
---|
| 1154 | |
---|
| 1155 | @node cor command line options |
---|
| 1156 | @subsection Command line options |
---|
| 1157 | |
---|
| 1158 | @table @code |
---|
| 1159 | |
---|
| 1160 | @parhelp |
---|
| 1161 | @parversion |
---|
| 1162 | @parinteractive |
---|
| 1163 | @c @parfile |
---|
| 1164 | @c @paroutput |
---|
| 1165 | @c @parfail |
---|
| 1166 | @c @parcopy |
---|
| 1167 | @parinputfield |
---|
| 1168 | @paroutputfield |
---|
| 1169 | @pardictionary |
---|
| 1170 | @parprocess |
---|
| 1171 | @parselect |
---|
| 1172 | @parunselect |
---|
| 1173 | @paroneline |
---|
| 1174 | @paronefield |
---|
| 1175 | |
---|
| 1176 | @item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}} |
---|
| 1177 | Maximum edit distance (default='1'). |
---|
| 1178 | |
---|
[261bf62] | 1179 | @c @item @b{@minus{}@minus{}replace, @minus{}r} |
---|
| 1180 | @c Replace original form with corrected form, place original form in the |
---|
| 1181 | @c cor field. This option has no effect in @option{--one-*} modes (default=off) |
---|
| 1182 | |
---|
[25ae32e] | 1183 | |
---|
| 1184 | @end table |
---|
| 1185 | |
---|
| 1186 | @node cor dictionaries |
---|
| 1187 | @subsection Dictionaries |
---|
| 1188 | |
---|
| 1189 | @command{cor} requires a dictionary. The dictionary has to be provided in binary (fsa) format. |
---|
| 1190 | The fsa format is created by compiling text-format dictionaries. |
---|
| 1191 | |
---|
| 1192 | @subsubheading Text format |
---|
| 1193 | |
---|
| 1194 | The @command{cor} dictionary is a list of words: |
---|
| 1195 | @example |
---|
| 1196 | odlot |
---|
| 1197 | odlotowy |
---|
| 1198 | odludek |
---|
| 1199 | @end example |
---|
| 1200 | |
---|
[261bf62] | 1201 | @subsubheading Binary format |
---|
| 1202 | |
---|
| 1203 | The mandatory file name extension for a binary dictionary is @code{bin}. To |
---|
| 1204 | compile a text dictionary into binary format, write: |
---|
| 1205 | |
---|
| 1206 | @example |
---|
| 1207 | compiledic <dictionaryname>.dic |
---|
| 1208 | @end example |
---|
| 1209 | |
---|
| 1210 | @c --------------------------------------------------------------------- |
---|
| 1211 | @c KOR |
---|
| 1212 | @c --------------------------------------------------------------------- |
---|
| 1213 | |
---|
| 1214 | @page |
---|
| 1215 | @node kor |
---|
| 1216 | @section kor - configurable spelling corrector |
---|
| 1217 | |
---|
| 1218 | [TODO] |
---|
| 1219 | |
---|
| 1220 | @c --------------------------------------------------------------------- |
---|
| 1221 | @c SEN |
---|
| 1222 | @c --------------------------------------------------------------------- |
---|
| 1223 | |
---|
[25ae32e] | 1224 | @page |
---|
| 1225 | @node sen |
---|
| 1226 | @section sen - a sentensizer |
---|
| 1227 | |
---|
| 1228 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} |
---|
| 1229 | |
---|
[19760ef] | 1230 | @item @strong{Authors:} @tab Tomasz Obrêbski |
---|
[25ae32e] | 1231 | @item @strong{Component category:} @tab filter |
---|
[261bf62] | 1232 | @item @strong{Input format:} @tab UTT regular |
---|
| 1233 | @item @strong{Output format:} @tab UTT regular |
---|
| 1234 | @item @strong{Required annotation:} @tab tok |
---|
[25ae32e] | 1235 | |
---|
| 1236 | @end multitable |
---|
| 1237 | |
---|
| 1238 | |
---|
| 1239 | @menu |
---|
[261bf62] | 1240 | * sen description:: |
---|
[25ae32e] | 1241 | @c * sen input:: |
---|
| 1242 | @c * sen output:: |
---|
| 1243 | * sen example:: |
---|
| 1244 | @end menu |
---|
| 1245 | |
---|
[261bf62] | 1246 | @node sen description |
---|
| 1247 | @subsection Description |
---|
| 1248 | |
---|
| 1249 | @command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation. |
---|
| 1250 | |
---|
[25ae32e] | 1251 | @node sen example |
---|
| 1252 | @subsection Example |
---|
| 1253 | |
---|
| 1254 | @example |
---|
| 1255 | command: sen |
---|
| 1256 | |
---|
| 1257 | input: |
---|
[19760ef] | 1258 | 0000 05 W Cze¶Ê |
---|
[25ae32e] | 1259 | 0005 01 P ! |
---|
| 1260 | 0006 01 S _ |
---|
| 1261 | 0007 02 W To |
---|
| 1262 | 0009 01 S _ |
---|
| 1263 | 0010 02 W ja |
---|
| 1264 | 0012 01 P . |
---|
| 1265 | 0013 01 S \n |
---|
| 1266 | |
---|
| 1267 | output: |
---|
| 1268 | 0000 00 BOS * |
---|
[19760ef] | 1269 | 0000 05 W Cze¶Ê |
---|
[25ae32e] | 1270 | 0005 01 P ! |
---|
| 1271 | 0006 00 EOS * |
---|
| 1272 | 0006 00 BOS * |
---|
| 1273 | 0006 01 S _ |
---|
| 1274 | 0007 02 W To |
---|
| 1275 | 0009 01 S _ |
---|
| 1276 | 0010 02 W ja |
---|
| 1277 | 0012 01 P . |
---|
| 1278 | 0013 01 S \n |
---|
| 1279 | 0014 00 EOS * |
---|
| 1280 | @end example |
---|
| 1281 | |
---|
| 1282 | |
---|
| 1283 | @c --------------------------------------------------------------------- |
---|
| 1284 | @c GPH |
---|
| 1285 | @c --------------------------------------------------------------------- |
---|
| 1286 | |
---|
| 1287 | @c @node gph - graphizer |
---|
| 1288 | @c @chapter gph - graphizer |
---|
| 1289 | |
---|
[19760ef] | 1290 | @c Authors: Tomasz Obrêbski |
---|
[25ae32e] | 1291 | |
---|
| 1292 | |
---|
| 1293 | |
---|
| 1294 | @c --------------------------------------------------------------------- |
---|
[261bf62] | 1295 | @c SER |
---|
[25ae32e] | 1296 | @c --------------------------------------------------------------------- |
---|
| 1297 | |
---|
| 1298 | @page |
---|
| 1299 | @node ser |
---|
| 1300 | @section ser - pattern search tool |
---|
| 1301 | |
---|
| 1302 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} |
---|
[19760ef] | 1303 | @item @strong{Authors:} @tab Tomasz Obrêbski |
---|
[25ae32e] | 1304 | @item @strong{Component category:} @tab filter |
---|
[261bf62] | 1305 | @item @strong{Input format:} @tab UTT regular |
---|
| 1306 | @item @strong{Output format:} @tab UTT regular |
---|
| 1307 | @item @strong{Required annotation:} @tab tok, lem --one-field |
---|
[25ae32e] | 1308 | @end multitable |
---|
| 1309 | |
---|
| 1310 | @menu |
---|
[261bf62] | 1311 | * ser description:: |
---|
[25ae32e] | 1312 | * ser command line options:: |
---|
| 1313 | * ser pattern:: |
---|
| 1314 | * ser how ser works:: |
---|
| 1315 | * ser customization:: |
---|
| 1316 | * ser limitations:: |
---|
| 1317 | * ser requirements:: |
---|
| 1318 | @end menu |
---|
| 1319 | |
---|
| 1320 | |
---|
[261bf62] | 1321 | @node ser description |
---|
| 1322 | @subsection Description |
---|
| 1323 | |
---|
| 1324 | @command{ser} looks for patterns in UTT-formatted texts. |
---|
| 1325 | |
---|
| 1326 | |
---|
[25ae32e] | 1327 | @c --------------------------------------------------------------------- |
---|
| 1328 | @node ser command line options |
---|
| 1329 | @subsection Command line options |
---|
| 1330 | |
---|
| 1331 | @table @code |
---|
| 1332 | |
---|
| 1333 | @parhelp |
---|
| 1334 | @parversion |
---|
| 1335 | @c @parfile |
---|
| 1336 | @c @paroutput |
---|
| 1337 | @c @parinputfield |
---|
| 1338 | @c @paroutputfield |
---|
| 1339 | @parprocess |
---|
| 1340 | @parinteractive |
---|
| 1341 | |
---|
| 1342 | @item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}} |
---|
| 1343 | The search pattern. |
---|
| 1344 | |
---|
| 1345 | @item @b{@minus{}@minus{}morph=@var{field}} |
---|
| 1346 | The name of the annotation field containing the morphological |
---|
| 1347 | description (default @code{lem}). |
---|
| 1348 | |
---|
| 1349 | @item @b{@minus{}@minus{}flex} |
---|
| 1350 | Only print the generated flex source code. |
---|
| 1351 | |
---|
| 1352 | @item @b{@minus{}@minus{}macro=@var{filename}} |
---|
| 1353 | Read macrodefinitions from file @var{filename} rather than from |
---|
| 1354 | default location. This option allows to redefine the set of terms. |
---|
| 1355 | |
---|
| 1356 | @item @b{@minus{}@minus{}define=@var{filename}} |
---|
| 1357 | Append macrodefinitions from file @var{filename}. This option |
---|
| 1358 | allows to extend the set of terms. |
---|
| 1359 | |
---|
| 1360 | @end table |
---|
| 1361 | |
---|
| 1362 | |
---|
| 1363 | @c --------------------------------------------------------------------- |
---|
| 1364 | @node ser pattern |
---|
| 1365 | @subsection Pattern |
---|
| 1366 | |
---|
| 1367 | The @command{ser} pattern is a regular expression over terms corresponding |
---|
| 1368 | to text segments or segment sequences. Predefined terms are: |
---|
| 1369 | |
---|
| 1370 | @table @code |
---|
| 1371 | |
---|
| 1372 | @item seg(@var{t},@var{f},@var{a}) |
---|
| 1373 | a segment of type @var{t}, containing form @var{f} and annotation |
---|
| 1374 | @var{a} |
---|
| 1375 | |
---|
| 1376 | @item form(@var{f}) |
---|
| 1377 | a segment containing form @var{f} |
---|
| 1378 | |
---|
| 1379 | @item field(@var{f}) |
---|
| 1380 | a segment containing annotation field @var{f} |
---|
| 1381 | |
---|
| 1382 | @item space(@var{f}) |
---|
| 1383 | a space segment of form @var{f} |
---|
| 1384 | |
---|
| 1385 | @item word(@var{f}) |
---|
| 1386 | a word segment of form @var{f} |
---|
| 1387 | |
---|
| 1388 | @item punct(@var{f}) |
---|
| 1389 | a punct segment of form @var{f} |
---|
| 1390 | |
---|
| 1391 | @item number(@var{f}) |
---|
| 1392 | a number segment of form @var{f} |
---|
| 1393 | |
---|
| 1394 | @item lexeme(@var{f}) |
---|
| 1395 | a word segment with lemma @var{f} |
---|
| 1396 | |
---|
| 1397 | @item cat(@var{c}) |
---|
| 1398 | a word segment of category @var{c} |
---|
| 1399 | |
---|
| 1400 | @end table |
---|
| 1401 | |
---|
| 1402 | All arguments are optional. If an argument is omitted, an arbitrary |
---|
| 1403 | string of non-blank characters is assumed as the argument value. Term |
---|
| 1404 | arguments may be arbitrary character-level regular expressions. The |
---|
| 1405 | following special symbols can by used: |
---|
| 1406 | |
---|
| 1407 | @multitable {aaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} |
---|
| 1408 | @item @code{[@dots{}]} @tab a character class |
---|
| 1409 | @item @code{[^@dots{}]} @tab a negated character class |
---|
| 1410 | @item @code{|} @tab alternative |
---|
| 1411 | @item @code{*} @tab repetition, including zero times |
---|
| 1412 | @item @code{+} @tab repetition, at least one time |
---|
| 1413 | @item @code{?} @tab optionality |
---|
| 1414 | @item @code{@{@var{m},@var{n}@}} @tab repetition from @var{m} to @var{n} times |
---|
| 1415 | @item @code{@{@var{m},@}} @tab repetition @var{m} or more times |
---|
| 1416 | @item @code{@{@var{m}@}} @tab repetition @var{m} times |
---|
| 1417 | @item @code{@var{\ddd}} @tab the character with octal value @var{ddd} |
---|
| 1418 | @item @code{\x@var{hh}} @tab the character with hexadecimal value @var{hh} |
---|
| 1419 | @item @code{( )} @tab parentheses, used to override precedence |
---|
| 1420 | @c @end multitable |
---|
| 1421 | |
---|
| 1422 | @c @multitable {aaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} |
---|
| 1423 | @item @code{.} @tab a non-blank character |
---|
| 1424 | @item @code{\w} @tab a letter |
---|
| 1425 | @item @code{\W} @tab a non-blank character other than a letter |
---|
| 1426 | @item @code{\d} @tab a digit |
---|
| 1427 | @item @code{\D} @tab a non-blank character other than a digit |
---|
| 1428 | @item @code{\s} @tab a space or tab character |
---|
| 1429 | @item @code{\S} @tab a non-blank character (the same as @code{.}) |
---|
| 1430 | @item @code{\l} @tab a lowercase letter |
---|
| 1431 | @item @code{\L} @tab an uppercase letter |
---|
| 1432 | @end multitable |
---|
| 1433 | |
---|
| 1434 | |
---|
| 1435 | @noindent The following characters: |
---|
| 1436 | @example |
---|
| 1437 | @verb{% [ ] ^ | * + ? { } , . < > \ %} |
---|
| 1438 | @end example |
---|
| 1439 | must be escaped with a backslash, i.e. written as: |
---|
| 1440 | @example |
---|
| 1441 | @verb{% \[ \] \^ \| \* \+ \? \{ \} \, \. \< \> \\ %} |
---|
| 1442 | @end example |
---|
| 1443 | |
---|
| 1444 | @quotation Note |
---|
| 1445 | The special symbols are ... borrowed from Perl with minor |
---|
| 1446 | modifications ... for convenience |
---|
| 1447 | The meaning of certain special characters/sequences slightly differs |
---|
| 1448 | from their common ???. This is motivated by convenience reasons. |
---|
| 1449 | The meaning of the @code{.} special character is modified due to |
---|
| 1450 | the special function of spaces in utt files (they are field |
---|
| 1451 | separators). Use @code{\s} to explicitly |
---|
| 1452 | @end quotation |
---|
| 1453 | |
---|
| 1454 | In the argument of the @code{cat} term a special operator <...> may be |
---|
| 1455 | used. A category specification enclosed in angle brackets matches all |
---|
| 1456 | category descriptions which are consistent (non-contradictory) with the |
---|
| 1457 | specification. For example @code{<N>} matches all noun descriptions, |
---|
| 1458 | @code{<ADJ/Can>} matches all adjectives in accusative or nominal case. |
---|
| 1459 | |
---|
| 1460 | |
---|
| 1461 | @* |
---|
| 1462 | @noindent @b{Examples of one-segment patterns:} |
---|
| 1463 | |
---|
| 1464 | @multitable {aaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} |
---|
| 1465 | @item @code{seg} @tab any segment |
---|
| 1466 | @item @code{word} @tab any word-form |
---|
| 1467 | @item @code{word(pomocy)} @tab the word-form @samp{pomocy} |
---|
| 1468 | @item @code{word(naj.+)} @tab a word-form beginning with @samp{naj} |
---|
| 1469 | @item @code{word(\L\l+)} @tab a capitalized word-form |
---|
| 1470 | @item @code{punct} @tab a punctuation character |
---|
| 1471 | @item @code{space(.*\\n.*)} @tab a space segment containing a newline character |
---|
| 1472 | @item @code{lexeme(pomoc)} @tab any form of the lexeme 'pomoc' |
---|
| 1473 | @item @code{cat(N/.*)} @tab a word which category starts with @code{N/} |
---|
| 1474 | @item @code{cat(<N/Ca>)} @tab a word which category matches @code{N/Ca} |
---|
| 1475 | @end multitable |
---|
| 1476 | |
---|
| 1477 | @* |
---|
| 1478 | @noindent @b{Examples of multi-segment patterns:} |
---|
| 1479 | |
---|
| 1480 | @table @code |
---|
| 1481 | |
---|
| 1482 | @item (word(\L) punct(\.) space?)+ word(\L\l+) |
---|
| 1483 | a sequence of initials followed by a surname |
---|
| 1484 | |
---|
| 1485 | @item punct seg(W|S|N)* cat(<NPRO/Sr>) seg(W|S|N)* punct |
---|
| 1486 | a text fragment between two punctuation characters, containing an |
---|
| 1487 | ocurrence of a relative pronoun |
---|
| 1488 | |
---|
| 1489 | @end table |
---|
| 1490 | |
---|
| 1491 | |
---|
| 1492 | @node ser how ser works |
---|
| 1493 | @subsection How ser works |
---|
| 1494 | |
---|
| 1495 | @node ser customization |
---|
| 1496 | @subsection Customization |
---|
| 1497 | |
---|
| 1498 | @c All predefined terms correspond to single segments, |
---|
| 1499 | |
---|
| 1500 | @example |
---|
[261bf62] | 1501 | define(`verbseq', `(cat(<V>) (space cat(<V>)))') |
---|
[25ae32e] | 1502 | @end example |
---|
| 1503 | |
---|
| 1504 | |
---|
| 1505 | the term @code{cat()} may not be used as a ... of |
---|
| 1506 | |
---|
| 1507 | @c See @command{m4} manual for further details on macro definition format. |
---|
| 1508 | |
---|
| 1509 | @node ser limitations |
---|
| 1510 | @subsection Limitations |
---|
| 1511 | |
---|
[261bf62] | 1512 | Do not use more than 3 attributes in <>. |
---|
[25ae32e] | 1513 | |
---|
| 1514 | @node ser requirements |
---|
| 1515 | @subsection Requirements |
---|
| 1516 | |
---|
| 1517 | In order to run @command{ser}, the following programs must be |
---|
| 1518 | installed in the system: |
---|
| 1519 | |
---|
| 1520 | @itemize |
---|
| 1521 | |
---|
| 1522 | @item @command{m4} |
---|
| 1523 | @item @command{grep} |
---|
| 1524 | @item @command{flex} |
---|
| 1525 | @item @command{gcc} |
---|
| 1526 | |
---|
| 1527 | @end itemize |
---|
| 1528 | |
---|
| 1529 | |
---|
| 1530 | @c --------------------------------------------------------------------- |
---|
[261bf62] | 1531 | @c GRP |
---|
[25ae32e] | 1532 | @c --------------------------------------------------------------------- |
---|
| 1533 | |
---|
| 1534 | @page |
---|
| 1535 | @node grp |
---|
| 1536 | @section grp - pattern search tool |
---|
| 1537 | |
---|
| 1538 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} |
---|
[19760ef] | 1539 | @item @strong{Authors:} @tab Tomasz Obrêbski |
---|
[25ae32e] | 1540 | @item @strong{Component category:} @tab filter |
---|
[261bf62] | 1541 | @item @strong{Input format:} @tab UTT flattened |
---|
| 1542 | @item @strong{Output format:} @tab UTT flattened |
---|
| 1543 | @item @strong{Required annotation:} @tab tok, sen, lem --one-field |
---|
[25ae32e] | 1544 | @end multitable |
---|
| 1545 | |
---|
| 1546 | |
---|
[261bf62] | 1547 | @menu |
---|
| 1548 | * grp description:: |
---|
| 1549 | * grp command line options:: |
---|
| 1550 | * grp pattern:: |
---|
| 1551 | * grp hints:: |
---|
| 1552 | @end menu |
---|
| 1553 | |
---|
| 1554 | |
---|
| 1555 | @node grp description |
---|
| 1556 | @subsection Description |
---|
| 1557 | |
---|
[25ae32e] | 1558 | @code{gre} selects sentences containing an expression matching a |
---|
| 1559 | pattern. The pattern format is exactly the same as that accepted by |
---|
| 1560 | @code{ser}. |
---|
| 1561 | |
---|
| 1562 | @code{gre} is intended mainly for speeding up corpus search process. |
---|
| 1563 | It is extremely fast (processing speed is usually higher then the speed |
---|
| 1564 | of reading the corpus file from disk). |
---|
| 1565 | |
---|
| 1566 | @node grp command line options |
---|
| 1567 | @subsection Command line options |
---|
| 1568 | |
---|
| 1569 | @table @code |
---|
| 1570 | |
---|
| 1571 | @parhelp |
---|
| 1572 | @parversion |
---|
| 1573 | @parprocess |
---|
| 1574 | @parinteractive |
---|
| 1575 | |
---|
| 1576 | @item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}} |
---|
| 1577 | The search pattern. |
---|
| 1578 | |
---|
| 1579 | @item @b{@minus{}@minus{}morph=@var{field}} |
---|
| 1580 | The name of the annotation field containing the morphological |
---|
| 1581 | description (default @code{lem}). |
---|
| 1582 | |
---|
| 1583 | @item @b{@minus{}@minus{}command} |
---|
| 1584 | Only print the generated flex source code. |
---|
| 1585 | |
---|
| 1586 | @item @b{@minus{}@minus{}macro=@var{filename}} |
---|
| 1587 | Read macrodefinitions from file @var{filename} rather than from |
---|
| 1588 | default location. This option allows to redefine the set of terms. |
---|
| 1589 | |
---|
| 1590 | @item @b{@minus{}@minus{}define=@var{filename}} |
---|
| 1591 | Append macrodefinitions from file @var{filename}. This option |
---|
| 1592 | allows to extend the set of terms. |
---|
| 1593 | |
---|
| 1594 | @end table |
---|
| 1595 | |
---|
| 1596 | |
---|
| 1597 | @node grp pattern |
---|
| 1598 | @subsection Pattern |
---|
| 1599 | |
---|
| 1600 | (see @code{ser}) |
---|
| 1601 | |
---|
| 1602 | @node grp hints |
---|
| 1603 | @subsection Hints |
---|
| 1604 | |
---|
| 1605 | The corpus search speed may be increased by combining grp with lzop |
---|
| 1606 | compression tool (grp usually processes data faster than it is read from a |
---|
| 1607 | disk, especially for slow laptop drives). |
---|
| 1608 | |
---|
| 1609 | @example |
---|
| 1610 | cat corpus | tok | sen | lem | grp -a p | lzop -7 > corpus.grp.lzo |
---|
| 1611 | @end example |
---|
| 1612 | |
---|
| 1613 | @example |
---|
| 1614 | lzop -cd corpus.grp.lzo | grp -a gP -e @var{EXPR} | ser -e @var{EXPR} |
---|
| 1615 | @end example |
---|
| 1616 | |
---|
| 1617 | |
---|
[261bf62] | 1618 | |
---|
[25ae32e] | 1619 | @c --------------------------------------------------------------------- |
---|
[261bf62] | 1620 | @c MAR |
---|
[25ae32e] | 1621 | @c --------------------------------------------------------------------- |
---|
[261bf62] | 1622 | |
---|
| 1623 | @page |
---|
| 1624 | @node mar |
---|
| 1625 | @section mar |
---|
| 1626 | |
---|
| 1627 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} |
---|
| 1628 | @item @strong{Authors:} @tab Marcin Walas, Tomasz Obrêbski |
---|
| 1629 | @item @strong{Component category:} @tab filter |
---|
| 1630 | @end multitable |
---|
| 1631 | |
---|
| 1632 | [TODO] |
---|
| 1633 | |
---|
| 1634 | @c --------------------------------------------------------------------- |
---|
| 1635 | @c KOT |
---|
[25ae32e] | 1636 | @c --------------------------------------------------------------------- |
---|
| 1637 | |
---|
[261bf62] | 1638 | |
---|
[25ae32e] | 1639 | @page |
---|
| 1640 | @node kot |
---|
| 1641 | @section kot - untokenizer |
---|
| 1642 | |
---|
[261bf62] | 1643 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} |
---|
| 1644 | @item @strong{Authors:} @tab Tomasz Obrêbski |
---|
| 1645 | @item @strong{Component category:} @tab filter |
---|
| 1646 | @item @strong{Input format:} @tab UTT regular |
---|
| 1647 | @item @strong{Output format:} @tab text |
---|
| 1648 | @item @strong{Required annotation:} @tab tok |
---|
| 1649 | @end multitable |
---|
[25ae32e] | 1650 | |
---|
| 1651 | |
---|
| 1652 | @menu |
---|
[261bf62] | 1653 | * kot description:: |
---|
[25ae32e] | 1654 | * kot command line options:: |
---|
| 1655 | * kot usage examples:: |
---|
| 1656 | @end menu |
---|
| 1657 | |
---|
[261bf62] | 1658 | @node kot description |
---|
| 1659 | @subsection Description |
---|
| 1660 | |
---|
| 1661 | @command{kot} transforms a UTT formatted file back into raw text format. |
---|
| 1662 | |
---|
[25ae32e] | 1663 | @node kot command line options |
---|
| 1664 | @subsection Command line options |
---|
| 1665 | |
---|
| 1666 | @table @code |
---|
| 1667 | |
---|
| 1668 | @parhelp |
---|
| 1669 | |
---|
| 1670 | @c @item @b{@minus{}@minus{}version}, @b{@minus{}v} |
---|
| 1671 | |
---|
| 1672 | @c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}} |
---|
| 1673 | |
---|
| 1674 | @c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}} |
---|
| 1675 | |
---|
| 1676 | @c @item @b{@minus{}@minus{}interactive @minus{}i} |
---|
| 1677 | |
---|
| 1678 | @c @item @b{@minus{}@minus{}config=@var{filename}} |
---|
| 1679 | |
---|
| 1680 | @item |
---|
| 1681 | |
---|
| 1682 | @item @b{@minus{}@minus{}gap-fill=@var{string}, @minus{}g @var{string}} |
---|
| 1683 | print @var{string} between nonadjacent segments of the input file |
---|
| 1684 | |
---|
| 1685 | @item @b{@minus{}@minus{}spaces, @minus{}r} |
---|
| 1686 | retain the special characters @code{_}, @code{\t}, |
---|
| 1687 | @code{\n}, @code{\r}, @code{\f} unexpanded in the output |
---|
| 1688 | |
---|
| 1689 | @end table |
---|
| 1690 | |
---|
| 1691 | @node kot usage examples |
---|
| 1692 | @subsection Usage examples |
---|
| 1693 | |
---|
| 1694 | @example |
---|
| 1695 | cat legia.txt | tok | kot |
---|
| 1696 | @end example |
---|
| 1697 | |
---|
| 1698 | @example |
---|
| 1699 | cat legia.txt | tok | lem -1 | kot |
---|
| 1700 | @end example |
---|
| 1701 | |
---|
[261bf62] | 1702 | @c --------------------------------------------------------------- |
---|
| 1703 | @c CON |
---|
| 1704 | @c --------------------------------------------------------------- |
---|
| 1705 | |
---|
[25ae32e] | 1706 | |
---|
| 1707 | @page |
---|
| 1708 | @node con |
---|
| 1709 | @section con - concordance table generator |
---|
| 1710 | |
---|
| 1711 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} |
---|
| 1712 | @item @strong{Authors:} @tab Justyna Walkowska |
---|
| 1713 | @item @strong{Component category:} @tab sink |
---|
[261bf62] | 1714 | @item @strong{Input format:} @tab UTT regular |
---|
| 1715 | @item @strong{Output format:} @tab text |
---|
| 1716 | @item @strong{Required annotation:} @tab ser or mar |
---|
[25ae32e] | 1717 | @end multitable |
---|
| 1718 | @c |
---|
| 1719 | |
---|
| 1720 | @menu |
---|
[261bf62] | 1721 | * con description:: |
---|
[25ae32e] | 1722 | * con command line options:: |
---|
| 1723 | * con usage example:: |
---|
| 1724 | * con hints:: |
---|
| 1725 | @end menu |
---|
| 1726 | |
---|
[261bf62] | 1727 | |
---|
| 1728 | @node con description |
---|
| 1729 | @subsection Description |
---|
| 1730 | |
---|
| 1731 | @command{con} generates a concordance table based on a pattern given to @command{ser}. |
---|
| 1732 | |
---|
| 1733 | |
---|
[25ae32e] | 1734 | @node con command line options |
---|
| 1735 | @subsection Command line options |
---|
| 1736 | |
---|
| 1737 | @table @code |
---|
| 1738 | |
---|
| 1739 | @parhelp |
---|
| 1740 | |
---|
| 1741 | @c @item @b{@minus{}@minus{}help}, @b{@minus{}h} |
---|
| 1742 | @c @item @b{@minus{}@minus{}version}, @b{@minus{}v} |
---|
| 1743 | @c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}} |
---|
| 1744 | @c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}} |
---|
| 1745 | @c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}} [???] |
---|
| 1746 | @c @item @b{@minus{}@minus{}copy, @minus{}c} [???] |
---|
| 1747 | @c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}} |
---|
| 1748 | @c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}} |
---|
| 1749 | @c @item @b{@minus{}@minus{}process=@var{class}, @minus{}p @var{class}} |
---|
| 1750 | @c @item @b{@minus{}@minus{}interactive @minus{}i} |
---|
| 1751 | @c @item @b{@minus{}@minus{}config=@var{filename}} |
---|
| 1752 | @c @item |
---|
| 1753 | @c @item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}} |
---|
| 1754 | @c search pattern |
---|
| 1755 | @c |
---|
| 1756 | @c @item @b{@minus{}@minus{}flex} |
---|
| 1757 | @c only print the generated flex source code |
---|
| 1758 | @c |
---|
| 1759 | @c @item @b{@minus{}@minus{}macro=@var{filename}} |
---|
| 1760 | @c read macrodefinitions from file @var{filename} rather than from |
---|
| 1761 | @c default location. This option allows to redefine the set of terms. |
---|
| 1762 | @c |
---|
| 1763 | @c @item @b{@minus{}@minus{}define=@var{filename}} |
---|
| 1764 | @c append macrodefinitions from file @var{filename}. This option |
---|
| 1765 | @c allows to extend the set of terms. |
---|
| 1766 | |
---|
| 1767 | @item @b{@minus{}@minus{}left @minus{}l} |
---|
| 1768 | Left context info (default='30c'). Example: |
---|
| 1769 | @example |
---|
| 1770 | -l=5c: left context is 5 characters |
---|
| 1771 | -l=5w: left context is 5 words |
---|
| 1772 | -l=5s: left context is 5 non-empty input lines |
---|
| 1773 | -l='\s*\S+\sr\S+BOS': left context starts with the given regex |
---|
| 1774 | @end example |
---|
| 1775 | |
---|
| 1776 | @item @b{@minus{}@minus{}right @minus{}r} |
---|
| 1777 | Right context info (default='30c'). |
---|
| 1778 | @item @b{@minus{}@minus{}trim @minus{}t} |
---|
| 1779 | Clear incomplete words from output. |
---|
| 1780 | @item @b{@minus{}@minus{}white @minus{}w} |
---|
| 1781 | DO NOT change all white characters into spaces. |
---|
| 1782 | @item @b{@minus{}@minus{}column @minus{}c} |
---|
| 1783 | Left column minimal width in characters (default = 0). |
---|
| 1784 | @item @b{@minus{}@minus{}ignore @minus{}i} |
---|
| 1785 | Ignore segment inconsistency in the input. |
---|
[261bf62] | 1786 | @item @b{@minus{}@minus{}bom} |
---|
[25ae32e] | 1787 | Beginning of selected segment (regex, default='[0-9]+ [0-9]+ BOM .*'). |
---|
[261bf62] | 1788 | @item @b{@minus{}@minus{}eom} |
---|
[25ae32e] | 1789 | End of selected segment (regex, default='[0-9]+ [0-9]+ EOM .*'). |
---|
| 1790 | @item @b{@minus{}@minus{}bod} |
---|
| 1791 | Selected segment beginning display string (default='['). |
---|
| 1792 | @item @b{@minus{}@minus{}eod} |
---|
| 1793 | Selected segment end display string (default=']'). |
---|
| 1794 | |
---|
| 1795 | |
---|
| 1796 | |
---|
| 1797 | @end table |
---|
| 1798 | |
---|
| 1799 | @node con usage example |
---|
| 1800 | @subsection Usage example |
---|
| 1801 | @example |
---|
[261bf62] | 1802 | cat file.txt | tok | lem -1 | ser -e 'lexeme(dom)' | con |
---|
[25ae32e] | 1803 | @end example |
---|
| 1804 | |
---|
| 1805 | |
---|
| 1806 | @node con hints |
---|
| 1807 | @subsection Hints |
---|
| 1808 | |
---|
| 1809 | @command{con} is a rather slow program. Do not pass large amounts of |
---|
| 1810 | redundant text through this program. @command{con} works fine in the following |
---|
| 1811 | sequence: |
---|
| 1812 | |
---|
| 1813 | @example |
---|
| 1814 | ... | grp -e EXPR | ser -e EXPR | con |
---|
| 1815 | @end example |
---|
| 1816 | |
---|
| 1817 | |
---|
| 1818 | @c --------------------------------------------------------------------- |
---|
| 1819 | @c --------------------------------------------------------------------- |
---|
| 1820 | |
---|
| 1821 | @page |
---|
| 1822 | @node Auxiliary tools |
---|
| 1823 | @chapter Auxiliary tools |
---|
| 1824 | |
---|
| 1825 | @menu |
---|
| 1826 | * compiledic:: dictionary compiler |
---|
| 1827 | * fla:: UTT file flattener |
---|
| 1828 | * unfla:: UTT file unflattener |
---|
| 1829 | @end menu |
---|
| 1830 | |
---|
| 1831 | |
---|
| 1832 | @page |
---|
| 1833 | @node compiledic |
---|
| 1834 | @section compiledic - the dictionary compiler |
---|
| 1835 | |
---|
| 1836 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} |
---|
| 1837 | @item @strong{Authors:} @tab Michal Stolarski, Tomasz Obrebski |
---|
| 1838 | @item @strong{Component category:} @tab additional tool |
---|
| 1839 | @end multitable |
---|
| 1840 | @c |
---|
| 1841 | |
---|
| 1842 | @command{compiledic} compiles dictionaries in text format (@code{.dic} extension) into binary |
---|
| 1843 | (FSA) format (@code{.bin} extension). |
---|
| 1844 | |
---|
| 1845 | Automaton representation of a dictionary is built using the AT&T tools: |
---|
| 1846 | @itemize |
---|
| 1847 | @item AT&T FSM Library, |
---|
| 1848 | @item AT&T Lextools. |
---|
| 1849 | @end itemize |
---|
| 1850 | |
---|
| 1851 | In order for the compiledic program to work you have to install the |
---|
| 1852 | above mentioned packages into your system. They are freely available |
---|
| 1853 | for non-commercial use. |
---|
| 1854 | |
---|
| 1855 | Usage: |
---|
| 1856 | @example |
---|
| 1857 | compiledic <dictionaryname>.dic |
---|
| 1858 | @end example |
---|
| 1859 | |
---|
| 1860 | The file <dictionaryname>.bin will be generated. |
---|
| 1861 | |
---|
| 1862 | Remarque: The program produces a lot of temporary files which are |
---|
| 1863 | stored in the current directory. They are deleted after successfull |
---|
| 1864 | termination of the program. |
---|
| 1865 | |
---|
| 1866 | @c @menu |
---|
| 1867 | @c * con command line options:: |
---|
| 1868 | @c * con usage example:: |
---|
| 1869 | @c * con hints:: |
---|
| 1870 | @c @end menu |
---|
| 1871 | |
---|
| 1872 | |
---|
| 1873 | @page |
---|
| 1874 | @node fla |
---|
| 1875 | @section fla - the UTT file flattener |
---|
| 1876 | |
---|
| 1877 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} |
---|
[19760ef] | 1878 | @item @strong{Authors:} @tab Tomasz Obrêbski |
---|
[25ae32e] | 1879 | @item @strong{Component category:} @tab filter |
---|
| 1880 | @end multitable |
---|
| 1881 | @c |
---|
| 1882 | |
---|
| 1883 | @command{fla} ``flattens'' a utt file by merging segments belonging |
---|
| 1884 | to one sentence in one line. Technically, end-of-line characters |
---|
| 1885 | ('\n', ASCII code 10) are replaced with line-feed characters ('\f', |
---|
| 1886 | ASCII code 12). The flattening makes it possible to process UTT files |
---|
| 1887 | with such tools as @command{grep} or @command{sed} sentence by |
---|
| 1888 | sentence (used in @command{grp} and @command{mar}). |
---|
| 1889 | |
---|
| 1890 | Flattened files should have the suffix @code{.fla}, eg. @file{thetext.utt.fla}. |
---|
| 1891 | |
---|
| 1892 | Flattened files are still human-readible. |
---|
| 1893 | |
---|
| 1894 | Usage: |
---|
| 1895 | |
---|
| 1896 | @example |
---|
| 1897 | fla [<bosregex>] |
---|
| 1898 | @end example |
---|
| 1899 | |
---|
| 1900 | The facultative argument is a regular expression describing segments |
---|
| 1901 | which should be treated as sentence beginnings (the test is: the |
---|
| 1902 | segment contains a fragment matching the @code{<bosregex>}). By |
---|
| 1903 | default, segments containing a field @code{BOS} are seeked. |
---|
| 1904 | @c @menu |
---|
| 1905 | @c * con command line options:: |
---|
| 1906 | @c * con usage example:: |
---|
| 1907 | @c * con hints:: |
---|
| 1908 | @c @end menu |
---|
| 1909 | |
---|
| 1910 | |
---|
| 1911 | |
---|
| 1912 | @page |
---|
| 1913 | @node unfla |
---|
| 1914 | @section unfla - the UTT file unflattener |
---|
| 1915 | |
---|
| 1916 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} |
---|
[19760ef] | 1917 | @item @strong{Authors:} @tab Tomasz Obrêbski |
---|
[25ae32e] | 1918 | @item @strong{Component category:} @tab filter |
---|
| 1919 | @end multitable |
---|
| 1920 | |
---|
| 1921 | @command{unfla} transforms a flattened UTT file, produced by |
---|
| 1922 | @command{fla}, into the regular format by restoring end-of-line |
---|
| 1923 | characters. |
---|
| 1924 | |
---|
| 1925 | |
---|
| 1926 | |
---|
| 1927 | |
---|
| 1928 | @c --------------------------------------------------------------------- |
---|
| 1929 | @c USAGE EXAMPLES |
---|
| 1930 | @c --------------------------------------------------------------------- |
---|
| 1931 | |
---|
| 1932 | @node Usage examples |
---|
| 1933 | @chapter Usage examples |
---|
| 1934 | |
---|
| 1935 | @subsubheading Simple pipelines |
---|
| 1936 | |
---|
| 1937 | @enumerate |
---|
| 1938 | |
---|
| 1939 | @item tokenization |
---|
| 1940 | |
---|
| 1941 | cat text | tok > output1 |
---|
| 1942 | |
---|
| 1943 | @item morphological annotation (1) |
---|
| 1944 | |
---|
| 1945 | simple dictionary based lemmatization |
---|
| 1946 | |
---|
| 1947 | cat text | tok | lem > output1 |
---|
| 1948 | |
---|
| 1949 | @item morphological annotation (2) |
---|
| 1950 | |
---|
| 1951 | 1) perform dictionary-based lemmatization |
---|
| 1952 | 4) guess descriptions for words which have no annotation |
---|
| 1953 | |
---|
| 1954 | @example |
---|
| 1955 | cat text | tok | lem | gue -S lem > output2 |
---|
| 1956 | @end example |
---|
| 1957 | |
---|
| 1958 | @item morphological annotation (3) |
---|
| 1959 | |
---|
| 1960 | 1) perform dictionary-based lemmatization |
---|
| 1961 | 2) try to correct words with no annotation |
---|
| 1962 | 3) perform dictionary-based lemmatization of corrected words |
---|
| 1963 | 4) guess descriptions for words which still have no annotation |
---|
| 1964 | |
---|
| 1965 | @example |
---|
| 1966 | cat text | tok | lem | cor -p W -S lem | lem -I cor | gue -p W -S lem |
---|
| 1967 | @end example |
---|
| 1968 | @item spelling correction |
---|
| 1969 | |
---|
| 1970 | |
---|
| 1971 | |
---|
| 1972 | @example |
---|
| 1973 | cat text | tok | lem --only-fail | cor -1 > output3 |
---|
| 1974 | @end example |
---|
| 1975 | |
---|
| 1976 | @item Expression extraction |
---|
| 1977 | |
---|
| 1978 | Extraction of all occurrences of a verb followed by a form of the noun 'rozmowa'. |
---|
| 1979 | |
---|
| 1980 | @example |
---|
| 1981 | cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' -m | kot > output4 |
---|
| 1982 | @end example |
---|
| 1983 | |
---|
| 1984 | @item A word in context |
---|
| 1985 | |
---|
| 1986 | Extraction of text fragments containing a form of the lexeme 'rozmowa' in |
---|
| 1987 | the context of 5 preceeding and 5 succeeding corpus segments. |
---|
| 1988 | |
---|
| 1989 | @example |
---|
| 1990 | cat text | tok | lem -1 | ser -e 'seg@{5@} lexeme(rozmowa) seg@{5@}' -m | kot > output |
---|
| 1991 | @end example |
---|
| 1992 | |
---|
| 1993 | @item generation of concordance table (1) |
---|
| 1994 | |
---|
| 1995 | @example |
---|
| 1996 | cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' | con |
---|
| 1997 | @end example |
---|
| 1998 | |
---|
| 1999 | 10" |
---|
| 2000 | |
---|
| 2001 | @item generation of concordance table (2) |
---|
| 2002 | |
---|
| 2003 | The same as above but much faster |
---|
| 2004 | |
---|
| 2005 | @example |
---|
| 2006 | cat text | tok | lem -1 | \ |
---|
| 2007 | grp -e 'cat(<V>) space lexeme(rozmowa)' | \ |
---|
| 2008 | ser -e 'cat(<V>) space lexeme(rozmowa)' | \ |
---|
| 2009 | con |
---|
| 2010 | @end example |
---|
| 2011 | |
---|
| 2012 | 2" |
---|
| 2013 | |
---|
| 2014 | @item generation of concordance table (3) |
---|
| 2015 | |
---|
| 2016 | Usually, one performs repetitively search over the same corpus. In |
---|
| 2017 | such case it is advisable to transform the corpus data into the format |
---|
| 2018 | required by @command{grp} first, and then use the preprocessed data. |
---|
| 2019 | |
---|
| 2020 | As @command{grp} (@command{grep}) processes data faster then it is |
---|
| 2021 | read from the disk drive, the search time may be still shortened by |
---|
| 2022 | using file compression techniques. We suggest usin @command{lzop}. |
---|
| 2023 | |
---|
| 2024 | @item the fastest way to search a large corpus |
---|
| 2025 | |
---|
| 2026 | step 1: preprocessing |
---|
| 2027 | |
---|
| 2028 | @example |
---|
| 2029 | cat corpus | tok | sen | lem -1 \ |
---|
| 2030 | | grp -a p | lzop -7 > corpus.grp.lzo |
---|
| 2031 | @end example |
---|
| 2032 | |
---|
| 2033 | step 2: search |
---|
| 2034 | |
---|
| 2035 | @example |
---|
| 2036 | lzop -cd corpus.grp.lzo | grp -a gP -e 'cat(<V>) space |
---|
| 2037 | lexeme(rozmowa)' | ser -e 'cat(<V>) space lexeme(rozmowa)' | con |
---|
| 2038 | @end example |
---|
| 2039 | |
---|
| 2040 | @end enumerate |
---|
| 2041 | |
---|
| 2042 | @subsubheading More complicated configurations |
---|
| 2043 | |
---|
| 2044 | |
---|
| 2045 | @example |
---|
| 2046 | mknod fifo1 p |
---|
| 2047 | mknod fifo2 p |
---|
| 2048 | mknod fifo3 p |
---|
| 2049 | mknod fifo4 p |
---|
| 2050 | mknod fifo5 p |
---|
| 2051 | |
---|
| 2052 | tok | lem -p W -e fifo1 > fifo2 & |
---|
| 2053 | cor -e fifo3 < fifo1 | lem > fifo4 & |
---|
| 2054 | gue < fifo3 > fifo5 & |
---|
| 2055 | sort -m fifo2 fifo4 fifo5 |
---|
| 2056 | |
---|
| 2057 | rm fifo? |
---|
| 2058 | @end example |
---|
| 2059 | |
---|
| 2060 | |
---|
| 2061 | @c --------------------------------------------------------------------- |
---|
| 2062 | @c --------------------------------------------------------------------- |
---|
| 2063 | |
---|
| 2064 | @c --------------------------------------------------------------------- |
---|
| 2065 | @c PMDBF DICTIONARY |
---|
| 2066 | @c --------------------------------------------------------------------- |
---|
| 2067 | |
---|
| 2068 | @node PMDBF dictionary |
---|
| 2069 | @chapter PMDBF dictionary |
---|
| 2070 | |
---|
| 2071 | UTT components come with lexical data derived from Polish |
---|
| 2072 | Morphological Database (PMDB). |
---|
| 2073 | |
---|
| 2074 | @menu |
---|
| 2075 | * PMDBF files:: |
---|
| 2076 | * PMDBF tag structure:: |
---|
| 2077 | * PMDBF parts of speech:: |
---|
| 2078 | * PMDBF morphosyntactic attributes:: |
---|
| 2079 | @end menu |
---|
| 2080 | |
---|
| 2081 | @node PMDBF files |
---|
| 2082 | @section Files |
---|
| 2083 | |
---|
| 2084 | @node PMDBF tag structure |
---|
| 2085 | @section Tag structure |
---|
| 2086 | |
---|
| 2087 | pos = [[:upper:]]+ |
---|
| 2088 | |
---|
| 2089 | attr = [[:upper:]]+ |
---|
| 2090 | |
---|
| 2091 | val = [[:lower:][:digit:]?!*+-] | <[^>\n]+> |
---|
| 2092 | |
---|
| 2093 | descr = pos ( / ( attr val + ) + ) ? |
---|
| 2094 | |
---|
| 2095 | @node PMDBF parts of speech |
---|
| 2096 | @section Parts of speech |
---|
| 2097 | |
---|
| 2098 | @multitable {ADJPRP} { adjectival-passive-participle } |
---|
| 2099 | @item @code{N} @tab noun |
---|
| 2100 | @item @code{NPRO} @tab nominal-pronoun |
---|
| 2101 | @item @code{NV} @tab deverbal-noun |
---|
| 2102 | @item @code{V} @tab verb |
---|
| 2103 | @item @code{BYC} @tab byc |
---|
| 2104 | @item @code{VNI} @tab non-inflected-verb |
---|
| 2105 | @item @code{ADJ} @tab adjective |
---|
| 2106 | @item @code{ADJPAP} @tab adjectival-passive-participle |
---|
| 2107 | @item @code{ADJPRP} @tab adjectival-present-participle |
---|
| 2108 | @item @code{ADJPP} @tab adjectival-past-participle |
---|
| 2109 | @item @code{ADJPRO} @tab adjectival-pronoun |
---|
| 2110 | @item @code{ADJNUM} @tab adjectival-numeral |
---|
| 2111 | @item @code{ADV} @tab adverb |
---|
| 2112 | @item @code{ADVANP} @tab adverbial-anterior-participle |
---|
| 2113 | @item @code{ADVPRP} @tab adverbial-present-participle |
---|
| 2114 | @item @code{ADVPRO} @tab adverbial-pronoun |
---|
| 2115 | @item @code{ADVNUM} @tab adverbial-numeral |
---|
| 2116 | @item @code{P} @tab preposition |
---|
| 2117 | @item @code{PPRO} @tab prep-noun-pronoun |
---|
| 2118 | @item @code{CONJ} @tab conjunction |
---|
| 2119 | @item @code{EXCL} @tab exclamation |
---|
| 2120 | @item @code{APP} @tab call |
---|
| 2121 | @item @code{ONO} @tab onomatopoeia |
---|
| 2122 | @item @code{PART} @tab particle |
---|
| 2123 | @item @code{NUMCRD} @tab cardinal-numeral |
---|
| 2124 | @item @code{NUMCOL} @tab collective-numeral |
---|
| 2125 | @item @code{NUMPAR} @tab partitive-numeral |
---|
| 2126 | @item @code{NUMORD} @tab ordinal-numeral |
---|
| 2127 | @end multitable |
---|
| 2128 | |
---|
| 2129 | @node PMDBF morphosyntactic attributes |
---|
| 2130 | @section Morphosyntactic attributes |
---|
| 2131 | |
---|
| 2132 | @multitable {Attr} {Val} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} |
---|
| 2133 | @c @headitem Attr @tab Val @tab Description |
---|
| 2134 | @item |
---|
| 2135 | @code{A} @tab @tab Aspect |
---|
| 2136 | @item |
---|
| 2137 | @tab @code{p} @tab perfect |
---|
| 2138 | @item |
---|
| 2139 | @tab @code{i} @tab imperfect. |
---|
| 2140 | @item |
---|
| 2141 | @item |
---|
| 2142 | @code{V} @tab @tab Verb-Form |
---|
| 2143 | @item |
---|
| 2144 | @tab @code{b} @tab infinitive, |
---|
| 2145 | @item |
---|
| 2146 | @tab @code{p} @tab personal, |
---|
| 2147 | @item |
---|
| 2148 | @tab @code{i} @tab impersonal. |
---|
| 2149 | @item |
---|
| 2150 | @item |
---|
| 2151 | @code{M} @tab @tab Mood |
---|
| 2152 | @item |
---|
| 2153 | @tab @code{d} @tab declarative, |
---|
| 2154 | @item |
---|
| 2155 | @tab @code{c} @tab conditional, |
---|
| 2156 | @item |
---|
| 2157 | @tab @code{i} @tab imperative. |
---|
| 2158 | @item |
---|
| 2159 | @item |
---|
| 2160 | @code{T} @tab @tab Tense |
---|
| 2161 | @item |
---|
| 2162 | @tab @code{a} @tab past, |
---|
| 2163 | @item |
---|
| 2164 | @tab @code{r} @tab present, |
---|
| 2165 | @item |
---|
| 2166 | @tab @code{f} @tab future. |
---|
| 2167 | @item |
---|
| 2168 | @item |
---|
| 2169 | @code{P} @tab @tab Person |
---|
| 2170 | @item |
---|
| 2171 | @tab @code{1} @tab 1, |
---|
| 2172 | @item |
---|
| 2173 | @tab @code{2} @tab 2, |
---|
| 2174 | @item |
---|
| 2175 | @tab @code{3} @tab 3. |
---|
| 2176 | @item |
---|
| 2177 | @item |
---|
| 2178 | @code{D} @tab @tab Degree |
---|
| 2179 | @item |
---|
| 2180 | @tab @code{p} @tab positive, |
---|
| 2181 | @item |
---|
| 2182 | @tab @code{c} @tab comparative, |
---|
| 2183 | @item |
---|
| 2184 | @tab @code{s} @tab superlative. |
---|
| 2185 | @item |
---|
| 2186 | @item |
---|
| 2187 | @code{N} @tab @tab Number |
---|
| 2188 | @item |
---|
| 2189 | @tab @code{s} @tab singular, |
---|
| 2190 | @item |
---|
| 2191 | @tab @code{p} @tab plural. |
---|
| 2192 | @item |
---|
| 2193 | @item |
---|
| 2194 | @code{C} @tab @tab Case |
---|
| 2195 | @item |
---|
| 2196 | @tab @code{n} @tab nominative, |
---|
| 2197 | @item |
---|
| 2198 | @tab @code{g} @tab genitive, |
---|
| 2199 | @item |
---|
| 2200 | @tab @code{d} @tab dative, |
---|
| 2201 | @item |
---|
| 2202 | @tab @code{a} @tab accusative, |
---|
| 2203 | @item |
---|
| 2204 | @tab @code{i} @tab instrumantal, |
---|
| 2205 | @item |
---|
| 2206 | @tab @code{l} @tab locative, |
---|
| 2207 | @item |
---|
| 2208 | @tab @code{v} @tab vocative. |
---|
| 2209 | @item |
---|
| 2210 | @item |
---|
| 2211 | @code{G} @tab @tab Gender |
---|
| 2212 | @item |
---|
| 2213 | @tab @code{p} @tab masculine-personal, |
---|
| 2214 | @item |
---|
| 2215 | @tab @code{a} @tab masculine-animal, |
---|
| 2216 | @item |
---|
| 2217 | @tab @code{i} @tab masculine-inanimate, |
---|
| 2218 | @item |
---|
| 2219 | @tab @code{f} @tab feminine, |
---|
| 2220 | @item |
---|
| 2221 | @tab @code{n} @tab neuter. |
---|
| 2222 | @end multitable |
---|
| 2223 | |
---|
| 2224 | |
---|
| 2225 | @c --------------------------------------------------------------------- |
---|
| 2226 | @c --------------------------------------------------------------------- |
---|
| 2227 | @c |
---|
| 2228 | @c @node Examples |
---|
| 2229 | @c @chapter Examples |
---|
| 2230 | |
---|
| 2231 | @c ---------------------------------------------------------------------- |
---|
| 2232 | @c ---------------------------------------------------------------------- |
---|
| 2233 | |
---|
| 2234 | @node GNU Free Documentation License |
---|
| 2235 | @chapter GNU Free Documentation License |
---|
| 2236 | |
---|
| 2237 | @c The GNU Free Documentation License. |
---|
| 2238 | @center Version 1.2, November 2002 |
---|
| 2239 | |
---|
| 2240 | @c This file is intended to be included within another document, |
---|
| 2241 | @c hence no sectioning command or @node. |
---|
| 2242 | |
---|
| 2243 | @display |
---|
| 2244 | Copyright @copyright{} 2000,2001,2002 Free Software Foundation, Inc. |
---|
| 2245 | 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA |
---|
| 2246 | |
---|
| 2247 | Everyone is permitted to copy and distribute verbatim copies |
---|
| 2248 | of this license document, but changing it is not allowed. |
---|
| 2249 | @end display |
---|
| 2250 | |
---|
| 2251 | @enumerate 0 |
---|
| 2252 | @item |
---|
| 2253 | PREAMBLE |
---|
| 2254 | |
---|
| 2255 | The purpose of this License is to make a manual, textbook, or other |
---|
| 2256 | functional and useful document @dfn{free} in the sense of freedom: to |
---|
| 2257 | assure everyone the effective freedom to copy and redistribute it, |
---|
| 2258 | with or without modifying it, either commercially or noncommercially. |
---|
| 2259 | Secondarily, this License preserves for the author and publisher a way |
---|
| 2260 | to get credit for their work, while not being considered responsible |
---|
| 2261 | for modifications made by others. |
---|
| 2262 | |
---|
| 2263 | This License is a kind of ``copyleft'', which means that derivative |
---|
| 2264 | works of the document must themselves be free in the same sense. It |
---|
| 2265 | complements the GNU General Public License, which is a copyleft |
---|
| 2266 | license designed for free software. |
---|
| 2267 | |
---|
| 2268 | We have designed this License in order to use it for manuals for free |
---|
| 2269 | software, because free software needs free documentation: a free |
---|
| 2270 | program should come with manuals providing the same freedoms that the |
---|
| 2271 | software does. But this License is not limited to software manuals; |
---|
| 2272 | it can be used for any textual work, regardless of subject matter or |
---|
| 2273 | whether it is published as a printed book. We recommend this License |
---|
| 2274 | principally for works whose purpose is instruction or reference. |
---|
| 2275 | |
---|
| 2276 | @item |
---|
| 2277 | APPLICABILITY AND DEFINITIONS |
---|
| 2278 | |
---|
| 2279 | This License applies to any manual or other work, in any medium, that |
---|
| 2280 | contains a notice placed by the copyright holder saying it can be |
---|
| 2281 | distributed under the terms of this License. Such a notice grants a |
---|
| 2282 | world-wide, royalty-free license, unlimited in duration, to use that |
---|
| 2283 | work under the conditions stated herein. The ``Document'', below, |
---|
| 2284 | refers to any such manual or work. Any member of the public is a |
---|
| 2285 | licensee, and is addressed as ``you''. You accept the license if you |
---|
| 2286 | copy, modify or distribute the work in a way requiring permission |
---|
| 2287 | under copyright law. |
---|
| 2288 | |
---|
| 2289 | A ``Modified Version'' of the Document means any work containing the |
---|
| 2290 | Document or a portion of it, either copied verbatim, or with |
---|
| 2291 | modifications and/or translated into another language. |
---|
| 2292 | |
---|
| 2293 | A ``Secondary Section'' is a named appendix or a front-matter section |
---|
| 2294 | of the Document that deals exclusively with the relationship of the |
---|
| 2295 | publishers or authors of the Document to the Document's overall |
---|
| 2296 | subject (or to related matters) and contains nothing that could fall |
---|
| 2297 | directly within that overall subject. (Thus, if the Document is in |
---|
| 2298 | part a textbook of mathematics, a Secondary Section may not explain |
---|
| 2299 | any mathematics.) The relationship could be a matter of historical |
---|
| 2300 | connection with the subject or with related matters, or of legal, |
---|
| 2301 | commercial, philosophical, ethical or political position regarding |
---|
| 2302 | them. |
---|
| 2303 | |
---|
| 2304 | The ``Invariant Sections'' are certain Secondary Sections whose titles |
---|
| 2305 | are designated, as being those of Invariant Sections, in the notice |
---|
| 2306 | that says that the Document is released under this License. If a |
---|
| 2307 | section does not fit the above definition of Secondary then it is not |
---|
| 2308 | allowed to be designated as Invariant. The Document may contain zero |
---|
| 2309 | Invariant Sections. If the Document does not identify any Invariant |
---|
| 2310 | Sections then there are none. |
---|
| 2311 | |
---|
| 2312 | The ``Cover Texts'' are certain short passages of text that are listed, |
---|
| 2313 | as Front-Cover Texts or Back-Cover Texts, in the notice that says that |
---|
| 2314 | the Document is released under this License. A Front-Cover Text may |
---|
| 2315 | be at most 5 words, and a Back-Cover Text may be at most 25 words. |
---|
| 2316 | |
---|
| 2317 | A ``Transparent'' copy of the Document means a machine-readable copy, |
---|
| 2318 | represented in a format whose specification is available to the |
---|
| 2319 | general public, that is suitable for revising the document |
---|
| 2320 | straightforwardly with generic text editors or (for images composed of |
---|
| 2321 | pixels) generic paint programs or (for drawings) some widely available |
---|
| 2322 | drawing editor, and that is suitable for input to text formatters or |
---|
| 2323 | for automatic translation to a variety of formats suitable for input |
---|
| 2324 | to text formatters. A copy made in an otherwise Transparent file |
---|
| 2325 | format whose markup, or absence of markup, has been arranged to thwart |
---|
| 2326 | or discourage subsequent modification by readers is not Transparent. |
---|
| 2327 | An image format is not Transparent if used for any substantial amount |
---|
| 2328 | of text. A copy that is not ``Transparent'' is called ``Opaque''. |
---|
| 2329 | |
---|
| 2330 | Examples of suitable formats for Transparent copies include plain |
---|
| 2331 | @sc{ascii} without markup, Texinfo input format, La@TeX{} input |
---|
| 2332 | format, @acronym{SGML} or @acronym{XML} using a publicly available |
---|
| 2333 | @acronym{DTD}, and standard-conforming simple @acronym{HTML}, |
---|
| 2334 | PostScript or @acronym{PDF} designed for human modification. Examples |
---|
| 2335 | of transparent image formats include @acronym{PNG}, @acronym{XCF} and |
---|
| 2336 | @acronym{JPG}. Opaque formats include proprietary formats that can be |
---|
| 2337 | read and edited only by proprietary word processors, @acronym{SGML} or |
---|
| 2338 | @acronym{XML} for which the @acronym{DTD} and/or processing tools are |
---|
| 2339 | not generally available, and the machine-generated @acronym{HTML}, |
---|
| 2340 | PostScript or @acronym{PDF} produced by some word processors for |
---|
| 2341 | output purposes only. |
---|
| 2342 | |
---|
| 2343 | The ``Title Page'' means, for a printed book, the title page itself, |
---|
| 2344 | plus such following pages as are needed to hold, legibly, the material |
---|
| 2345 | this License requires to appear in the title page. For works in |
---|
| 2346 | formats which do not have any title page as such, ``Title Page'' means |
---|
| 2347 | the text near the most prominent appearance of the work's title, |
---|
| 2348 | preceding the beginning of the body of the text. |
---|
| 2349 | |
---|
| 2350 | A section ``Entitled XYZ'' means a named subunit of the Document whose |
---|
| 2351 | title either is precisely XYZ or contains XYZ in parentheses following |
---|
| 2352 | text that translates XYZ in another language. (Here XYZ stands for a |
---|
| 2353 | specific section name mentioned below, such as ``Acknowledgements'', |
---|
| 2354 | ``Dedications'', ``Endorsements'', or ``History''.) To ``Preserve the Title'' |
---|
| 2355 | of such a section when you modify the Document means that it remains a |
---|
| 2356 | section ``Entitled XYZ'' according to this definition. |
---|
| 2357 | |
---|
| 2358 | The Document may include Warranty Disclaimers next to the notice which |
---|
| 2359 | states that this License applies to the Document. These Warranty |
---|
| 2360 | Disclaimers are considered to be included by reference in this |
---|
| 2361 | License, but only as regards disclaiming warranties: any other |
---|
| 2362 | implication that these Warranty Disclaimers may have is void and has |
---|
| 2363 | no effect on the meaning of this License. |
---|
| 2364 | |
---|
| 2365 | @item |
---|
| 2366 | VERBATIM COPYING |
---|
| 2367 | |
---|
| 2368 | You may copy and distribute the Document in any medium, either |
---|
| 2369 | commercially or noncommercially, provided that this License, the |
---|
| 2370 | copyright notices, and the license notice saying this License applies |
---|
| 2371 | to the Document are reproduced in all copies, and that you add no other |
---|
| 2372 | conditions whatsoever to those of this License. You may not use |
---|
| 2373 | technical measures to obstruct or control the reading or further |
---|
| 2374 | copying of the copies you make or distribute. However, you may accept |
---|
| 2375 | compensation in exchange for copies. If you distribute a large enough |
---|
| 2376 | number of copies you must also follow the conditions in section 3. |
---|
| 2377 | |
---|
| 2378 | You may also lend copies, under the same conditions stated above, and |
---|
| 2379 | you may publicly display copies. |
---|
| 2380 | |
---|
| 2381 | @item |
---|
| 2382 | COPYING IN QUANTITY |
---|
| 2383 | |
---|
| 2384 | If you publish printed copies (or copies in media that commonly have |
---|
| 2385 | printed covers) of the Document, numbering more than 100, and the |
---|
| 2386 | Document's license notice requires Cover Texts, you must enclose the |
---|
| 2387 | copies in covers that carry, clearly and legibly, all these Cover |
---|
| 2388 | Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on |
---|
| 2389 | the back cover. Both covers must also clearly and legibly identify |
---|
| 2390 | you as the publisher of these copies. The front cover must present |
---|
| 2391 | the full title with all words of the title equally prominent and |
---|
| 2392 | visible. You may add other material on the covers in addition. |
---|
| 2393 | Copying with changes limited to the covers, as long as they preserve |
---|
| 2394 | the title of the Document and satisfy these conditions, can be treated |
---|
| 2395 | as verbatim copying in other respects. |
---|
| 2396 | |
---|
| 2397 | If the required texts for either cover are too voluminous to fit |
---|
| 2398 | legibly, you should put the first ones listed (as many as fit |
---|
| 2399 | reasonably) on the actual cover, and continue the rest onto adjacent |
---|
| 2400 | pages. |
---|
| 2401 | |
---|
| 2402 | If you publish or distribute Opaque copies of the Document numbering |
---|
| 2403 | more than 100, you must either include a machine-readable Transparent |
---|
| 2404 | copy along with each Opaque copy, or state in or with each Opaque copy |
---|
| 2405 | a computer-network location from which the general network-using |
---|
| 2406 | public has access to download using public-standard network protocols |
---|
| 2407 | a complete Transparent copy of the Document, free of added material. |
---|
| 2408 | If you use the latter option, you must take reasonably prudent steps, |
---|
| 2409 | when you begin distribution of Opaque copies in quantity, to ensure |
---|
| 2410 | that this Transparent copy will remain thus accessible at the stated |
---|
| 2411 | location until at least one year after the last time you distribute an |
---|
| 2412 | Opaque copy (directly or through your agents or retailers) of that |
---|
| 2413 | edition to the public. |
---|
| 2414 | |
---|
| 2415 | It is requested, but not required, that you contact the authors of the |
---|
| 2416 | Document well before redistributing any large number of copies, to give |
---|
| 2417 | them a chance to provide you with an updated version of the Document. |
---|
| 2418 | |
---|
| 2419 | @item |
---|
| 2420 | MODIFICATIONS |
---|
| 2421 | |
---|
| 2422 | You may copy and distribute a Modified Version of the Document under |
---|
| 2423 | the conditions of sections 2 and 3 above, provided that you release |
---|
| 2424 | the Modified Version under precisely this License, with the Modified |
---|
| 2425 | Version filling the role of the Document, thus licensing distribution |
---|
| 2426 | and modification of the Modified Version to whoever possesses a copy |
---|
| 2427 | of it. In addition, you must do these things in the Modified Version: |
---|
| 2428 | |
---|
| 2429 | @enumerate A |
---|
| 2430 | @item |
---|
| 2431 | Use in the Title Page (and on the covers, if any) a title distinct |
---|
| 2432 | from that of the Document, and from those of previous versions |
---|
| 2433 | (which should, if there were any, be listed in the History section |
---|
| 2434 | of the Document). You may use the same title as a previous version |
---|
| 2435 | if the original publisher of that version gives permission. |
---|
| 2436 | |
---|
| 2437 | @item |
---|
| 2438 | List on the Title Page, as authors, one or more persons or entities |
---|
| 2439 | responsible for authorship of the modifications in the Modified |
---|
| 2440 | Version, together with at least five of the principal authors of the |
---|
| 2441 | Document (all of its principal authors, if it has fewer than five), |
---|
| 2442 | unless they release you from this requirement. |
---|
| 2443 | |
---|
| 2444 | @item |
---|
| 2445 | State on the Title page the name of the publisher of the |
---|
| 2446 | Modified Version, as the publisher. |
---|
| 2447 | |
---|
| 2448 | @item |
---|
| 2449 | Preserve all the copyright notices of the Document. |
---|
| 2450 | |
---|
| 2451 | @item |
---|
| 2452 | Add an appropriate copyright notice for your modifications |
---|
| 2453 | adjacent to the other copyright notices. |
---|
| 2454 | |
---|
| 2455 | @item |
---|
| 2456 | Include, immediately after the copyright notices, a license notice |
---|
| 2457 | giving the public permission to use the Modified Version under the |
---|
| 2458 | terms of this License, in the form shown in the Addendum below. |
---|
| 2459 | |
---|
| 2460 | @item |
---|
| 2461 | Preserve in that license notice the full lists of Invariant Sections |
---|
| 2462 | and required Cover Texts given in the Document's license notice. |
---|
| 2463 | |
---|
| 2464 | @item |
---|
| 2465 | Include an unaltered copy of this License. |
---|
| 2466 | |
---|
| 2467 | @item |
---|
| 2468 | Preserve the section Entitled ``History'', Preserve its Title, and add |
---|
| 2469 | to it an item stating at least the title, year, new authors, and |
---|
| 2470 | publisher of the Modified Version as given on the Title Page. If |
---|
| 2471 | there is no section Entitled ``History'' in the Document, create one |
---|
| 2472 | stating the title, year, authors, and publisher of the Document as |
---|
| 2473 | given on its Title Page, then add an item describing the Modified |
---|
| 2474 | Version as stated in the previous sentence. |
---|
| 2475 | |
---|
| 2476 | @item |
---|
| 2477 | Preserve the network location, if any, given in the Document for |
---|
| 2478 | public access to a Transparent copy of the Document, and likewise |
---|
| 2479 | the network locations given in the Document for previous versions |
---|
| 2480 | it was based on. These may be placed in the ``History'' section. |
---|
| 2481 | You may omit a network location for a work that was published at |
---|
| 2482 | least four years before the Document itself, or if the original |
---|
| 2483 | publisher of the version it refers to gives permission. |
---|
| 2484 | |
---|
| 2485 | @item |
---|
| 2486 | For any section Entitled ``Acknowledgements'' or ``Dedications'', Preserve |
---|
| 2487 | the Title of the section, and preserve in the section all the |
---|
| 2488 | substance and tone of each of the contributor acknowledgements and/or |
---|
| 2489 | dedications given therein. |
---|
| 2490 | |
---|
| 2491 | @item |
---|
| 2492 | Preserve all the Invariant Sections of the Document, |
---|
| 2493 | unaltered in their text and in their titles. Section numbers |
---|
| 2494 | or the equivalent are not considered part of the section titles. |
---|
| 2495 | |
---|
| 2496 | @item |
---|
| 2497 | Delete any section Entitled ``Endorsements''. Such a section |
---|
| 2498 | may not be included in the Modified Version. |
---|
| 2499 | |
---|
| 2500 | @item |
---|
| 2501 | Do not retitle any existing section to be Entitled ``Endorsements'' or |
---|
| 2502 | to conflict in title with any Invariant Section. |
---|
| 2503 | |
---|
| 2504 | @item |
---|
| 2505 | Preserve any Warranty Disclaimers. |
---|
| 2506 | @end enumerate |
---|
| 2507 | |
---|
| 2508 | If the Modified Version includes new front-matter sections or |
---|
| 2509 | appendices that qualify as Secondary Sections and contain no material |
---|
| 2510 | copied from the Document, you may at your option designate some or all |
---|
| 2511 | of these sections as invariant. To do this, add their titles to the |
---|
| 2512 | list of Invariant Sections in the Modified Version's license notice. |
---|
| 2513 | These titles must be distinct from any other section titles. |
---|
| 2514 | |
---|
| 2515 | You may add a section Entitled ``Endorsements'', provided it contains |
---|
| 2516 | nothing but endorsements of your Modified Version by various |
---|
| 2517 | parties---for example, statements of peer review or that the text has |
---|
| 2518 | been approved by an organization as the authoritative definition of a |
---|
| 2519 | standard. |
---|
| 2520 | |
---|
| 2521 | You may add a passage of up to five words as a Front-Cover Text, and a |
---|
| 2522 | passage of up to 25 words as a Back-Cover Text, to the end of the list |
---|
| 2523 | of Cover Texts in the Modified Version. Only one passage of |
---|
| 2524 | Front-Cover Text and one of Back-Cover Text may be added by (or |
---|
| 2525 | through arrangements made by) any one entity. If the Document already |
---|
| 2526 | includes a cover text for the same cover, previously added by you or |
---|
| 2527 | by arrangement made by the same entity you are acting on behalf of, |
---|
| 2528 | you may not add another; but you may replace the old one, on explicit |
---|
| 2529 | permission from the previous publisher that added the old one. |
---|
| 2530 | |
---|
| 2531 | The author(s) and publisher(s) of the Document do not by this License |
---|
| 2532 | give permission to use their names for publicity for or to assert or |
---|
| 2533 | imply endorsement of any Modified Version. |
---|
| 2534 | |
---|
| 2535 | @item |
---|
| 2536 | COMBINING DOCUMENTS |
---|
| 2537 | |
---|
| 2538 | You may combine the Document with other documents released under this |
---|
| 2539 | License, under the terms defined in section 4 above for modified |
---|
| 2540 | versions, provided that you include in the combination all of the |
---|
| 2541 | Invariant Sections of all of the original documents, unmodified, and |
---|
| 2542 | list them all as Invariant Sections of your combined work in its |
---|
| 2543 | license notice, and that you preserve all their Warranty Disclaimers. |
---|
| 2544 | |
---|
| 2545 | The combined work need only contain one copy of this License, and |
---|
| 2546 | multiple identical Invariant Sections may be replaced with a single |
---|
| 2547 | copy. If there are multiple Invariant Sections with the same name but |
---|
| 2548 | different contents, make the title of each such section unique by |
---|
| 2549 | adding at the end of it, in parentheses, the name of the original |
---|
| 2550 | author or publisher of that section if known, or else a unique number. |
---|
| 2551 | Make the same adjustment to the section titles in the list of |
---|
| 2552 | Invariant Sections in the license notice of the combined work. |
---|
| 2553 | |
---|
| 2554 | In the combination, you must combine any sections Entitled ``History'' |
---|
| 2555 | in the various original documents, forming one section Entitled |
---|
| 2556 | ``History''; likewise combine any sections Entitled ``Acknowledgements'', |
---|
| 2557 | and any sections Entitled ``Dedications''. You must delete all |
---|
| 2558 | sections Entitled ``Endorsements.'' |
---|
| 2559 | |
---|
| 2560 | @item |
---|
| 2561 | COLLECTIONS OF DOCUMENTS |
---|
| 2562 | |
---|
| 2563 | You may make a collection consisting of the Document and other documents |
---|
| 2564 | released under this License, and replace the individual copies of this |
---|
| 2565 | License in the various documents with a single copy that is included in |
---|
| 2566 | the collection, provided that you follow the rules of this License for |
---|
| 2567 | verbatim copying of each of the documents in all other respects. |
---|
| 2568 | |
---|
| 2569 | You may extract a single document from such a collection, and distribute |
---|
| 2570 | it individually under this License, provided you insert a copy of this |
---|
| 2571 | License into the extracted document, and follow this License in all |
---|
| 2572 | other respects regarding verbatim copying of that document. |
---|
| 2573 | |
---|
| 2574 | @item |
---|
| 2575 | AGGREGATION WITH INDEPENDENT WORKS |
---|
| 2576 | |
---|
| 2577 | A compilation of the Document or its derivatives with other separate |
---|
| 2578 | and independent documents or works, in or on a volume of a storage or |
---|
| 2579 | distribution medium, is called an ``aggregate'' if the copyright |
---|
| 2580 | resulting from the compilation is not used to limit the legal rights |
---|
| 2581 | of the compilation's users beyond what the individual works permit. |
---|
| 2582 | When the Document is included in an aggregate, this License does not |
---|
| 2583 | apply to the other works in the aggregate which are not themselves |
---|
| 2584 | derivative works of the Document. |
---|
| 2585 | |
---|
| 2586 | If the Cover Text requirement of section 3 is applicable to these |
---|
| 2587 | copies of the Document, then if the Document is less than one half of |
---|
| 2588 | the entire aggregate, the Document's Cover Texts may be placed on |
---|
| 2589 | covers that bracket the Document within the aggregate, or the |
---|
| 2590 | electronic equivalent of covers if the Document is in electronic form. |
---|
| 2591 | Otherwise they must appear on printed covers that bracket the whole |
---|
| 2592 | aggregate. |
---|
| 2593 | |
---|
| 2594 | @item |
---|
| 2595 | TRANSLATION |
---|
| 2596 | |
---|
| 2597 | Translation is considered a kind of modification, so you may |
---|
| 2598 | distribute translations of the Document under the terms of section 4. |
---|
| 2599 | Replacing Invariant Sections with translations requires special |
---|
| 2600 | permission from their copyright holders, but you may include |
---|
| 2601 | translations of some or all Invariant Sections in addition to the |
---|
| 2602 | original versions of these Invariant Sections. You may include a |
---|
| 2603 | translation of this License, and all the license notices in the |
---|
| 2604 | Document, and any Warranty Disclaimers, provided that you also include |
---|
| 2605 | the original English version of this License and the original versions |
---|
| 2606 | of those notices and disclaimers. In case of a disagreement between |
---|
| 2607 | the translation and the original version of this License or a notice |
---|
| 2608 | or disclaimer, the original version will prevail. |
---|
| 2609 | |
---|
| 2610 | If a section in the Document is Entitled ``Acknowledgements'', |
---|
| 2611 | ``Dedications'', or ``History'', the requirement (section 4) to Preserve |
---|
| 2612 | its Title (section 1) will typically require changing the actual |
---|
| 2613 | title. |
---|
| 2614 | |
---|
| 2615 | @item |
---|
| 2616 | TERMINATION |
---|
| 2617 | |
---|
| 2618 | You may not copy, modify, sublicense, or distribute the Document except |
---|
| 2619 | as expressly provided for under this License. Any other attempt to |
---|
| 2620 | copy, modify, sublicense or distribute the Document is void, and will |
---|
| 2621 | automatically terminate your rights under this License. However, |
---|
| 2622 | parties who have received copies, or rights, from you under this |
---|
| 2623 | License will not have their licenses terminated so long as such |
---|
| 2624 | parties remain in full compliance. |
---|
| 2625 | |
---|
| 2626 | @item |
---|
| 2627 | FUTURE REVISIONS OF THIS LICENSE |
---|
| 2628 | |
---|
| 2629 | The Free Software Foundation may publish new, revised versions |
---|
| 2630 | of the GNU Free Documentation License from time to time. Such new |
---|
| 2631 | versions will be similar in spirit to the present version, but may |
---|
| 2632 | differ in detail to address new problems or concerns. See |
---|
| 2633 | @uref{http://www.gnu.org/copyleft/}. |
---|
| 2634 | |
---|
| 2635 | Each version of the License is given a distinguishing version number. |
---|
| 2636 | If the Document specifies that a particular numbered version of this |
---|
| 2637 | License ``or any later version'' applies to it, you have the option of |
---|
| 2638 | following the terms and conditions either of that specified version or |
---|
| 2639 | of any later version that has been published (not as a draft) by the |
---|
| 2640 | Free Software Foundation. If the Document does not specify a version |
---|
| 2641 | number of this License, you may choose any version ever published (not |
---|
| 2642 | as a draft) by the Free Software Foundation. |
---|
| 2643 | @end enumerate |
---|
| 2644 | |
---|
| 2645 | @page |
---|
| 2646 | @heading ADDENDUM: How to use this License for your documents |
---|
| 2647 | |
---|
| 2648 | To use this License in a document you have written, include a copy of |
---|
| 2649 | the License in the document and put the following copyright and |
---|
| 2650 | license notices just after the title page: |
---|
| 2651 | |
---|
| 2652 | @smallexample |
---|
| 2653 | @group |
---|
| 2654 | Copyright (C) @var{year} @var{your name}. |
---|
| 2655 | Permission is granted to copy, distribute and/or modify this document |
---|
| 2656 | under the terms of the GNU Free Documentation License, Version 1.2 |
---|
| 2657 | or any later version published by the Free Software Foundation; |
---|
| 2658 | with no Invariant Sections, no Front-Cover Texts, and no Back-Cover |
---|
| 2659 | Texts. A copy of the license is included in the section entitled ``GNU |
---|
| 2660 | Free Documentation License''. |
---|
| 2661 | @end group |
---|
| 2662 | @end smallexample |
---|
| 2663 | |
---|
| 2664 | If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts, |
---|
| 2665 | replace the ``with@dots{}Texts.'' line with this: |
---|
| 2666 | |
---|
| 2667 | @smallexample |
---|
| 2668 | @group |
---|
| 2669 | with the Invariant Sections being @var{list their titles}, with |
---|
| 2670 | the Front-Cover Texts being @var{list}, and with the Back-Cover Texts |
---|
| 2671 | being @var{list}. |
---|
| 2672 | @end group |
---|
| 2673 | @end smallexample |
---|
| 2674 | |
---|
| 2675 | If you have Invariant Sections without Cover Texts, or some other |
---|
| 2676 | combination of the three, merge those two alternatives to suit the |
---|
| 2677 | situation. |
---|
| 2678 | |
---|
| 2679 | If your document contains nontrivial examples of program code, we |
---|
| 2680 | recommend releasing these examples in parallel under your choice of |
---|
| 2681 | free software license, such as the GNU General Public License, |
---|
| 2682 | to permit their use in free software. |
---|
| 2683 | |
---|
| 2684 | @c Local Variables: |
---|
| 2685 | @c ispell-local-pdict: "ispell-dict" |
---|
| 2686 | @c End: |
---|
| 2687 | |
---|
| 2688 | |
---|
| 2689 | @c --------------------------------------------------------------------- |
---|
| 2690 | @c --------------------------------------------------------------------- |
---|
| 2691 | |
---|
| 2692 | @node Reporting bugs |
---|
| 2693 | @chapter Reporting bugs |
---|
| 2694 | |
---|
| 2695 | Report bugs to <obrebski@@amu.edu.pl>. |
---|
| 2696 | |
---|
| 2697 | @c --------------------------------------------------------------------- |
---|
| 2698 | @c --------------------------------------------------------------------- |
---|
| 2699 | |
---|
| 2700 | @c @node Copyright |
---|
| 2701 | @c @chapter Copyright |
---|
| 2702 | @c |
---|
| 2703 | @c Copyright 2004 by Tomasz Obrebski |
---|
| 2704 | @c This software is free for research and educational use. |
---|
| 2705 | |
---|
| 2706 | @c --------------------------------------------------------------------- |
---|
| 2707 | @c --------------------------------------------------------------------- |
---|
| 2708 | |
---|
| 2709 | @node Author |
---|
| 2710 | @chapter Author |
---|
| 2711 | |
---|
| 2712 | |
---|
| 2713 | @bye |
---|