| 1 |  | 
|---|
| 2 | \input texinfo   @c -*-texinfo-*- | 
|---|
| 3 | @c @documentencoding ISO-8859-2 | 
|---|
| 4 | @documentencoding UTF-8 | 
|---|
| 5 | @c @documentlanguage pl | 
|---|
| 6 |  | 
|---|
| 7 | @c %**start of header | 
|---|
| 8 | @setfilename utt.info | 
|---|
| 9 | @settitle UAM Text Tools v0.90 | 
|---|
| 10 | @c %**end of header | 
|---|
| 11 |  | 
|---|
| 12 | @copying | 
|---|
| 13 | This manual is for UAM Text Tools (version 0.90, October, 2008) | 
|---|
| 14 |  | 
|---|
| 15 | Copyright @copyright{}  2005, 2007  Tomasz ObrÄbski, MichaÅ Stolarski, Justyna Walkowska, PaweÅ Konieczka. | 
|---|
| 16 |  | 
|---|
| 17 | Permission is granted to copy, distribute and/or modify this document | 
|---|
| 18 | under the terms of the GNU Free Documentation License, Version 1.2 or | 
|---|
| 19 | any later version published by the Free Software Foundation; with no | 
|---|
| 20 | Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.  A | 
|---|
| 21 | copy of the license is included in the section entitled GNU Free | 
|---|
| 22 | Documentation License,,GNU Free Documentation License. | 
|---|
| 23 |  | 
|---|
| 24 | @c @quotation | 
|---|
| 25 | @c Permission is granted to ... | 
|---|
| 26 | @c No permission is granted until the document is completed. | 
|---|
| 27 | @c @end quotation | 
|---|
| 28 | @end copying | 
|---|
| 29 |  | 
|---|
| 30 |  | 
|---|
| 31 | @titlepage | 
|---|
| 32 | @title UAM Text Tools 0.90 - User Manual | 
|---|
| 33 | @subtitle edition 0.01, @today | 
|---|
| 34 | @subtitle status: prescript | 
|---|
| 35 | @author by Justyna Walkowska, Tomasz ObrÄbski and MichaÅ Stolarski | 
|---|
| 36 | @page | 
|---|
| 37 | @vskip 0pt plus 1filll | 
|---|
| 38 | @insertcopying | 
|---|
| 39 | @end titlepage | 
|---|
| 40 |  | 
|---|
| 41 | @contents | 
|---|
| 42 |  | 
|---|
| 43 | @c @paragraphindent none | 
|---|
| 44 |  | 
|---|
| 45 | @iftex | 
|---|
| 46 | @tex | 
|---|
| 47 | % \usepackage[T1]{fontenc} | 
|---|
| 48 | % \usepackage[utf8]{inputenc} | 
|---|
| 49 | % \usepackage{times} | 
|---|
| 50 | @end tex | 
|---|
| 51 |  | 
|---|
| 52 | @parskip = 0.5@normalbaselineskip plus 3pt minus 1pt | 
|---|
| 53 | @end iftex | 
|---|
| 54 | @c @headings off | 
|---|
| 55 | @c @everyheading LEM(1) @| @| LEM(1) | 
|---|
| 56 | @everyfooting @today @c @| @thispage @| | 
|---|
| 57 |  | 
|---|
| 58 | @ifnottex | 
|---|
| 59 |  | 
|---|
| 60 | @node Top | 
|---|
| 61 | @top UTT - UAM Text Tools | 
|---|
| 62 |  | 
|---|
| 63 | @insertcopying | 
|---|
| 64 |  | 
|---|
| 65 | @menu | 
|---|
| 66 | * General information:: | 
|---|
| 67 | * UTT file format:: | 
|---|
| 68 | * Configuration files:: | 
|---|
| 69 | * UTT components:: | 
|---|
| 70 | * Auxiliary tools:: | 
|---|
| 71 | * Usage examples:: | 
|---|
| 72 | * PMDBF dictionary:: | 
|---|
| 73 | @c * Examples:: | 
|---|
| 74 | @c * Copyright:: | 
|---|
| 75 | * GNU Free Documentation License:: | 
|---|
| 76 | * Reporting bugs:: | 
|---|
| 77 | * Author:: | 
|---|
| 78 | @end menu | 
|---|
| 79 | @end ifnottex | 
|---|
| 80 |  | 
|---|
| 81 |  | 
|---|
| 82 | @c ---------------------------------------------------------------------- | 
|---|
| 83 |  | 
|---|
| 84 | @node General information | 
|---|
| 85 | @chapter General information | 
|---|
| 86 |  | 
|---|
| 87 | UAM Text Tools (UTT) is a package of language processing tools | 
|---|
| 88 | developed at Adam Mickiewicz University. Its functionality includes: | 
|---|
| 89 |  | 
|---|
| 90 | @itemize @bullet | 
|---|
| 91 |  | 
|---|
| 92 | @item | 
|---|
| 93 | tokenization óÅÄ
Ō | 
|---|
| 94 | @item | 
|---|
| 95 | dictionary-based morphological analysis | 
|---|
| 96 | @item | 
|---|
| 97 | heuristic morphological analysis of unknown words | 
|---|
| 98 | @item | 
|---|
| 99 | spelling correction óÅÄ
ÅÄÅŒ | 
|---|
| 100 | @item | 
|---|
| 101 | pattern search | 
|---|
| 102 | @item | 
|---|
| 103 | sentence splitting | 
|---|
| 104 | @item | 
|---|
| 105 | generation of concordance tables | 
|---|
| 106 | @end itemize | 
|---|
| 107 |  | 
|---|
| 108 | The toolkit is destined for processing of raw (not annotated) | 
|---|
| 109 | unrestricted text for any conceivable purpose. | 
|---|
| 110 |  | 
|---|
| 111 | The system is organized as a collection of command-line programs, each | 
|---|
| 112 | performing one operation, e.g. tokenization, lemmatization, spelling | 
|---|
| 113 | correction. The components are independent one from another, the | 
|---|
| 114 | unifying element being the uniform i/o file format. | 
|---|
| 115 |  | 
|---|
| 116 | The components may be combined in various ways to provide various text | 
|---|
| 117 | processing services. Also new components supplied by the used may be | 
|---|
| 118 | easily incorporated into the system provided that they respect the i/o | 
|---|
| 119 | file format conventions. | 
|---|
| 120 |  | 
|---|
| 121 | UTT component programs does not depend on any specific tagset or | 
|---|
| 122 | morphological description format. | 
|---|
| 123 |  | 
|---|
| 124 | UTT is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by | 
|---|
| 125 | the Free Software Foundation, either version 3 of the License, or (at your option) any later version. | 
|---|
| 126 |  | 
|---|
| 127 | The Polex/PMDBF dictionary is licensed under the Creative Commons by-nc-sa License which prohibits commercial use. | 
|---|
| 128 |  | 
|---|
| 129 |  | 
|---|
| 130 | List of contributors: | 
|---|
| 131 |  | 
|---|
| 132 | @itemize | 
|---|
| 133 | @item Pawel Konieczka | 
|---|
| 134 | @item Tomasz ObrÄbski | 
|---|
| 135 | @item MichaÅ Stolarski | 
|---|
| 136 | @item Marcin Walas | 
|---|
| 137 | @item Justyna Walkowska | 
|---|
| 138 | @item PaweÅ WereÅski | 
|---|
| 139 | @end itemize | 
|---|
| 140 |  | 
|---|
| 141 | @c ---------------------------------------------------------------------- | 
|---|
| 142 | @c --------------------------------------------------------------------- | 
|---|
| 143 |  | 
|---|
| 144 | @node    UTT file format | 
|---|
| 145 | @chapter UTT file format | 
|---|
| 146 |  | 
|---|
| 147 | A UTT file contains annotation of a text. It consists of a sequence of | 
|---|
| 148 | segments. Each segment explicitly refers to a continuous piece of the | 
|---|
| 149 | text and provides some information on it. | 
|---|
| 150 |  | 
|---|
| 151 | @section Segment format | 
|---|
| 152 |  | 
|---|
| 153 | A segment occupies one line of a UTT file and consists of | 
|---|
| 154 | space-separated fields: | 
|---|
| 155 |  | 
|---|
| 156 |  | 
|---|
| 157 | @quotation | 
|---|
| 158 | @sp 1 | 
|---|
| 159 | [@var{start} [@var{length}]] @var{type} @var{form} [@var{annotation1} [@var{annotation2} ...]] | 
|---|
| 160 | @sp 1 | 
|---|
| 161 | @end quotation | 
|---|
| 162 |  | 
|---|
| 163 | @table @var | 
|---|
| 164 |  | 
|---|
| 165 | @item @var{start} | 
|---|
| 166 | Non-negative integer value indicating the position in the source text where the | 
|---|
| 167 | segment starts. | 
|---|
| 168 |  | 
|---|
| 169 | @item @var{length} | 
|---|
| 170 | Non-negative integer value indicating the length of the segment. | 
|---|
| 171 |  | 
|---|
| 172 | @item @var{type} | 
|---|
| 173 | A sequence of non-ASCII characters (without spaces or letters, which could lead to @var{type} being misinterpreted as a @var{start} or @var{length} field). | 
|---|
| 174 | @var{type} reflects the main classification of segments - | 
|---|
| 175 | into words, numbers, punctuation marks, meta-text markers. | 
|---|
| 176 | @xref{tok output,,tok output}, for description of automatically recognized type markers. | 
|---|
| 177 |  | 
|---|
| 178 | @item @var{form} | 
|---|
| 179 | This field contains the textual form of the segment or the special | 
|---|
| 180 | symbol @code{*} indicating that the form is not given (e.g. when the segment has been created artificially to mark something and is of lentgh 0). | 
|---|
| 181 |  | 
|---|
| 182 | The characters or character sequences that have special meaning in the | 
|---|
| 183 | @var{form} field are enumerated below. | 
|---|
| 184 |  | 
|---|
| 185 | Characters with special meaning: | 
|---|
| 186 |  | 
|---|
| 187 | @itemize | 
|---|
| 188 | @item @code{_} - space character | 
|---|
| 189 | @item @code{*} - undefined contents | 
|---|
| 190 | @end itemize | 
|---|
| 191 |  | 
|---|
| 192 | Escape sequences: | 
|---|
| 193 |  | 
|---|
| 194 | @itemize | 
|---|
| 195 | @item @code{\n} - new line | 
|---|
| 196 | @item @code{\t} - tabulation | 
|---|
| 197 | @item @code{\r} - carriage return | 
|---|
| 198 |  | 
|---|
| 199 | @item @code{\_} - the @code{_} character | 
|---|
| 200 | @item @code{\*} - the @code{*} character | 
|---|
| 201 | @item @code{\\} - the @code{\} character | 
|---|
| 202 |  | 
|---|
| 203 | @c @item @code{\hh} - a character with hexadecimal code @code{hh} (used for non-printable characters) | 
|---|
| 204 | @end itemize | 
|---|
| 205 |  | 
|---|
| 206 | @item @var{annotation1} | 
|---|
| 207 | @item @var{annotation2} | 
|---|
| 208 | @item ... | 
|---|
| 209 | Annotation fields have the following format: | 
|---|
| 210 |  | 
|---|
| 211 | @var{longname} @code{:} @var{value} | 
|---|
| 212 |  | 
|---|
| 213 | or | 
|---|
| 214 |  | 
|---|
| 215 | @var{shortname} @var{value} | 
|---|
| 216 |  | 
|---|
| 217 | where @var{longname} is a string of alphanumeric characters | 
|---|
| 218 | (isalnum() test), @var{shortname} - a single non-alphanumeric character | 
|---|
| 219 | (ispunct() test), and @var{value} is an arbitrary string of non-blank characters. | 
|---|
| 220 |  | 
|---|
| 221 | @end table | 
|---|
| 222 |  | 
|---|
| 223 |  | 
|---|
| 224 | Only two fields are mandatory: @var{type} and @var{form}. All other fields | 
|---|
| 225 | may be absent. In the case when only one number precedes the | 
|---|
| 226 | @var{type} field, it is interpreted as the @var{START} position. | 
|---|
| 227 |  | 
|---|
| 228 | If the @var{length} field is ommited, the length of the segment is the | 
|---|
| 229 | length of the @var{form} field, except when the value of the | 
|---|
| 230 | @var{form} field is @code{*} -- in this case, the length is assumed to | 
|---|
| 231 | be 0. | 
|---|
| 232 |  | 
|---|
| 233 | If the @var{start} field is also absent, the segment is assumed to directly | 
|---|
| 234 | follow the preceding one. | 
|---|
| 235 |  | 
|---|
| 236 | @c Conventions: | 
|---|
| 237 |  | 
|---|
| 238 | @c Annotation fields with predefined meaning: | 
|---|
| 239 |  | 
|---|
| 240 | @c @itemize | 
|---|
| 241 | @c @item @code{!} - UTT components are allowed to modify the contents of | 
|---|
| 242 | @c the @var{form} field (e.g. spelling correction does this). If this happens the | 
|---|
| 243 | @c original form of the segment have to be placed in the @code{!}-field. | 
|---|
| 244 | @c @item @code{@@} - morphological description | 
|---|
| 245 | @c @item @code{=} - node identifier assignment (used in graph encoding) | 
|---|
| 246 | @c @item @code{<} - preceding/dominating node(s) (used in graph encoding) | 
|---|
| 247 | @c @item @code{>} - succeeding/subordinate node(s) (used in graph encoding) | 
|---|
| 248 | @c @end itemize | 
|---|
| 249 |  | 
|---|
| 250 | Segments of length 0 may be used to mark file positions with some | 
|---|
| 251 | information. See e.g. BOS and EOS (beginning/end of sentence) markers | 
|---|
| 252 | in the example below. | 
|---|
| 253 |  | 
|---|
| 254 | Example: | 
|---|
| 255 |  | 
|---|
| 256 | sentence: @samp{Piszemy dobre progrumy.} | 
|---|
| 257 |  | 
|---|
| 258 | @example | 
|---|
| 259 | 0000 00 BOS * | 
|---|
| 260 | 0000 07 W Piszemy lem:pisaÄ,V | 
|---|
| 261 | 0007 01 S _ | 
|---|
| 262 | 0008 05 W dobre lem:dobry,ADJ | 
|---|
| 263 | 0013 01 S _ | 
|---|
| 264 | 0014 08 W progrumy cor:programy lem:program,N | 
|---|
| 265 | 0022 01 P . | 
|---|
| 266 | 0023 00 EOS * | 
|---|
| 267 | 0023 01 S _ | 
|---|
| 268 | 0024 00 BOS * | 
|---|
| 269 | 0024 11 W Warszawiacy lem:Warszawiak,N | 
|---|
| 270 | 0035 01 S _ | 
|---|
| 271 | 0036 03 W teŌ | 
|---|
| 272 | 0039 01 P . | 
|---|
| 273 | 0040 00 EOS * | 
|---|
| 274 |  | 
|---|
| 275 | @end example | 
|---|
| 276 |  | 
|---|
| 277 | @example | 
|---|
| 278 | 0000 BOS * | 
|---|
| 279 | 0000 W Piszemy lem:pisaÄ,V | 
|---|
| 280 | 0007 S _ | 
|---|
| 281 | 0008 W dobre lem:dobry,ADJ | 
|---|
| 282 | 0013 S _ | 
|---|
| 283 | 0014 W progrumy cor:programy lem:program,N | 
|---|
| 284 | 0022 P . | 
|---|
| 285 | 0023 EOS * | 
|---|
| 286 | @end example | 
|---|
| 287 |  | 
|---|
| 288 | Posion information may be provided only for some types of segments: | 
|---|
| 289 |  | 
|---|
| 290 | @example | 
|---|
| 291 | 0000 BOS * | 
|---|
| 292 | W Piszemy lem:pisaÄÂ,V | 
|---|
| 293 | S _ | 
|---|
| 294 | W dobre lem:dobry,ADJ | 
|---|
| 295 | S _ | 
|---|
| 296 | W progrumy cor:programy lem:program,N | 
|---|
| 297 | P . | 
|---|
| 298 | EOS * | 
|---|
| 299 | S _ | 
|---|
| 300 | 0024 BOS * | 
|---|
| 301 | W Warszawiacy lem:Warszawiak,N | 
|---|
| 302 | S _ | 
|---|
| 303 | W teŌ | 
|---|
| 304 | P . | 
|---|
| 305 | EOS * | 
|---|
| 306 | @end example | 
|---|
| 307 |  | 
|---|
| 308 | Position/length information may be provided only when necessary: | 
|---|
| 309 |  | 
|---|
| 310 | @example | 
|---|
| 311 | 0000 04 N * | 
|---|
| 312 | 0000 N 12 | 
|---|
| 313 | P . | 
|---|
| 314 | N 5 | 
|---|
| 315 | S _ | 
|---|
| 316 | W km | 
|---|
| 317 | @end example | 
|---|
| 318 |  | 
|---|
| 319 | @section UTT File | 
|---|
| 320 |  | 
|---|
| 321 | A UTT file consists of a sequence of segments.  The same text position | 
|---|
| 322 | may be covered by multiple segments. In cosequence, ambiguous text | 
|---|
| 323 | segmentation and ambiguous annotation may be represented. | 
|---|
| 324 |  | 
|---|
| 325 | There are two structural requirements a valid UTT-formatted file | 
|---|
| 326 | has to meet: | 
|---|
| 327 |  | 
|---|
| 328 | @itemize @bullet | 
|---|
| 329 |  | 
|---|
| 330 | @item | 
|---|
| 331 | segments have to be sorted with respect to the @var{position} field, | 
|---|
| 332 |  | 
|---|
| 333 | @item | 
|---|
| 334 | for each | 
|---|
| 335 | segment ending at position @var{n}, either there must be a segment starting at | 
|---|
| 336 | position @var{n+1}, or position @var{n+1} is not covered by any segment; similarly | 
|---|
| 337 | for each segment starting at position @var{n}, either there must be a segment | 
|---|
| 338 | ending at position @var{n-1}, or the position @var{n-1} must not be covered | 
|---|
| 339 | by any segment. | 
|---|
| 340 |  | 
|---|
| 341 | @end itemize | 
|---|
| 342 |  | 
|---|
| 343 | A valid annotation for the text fragment | 
|---|
| 344 | @example | 
|---|
| 345 | 12.5 km | 
|---|
| 346 | @end example | 
|---|
| 347 |  | 
|---|
| 348 | may be | 
|---|
| 349 |  | 
|---|
| 350 | @example | 
|---|
| 351 | 0000 02 N 12 | 
|---|
| 352 | 0000 04 N 12.5 | 
|---|
| 353 | 0002 01 P . | 
|---|
| 354 | 0003 01 N 5 | 
|---|
| 355 | 0004 01 S _ | 
|---|
| 356 | 0005 02 W km | 
|---|
| 357 | @end example | 
|---|
| 358 |  | 
|---|
| 359 | but not | 
|---|
| 360 |  | 
|---|
| 361 | @example | 
|---|
| 362 | 0000 02 N 12 | 
|---|
| 363 | 0000 04 N 12.5 | 
|---|
| 364 | 0004 01 S _ | 
|---|
| 365 | 0005 02 W km | 
|---|
| 366 | @end example | 
|---|
| 367 |  | 
|---|
| 368 | because in the latter example the first segment (starting at position | 
|---|
| 369 | 0000, 2 characters long) ends at position @var{n}=0001 which is | 
|---|
| 370 | covered by the second segment and no segment starts at position | 
|---|
| 371 | @var{n+2}=0002. | 
|---|
| 372 |  | 
|---|
| 373 |  | 
|---|
| 374 | @section Flattened UTT file | 
|---|
| 375 |  | 
|---|
| 376 | A UTT file format has two variants: regular and flattened. The regular | 
|---|
| 377 | format was described above.  In the flattened format some of the | 
|---|
| 378 | end-of-line characters are replaced with line-feed characters. | 
|---|
| 379 |  | 
|---|
| 380 | The flatten format is basically used to represent whole sentences as | 
|---|
| 381 | single lines of the input file (all intrasentential end-of-line | 
|---|
| 382 | characters are replaced with line-feed characters). | 
|---|
| 383 |  | 
|---|
| 384 | This technical trick permits to perform certain text | 
|---|
| 385 | processing operations on entire sentences with the use of such tools as | 
|---|
| 386 | @command{grep} (see @command{grp} component) or @command{sed} (see  @command{mar} component). | 
|---|
| 387 |  | 
|---|
| 388 | The conversion between the two formats is performed by the tools: | 
|---|
| 389 | @command{fla} and @command{unfla}. | 
|---|
| 390 |  | 
|---|
| 391 | @section Character encoding | 
|---|
| 392 |  | 
|---|
| 393 | The UTT component programs accept only 1-byte character encoding, such | 
|---|
| 394 | as ISO, ANSI, DOS. | 
|---|
| 395 |  | 
|---|
| 396 |  | 
|---|
| 397 | @c @section Formats | 
|---|
| 398 |  | 
|---|
| 399 | @c @unnumberedsubsubsec Basic format | 
|---|
| 400 |  | 
|---|
| 401 | @c While processing large amounts of the overhead related with explicit | 
|---|
| 402 | @c ... of the start position and segment length becomes ... . Therefore, | 
|---|
| 403 | @c for efficiency reasons certain shortcuts are possible: | 
|---|
| 404 |  | 
|---|
| 405 | @c @unnumberedsubsubsec Relative start position | 
|---|
| 406 |  | 
|---|
| 407 | @c Start position may be given as relative distance from the last | 
|---|
| 408 | @c absolut position. | 
|---|
| 409 |  | 
|---|
| 410 | @c @unnumberedsubsubsec Absent length | 
|---|
| 411 |  | 
|---|
| 412 | @c Segment length may by omitted. Normally it can be restored by counting | 
|---|
| 413 | @c the length of the @emph{form field}. For segments with the special value | 
|---|
| 414 | @c @code{*} in the @emph{form field} length 0 is assumed. | 
|---|
| 415 |  | 
|---|
| 416 | @c @unnumberedsubsubsec Absent length and start position | 
|---|
| 417 |  | 
|---|
| 418 | @c Both start position and segment length may be omitted. In this format | 
|---|
| 419 | @c each segment is assumed to follow the previous one. This format is, | 
|---|
| 420 | @c therefore, suitable only for unambiguously tagged text | 
|---|
| 421 | @c (0-length markers can be still used.) | 
|---|
| 422 |  | 
|---|
| 423 |  | 
|---|
| 424 | @c @table @code | 
|---|
| 425 | @c @item AL | 
|---|
| 426 | @c @code{1234 03 W kot} | 
|---|
| 427 | @c @item RL | 
|---|
| 428 | @c @code{+56 03 W kot} | 
|---|
| 429 | @c @item A | 
|---|
| 430 | @c @code{1234 W kot} | 
|---|
| 431 | @c @item R | 
|---|
| 432 | @c @code{+56 W kot} | 
|---|
| 433 | @c @item 0 | 
|---|
| 434 | @c @code{W kot} | 
|---|
| 435 | @c @end table | 
|---|
| 436 |  | 
|---|
| 437 |  | 
|---|
| 438 | @c [JAK UZYSKAÄÂ POLSKIE CZCIONKI W DVI???] | 
|---|
| 439 |  | 
|---|
| 440 | @macro parhelp | 
|---|
| 441 | @item @b{@minus{}@minus{}help}, @b{@minus{}h} | 
|---|
| 442 | Print help. | 
|---|
| 443 | @end macro | 
|---|
| 444 |  | 
|---|
| 445 |  | 
|---|
| 446 | @macro parversion | 
|---|
| 447 | @item @b{@minus{}@minus{}version}, @b{@minus{}V} | 
|---|
| 448 | Print version information. | 
|---|
| 449 | @end macro | 
|---|
| 450 |  | 
|---|
| 451 | @macro parinteractive | 
|---|
| 452 | @item @b{@minus{}@minus{}interactive, @minus{}i} | 
|---|
| 453 | This option toggles interactive mode, which is by default off. In the | 
|---|
| 454 | interactive mode the program does not buffer the output. | 
|---|
| 455 | @end macro | 
|---|
| 456 |  | 
|---|
| 457 |  | 
|---|
| 458 | @c @macro parfile | 
|---|
| 459 | @c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}} | 
|---|
| 460 | @c Input file name. | 
|---|
| 461 | @c If this option is absent or equal to '@minus{}', the program | 
|---|
| 462 | @c reads from the standard input. | 
|---|
| 463 | @c @end macro | 
|---|
| 464 |  | 
|---|
| 465 |  | 
|---|
| 466 | @c @macro paroutput | 
|---|
| 467 | @c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}} | 
|---|
| 468 | @c Regular output file name. To regular output the program sends segments | 
|---|
| 469 | @c which it successfully processed and copies those which were not | 
|---|
| 470 | @c subject to processing. If this option is absent or equal to | 
|---|
| 471 | @c '@minus{}', standard output is used. | 
|---|
| 472 | @c @end macro | 
|---|
| 473 |  | 
|---|
| 474 | @c @macro parfail | 
|---|
| 475 | @c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}} | 
|---|
| 476 | @c Fail output file name. To fail output the program copies the segments | 
|---|
| 477 | @c it failed to process.  If this option is absent or equal to | 
|---|
| 478 | @c '@minus{}', standard output is used. | 
|---|
| 479 | @c @end macro | 
|---|
| 480 |  | 
|---|
| 481 |  | 
|---|
| 482 | @c @macro parcopy | 
|---|
| 483 | @c @item @b{@minus{}@minus{}copy, @minus{}c} | 
|---|
| 484 | @c Copy succesfully processed segments to regular output also in their | 
|---|
| 485 | @c original input form. | 
|---|
| 486 | @c @end macro | 
|---|
| 487 |  | 
|---|
| 488 |  | 
|---|
| 489 | @macro parinputfield | 
|---|
| 490 | @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}} | 
|---|
| 491 | The field containing the input to the program. The default is the | 
|---|
| 492 | @var{form} field. The fields @var{position}, @var{length}, @var{type}, | 
|---|
| 493 | and @var{form} are referred to as @code{1}, @code{2}, @code{3}, | 
|---|
| 494 | @code{4}, respectively. | 
|---|
| 495 | @end macro | 
|---|
| 496 |  | 
|---|
| 497 |  | 
|---|
| 498 | @macro paroutputfield | 
|---|
| 499 | @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}} | 
|---|
| 500 | The name of the field added by the program. The default is the name of the program. | 
|---|
| 501 | @end macro | 
|---|
| 502 |  | 
|---|
| 503 |  | 
|---|
| 504 | @macro pardictionary | 
|---|
| 505 | @item @b{@minus{}@minus{}dictionary=@var{filename}, @minus{}d @var{filename}} | 
|---|
| 506 | Dictionary file name. | 
|---|
| 507 | @end macro | 
|---|
| 508 |  | 
|---|
| 509 |  | 
|---|
| 510 | @macro parprocess | 
|---|
| 511 | @item @b{@minus{}@minus{}process=@var{type}, @minus{}p @var{type}} | 
|---|
| 512 | Process segments with the specified value in the @var{type} field. | 
|---|
| 513 | Multiple occurences of this option are allowed and are interpreted as | 
|---|
| 514 | disjunction. If this option is absent, all segments are processed. | 
|---|
| 515 | @end macro | 
|---|
| 516 |  | 
|---|
| 517 |  | 
|---|
| 518 | @macro parselect | 
|---|
| 519 | @item @b{@minus{}@minus{}select=@var{fieldname}, @minus{}s @var{fieldname}} | 
|---|
| 520 | Select for processing only segments in which the field named | 
|---|
| 521 | @var{fieldname} is present. Multiple occurences of this option are | 
|---|
| 522 | allowed and are interpreted as conjunction of conditions. If this | 
|---|
| 523 | option is absent, all segments are processed. | 
|---|
| 524 | @end macro | 
|---|
| 525 |  | 
|---|
| 526 |  | 
|---|
| 527 | @macro parunselect | 
|---|
| 528 | @item @b{@minus{}@minus{}unselect=@var{fieldname}, @minus{}S @var{fieldname}} | 
|---|
| 529 | Select for processing only segments in which the field @var{fieldname} | 
|---|
| 530 | is absent.  Multiple occurences of this option are allowed and are | 
|---|
| 531 | interpreted as conjunction of conditions. If this option is absent, | 
|---|
| 532 | all segments are processed. | 
|---|
| 533 | @end macro | 
|---|
| 534 |  | 
|---|
| 535 |  | 
|---|
| 536 | @macro paroneline | 
|---|
| 537 | @item @b{@minus{}@minus{}one-line} | 
|---|
| 538 | This option makes the program print ambiguous annotation in one output | 
|---|
| 539 | line by generating multiple annotation fields. By default when | 
|---|
| 540 | ambiguous annotation may be produced for a segment, the segment is | 
|---|
| 541 | multiplicated and each of the annotations is added to separate copy of | 
|---|
| 542 | the segment. | 
|---|
| 543 | @end macro | 
|---|
| 544 |  | 
|---|
| 545 |  | 
|---|
| 546 | @macro paronefield | 
|---|
| 547 | @item @b{@minus{}@minus{}one-field, @minus{}1} | 
|---|
| 548 | This option makes the program print ambiguous annotation in one | 
|---|
| 549 | annotation field. By default when ambiguous annotation may be produced | 
|---|
| 550 | for a segment, the segment is multiplicated and each of the | 
|---|
| 551 | annotations is added to separate copy of the segment. | 
|---|
| 552 |  | 
|---|
| 553 | This option is useful when working with @command{kot} or @command{con}. | 
|---|
| 554 | @end macro | 
|---|
| 555 |  | 
|---|
| 556 |  | 
|---|
| 557 | @c --------------------------------------------------------------------- | 
|---|
| 558 | @c CONFIGURATION FILES | 
|---|
| 559 | @c --------------------------------------------------------------------- | 
|---|
| 560 |  | 
|---|
| 561 | @node    Configuration files | 
|---|
| 562 | @chapter Configuration files | 
|---|
| 563 |  | 
|---|
| 564 | Values for all command line options accepted by a component | 
|---|
| 565 | may be set in configuration files. The default location of the | 
|---|
| 566 | configuration files for a component named @command{@var{program}} are | 
|---|
| 567 |  | 
|---|
| 568 | @example | 
|---|
| 569 | @file{/usr/local/etc/utt/@var{program}.conf} | 
|---|
| 570 | @end example | 
|---|
| 571 |  | 
|---|
| 572 | for system-wide configuration file and | 
|---|
| 573 |  | 
|---|
| 574 | @example | 
|---|
| 575 | @file{~/.utt/@var{program}.conf} | 
|---|
| 576 | @end example | 
|---|
| 577 |  | 
|---|
| 578 | for user configuration file. | 
|---|
| 579 |  | 
|---|
| 580 | @c The configuration file to load may be also specified with the | 
|---|
| 581 | @c @option{--config} option. Configuration file need not be provided. | 
|---|
| 582 |  | 
|---|
| 583 | For each option, the value is set according to the following priority: | 
|---|
| 584 |  | 
|---|
| 585 | @itemize | 
|---|
| 586 | @item command line | 
|---|
| 587 | @c @item configuration file indicated with @option{--config} option | 
|---|
| 588 | @item user configuration file (or configuration file indicated with the @option{--config} option) | 
|---|
| 589 | @item system-wide configuration file | 
|---|
| 590 | @end itemize | 
|---|
| 591 |  | 
|---|
| 592 | Parameter values are specified in the following format: | 
|---|
| 593 |  | 
|---|
| 594 | @var{parametername}=@var{value} | 
|---|
| 595 |  | 
|---|
| 596 | where @var{parametername} is the short or long name of an option accepted by | 
|---|
| 597 | the program, or | 
|---|
| 598 |  | 
|---|
| 599 | @var{parametername} | 
|---|
| 600 |  | 
|---|
| 601 | if the option does not need arguments. | 
|---|
| 602 |  | 
|---|
| 603 | You can introduce comments to configuration files using the # sign. | 
|---|
| 604 |  | 
|---|
| 605 | If a program accepts multiple occurences of an option (e.g. @var{lem}'s select option) you can specify them in two distinct lines of the program's configuration file. | 
|---|
| 606 |  | 
|---|
| 607 | @c The equal sign may be omitted. | 
|---|
| 608 |  | 
|---|
| 609 |  | 
|---|
| 610 | @quotation Tip | 
|---|
| 611 | If you have two (or more) frequently used sets of options for the same | 
|---|
| 612 | program (eg. lem with PMDBF dictionary and lem with a user dictionary) | 
|---|
| 613 | a good solution is to create two soft links to lem, called | 
|---|
| 614 | eg. lemg and lemu and specify their configuration in files lemg.conf | 
|---|
| 615 | and lemu.conf respectively. | 
|---|
| 616 | @end quotation | 
|---|
| 617 |  | 
|---|
| 618 | @c --------------------------------------------------------------------- | 
|---|
| 619 | @c COMPONENTS | 
|---|
| 620 | @c --------------------------------------------------------------------- | 
|---|
| 621 |  | 
|---|
| 622 | @node UTT components | 
|---|
| 623 | @chapter UTT components | 
|---|
| 624 |  | 
|---|
| 625 | UTT components are of three types: | 
|---|
| 626 |  | 
|---|
| 627 | @menu | 
|---|
| 628 | Sources: programs which read non-UTT data (e.g. raw text) and produce output | 
|---|
| 629 | in UTT format | 
|---|
| 630 | * tok::         a tokenizer | 
|---|
| 631 |  | 
|---|
| 632 | Filters: programs which read and produce UTT-formatted data | 
|---|
| 633 | * lem::         a morphological analyzer | 
|---|
| 634 | * gue::         a morphological guesser | 
|---|
| 635 | * cor::         a simple spelling corrector | 
|---|
| 636 | * kor::         a more elaborated spelling corrector | 
|---|
| 637 | * sen::         a sentensizer | 
|---|
| 638 | * ser::         a pattern search tool (marks matches) | 
|---|
| 639 | * mar::         a pattern search tool (introduces arbitrary markers into the text) | 
|---|
| 640 | * grp::         a pattern search tool (selects sentences containing a match) | 
|---|
| 641 | @c * gph::         a word-graph annotation tool:: | 
|---|
| 642 | @c * dgp::         a dependency parser | 
|---|
| 643 |  | 
|---|
| 644 | Sinks: programs which read UTT data and produce output in another format | 
|---|
| 645 | * kot::         an untokenizer | 
|---|
| 646 | * con::         a concordance table generator | 
|---|
| 647 | @end menu | 
|---|
| 648 |  | 
|---|
| 649 | @c --------------------------------------------------------------------- | 
|---|
| 650 | @c TOK | 
|---|
| 651 | @c --------------------------------------------------------------------- | 
|---|
| 652 |  | 
|---|
| 653 | @page | 
|---|
| 654 | @node tok | 
|---|
| 655 | @section tok - a tokenizer | 
|---|
| 656 |  | 
|---|
| 657 | @c ---------------------------------------- | 
|---|
| 658 |  | 
|---|
| 659 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 660 | @item @strong{Authors:}                 @tab Tomasz ObrÄbski | 
|---|
| 661 | @item @strong{Component category:}      @tab source | 
|---|
| 662 | @item @strong{Input format:}            @tab raw text file | 
|---|
| 663 | @item @strong{Output format:}           @tab UTT regular | 
|---|
| 664 | @item @strong{Required annotation:}     @tab - | 
|---|
| 665 | @end multitable | 
|---|
| 666 |  | 
|---|
| 667 |  | 
|---|
| 668 | @menu | 
|---|
| 669 | * tok description:: | 
|---|
| 670 | * tok input:: | 
|---|
| 671 | * tok output:: | 
|---|
| 672 | * tok command line options:: | 
|---|
| 673 | * tok example:: | 
|---|
| 674 | @end menu | 
|---|
| 675 |  | 
|---|
| 676 | @node tok description | 
|---|
| 677 | @subsection Description | 
|---|
| 678 |  | 
|---|
| 679 | @code{tok} is a simple program which reads a text file and identifies | 
|---|
| 680 | tokens on the basis of their orthographic form.  The type of the token | 
|---|
| 681 | is printed as the @var{type} field. | 
|---|
| 682 |  | 
|---|
| 683 | @node tok input | 
|---|
| 684 | @subsection Input | 
|---|
| 685 |  | 
|---|
| 686 | Raw text. | 
|---|
| 687 |  | 
|---|
| 688 | @node tok output | 
|---|
| 689 | @subsection Output | 
|---|
| 690 |  | 
|---|
| 691 | UTT-file with four fields: @var{start}, @var{length}, @var{type}, and @var{form}. In the @var{type} field five types of tokens are distinguished: | 
|---|
| 692 |  | 
|---|
| 693 | @itemize | 
|---|
| 694 |  | 
|---|
| 695 | @item @code{W} | 
|---|
| 696 | (word) | 
|---|
| 697 | - continuous sequence of letters | 
|---|
| 698 |  | 
|---|
| 699 | @item @code{N} | 
|---|
| 700 | (number) | 
|---|
| 701 | - continuous sequence of digits | 
|---|
| 702 |  | 
|---|
| 703 | @item @code{S} | 
|---|
| 704 | (space) | 
|---|
| 705 | - continuous sequence of space characters | 
|---|
| 706 |  | 
|---|
| 707 | @item @code{P} | 
|---|
| 708 | (punctuation mark) | 
|---|
| 709 | - single printable characters not belonging to any of the other classes | 
|---|
| 710 |  | 
|---|
| 711 | @item @code{B} | 
|---|
| 712 | (unprintable character) | 
|---|
| 713 | - single unprintable character | 
|---|
| 714 |  | 
|---|
| 715 | @end itemize | 
|---|
| 716 |  | 
|---|
| 717 |  | 
|---|
| 718 |  | 
|---|
| 719 | @node tok command line options | 
|---|
| 720 | @subsection Command line options | 
|---|
| 721 |  | 
|---|
| 722 | @table @code | 
|---|
| 723 |  | 
|---|
| 724 | @item @b{@minus{}@minus{}help}, @b{@minus{}h} | 
|---|
| 725 | Print help. | 
|---|
| 726 |  | 
|---|
| 727 | @item @b{@minus{}@minus{}version}, @b{@minus{}V} | 
|---|
| 728 | Print version information. | 
|---|
| 729 |  | 
|---|
| 730 | @item @b{@minus{}@minus{}interactive, @minus{}i} | 
|---|
| 731 | This option toggles interactive mode, which is by default off. In the | 
|---|
| 732 | interactive mode the program does not buffer the output. | 
|---|
| 733 |  | 
|---|
| 734 | @end table | 
|---|
| 735 |  | 
|---|
| 736 | @node tok example | 
|---|
| 737 | @subsection Example | 
|---|
| 738 |  | 
|---|
| 739 | Input: | 
|---|
| 740 |  | 
|---|
| 741 | @example | 
|---|
| 742 | Piszemy dobre programy. | 
|---|
| 743 | @end example | 
|---|
| 744 |  | 
|---|
| 745 | Output: | 
|---|
| 746 |  | 
|---|
| 747 | @example | 
|---|
| 748 | 0000 07 W Piszemy | 
|---|
| 749 | 0007 01 S _ | 
|---|
| 750 | 0008 05 W dobre | 
|---|
| 751 | 0013 01 S _ | 
|---|
| 752 | 0014 08 W programy | 
|---|
| 753 | 0022 01 P . | 
|---|
| 754 | 0023 01 S \n | 
|---|
| 755 | @end example | 
|---|
| 756 |  | 
|---|
| 757 |  | 
|---|
| 758 | @c --------------------------------------------------------------------- | 
|---|
| 759 | @c SEN | 
|---|
| 760 | @c --------------------------------------------------------------------- | 
|---|
| 761 |  | 
|---|
| 762 | @c @node sen - sentencizer | 
|---|
| 763 | @c @chapter sen - sentencizer | 
|---|
| 764 |  | 
|---|
| 765 | @c Authors: Tomasz ObrÄbski | 
|---|
| 766 |  | 
|---|
| 767 | @c --------------------------------------------------------------------- | 
|---|
| 768 | @c LEM | 
|---|
| 769 | @c --------------------------------------------------------------------- | 
|---|
| 770 |  | 
|---|
| 771 | @page | 
|---|
| 772 | @node lem | 
|---|
| 773 | @section lem - morphological analyzer | 
|---|
| 774 |  | 
|---|
| 775 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 776 | @item @strong{Authors:}                 @tab Tomasz ObrÄbski, MichaÅ Stolarski | 
|---|
| 777 | @item @strong{Component category:}      @tab filter | 
|---|
| 778 | @item @strong{Input format:}            @tab UTT regular | 
|---|
| 779 | @item @strong{Output format:}           @tab UTT regular | 
|---|
| 780 | @item @strong{Required annotation:}     @tab tok | 
|---|
| 781 | @end multitable | 
|---|
| 782 |  | 
|---|
| 783 | @menu | 
|---|
| 784 | * lem description:: | 
|---|
| 785 | * lem command line options:: | 
|---|
| 786 | * lem input:: | 
|---|
| 787 | * lem output:: | 
|---|
| 788 | * lem example:: | 
|---|
| 789 | * lem dictionaries:: | 
|---|
| 790 | * lem hints:: | 
|---|
| 791 | @end menu | 
|---|
| 792 |  | 
|---|
| 793 | @node lem description | 
|---|
| 794 | @subsection Description | 
|---|
| 795 |  | 
|---|
| 796 | @command{lem} performs morphological analysis of a simple orthographic | 
|---|
| 797 | word, returning all its possible morphological annotations, | 
|---|
| 798 | disregarding the context. | 
|---|
| 799 |  | 
|---|
| 800 | @c ---------------------------------------- | 
|---|
| 801 |  | 
|---|
| 802 | @node lem command line options | 
|---|
| 803 | @subsection Command line options | 
|---|
| 804 |  | 
|---|
| 805 | @table @code | 
|---|
| 806 | @parhelp | 
|---|
| 807 | @parversion | 
|---|
| 808 | @parinteractive | 
|---|
| 809 | @c @parfile | 
|---|
| 810 | @c @paroutput | 
|---|
| 811 | @c @parfail | 
|---|
| 812 | @c @parcopy | 
|---|
| 813 | @parinputfield | 
|---|
| 814 | @paroutputfield | 
|---|
| 815 | @pardictionary | 
|---|
| 816 | @parprocess | 
|---|
| 817 | @parselect | 
|---|
| 818 | @parunselect | 
|---|
| 819 | @paroneline | 
|---|
| 820 | @paronefield | 
|---|
| 821 | @end table | 
|---|
| 822 |  | 
|---|
| 823 | @c ---------------------------------------- | 
|---|
| 824 |  | 
|---|
| 825 | @node lem input | 
|---|
| 826 | @subsection Input | 
|---|
| 827 |  | 
|---|
| 828 | Lem reads a UTT file and processes the value of the @var{form} field | 
|---|
| 829 | (the input field may be changed with @option{--input-field} option). | 
|---|
| 830 |  | 
|---|
| 831 | @node lem output | 
|---|
| 832 | @subsection Output | 
|---|
| 833 |  | 
|---|
| 834 | @command{lem} adds a new annotation field, whose default name is @code{lem}.  In | 
|---|
| 835 | case of ambiguity either the segment is multiplicated (default), | 
|---|
| 836 | multiple @code{lem} fields are added (@option{--one-line}) or ambiguous | 
|---|
| 837 | annotation is produced as the value of single @code{lem} field (option | 
|---|
| 838 | @option{--one-field,-1}): | 
|---|
| 839 |  | 
|---|
| 840 | @itemize @bullet | 
|---|
| 841 |  | 
|---|
| 842 | @item | 
|---|
| 843 | unambiguous value format: | 
|---|
| 844 |  | 
|---|
| 845 | @example | 
|---|
| 846 | <lemma>,<descr> | 
|---|
| 847 | @end example | 
|---|
| 848 |  | 
|---|
| 849 | @item | 
|---|
| 850 | ambiguous value format (@option{--one-field} option) | 
|---|
| 851 |  | 
|---|
| 852 |  | 
|---|
| 853 | @example | 
|---|
| 854 | <lemma>,<descr>[,<descr>][;<lemma>,<descr>[,<descr>]] | 
|---|
| 855 | @end example | 
|---|
| 856 |  | 
|---|
| 857 | (alternative descriptions for the same lemma are separated by commas, | 
|---|
| 858 | alternative lemmata are separated by semicolons.) | 
|---|
| 859 |  | 
|---|
| 860 | @end itemize | 
|---|
| 861 |  | 
|---|
| 862 | @node lem example | 
|---|
| 863 | @subsection Example | 
|---|
| 864 |  | 
|---|
| 865 | Input: | 
|---|
| 866 |  | 
|---|
| 867 | @example | 
|---|
| 868 | 0000 07 W Piszemy | 
|---|
| 869 | 0007 01 S _ | 
|---|
| 870 | 0008 05 W dobre | 
|---|
| 871 | 0013 01 S _ | 
|---|
| 872 | 0014 08 W programy | 
|---|
| 873 | 0022 01 P . | 
|---|
| 874 | 0023 01 B \n | 
|---|
| 875 | @end example | 
|---|
| 876 |  | 
|---|
| 877 | Output (default): | 
|---|
| 878 |  | 
|---|
| 879 | @example | 
|---|
| 880 | 0000 07 W Piszemy lem:pisaÄ,V/AiVpMdTrfNpP1 | 
|---|
| 881 | 0007 01 B _ | 
|---|
| 882 | 0008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn | 
|---|
| 883 | 0008 05 W dobre lem:dobry,ADJ/DpNsCnavGn | 
|---|
| 884 | 0013 01 B _ | 
|---|
| 885 | 0014 08 W programy lem:program,N/GiNpCa | 
|---|
| 886 | 0014 08 W programy lem:program,N/GiNpCn | 
|---|
| 887 | 0014 08 W programy lem:program,N/GiNpCv | 
|---|
| 888 | 0022 01 P . | 
|---|
| 889 | 0023 01 B \n | 
|---|
| 890 | @end example | 
|---|
| 891 |  | 
|---|
| 892 | Output (@option{--one-line} option): | 
|---|
| 893 |  | 
|---|
| 894 | @example | 
|---|
| 895 | 0000 07 W Piszemy lem:pisaÄ,V/AiVpMdTrfNpP1 | 
|---|
| 896 | 0007 01 S _ | 
|---|
| 897 | 0008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn lem:dobry,ADJ/DpNsCnavGn | 
|---|
| 898 | 0013 01 S _ | 
|---|
| 899 | 0014 08 W programy lem:program,N/GiNpCa lem:program,N/GiNpCn lem:program,N/GiNpCv | 
|---|
| 900 | 0022 01 P . | 
|---|
| 901 | 0023 01 S \n | 
|---|
| 902 | @end example | 
|---|
| 903 |  | 
|---|
| 904 | Output (@option{--one-field} option): | 
|---|
| 905 |  | 
|---|
| 906 | @example | 
|---|
| 907 | 0000 07 W Piszemy lem:pisaÄ,V/AiVpMdTrfNpP1 | 
|---|
| 908 | 0007 01 S _ | 
|---|
| 909 | 0008 05 W dobre lem:dobry,ADJ/DpNpCnavGaifn,ADJ/DpNsCnavGn | 
|---|
| 910 | 0013 01 S _ | 
|---|
| 911 | 0014 08 W programy lem:program,N/GiNpCa,N/GiNpCn,N/GiNpCv | 
|---|
| 912 | 0022 01 P . | 
|---|
| 913 | 0023 01 S \n | 
|---|
| 914 | @end example | 
|---|
| 915 |  | 
|---|
| 916 | @c ---------------------------------------- | 
|---|
| 917 |  | 
|---|
| 918 | @node lem dictionaries | 
|---|
| 919 | @subsection Dictionaries | 
|---|
| 920 |  | 
|---|
| 921 | @command{lem} requires a dictionary. The dictionary may be provided in | 
|---|
| 922 | one of two formats: in text (source) format or in binary (fsa) format. | 
|---|
| 923 |  | 
|---|
| 924 | @subsubheading Text format | 
|---|
| 925 |  | 
|---|
| 926 | Dictionary entries have the following structure: | 
|---|
| 927 |  | 
|---|
| 928 | @example | 
|---|
| 929 | <form>;<lemma>,<descr>[;<lemma>,<descr>] | 
|---|
| 930 | @end example | 
|---|
| 931 |  | 
|---|
| 932 | @var{lemma} may be given explicitly or in the cut-add format: | 
|---|
| 933 |  | 
|---|
| 934 | @example | 
|---|
| 935 | @code{[<cut1><add1>-]<cut2><add2>} | 
|---|
| 936 | @end example | 
|---|
| 937 |  | 
|---|
| 938 | meaning: replace prefix of length @code{<cut1>} with | 
|---|
| 939 | string @code{<add1>}, replace suffix of length @code{<cut2>} with string | 
|---|
| 940 | @code{<add2>}. For example @code{3t} transforms @samp{kocie} into | 
|---|
| 941 | @samp{kot}, @code{3-4aÃÅy} transforms @samp{najbielsi} into @samp{biaÃÅy} | 
|---|
| 942 |  | 
|---|
| 943 | Each dictionary entry must be written in one line and must not contain blank characters. | 
|---|
| 944 |  | 
|---|
| 945 | Examples: | 
|---|
| 946 | @example | 
|---|
| 947 | kot;0,N/GaNsCn | 
|---|
| 948 | kota;1,N/GaNsCg;1,N/GaNsCa | 
|---|
| 949 | kotu;1,N/GaNsCd | 
|---|
| 950 | kotem;2,N/GaNsCi | 
|---|
| 951 | kocie;3t,N/GaNsCl;3t,N/GaNsCv | 
|---|
| 952 | najbielsi;3-4aÅy,ADJ/DsNpCnGp | 
|---|
| 953 | najbielsze;3-5aÅy,ADJ/DsNpCnGaifn | 
|---|
| 954 | najlepsi;dobry,ADJ/DsNpCnGp | 
|---|
| 955 | najlepsze;dobry,ADJ/DsNpCnGaifn | 
|---|
| 956 | @end example | 
|---|
| 957 |  | 
|---|
| 958 |  | 
|---|
| 959 | The mandatory file name extension for a text dictionary is @code{dic}. For large | 
|---|
| 960 | dictionaries it is preferable, however, to compile them into binary | 
|---|
| 961 | (fsa) format. | 
|---|
| 962 |  | 
|---|
| 963 | @subsubheading Binary format | 
|---|
| 964 |  | 
|---|
| 965 | The mandatory file name extension for a binary dictionary is @code{bin}. To | 
|---|
| 966 | compile a text dictionary into binary format, write: | 
|---|
| 967 |  | 
|---|
| 968 | @example | 
|---|
| 969 | compiledic <dictionaryname>.dic | 
|---|
| 970 | @end example | 
|---|
| 971 |  | 
|---|
| 972 | @subsubheading Polex/PMDBF dictionary | 
|---|
| 973 |  | 
|---|
| 974 | A large-coverage morphological dictionary for Polish language, Polex/PMDBF, is included in | 
|---|
| 975 | the distribution as the default @emph{lem}'s dictionary. It's | 
|---|
| 976 | located by default in: | 
|---|
| 977 |  | 
|---|
| 978 | @file{$HOME/.local/share/utt/pl_PL.ISO-8859-2/lem.bin} | 
|---|
| 979 |  | 
|---|
| 980 | in local installation or in | 
|---|
| 981 |  | 
|---|
| 982 | @file{/usr/local/share/utt/pl_PL.ISO-8859-2/lem.bin} | 
|---|
| 983 |  | 
|---|
| 984 | in system installation. | 
|---|
| 985 |  | 
|---|
| 986 | @node lem hints | 
|---|
| 987 | @subsection Hints | 
|---|
| 988 |  | 
|---|
| 989 | @subsubheading Combining data from multiple dictionaries | 
|---|
| 990 |  | 
|---|
| 991 | @itemize | 
|---|
| 992 |  | 
|---|
| 993 | @item Apply <dict1>, then apply <dict2> to words which were not annotatated. | 
|---|
| 994 |  | 
|---|
| 995 | @example | 
|---|
| 996 | lem -d <dict1> | lem -S lem -d <dict2> | 
|---|
| 997 | @end example | 
|---|
| 998 |  | 
|---|
| 999 | @item Add annotations from two dictionaries <dict1> and <dict2>. | 
|---|
| 1000 |  | 
|---|
| 1001 | @example | 
|---|
| 1002 | lem -c -d <dict1> | lem -S lem -d <dict2> | 
|---|
| 1003 | @end example | 
|---|
| 1004 |  | 
|---|
| 1005 | @end itemize | 
|---|
| 1006 |  | 
|---|
| 1007 |  | 
|---|
| 1008 | @c --------------------------------------------------------------------- | 
|---|
| 1009 | @c GUE | 
|---|
| 1010 | @c --------------------------------------------------------------------- | 
|---|
| 1011 |  | 
|---|
| 1012 | @page | 
|---|
| 1013 | @node gue | 
|---|
| 1014 | @section gue - morphological guesser | 
|---|
| 1015 |  | 
|---|
| 1016 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 1017 |  | 
|---|
| 1018 | @item @strong{Authors:}                 @tab MichaÅ Stolarski, Tomasz ObrÄbski | 
|---|
| 1019 | @item @strong{Component category:}      @tab filter | 
|---|
| 1020 |  | 
|---|
| 1021 | @end multitable | 
|---|
| 1022 |  | 
|---|
| 1023 | @menu | 
|---|
| 1024 | * gue description:: | 
|---|
| 1025 | * gue command line options:: | 
|---|
| 1026 | * gue example:: | 
|---|
| 1027 | * gue dictionaries:: | 
|---|
| 1028 | @end menu | 
|---|
| 1029 |  | 
|---|
| 1030 |  | 
|---|
| 1031 | @node gue description | 
|---|
| 1032 | @subsection Description | 
|---|
| 1033 |  | 
|---|
| 1034 | @command{gue} guesess morphological descriptions of the form contained | 
|---|
| 1035 | in the @var{form} field. | 
|---|
| 1036 |  | 
|---|
| 1037 |  | 
|---|
| 1038 | @node gue command line options | 
|---|
| 1039 | @subsection Command line options | 
|---|
| 1040 |  | 
|---|
| 1041 | @table @code | 
|---|
| 1042 |  | 
|---|
| 1043 | @parhelp | 
|---|
| 1044 | @parversion | 
|---|
| 1045 | @parinteractive | 
|---|
| 1046 | @c @parfile | 
|---|
| 1047 | @c @paroutput | 
|---|
| 1048 | @c @parfail | 
|---|
| 1049 | @c @parcopy | 
|---|
| 1050 | @parinputfield | 
|---|
| 1051 | @paroutputfield | 
|---|
| 1052 | @pardictionary | 
|---|
| 1053 | @parprocess | 
|---|
| 1054 | @parselect | 
|---|
| 1055 | @parunselect | 
|---|
| 1056 | @paroneline | 
|---|
| 1057 | @paronefield | 
|---|
| 1058 |  | 
|---|
| 1059 | @item @b{@minus{}@minus{}delta=@var{n}} | 
|---|
| 1060 | Stop displaying answers after fall of weight, that is, when weight difference between 2 subsequent results is more than delta value (default=`0.2'). | 
|---|
| 1061 |  | 
|---|
| 1062 |  | 
|---|
| 1063 | @item @b{@minus{}@minus{}cut-off=@var{n}} | 
|---|
| 1064 | Do not display answers with less weight than cut-off value (default=`200'). | 
|---|
| 1065 |  | 
|---|
| 1066 |  | 
|---|
| 1067 | @item @b{@minus{}@minus{}guess_count=@var{n}, @minus{}n @var{n}} | 
|---|
| 1068 | Guess up to n descriptions  (default=`0', which means 'display all results'). | 
|---|
| 1069 |  | 
|---|
| 1070 |  | 
|---|
| 1071 |  | 
|---|
| 1072 | @end table | 
|---|
| 1073 |  | 
|---|
| 1074 | @node gue example | 
|---|
| 1075 | @subsection Example | 
|---|
| 1076 |  | 
|---|
| 1077 | @example | 
|---|
| 1078 | command: gue -n 2 | 
|---|
| 1079 |  | 
|---|
| 1080 | input: | 
|---|
| 1081 | 0000 07 W smerfny | 
|---|
| 1082 |  | 
|---|
| 1083 | output: | 
|---|
| 1084 | 0000 07 W smerfny gue:,ADJ/CaDpGiNs | 
|---|
| 1085 | 0000 07 W smerfny gue:,ADJ/CnvDpGaipNs | 
|---|
| 1086 | @end example | 
|---|
| 1087 |  | 
|---|
| 1088 |  | 
|---|
| 1089 | @node gue dictionaries | 
|---|
| 1090 | @subsection Dictionaries | 
|---|
| 1091 |  | 
|---|
| 1092 | @command{gue} requires a dictionary. For now, the dictionary must be provided in binary (fsa) format. | 
|---|
| 1093 | The fsa format is created by compiling text-format dictionaries. | 
|---|
| 1094 |  | 
|---|
| 1095 |  | 
|---|
| 1096 |  | 
|---|
| 1097 | @subsubheading Text format | 
|---|
| 1098 |  | 
|---|
| 1099 | Dictionary entries have the following structure: | 
|---|
| 1100 |  | 
|---|
| 1101 | @example | 
|---|
| 1102 | @var{prefix}@code{*}@var{suffix}@code{;}@var{lemma}@code{,}@var{description}@code{:}@var{weight} | 
|---|
| 1103 | @end example | 
|---|
| 1104 |  | 
|---|
| 1105 | @var{lemma} must be given in the cut-add format: | 
|---|
| 1106 |  | 
|---|
| 1107 | @example | 
|---|
| 1108 | @code{[<cut1><add1>-]<cut2><add2>} | 
|---|
| 1109 | @end example | 
|---|
| 1110 | (no spaces in between): replace prefix of length @var{cut1} with | 
|---|
| 1111 | string @var{add1}, replace suffix of length @var{cat2} with string | 
|---|
| 1112 | @var{add2}. | 
|---|
| 1113 |  | 
|---|
| 1114 |  | 
|---|
| 1115 | Example: @code{3-4aÅy} transforms @i{najbielsi} into @i{biaÅy} | 
|---|
| 1116 |  | 
|---|
| 1117 |  | 
|---|
| 1118 | @var{description} contains the part of speech and morphosyntactic information (@xref{PMDBF dictionary}.). | 
|---|
| 1119 |  | 
|---|
| 1120 | @var{weight} is an integer value between 1 and 999 indicating the | 
|---|
| 1121 | likelihood of the guess. | 
|---|
| 1122 |  | 
|---|
| 1123 | @c @example | 
|---|
| 1124 | @c *ÅkÄ;1a,N/GfNsCa | 
|---|
| 1125 | @c naj*elszy;3-4aÅy,ADJ/...:... | 
|---|
| 1126 | @c @end example | 
|---|
| 1127 |  | 
|---|
| 1128 |  | 
|---|
| 1129 | @c --------------------------------------------------------------------- | 
|---|
| 1130 | @c COR | 
|---|
| 1131 | @c --------------------------------------------------------------------- | 
|---|
| 1132 |  | 
|---|
| 1133 | @page | 
|---|
| 1134 | @node cor | 
|---|
| 1135 | @section cor - spelling corrector | 
|---|
| 1136 |  | 
|---|
| 1137 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 1138 | @item @strong{Authors:}                 @tab Tomasz ObrÄbski, MichaÅ Stolarski | 
|---|
| 1139 | @item @strong{Component category:}      @tab filter | 
|---|
| 1140 | @item @strong{Input format:}            @tab UTT regular | 
|---|
| 1141 | @item @strong{Output format:}           @tab UTT regular | 
|---|
| 1142 | @item @strong{Required annotation:}     @tab tok | 
|---|
| 1143 | @end multitable | 
|---|
| 1144 |  | 
|---|
| 1145 | @menu | 
|---|
| 1146 | * cor description:: | 
|---|
| 1147 | * cor command line options:: | 
|---|
| 1148 | * cor dictionaries:: | 
|---|
| 1149 | @end menu | 
|---|
| 1150 |  | 
|---|
| 1151 |  | 
|---|
| 1152 | @node cor description | 
|---|
| 1153 | @subsection Description | 
|---|
| 1154 |  | 
|---|
| 1155 | The spelling corrector applies Kemal Oflazer's dynamic programming | 
|---|
| 1156 | algorithm @cite{oflazer96} to the FSA representation of the set of | 
|---|
| 1157 | word forms of the Polex/PMDBF dictionary. Given an incorrect | 
|---|
| 1158 | word form it returns all word forms present in the dictionary whose | 
|---|
| 1159 | edit distance is smaller than the threshold given as the parameter. | 
|---|
| 1160 |  | 
|---|
| 1161 |  | 
|---|
| 1162 | @node cor command line options | 
|---|
| 1163 | @subsection Command line options | 
|---|
| 1164 |  | 
|---|
| 1165 | @table @code | 
|---|
| 1166 |  | 
|---|
| 1167 | @parhelp | 
|---|
| 1168 | @parversion | 
|---|
| 1169 | @parinteractive | 
|---|
| 1170 | @c @parfile | 
|---|
| 1171 | @c @paroutput | 
|---|
| 1172 | @c @parfail | 
|---|
| 1173 | @c @parcopy | 
|---|
| 1174 | @parinputfield | 
|---|
| 1175 | @paroutputfield | 
|---|
| 1176 | @pardictionary | 
|---|
| 1177 | @parprocess | 
|---|
| 1178 | @parselect | 
|---|
| 1179 | @parunselect | 
|---|
| 1180 | @paroneline | 
|---|
| 1181 | @paronefield | 
|---|
| 1182 |  | 
|---|
| 1183 | @item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}} | 
|---|
| 1184 | Maximum edit distance (default='1'). | 
|---|
| 1185 |  | 
|---|
| 1186 | @c @item @b{@minus{}@minus{}replace, @minus{}r} | 
|---|
| 1187 | @c Replace original form with corrected form, place original form in the | 
|---|
| 1188 | @c cor field. This option has no effect in @option{--one-*} modes (default=off) | 
|---|
| 1189 |  | 
|---|
| 1190 |  | 
|---|
| 1191 | @end table | 
|---|
| 1192 |  | 
|---|
| 1193 | @node cor dictionaries | 
|---|
| 1194 | @subsection Dictionaries | 
|---|
| 1195 |  | 
|---|
| 1196 | @command{cor} requires a dictionary. The dictionary has to be provided in binary (fsa) format. | 
|---|
| 1197 | The fsa format is created by compiling text-format dictionaries. | 
|---|
| 1198 |  | 
|---|
| 1199 | @subsubheading Text format | 
|---|
| 1200 |  | 
|---|
| 1201 | The @command{cor} dictionary is a list of words: | 
|---|
| 1202 | @example | 
|---|
| 1203 | odlot | 
|---|
| 1204 | odlotowy | 
|---|
| 1205 | odludek | 
|---|
| 1206 | @end example | 
|---|
| 1207 |  | 
|---|
| 1208 | @subsubheading Binary format | 
|---|
| 1209 |  | 
|---|
| 1210 | The mandatory file name extension for a binary dictionary is @code{bin}. To | 
|---|
| 1211 | compile a text dictionary into binary format, write: | 
|---|
| 1212 |  | 
|---|
| 1213 | @example | 
|---|
| 1214 | compiledic <dictionaryname>.dic | 
|---|
| 1215 | @end example | 
|---|
| 1216 |  | 
|---|
| 1217 | @c --------------------------------------------------------------------- | 
|---|
| 1218 | @c KOR | 
|---|
| 1219 | @c --------------------------------------------------------------------- | 
|---|
| 1220 |  | 
|---|
| 1221 | @page | 
|---|
| 1222 | @node kor | 
|---|
| 1223 | @section kor - configurable spelling corrector | 
|---|
| 1224 |  | 
|---|
| 1225 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 1226 | @item @strong{Authors:}                 @tab PaweÅ Werenski, Tomasz ObrÄbski, MichaÅ Stolarski | 
|---|
| 1227 | @item @strong{Component category:}      @tab filter | 
|---|
| 1228 | @item @strong{Input format:}            @tab UTT regular | 
|---|
| 1229 | @item @strong{Output format:}           @tab UTT regular | 
|---|
| 1230 | @item @strong{Required annotation:}     @tab tok | 
|---|
| 1231 | @end multitable | 
|---|
| 1232 |  | 
|---|
| 1233 | @menu | 
|---|
| 1234 | * kor description:: | 
|---|
| 1235 | * kor command line options:: | 
|---|
| 1236 | * kor weights definition file:: | 
|---|
| 1237 | * kor dictionaries:: | 
|---|
| 1238 | @end menu | 
|---|
| 1239 |  | 
|---|
| 1240 |  | 
|---|
| 1241 | @node kor description | 
|---|
| 1242 | @subsection Description | 
|---|
| 1243 |  | 
|---|
| 1244 | The spelling corrector applies a Pawel Werenski's dynamic programming | 
|---|
| 1245 | algorithm to the FSA representation of the set of word forms of the | 
|---|
| 1246 | Polex/PMDBF dictionary. The algorithm is an extension of K. Oflazer | 
|---|
| 1247 | algorithm used by @command{cor}. In the extended version it is | 
|---|
| 1248 | possible to assign weights to individual edit operations. | 
|---|
| 1249 |  | 
|---|
| 1250 | Given an incorrect word form it returns all word forms | 
|---|
| 1251 | present in the dictionary whose edit distance is smaller than the | 
|---|
| 1252 | threshold given as the parameter. | 
|---|
| 1253 |  | 
|---|
| 1254 |  | 
|---|
| 1255 | @node kor command line options | 
|---|
| 1256 | @subsection Command line options | 
|---|
| 1257 |  | 
|---|
| 1258 | @table @code | 
|---|
| 1259 |  | 
|---|
| 1260 | @parhelp | 
|---|
| 1261 | @parversion | 
|---|
| 1262 | @parinteractive | 
|---|
| 1263 | @c @parfile | 
|---|
| 1264 | @c @paroutput | 
|---|
| 1265 | @c @parfail | 
|---|
| 1266 | @c @parcopy | 
|---|
| 1267 | @parinputfield | 
|---|
| 1268 | @paroutputfield | 
|---|
| 1269 | @pardictionary | 
|---|
| 1270 | @parprocess | 
|---|
| 1271 | @parselect | 
|---|
| 1272 | @parunselect | 
|---|
| 1273 | @paroneline | 
|---|
| 1274 | @paronefield | 
|---|
| 1275 |  | 
|---|
| 1276 | @item @b{@minus{}@minus{}distance=@var{int}, @minus{}n @var{int}} | 
|---|
| 1277 | Maximum edit distance (default='1'). | 
|---|
| 1278 |  | 
|---|
| 1279 | @item @b{@minus{}@minus{}weights=@var{filename}, @minus{}w @var{filename}} | 
|---|
| 1280 | Edit operations' weights file. | 
|---|
| 1281 |  | 
|---|
| 1282 | @c @item @b{@minus{}@minus{}replace, @minus{}r} | 
|---|
| 1283 | @c Replace original form with corrected form, place original form in the | 
|---|
| 1284 | @c cor field. This option has no effect in @option{--one-*} modes (default=off) | 
|---|
| 1285 |  | 
|---|
| 1286 |  | 
|---|
| 1287 | @end table | 
|---|
| 1288 |  | 
|---|
| 1289 |  | 
|---|
| 1290 | @node kor weights definition file | 
|---|
| 1291 | @subsection Weights definition file | 
|---|
| 1292 |  | 
|---|
| 1293 | Example: | 
|---|
| 1294 |  | 
|---|
| 1295 | @example | 
|---|
| 1296 |  | 
|---|
| 1297 | %stdcor 1 | 
|---|
| 1298 | %xchg   1 | 
|---|
| 1299 | ÅŒ  rz 0.5 | 
|---|
| 1300 | ch h  0.5 | 
|---|
| 1301 | u  ó  0.5 | 
|---|
| 1302 |  | 
|---|
| 1303 | @end example | 
|---|
| 1304 |  | 
|---|
| 1305 |  | 
|---|
| 1306 | Default weight is set to 1 (@code{%stdcor 1}), the weight of exchange | 
|---|
| 1307 | operation is set to 1 (@code{%xchg 1}), the three principal orthographic | 
|---|
| 1308 | errors are assigned the weight 0.5. | 
|---|
| 1309 |  | 
|---|
| 1310 | The edit operation weight declaration, such as | 
|---|
| 1311 |  | 
|---|
| 1312 | @example | 
|---|
| 1313 | ÅŒ  rz 0.5 | 
|---|
| 1314 | @end example | 
|---|
| 1315 |  | 
|---|
| 1316 | works in both ways, i.e. Ō->rz, rz->Ō. | 
|---|
| 1317 |  | 
|---|
| 1318 | The default weights definition file for @code{kor} is: | 
|---|
| 1319 |  | 
|---|
| 1320 | @example | 
|---|
| 1321 | $HOME/.local/share/utt/weights.kor | 
|---|
| 1322 | @end example | 
|---|
| 1323 |  | 
|---|
| 1324 | or, if the above mentioned file is absent: | 
|---|
| 1325 |  | 
|---|
| 1326 | @example | 
|---|
| 1327 | /usr/local/share/utt/weights.kor | 
|---|
| 1328 | @end example | 
|---|
| 1329 |  | 
|---|
| 1330 |  | 
|---|
| 1331 | @node kor dictionaries | 
|---|
| 1332 | @subsection Dictionaries | 
|---|
| 1333 |  | 
|---|
| 1334 | see @command{cor} | 
|---|
| 1335 |  | 
|---|
| 1336 | @c --------------------------------------------------------------------- | 
|---|
| 1337 | @c SEN | 
|---|
| 1338 | @c --------------------------------------------------------------------- | 
|---|
| 1339 |  | 
|---|
| 1340 | @page | 
|---|
| 1341 | @node sen | 
|---|
| 1342 | @section sen - a sentensizer | 
|---|
| 1343 |  | 
|---|
| 1344 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 1345 |  | 
|---|
| 1346 | @item @strong{Authors:}                 @tab Tomasz ObrÄbski | 
|---|
| 1347 | @item @strong{Component category:}      @tab filter | 
|---|
| 1348 | @item @strong{Input format:}            @tab UTT regular | 
|---|
| 1349 | @item @strong{Output format:}           @tab UTT regular | 
|---|
| 1350 | @item @strong{Required annotation:}     @tab tok | 
|---|
| 1351 |  | 
|---|
| 1352 | @end multitable | 
|---|
| 1353 |  | 
|---|
| 1354 |  | 
|---|
| 1355 | @menu | 
|---|
| 1356 | * sen description:: | 
|---|
| 1357 | @c * sen input:: | 
|---|
| 1358 | @c * sen output:: | 
|---|
| 1359 | * sen example:: | 
|---|
| 1360 | @end menu | 
|---|
| 1361 |  | 
|---|
| 1362 | @node sen description | 
|---|
| 1363 | @subsection Description | 
|---|
| 1364 |  | 
|---|
| 1365 | @command{sen} detects sentence boundaries in UTT-formatted texts and marks them with special zero-length segments, in which the @var{type} field may contain the BOS (beginning of sentence) or EOS (end of sentence) annotation. | 
|---|
| 1366 |  | 
|---|
| 1367 | @node sen example | 
|---|
| 1368 | @subsection Example | 
|---|
| 1369 |  | 
|---|
| 1370 | @example | 
|---|
| 1371 | command: sen | 
|---|
| 1372 |  | 
|---|
| 1373 | input: | 
|---|
| 1374 | 0000 05 W CzeÅÄ | 
|---|
| 1375 | 0005 01 P ! | 
|---|
| 1376 | 0006 01 S _ | 
|---|
| 1377 | 0007 02 W To | 
|---|
| 1378 | 0009 01 S _ | 
|---|
| 1379 | 0010 02 W ja | 
|---|
| 1380 | 0012 01 P . | 
|---|
| 1381 | 0013 01 S \n | 
|---|
| 1382 |  | 
|---|
| 1383 | output: | 
|---|
| 1384 | 0000 00 BOS * | 
|---|
| 1385 | 0000 05 W CzeÅÄ | 
|---|
| 1386 | 0005 01 P ! | 
|---|
| 1387 | 0006 00 EOS * | 
|---|
| 1388 | 0006 00 BOS * | 
|---|
| 1389 | 0006 01 S _ | 
|---|
| 1390 | 0007 02 W To | 
|---|
| 1391 | 0009 01 S _ | 
|---|
| 1392 | 0010 02 W ja | 
|---|
| 1393 | 0012 01 P . | 
|---|
| 1394 | 0013 01 S \n | 
|---|
| 1395 | 0014 00 EOS * | 
|---|
| 1396 | @end example | 
|---|
| 1397 |  | 
|---|
| 1398 |  | 
|---|
| 1399 | @c --------------------------------------------------------------------- | 
|---|
| 1400 | @c GPH | 
|---|
| 1401 | @c --------------------------------------------------------------------- | 
|---|
| 1402 |  | 
|---|
| 1403 | @c @node gph - graphizer | 
|---|
| 1404 | @c @chapter gph - graphizer | 
|---|
| 1405 |  | 
|---|
| 1406 | @c Authors: Tomasz ObrÄbski | 
|---|
| 1407 |  | 
|---|
| 1408 |  | 
|---|
| 1409 |  | 
|---|
| 1410 | @c --------------------------------------------------------------------- | 
|---|
| 1411 | @c SER | 
|---|
| 1412 | @c --------------------------------------------------------------------- | 
|---|
| 1413 |  | 
|---|
| 1414 | @page | 
|---|
| 1415 | @node ser | 
|---|
| 1416 | @section ser - pattern search tool | 
|---|
| 1417 |  | 
|---|
| 1418 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 1419 | @item @strong{Authors:}                 @tab Tomasz ObrÄbski | 
|---|
| 1420 | @item @strong{Component category:}      @tab filter | 
|---|
| 1421 | @item @strong{Input format:}            @tab UTT regular | 
|---|
| 1422 | @item @strong{Output format:}           @tab UTT regular | 
|---|
| 1423 | @item @strong{Required annotation:}     @tab tok,  lem --one-field | 
|---|
| 1424 | @end multitable | 
|---|
| 1425 |  | 
|---|
| 1426 | @menu | 
|---|
| 1427 | * ser description:: | 
|---|
| 1428 | * ser command line options:: | 
|---|
| 1429 | * ser pattern:: | 
|---|
| 1430 | * ser how ser works:: | 
|---|
| 1431 | * ser customization:: | 
|---|
| 1432 | * ser limitations:: | 
|---|
| 1433 | * ser requirements:: | 
|---|
| 1434 | @end menu | 
|---|
| 1435 |  | 
|---|
| 1436 |  | 
|---|
| 1437 | @node ser description | 
|---|
| 1438 | @subsection Description | 
|---|
| 1439 |  | 
|---|
| 1440 | @command{ser} looks for patterns in UTT-formatted texts. | 
|---|
| 1441 |  | 
|---|
| 1442 |  | 
|---|
| 1443 | @c --------------------------------------------------------------------- | 
|---|
| 1444 | @node ser command line options | 
|---|
| 1445 | @subsection Command line options | 
|---|
| 1446 |  | 
|---|
| 1447 | @table @code | 
|---|
| 1448 |  | 
|---|
| 1449 | @parhelp | 
|---|
| 1450 | @parversion | 
|---|
| 1451 | @c @parfile | 
|---|
| 1452 | @c @paroutput | 
|---|
| 1453 | @c @parinputfield | 
|---|
| 1454 | @c @paroutputfield | 
|---|
| 1455 | @parprocess | 
|---|
| 1456 | @parinteractive | 
|---|
| 1457 |  | 
|---|
| 1458 | @item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}} | 
|---|
| 1459 | The search pattern. | 
|---|
| 1460 |  | 
|---|
| 1461 | @item @b{@minus{}@minus{}morph=@var{field}} | 
|---|
| 1462 | The name of the annotation field containing the morphological | 
|---|
| 1463 | description (default @code{lem}). | 
|---|
| 1464 |  | 
|---|
| 1465 | @item @b{@minus{}@minus{}flex} | 
|---|
| 1466 | Only print the generated flex source code. | 
|---|
| 1467 |  | 
|---|
| 1468 | @item @b{@minus{}@minus{}macro=@var{filename}} | 
|---|
| 1469 | Read macrodefinitions from file @var{filename} rather than from | 
|---|
| 1470 | default location. This option allows to redefine the set of terms. | 
|---|
| 1471 |  | 
|---|
| 1472 | @item @b{@minus{}@minus{}define=@var{filename}} | 
|---|
| 1473 | Append macrodefinitions from file @var{filename}. This option | 
|---|
| 1474 | allows to extend the set of terms. | 
|---|
| 1475 |  | 
|---|
| 1476 | @end table | 
|---|
| 1477 |  | 
|---|
| 1478 |  | 
|---|
| 1479 | @c --------------------------------------------------------------------- | 
|---|
| 1480 | @node ser pattern | 
|---|
| 1481 | @subsection Pattern | 
|---|
| 1482 |  | 
|---|
| 1483 | The @command{ser} pattern is a regular expression over terms corresponding | 
|---|
| 1484 | to text segments or segment sequences. Predefined terms are: | 
|---|
| 1485 |  | 
|---|
| 1486 | @table @code | 
|---|
| 1487 |  | 
|---|
| 1488 | @item seg(@var{t},@var{f},@var{a}) | 
|---|
| 1489 | a segment of type @var{t}, containing form @var{f} and annotation | 
|---|
| 1490 | @var{a} | 
|---|
| 1491 |  | 
|---|
| 1492 | @item form(@var{f}) | 
|---|
| 1493 | a segment containing form @var{f} | 
|---|
| 1494 |  | 
|---|
| 1495 | @item field(@var{f}) | 
|---|
| 1496 | a segment containing annotation field @var{f} | 
|---|
| 1497 |  | 
|---|
| 1498 | @item space(@var{f}) | 
|---|
| 1499 | a space segment of form @var{f} | 
|---|
| 1500 |  | 
|---|
| 1501 | @item word(@var{f}) | 
|---|
| 1502 | a word segment of form @var{f} | 
|---|
| 1503 |  | 
|---|
| 1504 | @item punct(@var{f}) | 
|---|
| 1505 | a punct segment of form @var{f} | 
|---|
| 1506 |  | 
|---|
| 1507 | @item number(@var{f}) | 
|---|
| 1508 | a number segment of form @var{f} | 
|---|
| 1509 |  | 
|---|
| 1510 | @item lexeme(@var{f}) | 
|---|
| 1511 | a word segment with lemma @var{f} | 
|---|
| 1512 |  | 
|---|
| 1513 | @item cat(@var{c}) | 
|---|
| 1514 | a word segment of category @var{c} | 
|---|
| 1515 |  | 
|---|
| 1516 | @end table | 
|---|
| 1517 |  | 
|---|
| 1518 | All arguments are optional. If an argument is omitted, an arbitrary | 
|---|
| 1519 | string of non-blank characters is assumed as the argument value. Term | 
|---|
| 1520 | arguments may be arbitrary character-level regular expressions. The | 
|---|
| 1521 | following special symbols can by used: | 
|---|
| 1522 |  | 
|---|
| 1523 | @multitable {aaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 1524 | @item @code{[@dots{}]}            @tab a character class | 
|---|
| 1525 | @item @code{[^@dots{}]}           @tab a negated character class | 
|---|
| 1526 | @item @code{|}                    @tab alternative | 
|---|
| 1527 | @item @code{*}                    @tab repetition, including zero times | 
|---|
| 1528 | @item @code{+}                    @tab repetition, at least one time | 
|---|
| 1529 | @item @code{?}                    @tab optionality | 
|---|
| 1530 | @item @code{@{@var{m},@var{n}@}}  @tab repetition from @var{m} to @var{n} times | 
|---|
| 1531 | @item @code{@{@var{m},@}}         @tab repetition @var{m} or more times | 
|---|
| 1532 | @item @code{@{@var{m}@}}          @tab repetition @var{m} times | 
|---|
| 1533 | @item @code{@var{\ddd}}           @tab the character with octal value @var{ddd} | 
|---|
| 1534 | @item @code{\x@var{hh}}           @tab the character with hexadecimal value @var{hh} | 
|---|
| 1535 | @item @code{( )}                  @tab parentheses, used to override precedence | 
|---|
| 1536 | @c @end multitable | 
|---|
| 1537 |  | 
|---|
| 1538 | @c @multitable {aaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 1539 | @item @code{.}    @tab a non-blank character | 
|---|
| 1540 | @item @code{\w}   @tab a letter | 
|---|
| 1541 | @item @code{\W}   @tab a non-blank character other than a letter | 
|---|
| 1542 | @item @code{\d}   @tab a digit | 
|---|
| 1543 | @item @code{\D}   @tab a non-blank character other than a digit | 
|---|
| 1544 | @item @code{\s}   @tab a space or tab character | 
|---|
| 1545 | @item @code{\S}   @tab a non-blank character (the same as @code{.}) | 
|---|
| 1546 | @item @code{\l}   @tab a lowercase letter | 
|---|
| 1547 | @item @code{\L}   @tab an uppercase letter | 
|---|
| 1548 | @end multitable | 
|---|
| 1549 |  | 
|---|
| 1550 |  | 
|---|
| 1551 | @noindent The following characters: | 
|---|
| 1552 | @example | 
|---|
| 1553 | @verb{%  [   ]   ^   |   *   +   ?   {   }   ,   .   <   >   \ %} | 
|---|
| 1554 | @end example | 
|---|
| 1555 | must be escaped with a backslash, i.e. written as: | 
|---|
| 1556 | @example | 
|---|
| 1557 | @verb{% \[  \]  \^  \|  \*  \+  \?  \{  \}  \,  \.  \<  \>  \\ %} | 
|---|
| 1558 | @end example | 
|---|
| 1559 |  | 
|---|
| 1560 | @quotation Note | 
|---|
| 1561 | The special symbols are ... borrowed from Perl with minor | 
|---|
| 1562 | modifications ... for convenience | 
|---|
| 1563 | The meaning of certain special characters/sequences slightly differs | 
|---|
| 1564 | from their common ???. This is motivated by convenience reasons. | 
|---|
| 1565 | The meaning of the @code{.} special character is modified due to | 
|---|
| 1566 | the special function of spaces in utt files (they are field | 
|---|
| 1567 | separators). Use @code{\s} to explicitly | 
|---|
| 1568 | @end quotation | 
|---|
| 1569 |  | 
|---|
| 1570 | In the argument of the @code{cat} term a special operator <...> may be | 
|---|
| 1571 | used. A category specification enclosed in angle brackets matches all | 
|---|
| 1572 | category descriptions which are consistent (non-contradictory) with the | 
|---|
| 1573 | specification. For example @code{<N>} matches all noun descriptions, | 
|---|
| 1574 | @code{<ADJ/Can>} matches all adjectives in accusative or nominal case. | 
|---|
| 1575 |  | 
|---|
| 1576 |  | 
|---|
| 1577 | @* | 
|---|
| 1578 | @noindent @b{Examples of one-segment patterns:} | 
|---|
| 1579 |  | 
|---|
| 1580 | @multitable {aaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 1581 | @item @code{seg}            @tab any segment | 
|---|
| 1582 | @item @code{word}           @tab any word-form | 
|---|
| 1583 | @item @code{word(pomocy)}   @tab the word-form @samp{pomocy} | 
|---|
| 1584 | @item @code{word(naj.+)}    @tab a word-form beginning with @samp{naj} | 
|---|
| 1585 | @item @code{word(\L\l+)}    @tab a capitalized word-form | 
|---|
| 1586 | @item @code{punct}          @tab a punctuation character | 
|---|
| 1587 | @item @code{space(.*\\n.*)} @tab a space segment containing a newline character | 
|---|
| 1588 | @item @code{lexeme(pomoc)}  @tab any form of the lexeme 'pomoc' | 
|---|
| 1589 | @item @code{cat(N/.*)}      @tab a word which category starts with @code{N/} | 
|---|
| 1590 | @item @code{cat(<N/Ca>)}    @tab a word which category matches @code{N/Ca} | 
|---|
| 1591 | @end multitable | 
|---|
| 1592 |  | 
|---|
| 1593 | @* | 
|---|
| 1594 | @noindent @b{Examples of multi-segment patterns:} | 
|---|
| 1595 |  | 
|---|
| 1596 | @table @code | 
|---|
| 1597 |  | 
|---|
| 1598 | @item (word(\L) punct(\.) space?)+ word(\L\l+) | 
|---|
| 1599 | a sequence of initials followed by a surname | 
|---|
| 1600 |  | 
|---|
| 1601 | @item punct seg(W|S|N)* cat(<NPRO/Sr>) seg(W|S|N)* punct | 
|---|
| 1602 | a text fragment between two punctuation characters, containing an | 
|---|
| 1603 | ocurrence of a relative pronoun | 
|---|
| 1604 |  | 
|---|
| 1605 | @end table | 
|---|
| 1606 |  | 
|---|
| 1607 |  | 
|---|
| 1608 | @node ser how ser works | 
|---|
| 1609 | @subsection How ser works | 
|---|
| 1610 |  | 
|---|
| 1611 | @node ser customization | 
|---|
| 1612 | @subsection Customization | 
|---|
| 1613 |  | 
|---|
| 1614 | @c All predefined terms correspond to single segments, | 
|---|
| 1615 |  | 
|---|
| 1616 | @example | 
|---|
| 1617 | define(`verbseq', `(cat(<V>) (space cat(<V>)))') | 
|---|
| 1618 | @end example | 
|---|
| 1619 |  | 
|---|
| 1620 |  | 
|---|
| 1621 | the term @code{cat()} may not be used as a ... of | 
|---|
| 1622 |  | 
|---|
| 1623 | @c See @command{m4} manual for further details on macro definition format. | 
|---|
| 1624 |  | 
|---|
| 1625 | @node ser limitations | 
|---|
| 1626 | @subsection Limitations | 
|---|
| 1627 |  | 
|---|
| 1628 | Do not use more than 3 attributes in <>. | 
|---|
| 1629 |  | 
|---|
| 1630 | @node ser requirements | 
|---|
| 1631 | @subsection Requirements | 
|---|
| 1632 |  | 
|---|
| 1633 | In order to run @command{ser}, the following programs must be | 
|---|
| 1634 | installed in the system: | 
|---|
| 1635 |  | 
|---|
| 1636 | @itemize | 
|---|
| 1637 |  | 
|---|
| 1638 | @item @command{m4} | 
|---|
| 1639 | @item @command{grep} | 
|---|
| 1640 | @item @command{flex} | 
|---|
| 1641 | @item @command{gcc} | 
|---|
| 1642 |  | 
|---|
| 1643 | @end itemize | 
|---|
| 1644 |  | 
|---|
| 1645 |  | 
|---|
| 1646 | @c --------------------------------------------------------------------- | 
|---|
| 1647 | @c GRP | 
|---|
| 1648 | @c --------------------------------------------------------------------- | 
|---|
| 1649 |  | 
|---|
| 1650 | @page | 
|---|
| 1651 | @node grp | 
|---|
| 1652 | @section grp - pattern search tool | 
|---|
| 1653 |  | 
|---|
| 1654 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 1655 | @item @strong{Authors:}                 @tab Tomasz ObrÄbski | 
|---|
| 1656 | @item @strong{Component category:}      @tab filter | 
|---|
| 1657 | @item @strong{Input format:}            @tab UTT flattened | 
|---|
| 1658 | @item @strong{Output format:}           @tab UTT flattened | 
|---|
| 1659 | @item @strong{Required annotation:}     @tab tok, sen, lem --one-field | 
|---|
| 1660 | @end multitable | 
|---|
| 1661 |  | 
|---|
| 1662 |  | 
|---|
| 1663 | @menu | 
|---|
| 1664 | * grp description:: | 
|---|
| 1665 | * grp command line options:: | 
|---|
| 1666 | * grp pattern:: | 
|---|
| 1667 | * grp hints:: | 
|---|
| 1668 | @end menu | 
|---|
| 1669 |  | 
|---|
| 1670 |  | 
|---|
| 1671 | @node grp description | 
|---|
| 1672 | @subsection Description | 
|---|
| 1673 |  | 
|---|
| 1674 | @code{gre} selects sentences containing an expression matching a | 
|---|
| 1675 | pattern. The pattern format is exactly the same as that accepted by | 
|---|
| 1676 | @code{ser}. | 
|---|
| 1677 |  | 
|---|
| 1678 | @code{gre} is intended mainly for speeding up corpus search process. | 
|---|
| 1679 | It is extremely fast (processing speed is usually higher then the speed | 
|---|
| 1680 | of reading the corpus file from disk). | 
|---|
| 1681 |  | 
|---|
| 1682 | @node grp command line options | 
|---|
| 1683 | @subsection Command line options | 
|---|
| 1684 |  | 
|---|
| 1685 | @table @code | 
|---|
| 1686 |  | 
|---|
| 1687 | @parhelp | 
|---|
| 1688 | @parversion | 
|---|
| 1689 | @parprocess | 
|---|
| 1690 | @parinteractive | 
|---|
| 1691 |  | 
|---|
| 1692 | @item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}} | 
|---|
| 1693 | The search pattern. | 
|---|
| 1694 |  | 
|---|
| 1695 | @item @b{@minus{}@minus{}morph=@var{field}} | 
|---|
| 1696 | The name of the annotation field containing the morphological | 
|---|
| 1697 | description (default @code{lem}). | 
|---|
| 1698 |  | 
|---|
| 1699 | @item @b{@minus{}@minus{}command} | 
|---|
| 1700 | Only print the generated flex source code. | 
|---|
| 1701 |  | 
|---|
| 1702 | @item @b{@minus{}@minus{}macro=@var{filename}} | 
|---|
| 1703 | Read macrodefinitions from file @var{filename} rather than from | 
|---|
| 1704 | default location. This option allows to redefine the set of terms. | 
|---|
| 1705 |  | 
|---|
| 1706 | @item @b{@minus{}@minus{}define=@var{filename}} | 
|---|
| 1707 | Append macrodefinitions from file @var{filename}. This option | 
|---|
| 1708 | allows to extend the set of terms. | 
|---|
| 1709 |  | 
|---|
| 1710 | @end table | 
|---|
| 1711 |  | 
|---|
| 1712 |  | 
|---|
| 1713 | @node grp pattern | 
|---|
| 1714 | @subsection Pattern | 
|---|
| 1715 |  | 
|---|
| 1716 | (see @code{ser}) | 
|---|
| 1717 |  | 
|---|
| 1718 | @node grp hints | 
|---|
| 1719 | @subsection Hints | 
|---|
| 1720 |  | 
|---|
| 1721 | The corpus search speed may be increased by combining grp with lzop | 
|---|
| 1722 | compression tool (grp usually processes data faster than it is read from a | 
|---|
| 1723 | disk, especially for slow laptop drives). | 
|---|
| 1724 |  | 
|---|
| 1725 | @example | 
|---|
| 1726 | cat corpus | tok | sen | lem -1 | fla | lzop -7 > corpus.grp.lzo | 
|---|
| 1727 | @end example | 
|---|
| 1728 |  | 
|---|
| 1729 | @example | 
|---|
| 1730 | lzop -cd corpus.grp.lzo | grp -e @var{EXPR} | unfla | ser -e @var{EXPR} | 
|---|
| 1731 | @end example | 
|---|
| 1732 |  | 
|---|
| 1733 |  | 
|---|
| 1734 |  | 
|---|
| 1735 | @c --------------------------------------------------------------------- | 
|---|
| 1736 | @c MAR | 
|---|
| 1737 | @c --------------------------------------------------------------------- | 
|---|
| 1738 |  | 
|---|
| 1739 | @page | 
|---|
| 1740 | @node mar | 
|---|
| 1741 | @section mar | 
|---|
| 1742 |  | 
|---|
| 1743 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 1744 | @item @strong{Authors:}                 @tab Marcin Walas, Tomasz ObrÄbski | 
|---|
| 1745 | @item @strong{Input format:}            @tab UTT flattened | 
|---|
| 1746 | @item @strong{Output format:}           @tab UTT flattened | 
|---|
| 1747 | @item @strong{Required annotation:}     @tab tok, sen, lem -1 | 
|---|
| 1748 | @end multitable | 
|---|
| 1749 |  | 
|---|
| 1750 | @subsection Description | 
|---|
| 1751 | @code{mar} is a perl script, which matches given pattern on the utt-formated text | 
|---|
| 1752 | and tags matching parts with any number of user-defined tags. | 
|---|
| 1753 |  | 
|---|
| 1754 | @subsection Command line options | 
|---|
| 1755 | @table @code | 
|---|
| 1756 | @parhelp | 
|---|
| 1757 | @parversion | 
|---|
| 1758 |  | 
|---|
| 1759 | @item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}} | 
|---|
| 1760 | The search pattern. | 
|---|
| 1761 | @item @b{@minus{}@minus{}action=@var{action}, @minus{}a @var{action} [p] [s] [P]} | 
|---|
| 1762 | Perform only indicated actions. Where: | 
|---|
| 1763 | @multitable {aaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 1764 | @item @code{p}   @tab preprocess | 
|---|
| 1765 | @item @code{s}   @tab search | 
|---|
| 1766 | @item @code{P}   @tab postprocess | 
|---|
| 1767 | @end multitable | 
|---|
| 1768 | default: psP | 
|---|
| 1769 |  | 
|---|
| 1770 | @item @b{@minus{}@minus{}command} | 
|---|
| 1771 | print generated sed command, then exit | 
|---|
| 1772 |  | 
|---|
| 1773 | @item @b{@minus{}@minus{}help, @minus{}h} | 
|---|
| 1774 | print help, then exit | 
|---|
| 1775 |  | 
|---|
| 1776 | @item @b{@minus{}@minus{}version, @minus{}v} | 
|---|
| 1777 | print version, then exit | 
|---|
| 1778 | @end table | 
|---|
| 1779 | @subsection Tokens in pattern | 
|---|
| 1780 | @code{mar} pattern is based on @code{ser} patterns(see @pxref{ser pattern}). @code{mar} pattern is a @code{ser} pattern, | 
|---|
| 1781 | in which you can add any number of matching tags, which will be printed in exacly the place, where | 
|---|
| 1782 | they were placed in the pattern. A valid token starts with @@ which follows any number of alphanumeric | 
|---|
| 1783 | characters. For example valid match tokens are: @@STARTMATCH @@ENDMATCH | 
|---|
| 1784 |  | 
|---|
| 1785 | Matching tokens can be placed between, before or after any of @code{ser} pattern terms. They don't have | 
|---|
| 1786 | to be paritied. There can be any number of them in the pattern (zero or more). They don't have to be unique. | 
|---|
| 1787 | They can be placed one after another. For example: | 
|---|
| 1788 |  | 
|---|
| 1789 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 1790 | @item @code{@@BOM lexeme(pomoc)}  @tab place tag @b{BOM} before any form of the lexeme 'pomoc' | 
|---|
| 1791 | @item @code{@@MATCH lexeme(pomoc) @@MATCH}      @tab place tag @b{MATCH} before and after any form of the lexeme 'pomoc' | 
|---|
| 1792 | @item @code{cat(<ADJ>) @@MATCH lexeme(pomoc) @@MATCH}      @tab place tag @b{MATCH} before and after any form of the lexeme 'pomoc' which is  followef by adjective | 
|---|
| 1793 | @item @code{cat(<ADJ>) @@TAG @@BOM lexeme(pomoc) @@EOM}      @tab place tags @b{TAG} and @b{BOM}  before any form of the lexeme 'pomoc' which is  followed by adjective and tag @b{EOM} after it | 
|---|
| 1794 | @end multitable | 
|---|
| 1795 |  | 
|---|
| 1796 | (see mar's help 'mar -h' for some more information) | 
|---|
| 1797 |  | 
|---|
| 1798 | @subsection How mar works | 
|---|
| 1799 | @code{mar} translates given @code{ser} pattern with @code{m4} macroprocessor to regular expression. Then it changes it into @code{sed} command script, which is then executed. | 
|---|
| 1800 |  | 
|---|
| 1801 | You can see translated sed script by using the @code{@minus{}@minus{}command} option. | 
|---|
| 1802 | @subsection Limitations | 
|---|
| 1803 | The complexity of computations performed by @code{mar} increases linearly with the number of placed tokens. So it is highly recommended not to place too much tokens. | 
|---|
| 1804 | @subsection Requirements | 
|---|
| 1805 | In order to run @code{mar}, the following programs must be installed in the system: | 
|---|
| 1806 |  | 
|---|
| 1807 | @itemize | 
|---|
| 1808 |  | 
|---|
| 1809 | @item @command{m4} | 
|---|
| 1810 | @item @command{grep} | 
|---|
| 1811 | @item @command{sed} | 
|---|
| 1812 |  | 
|---|
| 1813 | @end itemize | 
|---|
| 1814 |  | 
|---|
| 1815 |  | 
|---|
| 1816 |  | 
|---|
| 1817 | @c --------------------------------------------------------------------- | 
|---|
| 1818 | @c KOT | 
|---|
| 1819 | @c --------------------------------------------------------------------- | 
|---|
| 1820 |  | 
|---|
| 1821 | @page | 
|---|
| 1822 | @node kot | 
|---|
| 1823 | @section kot - untokenizer | 
|---|
| 1824 |  | 
|---|
| 1825 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 1826 | @item @strong{Authors:}                 @tab Tomasz ObrÄbski | 
|---|
| 1827 | @item @strong{Component category:}      @tab filter | 
|---|
| 1828 | @item @strong{Input format:}            @tab UTT regular | 
|---|
| 1829 | @item @strong{Output format:}           @tab text | 
|---|
| 1830 | @item @strong{Required annotation:}     @tab tok | 
|---|
| 1831 | @end multitable | 
|---|
| 1832 |  | 
|---|
| 1833 |  | 
|---|
| 1834 | @menu | 
|---|
| 1835 | * kot description:: | 
|---|
| 1836 | * kot command line options:: | 
|---|
| 1837 | * kot usage examples:: | 
|---|
| 1838 | @end menu | 
|---|
| 1839 |  | 
|---|
| 1840 | @node kot description | 
|---|
| 1841 | @subsection Description | 
|---|
| 1842 |  | 
|---|
| 1843 | @command{kot} transforms a UTT formatted file back into raw text format. | 
|---|
| 1844 |  | 
|---|
| 1845 | @node kot command line options | 
|---|
| 1846 | @subsection Command line options | 
|---|
| 1847 |  | 
|---|
| 1848 | @table @code | 
|---|
| 1849 |  | 
|---|
| 1850 | @parhelp | 
|---|
| 1851 |  | 
|---|
| 1852 | @c @item @b{@minus{}@minus{}version}, @b{@minus{}v} | 
|---|
| 1853 |  | 
|---|
| 1854 | @c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}} | 
|---|
| 1855 |  | 
|---|
| 1856 | @c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}} | 
|---|
| 1857 |  | 
|---|
| 1858 | @c @item @b{@minus{}@minus{}interactive @minus{}i} | 
|---|
| 1859 |  | 
|---|
| 1860 | @c @item @b{@minus{}@minus{}config=@var{filename}} | 
|---|
| 1861 |  | 
|---|
| 1862 | @item | 
|---|
| 1863 |  | 
|---|
| 1864 | @item @b{@minus{}@minus{}gap-fill=@var{string}, @minus{}g @var{string}} | 
|---|
| 1865 | print @var{string} between nonadjacent segments of the input file | 
|---|
| 1866 |  | 
|---|
| 1867 | @item @b{@minus{}@minus{}spaces, @minus{}r} | 
|---|
| 1868 | retain the special characters @code{_}, @code{\t}, | 
|---|
| 1869 | @code{\n}, @code{\r}, @code{\f} unexpanded in the output | 
|---|
| 1870 |  | 
|---|
| 1871 | @end table | 
|---|
| 1872 |  | 
|---|
| 1873 | @node kot usage examples | 
|---|
| 1874 | @subsection Usage examples | 
|---|
| 1875 |  | 
|---|
| 1876 | @example | 
|---|
| 1877 | cat legia.txt | tok | kot | 
|---|
| 1878 | @end example | 
|---|
| 1879 |  | 
|---|
| 1880 | @example | 
|---|
| 1881 | cat legia.txt | tok | lem -1 | kot | 
|---|
| 1882 | @end example | 
|---|
| 1883 |  | 
|---|
| 1884 | @c --------------------------------------------------------------- | 
|---|
| 1885 | @c CON | 
|---|
| 1886 | @c --------------------------------------------------------------- | 
|---|
| 1887 |  | 
|---|
| 1888 |  | 
|---|
| 1889 | @page | 
|---|
| 1890 | @node con | 
|---|
| 1891 | @section con - concordance table generator | 
|---|
| 1892 |  | 
|---|
| 1893 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 1894 | @item @strong{Authors:}                 @tab Justyna Walkowska | 
|---|
| 1895 | @item @strong{Component category:}      @tab sink | 
|---|
| 1896 | @item @strong{Input format:}            @tab UTT regular | 
|---|
| 1897 | @item @strong{Output format:}           @tab text | 
|---|
| 1898 | @item @strong{Required annotation:}     @tab ser or mar | 
|---|
| 1899 | @end multitable | 
|---|
| 1900 | @c | 
|---|
| 1901 |  | 
|---|
| 1902 | @menu | 
|---|
| 1903 | * con description:: | 
|---|
| 1904 | * con command line options:: | 
|---|
| 1905 | * con usage example:: | 
|---|
| 1906 | * con hints:: | 
|---|
| 1907 | @end menu | 
|---|
| 1908 |  | 
|---|
| 1909 |  | 
|---|
| 1910 | @node con description | 
|---|
| 1911 | @subsection Description | 
|---|
| 1912 |  | 
|---|
| 1913 | @command{con} generates a concordance table based on a pattern given to @command{ser}. | 
|---|
| 1914 |  | 
|---|
| 1915 |  | 
|---|
| 1916 | @node con command line options | 
|---|
| 1917 | @subsection Command line options | 
|---|
| 1918 |  | 
|---|
| 1919 | @table @code | 
|---|
| 1920 |  | 
|---|
| 1921 | @parhelp | 
|---|
| 1922 |  | 
|---|
| 1923 | @c @item @b{@minus{}@minus{}help}, @b{@minus{}h} | 
|---|
| 1924 | @c @item @b{@minus{}@minus{}version}, @b{@minus{}v} | 
|---|
| 1925 | @c @item @b{@minus{}@minus{}file=@var{filename}, @minus{}f @var{filename}} | 
|---|
| 1926 | @c @item @b{@minus{}@minus{}output=@var{filename}, @minus{}o @var{filename}} | 
|---|
| 1927 | @c @item @b{@minus{}@minus{}fail=@var{filename}, @minus{}e @var{filename}} [???] | 
|---|
| 1928 | @c @item @b{@minus{}@minus{}copy, @minus{}c} [???] | 
|---|
| 1929 | @c @item @b{@minus{}@minus{}input-field=@var{fieldname}, @minus{}I @var{fieldname}} | 
|---|
| 1930 | @c @item @b{@minus{}@minus{}output-field=@var{fieldname}, @minus{}O @var{fieldname}} | 
|---|
| 1931 | @c @item @b{@minus{}@minus{}process=@var{class}, @minus{}p @var{class}} | 
|---|
| 1932 | @c @item @b{@minus{}@minus{}interactive @minus{}i} | 
|---|
| 1933 | @c @item @b{@minus{}@minus{}config=@var{filename}} | 
|---|
| 1934 | @c @item | 
|---|
| 1935 | @c @item @b{@minus{}@minus{}pattern=@var{pattern}, @minus{}e @var{pattern}} | 
|---|
| 1936 | @c search pattern | 
|---|
| 1937 | @c | 
|---|
| 1938 | @c @item @b{@minus{}@minus{}flex} | 
|---|
| 1939 | @c only print the generated flex source code | 
|---|
| 1940 | @c | 
|---|
| 1941 | @c @item @b{@minus{}@minus{}macro=@var{filename}} | 
|---|
| 1942 | @c read macrodefinitions from file @var{filename} rather than from | 
|---|
| 1943 | @c default location. This option allows to redefine the set of terms. | 
|---|
| 1944 | @c | 
|---|
| 1945 | @c @item @b{@minus{}@minus{}define=@var{filename}} | 
|---|
| 1946 | @c append macrodefinitions from file @var{filename}. This option | 
|---|
| 1947 | @c allows to extend the set of terms. | 
|---|
| 1948 |  | 
|---|
| 1949 | @item @b{@minus{}@minus{}left @minus{}l} | 
|---|
| 1950 | Left context info (default='30c'). Example: | 
|---|
| 1951 | @example | 
|---|
| 1952 | -l=5c: left context is 5 characters | 
|---|
| 1953 | -l=5w: left context is 5 words | 
|---|
| 1954 | -l=5s: left context is 5 non-empty input lines | 
|---|
| 1955 | -l='\s*\S+\sr\S+BOS': left context starts with the given regex | 
|---|
| 1956 | @end example | 
|---|
| 1957 |  | 
|---|
| 1958 | @item @b{@minus{}@minus{}right @minus{}r} | 
|---|
| 1959 | Right context info (default='30c'). | 
|---|
| 1960 | @item @b{@minus{}@minus{}trim @minus{}t} | 
|---|
| 1961 | Clear incomplete words from output. | 
|---|
| 1962 | @item @b{@minus{}@minus{}white @minus{}w} | 
|---|
| 1963 | DO NOT change all white characters into spaces. | 
|---|
| 1964 | @item @b{@minus{}@minus{}column @minus{}c} | 
|---|
| 1965 | Left column minimal width in characters (default = 0). | 
|---|
| 1966 | @item @b{@minus{}@minus{}ignore @minus{}i} | 
|---|
| 1967 | Ignore segment inconsistency in the input. | 
|---|
| 1968 | @item @b{@minus{}@minus{}bom} | 
|---|
| 1969 | Beginning of selected segment (regex, default='[0-9]+ [0-9]+ BOM .*'). | 
|---|
| 1970 | @item @b{@minus{}@minus{}eom} | 
|---|
| 1971 | End of selected segment (regex, default='[0-9]+ [0-9]+ EOM .*'). | 
|---|
| 1972 | @item @b{@minus{}@minus{}bod} | 
|---|
| 1973 | Selected segment beginning display string (default='['). | 
|---|
| 1974 | @item @b{@minus{}@minus{}eod} | 
|---|
| 1975 | Selected segment end display string (default=']'). | 
|---|
| 1976 |  | 
|---|
| 1977 |  | 
|---|
| 1978 |  | 
|---|
| 1979 | @end table | 
|---|
| 1980 |  | 
|---|
| 1981 | @node con usage example | 
|---|
| 1982 | @subsection Usage example | 
|---|
| 1983 | @example | 
|---|
| 1984 | cat file.txt | tok | lem -1 | ser -e 'lexeme(dom)' | con | 
|---|
| 1985 | @end example | 
|---|
| 1986 |  | 
|---|
| 1987 |  | 
|---|
| 1988 | @node con hints | 
|---|
| 1989 | @subsection Hints | 
|---|
| 1990 |  | 
|---|
| 1991 | @command{con} is a rather slow program. Do not pass large amounts of | 
|---|
| 1992 | redundant text through this program. @command{con} works fine in the following | 
|---|
| 1993 | sequence: | 
|---|
| 1994 |  | 
|---|
| 1995 | @example | 
|---|
| 1996 | ... | grp -e EXPR | ser -e EXPR | con | 
|---|
| 1997 | @end example | 
|---|
| 1998 |  | 
|---|
| 1999 |  | 
|---|
| 2000 | @c --------------------------------------------------------------------- | 
|---|
| 2001 | @c --------------------------------------------------------------------- | 
|---|
| 2002 |  | 
|---|
| 2003 | @page | 
|---|
| 2004 | @node Auxiliary tools | 
|---|
| 2005 | @chapter Auxiliary tools | 
|---|
| 2006 |  | 
|---|
| 2007 | @menu | 
|---|
| 2008 | * compiledic::         dictionary compiler | 
|---|
| 2009 | * fla::                UTT file flattener | 
|---|
| 2010 | * unfla::              UTT file unflattener | 
|---|
| 2011 | @end menu | 
|---|
| 2012 |  | 
|---|
| 2013 |  | 
|---|
| 2014 | @page | 
|---|
| 2015 | @node compiledic | 
|---|
| 2016 | @section compiledic - the dictionary compiler | 
|---|
| 2017 |  | 
|---|
| 2018 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 2019 | @item @strong{Authors:}                 @tab MichaÅ Stolarski, Tomasz ObrÄbski | 
|---|
| 2020 | @item @strong{Component category:}      @tab additional tool | 
|---|
| 2021 | @end multitable | 
|---|
| 2022 | @c | 
|---|
| 2023 |  | 
|---|
| 2024 | @command{compiledic} compiles dictionaries in text format (@code{.dic} extension) into binary | 
|---|
| 2025 | (FSA) format (@code{.bin} extension). | 
|---|
| 2026 |  | 
|---|
| 2027 | Automaton representation of a dictionary is built using the AT&T tools: | 
|---|
| 2028 | @itemize | 
|---|
| 2029 | @item AT&T FSM Library, | 
|---|
| 2030 | @item AT&T Lextools. | 
|---|
| 2031 | @end itemize | 
|---|
| 2032 |  | 
|---|
| 2033 | In order for the compiledic program to work you have to install the | 
|---|
| 2034 | above mentioned packages into your system.  They are freely available | 
|---|
| 2035 | for non-commercial use. | 
|---|
| 2036 |  | 
|---|
| 2037 | Usage: | 
|---|
| 2038 | @example | 
|---|
| 2039 | compiledic <dictionaryname>.dic | 
|---|
| 2040 | @end example | 
|---|
| 2041 |  | 
|---|
| 2042 | The file <dictionaryname>.bin will be generated. | 
|---|
| 2043 |  | 
|---|
| 2044 | Remarque: The program produces a lot of temporary files which are | 
|---|
| 2045 | stored in the current directory. They are deleted after successfull | 
|---|
| 2046 | termination of the program. | 
|---|
| 2047 |  | 
|---|
| 2048 | @c @menu | 
|---|
| 2049 | @c * con command line options:: | 
|---|
| 2050 | @c * con usage example:: | 
|---|
| 2051 | @c * con hints:: | 
|---|
| 2052 | @c @end menu | 
|---|
| 2053 |  | 
|---|
| 2054 |  | 
|---|
| 2055 | @c ------------------------------------------------------------------------------- | 
|---|
| 2056 | @c FLA | 
|---|
| 2057 | @c ------------------------------------------------------------------------------- | 
|---|
| 2058 |  | 
|---|
| 2059 | @page | 
|---|
| 2060 | @node fla | 
|---|
| 2061 | @section fla - the UTT file flattener | 
|---|
| 2062 |  | 
|---|
| 2063 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 2064 | @item @strong{Authors:}                 @tab Tomasz ObrÄbski | 
|---|
| 2065 | @item @strong{Input format:}            @tab UTT regular | 
|---|
| 2066 | @item @strong{Output format:}           @tab UTT flattened | 
|---|
| 2067 | @item @strong{Required annotation:}     @tab sen | 
|---|
| 2068 | @end multitable | 
|---|
| 2069 | @c | 
|---|
| 2070 |  | 
|---|
| 2071 | @menu | 
|---|
| 2072 | * fla description:: | 
|---|
| 2073 | @c * fla command line options:: | 
|---|
| 2074 | @c * fla usage example:: | 
|---|
| 2075 | @end menu | 
|---|
| 2076 |  | 
|---|
| 2077 |  | 
|---|
| 2078 | @node fla description | 
|---|
| 2079 | @subsection Description | 
|---|
| 2080 |  | 
|---|
| 2081 | @command{fla} ``flattens'' a utt file by merging segments belonging | 
|---|
| 2082 | to one sentence in one line. Technically, end-of-line characters | 
|---|
| 2083 | ('\n', ASCII code 10) are replaced with line-feed characters ('\f', | 
|---|
| 2084 | ASCII code 12).  The flattening makes it possible to process UTT files | 
|---|
| 2085 | with such tools as @command{grep} or @command{sed} sentence by | 
|---|
| 2086 | sentence (used in @command{grp} and @command{mar}). | 
|---|
| 2087 |  | 
|---|
| 2088 | Flattened files should have the suffix @code{.fla}, eg. @file{thetext.utt.fla}. | 
|---|
| 2089 |  | 
|---|
| 2090 | Flattened files are still human-readible. | 
|---|
| 2091 |  | 
|---|
| 2092 | Usage: | 
|---|
| 2093 |  | 
|---|
| 2094 | @example | 
|---|
| 2095 | fla [<bosregex>] | 
|---|
| 2096 | @end example | 
|---|
| 2097 |  | 
|---|
| 2098 | The facultative argument is a regular expression describing segments | 
|---|
| 2099 | which should be treated as sentence beginnings (the test is: the | 
|---|
| 2100 | segment contains a fragment matching the @code{<bosregex>}). By | 
|---|
| 2101 | default, segments containing a field @code{BOS} are seeked. | 
|---|
| 2102 |  | 
|---|
| 2103 | @c ------------------------------------------------------------------------------- | 
|---|
| 2104 | @c UNFLA | 
|---|
| 2105 | @c ------------------------------------------------------------------------------- | 
|---|
| 2106 |  | 
|---|
| 2107 | @page | 
|---|
| 2108 | @node unfla | 
|---|
| 2109 | @section unfla - the UTT file unflattener | 
|---|
| 2110 |  | 
|---|
| 2111 | @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 2112 | @item @strong{Authors:}                 @tab Tomasz ObrÄbski | 
|---|
| 2113 | @item @strong{Input format:}            @tab UTT flattened | 
|---|
| 2114 | @item @strong{Output format:}           @tab UTT regular | 
|---|
| 2115 | @item @strong{Required annotation:}     @tab - | 
|---|
| 2116 | @end multitable | 
|---|
| 2117 |  | 
|---|
| 2118 | @menu | 
|---|
| 2119 | * unfla description:: | 
|---|
| 2120 | @c * fla command line options:: | 
|---|
| 2121 | @c * fla usage example:: | 
|---|
| 2122 | @end menu | 
|---|
| 2123 |  | 
|---|
| 2124 | @node unfla description | 
|---|
| 2125 | @subsection Description | 
|---|
| 2126 | @command{unfla} transforms a flattened UTT file, produced by | 
|---|
| 2127 | @command{fla}, into the regular format by restoring end-of-line | 
|---|
| 2128 | characters. | 
|---|
| 2129 |  | 
|---|
| 2130 |  | 
|---|
| 2131 |  | 
|---|
| 2132 |  | 
|---|
| 2133 | @c --------------------------------------------------------------------- | 
|---|
| 2134 | @c USAGE EXAMPLES | 
|---|
| 2135 | @c --------------------------------------------------------------------- | 
|---|
| 2136 |  | 
|---|
| 2137 | @node Usage examples | 
|---|
| 2138 | @chapter Usage examples | 
|---|
| 2139 |  | 
|---|
| 2140 | @subsubheading Simple pipelines | 
|---|
| 2141 |  | 
|---|
| 2142 | @enumerate | 
|---|
| 2143 |  | 
|---|
| 2144 | @item tokenization | 
|---|
| 2145 |  | 
|---|
| 2146 | cat text | tok > output1 | 
|---|
| 2147 |  | 
|---|
| 2148 | @item morphological annotation (1) | 
|---|
| 2149 |  | 
|---|
| 2150 | simple dictionary based lemmatization | 
|---|
| 2151 |  | 
|---|
| 2152 | cat text | tok | lem > output1 | 
|---|
| 2153 |  | 
|---|
| 2154 | @item morphological annotation (2) | 
|---|
| 2155 |  | 
|---|
| 2156 | 1) perform dictionary-based lemmatization | 
|---|
| 2157 | 4) guess descriptions for words which have no annotation | 
|---|
| 2158 |  | 
|---|
| 2159 | @example | 
|---|
| 2160 | cat text | tok | lem | gue -S lem > output2 | 
|---|
| 2161 | @end example | 
|---|
| 2162 |  | 
|---|
| 2163 | @item morphological annotation (3) | 
|---|
| 2164 |  | 
|---|
| 2165 | 1) perform dictionary-based lemmatization | 
|---|
| 2166 | 2) try to correct words with no annotation | 
|---|
| 2167 | 3) perform dictionary-based lemmatization of corrected words | 
|---|
| 2168 | 4) guess descriptions for words which still have no annotation | 
|---|
| 2169 |  | 
|---|
| 2170 | @example | 
|---|
| 2171 | cat text | tok | lem | cor -p W -S lem | lem -I cor | gue -p W -S lem | 
|---|
| 2172 | @end example | 
|---|
| 2173 | @item spelling correction | 
|---|
| 2174 |  | 
|---|
| 2175 |  | 
|---|
| 2176 |  | 
|---|
| 2177 | @example | 
|---|
| 2178 | cat text | tok | egrep ' W ' | lem | egrep -v 'lem:' | cor -1 | 
|---|
| 2179 | @end example | 
|---|
| 2180 |  | 
|---|
| 2181 | @item Expression extraction | 
|---|
| 2182 |  | 
|---|
| 2183 | Extraction of all occurrences of a verb followed by a form of the noun 'rozmowa'. | 
|---|
| 2184 |  | 
|---|
| 2185 | @example | 
|---|
| 2186 | cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' -m | kot > output4 | 
|---|
| 2187 | @end example | 
|---|
| 2188 |  | 
|---|
| 2189 | @item A word in context | 
|---|
| 2190 |  | 
|---|
| 2191 | Extraction of text fragments containing a form of the lexeme 'rozmowa' in | 
|---|
| 2192 | the context of 5 preceeding and 5 succeeding corpus segments. | 
|---|
| 2193 |  | 
|---|
| 2194 | @example | 
|---|
| 2195 | cat text | tok | lem -1 | ser -e 'seg@{5@} lexeme(rozmowa) seg@{5@}' -m | kot > output | 
|---|
| 2196 | @end example | 
|---|
| 2197 |  | 
|---|
| 2198 | @item generation of concordance table (1) | 
|---|
| 2199 |  | 
|---|
| 2200 | @example | 
|---|
| 2201 | cat text | tok | lem -1 | ser -e 'cat(<V>) space lexeme(rozmowa)' | con | 
|---|
| 2202 | @end example | 
|---|
| 2203 |  | 
|---|
| 2204 | 10" | 
|---|
| 2205 |  | 
|---|
| 2206 | @item generation of concordance table (2) | 
|---|
| 2207 |  | 
|---|
| 2208 | The same as above but much faster | 
|---|
| 2209 |  | 
|---|
| 2210 | @example | 
|---|
| 2211 | cat text | tok | lem -1 | \ | 
|---|
| 2212 | grp -e 'cat(<V>) space lexeme(rozmowa)' | \ | 
|---|
| 2213 | ser -e 'cat(<V>) space lexeme(rozmowa)' | \ | 
|---|
| 2214 | con | 
|---|
| 2215 | @end example | 
|---|
| 2216 |  | 
|---|
| 2217 | 2" | 
|---|
| 2218 |  | 
|---|
| 2219 | @item generation of concordance table (3) | 
|---|
| 2220 |  | 
|---|
| 2221 | Usually, one performs repetitively search over the same corpus. In | 
|---|
| 2222 | such case it is advisable to transform the corpus data into the format | 
|---|
| 2223 | required by @command{grp} first, and then use the preprocessed data. | 
|---|
| 2224 |  | 
|---|
| 2225 | As @command{grp} (@command{grep}) processes data faster then it is | 
|---|
| 2226 | read from the disk drive, the search time may be still shortened by | 
|---|
| 2227 | using file compression techniques.  We suggest using the | 
|---|
| 2228 | @command{lzop} compressor/decompressor. | 
|---|
| 2229 |  | 
|---|
| 2230 | @item the fastest way to search a large corpus | 
|---|
| 2231 |  | 
|---|
| 2232 | step 1: corpus preprocessing | 
|---|
| 2233 |  | 
|---|
| 2234 | @example | 
|---|
| 2235 | cat corpus | tok | sen | lem -1 \ | 
|---|
| 2236 | | fla | lzop -7 > corpus.grp.lzo | 
|---|
| 2237 | @end example | 
|---|
| 2238 |  | 
|---|
| 2239 | step 2: search | 
|---|
| 2240 |  | 
|---|
| 2241 | @example | 
|---|
| 2242 | lzop -cd corpus.grp.lzo | unfla | grp -e 'cat(<V>) space | 
|---|
| 2243 | lexeme(rozmowa)' | ser -e 'cat(<V>) space lexeme(rozmowa)' | con | 
|---|
| 2244 | @end example | 
|---|
| 2245 |  | 
|---|
| 2246 | @end enumerate | 
|---|
| 2247 |  | 
|---|
| 2248 | @c @subsubheading More complicated configurations | 
|---|
| 2249 |  | 
|---|
| 2250 |  | 
|---|
| 2251 | @c @example | 
|---|
| 2252 | @c mknod fifo1 p | 
|---|
| 2253 | @c mknod fifo2 p | 
|---|
| 2254 | @c mknod fifo3 p | 
|---|
| 2255 | @c mknod fifo4 p | 
|---|
| 2256 | @c mknod fifo5 p | 
|---|
| 2257 |  | 
|---|
| 2258 | @c tok | lem -p W -e fifo1 > fifo2 & | 
|---|
| 2259 | @c cor -e fifo3 < fifo1 | lem > fifo4 & | 
|---|
| 2260 | @c gue < fifo3 > fifo5 & | 
|---|
| 2261 | @c sort -m fifo2 fifo4 fifo5 | 
|---|
| 2262 |  | 
|---|
| 2263 | @c rm fifo? | 
|---|
| 2264 | @c @end example | 
|---|
| 2265 |  | 
|---|
| 2266 |  | 
|---|
| 2267 | @c --------------------------------------------------------------------- | 
|---|
| 2268 | @c --------------------------------------------------------------------- | 
|---|
| 2269 |  | 
|---|
| 2270 | @c --------------------------------------------------------------------- | 
|---|
| 2271 | @c PMDBF DICTIONARY | 
|---|
| 2272 | @c --------------------------------------------------------------------- | 
|---|
| 2273 |  | 
|---|
| 2274 | @node PMDBF dictionary | 
|---|
| 2275 | @chapter PMDBF dictionary | 
|---|
| 2276 |  | 
|---|
| 2277 | UTT components come with lexical data derived from Polish | 
|---|
| 2278 | Morphological Database (PMDB). | 
|---|
| 2279 |  | 
|---|
| 2280 | @menu | 
|---|
| 2281 | * PMDBF files:: | 
|---|
| 2282 | * PMDBF tag structure:: | 
|---|
| 2283 | * PMDBF parts of speech:: | 
|---|
| 2284 | * PMDBF morphosyntactic attributes:: | 
|---|
| 2285 | @end menu | 
|---|
| 2286 |  | 
|---|
| 2287 | @node PMDBF files | 
|---|
| 2288 | @section Files | 
|---|
| 2289 |  | 
|---|
| 2290 | @node PMDBF tag structure | 
|---|
| 2291 | @section Tag structure | 
|---|
| 2292 |  | 
|---|
| 2293 | pos = [[:upper:]]+ | 
|---|
| 2294 |  | 
|---|
| 2295 | attr = [[:upper:]]+ | 
|---|
| 2296 |  | 
|---|
| 2297 | val = [[:lower:][:digit:]?!*+-] | <[^>\n]+> | 
|---|
| 2298 |  | 
|---|
| 2299 | descr = pos ( / ( attr val + ) + ) ? | 
|---|
| 2300 |  | 
|---|
| 2301 | @node PMDBF parts of speech | 
|---|
| 2302 | @section Parts of speech | 
|---|
| 2303 |  | 
|---|
| 2304 | @multitable {ADJPRP} { adjectival-passive-participle } | 
|---|
| 2305 | @item @code{N} @tab noun | 
|---|
| 2306 | @item @code{NPRO} @tab nominal-pronoun | 
|---|
| 2307 | @item @code{NV} @tab deverbal-noun | 
|---|
| 2308 | @item @code{V} @tab verb | 
|---|
| 2309 | @item @code{BYC} @tab byc | 
|---|
| 2310 | @item @code{VNI} @tab non-inflected-verb | 
|---|
| 2311 | @item @code{ADJ} @tab adjective | 
|---|
| 2312 | @item @code{ADJPAP} @tab adjectival-passive-participle | 
|---|
| 2313 | @item @code{ADJPRP} @tab adjectival-present-participle | 
|---|
| 2314 | @item @code{ADJPP} @tab adjectival-past-participle | 
|---|
| 2315 | @item @code{ADJPRO} @tab adjectival-pronoun | 
|---|
| 2316 | @item @code{ADJNUM} @tab adjectival-numeral | 
|---|
| 2317 | @item @code{ADV} @tab adverb | 
|---|
| 2318 | @item @code{ADVANP} @tab adverbial-anterior-participle | 
|---|
| 2319 | @item @code{ADVPRP} @tab adverbial-present-participle | 
|---|
| 2320 | @item @code{ADVPRO} @tab adverbial-pronoun | 
|---|
| 2321 | @item @code{ADVNUM} @tab  adverbial-numeral | 
|---|
| 2322 | @item @code{P} @tab preposition | 
|---|
| 2323 | @item @code{PPRO} @tab prep-noun-pronoun | 
|---|
| 2324 | @item @code{CONJ} @tab conjunction | 
|---|
| 2325 | @item @code{EXCL} @tab exclamation | 
|---|
| 2326 | @item @code{APP} @tab call | 
|---|
| 2327 | @item @code{ONO} @tab onomatopoeia | 
|---|
| 2328 | @item @code{PART} @tab particle | 
|---|
| 2329 | @item @code{NUMCRD} @tab cardinal-numeral | 
|---|
| 2330 | @item @code{NUMCOL} @tab collective-numeral | 
|---|
| 2331 | @item @code{NUMPAR} @tab partitive-numeral | 
|---|
| 2332 | @item @code{NUMORD} @tab ordinal-numeral | 
|---|
| 2333 | @end multitable | 
|---|
| 2334 |  | 
|---|
| 2335 | @node PMDBF morphosyntactic attributes | 
|---|
| 2336 | @section Morphosyntactic attributes | 
|---|
| 2337 |  | 
|---|
| 2338 | @multitable {Attr} {Val} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} | 
|---|
| 2339 | @c @headitem Attr @tab Val @tab Description | 
|---|
| 2340 | @item | 
|---|
| 2341 | @code{A} @tab @tab Aspect | 
|---|
| 2342 | @item | 
|---|
| 2343 | @tab @code{p} @tab perfect | 
|---|
| 2344 | @item | 
|---|
| 2345 | @tab @code{i} @tab imperfect. | 
|---|
| 2346 | @item | 
|---|
| 2347 | @item | 
|---|
| 2348 | @code{V} @tab @tab Verb-Form | 
|---|
| 2349 | @item | 
|---|
| 2350 | @tab @code{b} @tab infinitive, | 
|---|
| 2351 | @item | 
|---|
| 2352 | @tab @code{p} @tab personal, | 
|---|
| 2353 | @item | 
|---|
| 2354 | @tab @code{i} @tab impersonal. | 
|---|
| 2355 | @item | 
|---|
| 2356 | @item | 
|---|
| 2357 | @code{M} @tab @tab Mood | 
|---|
| 2358 | @item | 
|---|
| 2359 | @tab @code{d} @tab declarative, | 
|---|
| 2360 | @item | 
|---|
| 2361 | @tab @code{c} @tab conditional, | 
|---|
| 2362 | @item | 
|---|
| 2363 | @tab @code{i} @tab imperative. | 
|---|
| 2364 | @item | 
|---|
| 2365 | @item | 
|---|
| 2366 | @code{T} @tab @tab Tense | 
|---|
| 2367 | @item | 
|---|
| 2368 | @tab @code{a} @tab past, | 
|---|
| 2369 | @item | 
|---|
| 2370 | @tab @code{r} @tab present, | 
|---|
| 2371 | @item | 
|---|
| 2372 | @tab @code{f} @tab future. | 
|---|
| 2373 | @item | 
|---|
| 2374 | @item | 
|---|
| 2375 | @code{P} @tab @tab Person | 
|---|
| 2376 | @item | 
|---|
| 2377 | @tab @code{1} @tab 1, | 
|---|
| 2378 | @item | 
|---|
| 2379 | @tab @code{2} @tab 2, | 
|---|
| 2380 | @item | 
|---|
| 2381 | @tab @code{3} @tab 3. | 
|---|
| 2382 | @item | 
|---|
| 2383 | @item | 
|---|
| 2384 | @code{D} @tab @tab Degree | 
|---|
| 2385 | @item | 
|---|
| 2386 | @tab @code{p} @tab positive, | 
|---|
| 2387 | @item | 
|---|
| 2388 | @tab @code{c} @tab comparative, | 
|---|
| 2389 | @item | 
|---|
| 2390 | @tab @code{s} @tab superlative. | 
|---|
| 2391 | @item | 
|---|
| 2392 | @item | 
|---|
| 2393 | @code{N} @tab @tab Number | 
|---|
| 2394 | @item | 
|---|
| 2395 | @tab @code{s} @tab singular, | 
|---|
| 2396 | @item | 
|---|
| 2397 | @tab @code{p} @tab plural. | 
|---|
| 2398 | @item | 
|---|
| 2399 | @item | 
|---|
| 2400 | @code{C} @tab @tab Case | 
|---|
| 2401 | @item | 
|---|
| 2402 | @tab @code{n} @tab nominative, | 
|---|
| 2403 | @item | 
|---|
| 2404 | @tab @code{g} @tab genitive, | 
|---|
| 2405 | @item | 
|---|
| 2406 | @tab @code{d} @tab dative, | 
|---|
| 2407 | @item | 
|---|
| 2408 | @tab @code{a} @tab accusative, | 
|---|
| 2409 | @item | 
|---|
| 2410 | @tab @code{i} @tab instrumantal, | 
|---|
| 2411 | @item | 
|---|
| 2412 | @tab @code{l} @tab locative, | 
|---|
| 2413 | @item | 
|---|
| 2414 | @tab @code{v} @tab vocative. | 
|---|
| 2415 | @item | 
|---|
| 2416 | @code{G} @tab @tab Gender | 
|---|
| 2417 | @item | 
|---|
| 2418 | @tab @code{p} @tab masculine-personal, | 
|---|
| 2419 | @item | 
|---|
| 2420 | @tab @code{a} @tab masculine-animal, | 
|---|
| 2421 | @item | 
|---|
| 2422 | @tab @code{i} @tab masculine-inanimate, | 
|---|
| 2423 | @item | 
|---|
| 2424 | @tab @code{f} @tab feminine, | 
|---|
| 2425 | @item | 
|---|
| 2426 | @tab @code{n} @tab neuter. | 
|---|
| 2427 | @end multitable | 
|---|
| 2428 |  | 
|---|
| 2429 |  | 
|---|
| 2430 | @c --------------------------------------------------------------------- | 
|---|
| 2431 | @c --------------------------------------------------------------------- | 
|---|
| 2432 | @c | 
|---|
| 2433 | @c @node Examples | 
|---|
| 2434 | @c @chapter Examples | 
|---|
| 2435 |  | 
|---|
| 2436 | @c ---------------------------------------------------------------------- | 
|---|
| 2437 | @c ---------------------------------------------------------------------- | 
|---|
| 2438 |  | 
|---|
| 2439 | @node    GNU Free Documentation License | 
|---|
| 2440 | @chapter GNU Free Documentation License | 
|---|
| 2441 |  | 
|---|
| 2442 | @c The GNU Free Documentation License. | 
|---|
| 2443 | @center Version 1.2, November 2002 | 
|---|
| 2444 |  | 
|---|
| 2445 | @c This file is intended to be included within another document, | 
|---|
| 2446 | @c hence no sectioning command or @node. | 
|---|
| 2447 |  | 
|---|
| 2448 | @display | 
|---|
| 2449 | Copyright @copyright{} 2000,2001,2002 Free Software Foundation, Inc. | 
|---|
| 2450 | 51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA | 
|---|
| 2451 |  | 
|---|
| 2452 | Everyone is permitted to copy and distribute verbatim copies | 
|---|
| 2453 | of this license document, but changing it is not allowed. | 
|---|
| 2454 | @end display | 
|---|
| 2455 |  | 
|---|
| 2456 | @enumerate 0 | 
|---|
| 2457 | @item | 
|---|
| 2458 | PREAMBLE | 
|---|
| 2459 |  | 
|---|
| 2460 | The purpose of this License is to make a manual, textbook, or other | 
|---|
| 2461 | functional and useful document @dfn{free} in the sense of freedom: to | 
|---|
| 2462 | assure everyone the effective freedom to copy and redistribute it, | 
|---|
| 2463 | with or without modifying it, either commercially or noncommercially. | 
|---|
| 2464 | Secondarily, this License preserves for the author and publisher a way | 
|---|
| 2465 | to get credit for their work, while not being considered responsible | 
|---|
| 2466 | for modifications made by others. | 
|---|
| 2467 |  | 
|---|
| 2468 | This License is a kind of ``copyleft'', which means that derivative | 
|---|
| 2469 | works of the document must themselves be free in the same sense.  It | 
|---|
| 2470 | complements the GNU General Public License, which is a copyleft | 
|---|
| 2471 | license designed for free software. | 
|---|
| 2472 |  | 
|---|
| 2473 | We have designed this License in order to use it for manuals for free | 
|---|
| 2474 | software, because free software needs free documentation: a free | 
|---|
| 2475 | program should come with manuals providing the same freedoms that the | 
|---|
| 2476 | software does.  But this License is not limited to software manuals; | 
|---|
| 2477 | it can be used for any textual work, regardless of subject matter or | 
|---|
| 2478 | whether it is published as a printed book.  We recommend this License | 
|---|
| 2479 | principally for works whose purpose is instruction or reference. | 
|---|
| 2480 |  | 
|---|
| 2481 | @item | 
|---|
| 2482 | APPLICABILITY AND DEFINITIONS | 
|---|
| 2483 |  | 
|---|
| 2484 | This License applies to any manual or other work, in any medium, that | 
|---|
| 2485 | contains a notice placed by the copyright holder saying it can be | 
|---|
| 2486 | distributed under the terms of this License.  Such a notice grants a | 
|---|
| 2487 | world-wide, royalty-free license, unlimited in duration, to use that | 
|---|
| 2488 | work under the conditions stated herein.  The ``Document'', below, | 
|---|
| 2489 | refers to any such manual or work.  Any member of the public is a | 
|---|
| 2490 | licensee, and is addressed as ``you''.  You accept the license if you | 
|---|
| 2491 | copy, modify or distribute the work in a way requiring permission | 
|---|
| 2492 | under copyright law. | 
|---|
| 2493 |  | 
|---|
| 2494 | A ``Modified Version'' of the Document means any work containing the | 
|---|
| 2495 | Document or a portion of it, either copied verbatim, or with | 
|---|
| 2496 | modifications and/or translated into another language. | 
|---|
| 2497 |  | 
|---|
| 2498 | A ``Secondary Section'' is a named appendix or a front-matter section | 
|---|
| 2499 | of the Document that deals exclusively with the relationship of the | 
|---|
| 2500 | publishers or authors of the Document to the Document's overall | 
|---|
| 2501 | subject (or to related matters) and contains nothing that could fall | 
|---|
| 2502 | directly within that overall subject.  (Thus, if the Document is in | 
|---|
| 2503 | part a textbook of mathematics, a Secondary Section may not explain | 
|---|
| 2504 | any mathematics.)  The relationship could be a matter of historical | 
|---|
| 2505 | connection with the subject or with related matters, or of legal, | 
|---|
| 2506 | commercial, philosophical, ethical or political position regarding | 
|---|
| 2507 | them. | 
|---|
| 2508 |  | 
|---|
| 2509 | The ``Invariant Sections'' are certain Secondary Sections whose titles | 
|---|
| 2510 | are designated, as being those of Invariant Sections, in the notice | 
|---|
| 2511 | that says that the Document is released under this License.  If a | 
|---|
| 2512 | section does not fit the above definition of Secondary then it is not | 
|---|
| 2513 | allowed to be designated as Invariant.  The Document may contain zero | 
|---|
| 2514 | Invariant Sections.  If the Document does not identify any Invariant | 
|---|
| 2515 | Sections then there are none. | 
|---|
| 2516 |  | 
|---|
| 2517 | The ``Cover Texts'' are certain short passages of text that are listed, | 
|---|
| 2518 | as Front-Cover Texts or Back-Cover Texts, in the notice that says that | 
|---|
| 2519 | the Document is released under this License.  A Front-Cover Text may | 
|---|
| 2520 | be at most 5 words, and a Back-Cover Text may be at most 25 words. | 
|---|
| 2521 |  | 
|---|
| 2522 | A ``Transparent'' copy of the Document means a machine-readable copy, | 
|---|
| 2523 | represented in a format whose specification is available to the | 
|---|
| 2524 | general public, that is suitable for revising the document | 
|---|
| 2525 | straightforwardly with generic text editors or (for images composed of | 
|---|
| 2526 | pixels) generic paint programs or (for drawings) some widely available | 
|---|
| 2527 | drawing editor, and that is suitable for input to text formatters or | 
|---|
| 2528 | for automatic translation to a variety of formats suitable for input | 
|---|
| 2529 | to text formatters.  A copy made in an otherwise Transparent file | 
|---|
| 2530 | format whose markup, or absence of markup, has been arranged to thwart | 
|---|
| 2531 | or discourage subsequent modification by readers is not Transparent. | 
|---|
| 2532 | An image format is not Transparent if used for any substantial amount | 
|---|
| 2533 | of text.  A copy that is not ``Transparent'' is called ``Opaque''. | 
|---|
| 2534 |  | 
|---|
| 2535 | Examples of suitable formats for Transparent copies include plain | 
|---|
| 2536 | @sc{ascii} without markup, Texinfo input format, La@TeX{} input | 
|---|
| 2537 | format, @acronym{SGML} or @acronym{XML} using a publicly available | 
|---|
| 2538 | @acronym{DTD}, and standard-conforming simple @acronym{HTML}, | 
|---|
| 2539 | PostScript or @acronym{PDF} designed for human modification.  Examples | 
|---|
| 2540 | of transparent image formats include @acronym{PNG}, @acronym{XCF} and | 
|---|
| 2541 | @acronym{JPG}.  Opaque formats include proprietary formats that can be | 
|---|
| 2542 | read and edited only by proprietary word processors, @acronym{SGML} or | 
|---|
| 2543 | @acronym{XML} for which the @acronym{DTD} and/or processing tools are | 
|---|
| 2544 | not generally available, and the machine-generated @acronym{HTML}, | 
|---|
| 2545 | PostScript or @acronym{PDF} produced by some word processors for | 
|---|
| 2546 | output purposes only. | 
|---|
| 2547 |  | 
|---|
| 2548 | The ``Title Page'' means, for a printed book, the title page itself, | 
|---|
| 2549 | plus such following pages as are needed to hold, legibly, the material | 
|---|
| 2550 | this License requires to appear in the title page.  For works in | 
|---|
| 2551 | formats which do not have any title page as such, ``Title Page'' means | 
|---|
| 2552 | the text near the most prominent appearance of the work's title, | 
|---|
| 2553 | preceding the beginning of the body of the text. | 
|---|
| 2554 |  | 
|---|
| 2555 | A section ``Entitled XYZ'' means a named subunit of the Document whose | 
|---|
| 2556 | title either is precisely XYZ or contains XYZ in parentheses following | 
|---|
| 2557 | text that translates XYZ in another language.  (Here XYZ stands for a | 
|---|
| 2558 | specific section name mentioned below, such as ``Acknowledgements'', | 
|---|
| 2559 | ``Dedications'', ``Endorsements'', or ``History''.)  To ``Preserve the Title'' | 
|---|
| 2560 | of such a section when you modify the Document means that it remains a | 
|---|
| 2561 | section ``Entitled XYZ'' according to this definition. | 
|---|
| 2562 |  | 
|---|
| 2563 | The Document may include Warranty Disclaimers next to the notice which | 
|---|
| 2564 | states that this License applies to the Document.  These Warranty | 
|---|
| 2565 | Disclaimers are considered to be included by reference in this | 
|---|
| 2566 | License, but only as regards disclaiming warranties: any other | 
|---|
| 2567 | implication that these Warranty Disclaimers may have is void and has | 
|---|
| 2568 | no effect on the meaning of this License. | 
|---|
| 2569 |  | 
|---|
| 2570 | @item | 
|---|
| 2571 | VERBATIM COPYING | 
|---|
| 2572 |  | 
|---|
| 2573 | You may copy and distribute the Document in any medium, either | 
|---|
| 2574 | commercially or noncommercially, provided that this License, the | 
|---|
| 2575 | copyright notices, and the license notice saying this License applies | 
|---|
| 2576 | to the Document are reproduced in all copies, and that you add no other | 
|---|
| 2577 | conditions whatsoever to those of this License.  You may not use | 
|---|
| 2578 | technical measures to obstruct or control the reading or further | 
|---|
| 2579 | copying of the copies you make or distribute.  However, you may accept | 
|---|
| 2580 | compensation in exchange for copies.  If you distribute a large enough | 
|---|
| 2581 | number of copies you must also follow the conditions in section 3. | 
|---|
| 2582 |  | 
|---|
| 2583 | You may also lend copies, under the same conditions stated above, and | 
|---|
| 2584 | you may publicly display copies. | 
|---|
| 2585 |  | 
|---|
| 2586 | @item | 
|---|
| 2587 | COPYING IN QUANTITY | 
|---|
| 2588 |  | 
|---|
| 2589 | If you publish printed copies (or copies in media that commonly have | 
|---|
| 2590 | printed covers) of the Document, numbering more than 100, and the | 
|---|
| 2591 | Document's license notice requires Cover Texts, you must enclose the | 
|---|
| 2592 | copies in covers that carry, clearly and legibly, all these Cover | 
|---|
| 2593 | Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on | 
|---|
| 2594 | the back cover.  Both covers must also clearly and legibly identify | 
|---|
| 2595 | you as the publisher of these copies.  The front cover must present | 
|---|
| 2596 | the full title with all words of the title equally prominent and | 
|---|
| 2597 | visible.  You may add other material on the covers in addition. | 
|---|
| 2598 | Copying with changes limited to the covers, as long as they preserve | 
|---|
| 2599 | the title of the Document and satisfy these conditions, can be treated | 
|---|
| 2600 | as verbatim copying in other respects. | 
|---|
| 2601 |  | 
|---|
| 2602 | If the required texts for either cover are too voluminous to fit | 
|---|
| 2603 | legibly, you should put the first ones listed (as many as fit | 
|---|
| 2604 | reasonably) on the actual cover, and continue the rest onto adjacent | 
|---|
| 2605 | pages. | 
|---|
| 2606 |  | 
|---|
| 2607 | If you publish or distribute Opaque copies of the Document numbering | 
|---|
| 2608 | more than 100, you must either include a machine-readable Transparent | 
|---|
| 2609 | copy along with each Opaque copy, or state in or with each Opaque copy | 
|---|
| 2610 | a computer-network location from which the general network-using | 
|---|
| 2611 | public has access to download using public-standard network protocols | 
|---|
| 2612 | a complete Transparent copy of the Document, free of added material. | 
|---|
| 2613 | If you use the latter option, you must take reasonably prudent steps, | 
|---|
| 2614 | when you begin distribution of Opaque copies in quantity, to ensure | 
|---|
| 2615 | that this Transparent copy will remain thus accessible at the stated | 
|---|
| 2616 | location until at least one year after the last time you distribute an | 
|---|
| 2617 | Opaque copy (directly or through your agents or retailers) of that | 
|---|
| 2618 | edition to the public. | 
|---|
| 2619 |  | 
|---|
| 2620 | It is requested, but not required, that you contact the authors of the | 
|---|
| 2621 | Document well before redistributing any large number of copies, to give | 
|---|
| 2622 | them a chance to provide you with an updated version of the Document. | 
|---|
| 2623 |  | 
|---|
| 2624 | @item | 
|---|
| 2625 | MODIFICATIONS | 
|---|
| 2626 |  | 
|---|
| 2627 | You may copy and distribute a Modified Version of the Document under | 
|---|
| 2628 | the conditions of sections 2 and 3 above, provided that you release | 
|---|
| 2629 | the Modified Version under precisely this License, with the Modified | 
|---|
| 2630 | Version filling the role of the Document, thus licensing distribution | 
|---|
| 2631 | and modification of the Modified Version to whoever possesses a copy | 
|---|
| 2632 | of it.  In addition, you must do these things in the Modified Version: | 
|---|
| 2633 |  | 
|---|
| 2634 | @enumerate A | 
|---|
| 2635 | @item | 
|---|
| 2636 | Use in the Title Page (and on the covers, if any) a title distinct | 
|---|
| 2637 | from that of the Document, and from those of previous versions | 
|---|
| 2638 | (which should, if there were any, be listed in the History section | 
|---|
| 2639 | of the Document).  You may use the same title as a previous version | 
|---|
| 2640 | if the original publisher of that version gives permission. | 
|---|
| 2641 |  | 
|---|
| 2642 | @item | 
|---|
| 2643 | List on the Title Page, as authors, one or more persons or entities | 
|---|
| 2644 | responsible for authorship of the modifications in the Modified | 
|---|
| 2645 | Version, together with at least five of the principal authors of the | 
|---|
| 2646 | Document (all of its principal authors, if it has fewer than five), | 
|---|
| 2647 | unless they release you from this requirement. | 
|---|
| 2648 |  | 
|---|
| 2649 | @item | 
|---|
| 2650 | State on the Title page the name of the publisher of the | 
|---|
| 2651 | Modified Version, as the publisher. | 
|---|
| 2652 |  | 
|---|
| 2653 | @item | 
|---|
| 2654 | Preserve all the copyright notices of the Document. | 
|---|
| 2655 |  | 
|---|
| 2656 | @item | 
|---|
| 2657 | Add an appropriate copyright notice for your modifications | 
|---|
| 2658 | adjacent to the other copyright notices. | 
|---|
| 2659 |  | 
|---|
| 2660 | @item | 
|---|
| 2661 | Include, immediately after the copyright notices, a license notice | 
|---|
| 2662 | giving the public permission to use the Modified Version under the | 
|---|
| 2663 | terms of this License, in the form shown in the Addendum below. | 
|---|
| 2664 |  | 
|---|
| 2665 | @item | 
|---|
| 2666 | Preserve in that license notice the full lists of Invariant Sections | 
|---|
| 2667 | and required Cover Texts given in the Document's license notice. | 
|---|
| 2668 |  | 
|---|
| 2669 | @item | 
|---|
| 2670 | Include an unaltered copy of this License. | 
|---|
| 2671 |  | 
|---|
| 2672 | @item | 
|---|
| 2673 | Preserve the section Entitled ``History'', Preserve its Title, and add | 
|---|
| 2674 | to it an item stating at least the title, year, new authors, and | 
|---|
| 2675 | publisher of the Modified Version as given on the Title Page.  If | 
|---|
| 2676 | there is no section Entitled ``History'' in the Document, create one | 
|---|
| 2677 | stating the title, year, authors, and publisher of the Document as | 
|---|
| 2678 | given on its Title Page, then add an item describing the Modified | 
|---|
| 2679 | Version as stated in the previous sentence. | 
|---|
| 2680 |  | 
|---|
| 2681 | @item | 
|---|
| 2682 | Preserve the network location, if any, given in the Document for | 
|---|
| 2683 | public access to a Transparent copy of the Document, and likewise | 
|---|
| 2684 | the network locations given in the Document for previous versions | 
|---|
| 2685 | it was based on.  These may be placed in the ``History'' section. | 
|---|
| 2686 | You may omit a network location for a work that was published at | 
|---|
| 2687 | least four years before the Document itself, or if the original | 
|---|
| 2688 | publisher of the version it refers to gives permission. | 
|---|
| 2689 |  | 
|---|
| 2690 | @item | 
|---|
| 2691 | For any section Entitled ``Acknowledgements'' or ``Dedications'', Preserve | 
|---|
| 2692 | the Title of the section, and preserve in the section all the | 
|---|
| 2693 | substance and tone of each of the contributor acknowledgements and/or | 
|---|
| 2694 | dedications given therein. | 
|---|
| 2695 |  | 
|---|
| 2696 | @item | 
|---|
| 2697 | Preserve all the Invariant Sections of the Document, | 
|---|
| 2698 | unaltered in their text and in their titles.  Section numbers | 
|---|
| 2699 | or the equivalent are not considered part of the section titles. | 
|---|
| 2700 |  | 
|---|
| 2701 | @item | 
|---|
| 2702 | Delete any section Entitled ``Endorsements''.  Such a section | 
|---|
| 2703 | may not be included in the Modified Version. | 
|---|
| 2704 |  | 
|---|
| 2705 | @item | 
|---|
| 2706 | Do not retitle any existing section to be Entitled ``Endorsements'' or | 
|---|
| 2707 | to conflict in title with any Invariant Section. | 
|---|
| 2708 |  | 
|---|
| 2709 | @item | 
|---|
| 2710 | Preserve any Warranty Disclaimers. | 
|---|
| 2711 | @end enumerate | 
|---|
| 2712 |  | 
|---|
| 2713 | If the Modified Version includes new front-matter sections or | 
|---|
| 2714 | appendices that qualify as Secondary Sections and contain no material | 
|---|
| 2715 | copied from the Document, you may at your option designate some or all | 
|---|
| 2716 | of these sections as invariant.  To do this, add their titles to the | 
|---|
| 2717 | list of Invariant Sections in the Modified Version's license notice. | 
|---|
| 2718 | These titles must be distinct from any other section titles. | 
|---|
| 2719 |  | 
|---|
| 2720 | You may add a section Entitled ``Endorsements'', provided it contains | 
|---|
| 2721 | nothing but endorsements of your Modified Version by various | 
|---|
| 2722 | parties---for example, statements of peer review or that the text has | 
|---|
| 2723 | been approved by an organization as the authoritative definition of a | 
|---|
| 2724 | standard. | 
|---|
| 2725 |  | 
|---|
| 2726 | You may add a passage of up to five words as a Front-Cover Text, and a | 
|---|
| 2727 | passage of up to 25 words as a Back-Cover Text, to the end of the list | 
|---|
| 2728 | of Cover Texts in the Modified Version.  Only one passage of | 
|---|
| 2729 | Front-Cover Text and one of Back-Cover Text may be added by (or | 
|---|
| 2730 | through arrangements made by) any one entity.  If the Document already | 
|---|
| 2731 | includes a cover text for the same cover, previously added by you or | 
|---|
| 2732 | by arrangement made by the same entity you are acting on behalf of, | 
|---|
| 2733 | you may not add another; but you may replace the old one, on explicit | 
|---|
| 2734 | permission from the previous publisher that added the old one. | 
|---|
| 2735 |  | 
|---|
| 2736 | The author(s) and publisher(s) of the Document do not by this License | 
|---|
| 2737 | give permission to use their names for publicity for or to assert or | 
|---|
| 2738 | imply endorsement of any Modified Version. | 
|---|
| 2739 |  | 
|---|
| 2740 | @item | 
|---|
| 2741 | COMBINING DOCUMENTS | 
|---|
| 2742 |  | 
|---|
| 2743 | You may combine the Document with other documents released under this | 
|---|
| 2744 | License, under the terms defined in section 4 above for modified | 
|---|
| 2745 | versions, provided that you include in the combination all of the | 
|---|
| 2746 | Invariant Sections of all of the original documents, unmodified, and | 
|---|
| 2747 | list them all as Invariant Sections of your combined work in its | 
|---|
| 2748 | license notice, and that you preserve all their Warranty Disclaimers. | 
|---|
| 2749 |  | 
|---|
| 2750 | The combined work need only contain one copy of this License, and | 
|---|
| 2751 | multiple identical Invariant Sections may be replaced with a single | 
|---|
| 2752 | copy.  If there are multiple Invariant Sections with the same name but | 
|---|
| 2753 | different contents, make the title of each such section unique by | 
|---|
| 2754 | adding at the end of it, in parentheses, the name of the original | 
|---|
| 2755 | author or publisher of that section if known, or else a unique number. | 
|---|
| 2756 | Make the same adjustment to the section titles in the list of | 
|---|
| 2757 | Invariant Sections in the license notice of the combined work. | 
|---|
| 2758 |  | 
|---|
| 2759 | In the combination, you must combine any sections Entitled ``History'' | 
|---|
| 2760 | in the various original documents, forming one section Entitled | 
|---|
| 2761 | ``History''; likewise combine any sections Entitled ``Acknowledgements'', | 
|---|
| 2762 | and any sections Entitled ``Dedications''.  You must delete all | 
|---|
| 2763 | sections Entitled ``Endorsements.'' | 
|---|
| 2764 |  | 
|---|
| 2765 | @item | 
|---|
| 2766 | COLLECTIONS OF DOCUMENTS | 
|---|
| 2767 |  | 
|---|
| 2768 | You may make a collection consisting of the Document and other documents | 
|---|
| 2769 | released under this License, and replace the individual copies of this | 
|---|
| 2770 | License in the various documents with a single copy that is included in | 
|---|
| 2771 | the collection, provided that you follow the rules of this License for | 
|---|
| 2772 | verbatim copying of each of the documents in all other respects. | 
|---|
| 2773 |  | 
|---|
| 2774 | You may extract a single document from such a collection, and distribute | 
|---|
| 2775 | it individually under this License, provided you insert a copy of this | 
|---|
| 2776 | License into the extracted document, and follow this License in all | 
|---|
| 2777 | other respects regarding verbatim copying of that document. | 
|---|
| 2778 |  | 
|---|
| 2779 | @item | 
|---|
| 2780 | AGGREGATION WITH INDEPENDENT WORKS | 
|---|
| 2781 |  | 
|---|
| 2782 | A compilation of the Document or its derivatives with other separate | 
|---|
| 2783 | and independent documents or works, in or on a volume of a storage or | 
|---|
| 2784 | distribution medium, is called an ``aggregate'' if the copyright | 
|---|
| 2785 | resulting from the compilation is not used to limit the legal rights | 
|---|
| 2786 | of the compilation's users beyond what the individual works permit. | 
|---|
| 2787 | When the Document is included in an aggregate, this License does not | 
|---|
| 2788 | apply to the other works in the aggregate which are not themselves | 
|---|
| 2789 | derivative works of the Document. | 
|---|
| 2790 |  | 
|---|
| 2791 | If the Cover Text requirement of section 3 is applicable to these | 
|---|
| 2792 | copies of the Document, then if the Document is less than one half of | 
|---|
| 2793 | the entire aggregate, the Document's Cover Texts may be placed on | 
|---|
| 2794 | covers that bracket the Document within the aggregate, or the | 
|---|
| 2795 | electronic equivalent of covers if the Document is in electronic form. | 
|---|
| 2796 | Otherwise they must appear on printed covers that bracket the whole | 
|---|
| 2797 | aggregate. | 
|---|
| 2798 |  | 
|---|
| 2799 | @item | 
|---|
| 2800 | TRANSLATION | 
|---|
| 2801 |  | 
|---|
| 2802 | Translation is considered a kind of modification, so you may | 
|---|
| 2803 | distribute translations of the Document under the terms of section 4. | 
|---|
| 2804 | Replacing Invariant Sections with translations requires special | 
|---|
| 2805 | permission from their copyright holders, but you may include | 
|---|
| 2806 | translations of some or all Invariant Sections in addition to the | 
|---|
| 2807 | original versions of these Invariant Sections.  You may include a | 
|---|
| 2808 | translation of this License, and all the license notices in the | 
|---|
| 2809 | Document, and any Warranty Disclaimers, provided that you also include | 
|---|
| 2810 | the original English version of this License and the original versions | 
|---|
| 2811 | of those notices and disclaimers.  In case of a disagreement between | 
|---|
| 2812 | the translation and the original version of this License or a notice | 
|---|
| 2813 | or disclaimer, the original version will prevail. | 
|---|
| 2814 |  | 
|---|
| 2815 | If a section in the Document is Entitled ``Acknowledgements'', | 
|---|
| 2816 | ``Dedications'', or ``History'', the requirement (section 4) to Preserve | 
|---|
| 2817 | its Title (section 1) will typically require changing the actual | 
|---|
| 2818 | title. | 
|---|
| 2819 |  | 
|---|
| 2820 | @item | 
|---|
| 2821 | TERMINATION | 
|---|
| 2822 |  | 
|---|
| 2823 | You may not copy, modify, sublicense, or distribute the Document except | 
|---|
| 2824 | as expressly provided for under this License.  Any other attempt to | 
|---|
| 2825 | copy, modify, sublicense or distribute the Document is void, and will | 
|---|
| 2826 | automatically terminate your rights under this License.  However, | 
|---|
| 2827 | parties who have received copies, or rights, from you under this | 
|---|
| 2828 | License will not have their licenses terminated so long as such | 
|---|
| 2829 | parties remain in full compliance. | 
|---|
| 2830 |  | 
|---|
| 2831 | @item | 
|---|
| 2832 | FUTURE REVISIONS OF THIS LICENSE | 
|---|
| 2833 |  | 
|---|
| 2834 | The Free Software Foundation may publish new, revised versions | 
|---|
| 2835 | of the GNU Free Documentation License from time to time.  Such new | 
|---|
| 2836 | versions will be similar in spirit to the present version, but may | 
|---|
| 2837 | differ in detail to address new problems or concerns.  See | 
|---|
| 2838 | @uref{http://www.gnu.org/copyleft/}. | 
|---|
| 2839 |  | 
|---|
| 2840 | Each version of the License is given a distinguishing version number. | 
|---|
| 2841 | If the Document specifies that a particular numbered version of this | 
|---|
| 2842 | License ``or any later version'' applies to it, you have the option of | 
|---|
| 2843 | following the terms and conditions either of that specified version or | 
|---|
| 2844 | of any later version that has been published (not as a draft) by the | 
|---|
| 2845 | Free Software Foundation.  If the Document does not specify a version | 
|---|
| 2846 | number of this License, you may choose any version ever published (not | 
|---|
| 2847 | as a draft) by the Free Software Foundation. | 
|---|
| 2848 | @end enumerate | 
|---|
| 2849 |  | 
|---|
| 2850 | @page | 
|---|
| 2851 | @heading ADDENDUM: How to use this License for your documents | 
|---|
| 2852 |  | 
|---|
| 2853 | To use this License in a document you have written, include a copy of | 
|---|
| 2854 | the License in the document and put the following copyright and | 
|---|
| 2855 | license notices just after the title page: | 
|---|
| 2856 |  | 
|---|
| 2857 | @smallexample | 
|---|
| 2858 | @group | 
|---|
| 2859 | Copyright (C)  @var{year}  @var{your name}. | 
|---|
| 2860 | Permission is granted to copy, distribute and/or modify this document | 
|---|
| 2861 | under the terms of the GNU Free Documentation License, Version 1.2 | 
|---|
| 2862 | or any later version published by the Free Software Foundation; | 
|---|
| 2863 | with no Invariant Sections, no Front-Cover Texts, and no Back-Cover | 
|---|
| 2864 | Texts.  A copy of the license is included in the section entitled ``GNU | 
|---|
| 2865 | Free Documentation License''. | 
|---|
| 2866 | @end group | 
|---|
| 2867 | @end smallexample | 
|---|
| 2868 |  | 
|---|
| 2869 | If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts, | 
|---|
| 2870 | replace the ``with@dots{}Texts.'' line with this: | 
|---|
| 2871 |  | 
|---|
| 2872 | @smallexample | 
|---|
| 2873 | @group | 
|---|
| 2874 | with the Invariant Sections being @var{list their titles}, with | 
|---|
| 2875 | the Front-Cover Texts being @var{list}, and with the Back-Cover Texts | 
|---|
| 2876 | being @var{list}. | 
|---|
| 2877 | @end group | 
|---|
| 2878 | @end smallexample | 
|---|
| 2879 |  | 
|---|
| 2880 | If you have Invariant Sections without Cover Texts, or some other | 
|---|
| 2881 | combination of the three, merge those two alternatives to suit the | 
|---|
| 2882 | situation. | 
|---|
| 2883 |  | 
|---|
| 2884 | If your document contains nontrivial examples of program code, we | 
|---|
| 2885 | recommend releasing these examples in parallel under your choice of | 
|---|
| 2886 | free software license, such as the GNU General Public License, | 
|---|
| 2887 | to permit their use in free software. | 
|---|
| 2888 |  | 
|---|
| 2889 | @c Local Variables: | 
|---|
| 2890 | @c ispell-local-pdict: "ispell-dict" | 
|---|
| 2891 | @c End: | 
|---|
| 2892 |  | 
|---|
| 2893 |  | 
|---|
| 2894 | @c --------------------------------------------------------------------- | 
|---|
| 2895 | @c --------------------------------------------------------------------- | 
|---|
| 2896 |  | 
|---|
| 2897 | @node    Reporting bugs | 
|---|
| 2898 | @chapter Reporting bugs | 
|---|
| 2899 |  | 
|---|
| 2900 | Report bugs to <obrebski@@amu.edu.pl>. | 
|---|
| 2901 |  | 
|---|
| 2902 | @c --------------------------------------------------------------------- | 
|---|
| 2903 | @c --------------------------------------------------------------------- | 
|---|
| 2904 |  | 
|---|
| 2905 | @c @node    Copyright | 
|---|
| 2906 | @c @chapter Copyright | 
|---|
| 2907 | @c | 
|---|
| 2908 | @c Copyright 2004 by Tomasz ObrÄbski | 
|---|
| 2909 | @c This software is free for research and educational use. | 
|---|
| 2910 |  | 
|---|
| 2911 | @c --------------------------------------------------------------------- | 
|---|
| 2912 | @c --------------------------------------------------------------------- | 
|---|
| 2913 |  | 
|---|
| 2914 | @node    Author | 
|---|
| 2915 | @chapter Author | 
|---|
| 2916 |  | 
|---|
| 2917 |  | 
|---|
| 2918 | @bye | 
|---|