- Timestamp:
- 10/29/08 11:17:16 (16 years ago)
- Branches:
- master, help
- Children:
- 91ed676
- Parents:
- 261bf62
- git-author:
- obrebski <obrebski@…> (10/29/08 11:17:16)
- git-committer:
- obrebski <obrebski@…> (10/29/08 11:17:16)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
app/doc/utt.texinfo
r261bf62 re28a625 367 367 @section Flattened UTT file 368 368 369 A UTT file format has two variants: regular and flatten d. The regular369 A UTT file format has two variants: regular and flattened. The regular 370 370 format was described above. In the flattened format some of the 371 371 end-of-line characters are replaced with line-feed characters. … … 1608 1608 1609 1609 @example 1610 cat corpus | tok | sen | lem | grp -a p| lzop -7 > corpus.grp.lzo1611 @end example 1612 1613 @example 1614 lzop -cd corpus.grp.lzo | grp - a gP -e @var{EXPR}| ser -e @var{EXPR}1610 cat corpus | tok | sen | lem -1 | fla | lzop -7 > corpus.grp.lzo 1611 @end example 1612 1613 @example 1614 lzop -cd corpus.grp.lzo | grp -e @var{EXPR} | unfla | ser -e @var{EXPR} 1615 1615 @end example 1616 1616 … … 1627 1627 @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} 1628 1628 @item @strong{Authors:} @tab Marcin Walas, Tomasz Obrêbski 1629 @item @strong{Component category:} @tab filter 1629 @item @strong{Input format:} @tab UTT flattened 1630 @item @strong{Output format:} @tab UTT flattened 1631 @item @strong{Required annotation:} @tab tok, sen, lem -1 1630 1632 @end multitable 1631 1633 1632 1634 [TODO] 1635 1636 (see mar's help 'mar -h' for some information) 1633 1637 1634 1638 @c --------------------------------------------------------------------- … … 1871 1875 1872 1876 1877 @c ------------------------------------------------------------------------------- 1878 @c FLA 1879 @c ------------------------------------------------------------------------------- 1880 1873 1881 @page 1874 1882 @node fla … … 1877 1885 @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} 1878 1886 @item @strong{Authors:} @tab Tomasz Obrêbski 1879 @item @strong{Component category:} @tab filter 1887 @item @strong{Input format:} @tab UTT regular 1888 @item @strong{Output format:} @tab UTT flattened 1889 @item @strong{Required annotation:} @tab sen 1880 1890 @end multitable 1881 1891 @c 1892 1893 @menu 1894 * fla description:: 1895 @c * fla command line options:: 1896 @c * fla usage example:: 1897 @end menu 1898 1899 1900 @node fla description 1901 @subsection Description 1882 1902 1883 1903 @command{fla} ``flattens'' a utt file by merging segments belonging … … 1902 1922 segment contains a fragment matching the @code{<bosregex>}). By 1903 1923 default, segments containing a field @code{BOS} are seeked. 1904 @c @menu 1905 @c * con command line options:: 1906 @c * con usage example:: 1907 @c * con hints:: 1908 @c @end menu 1909 1910 1924 1925 @c ------------------------------------------------------------------------------- 1926 @c UNFLA 1927 @c ------------------------------------------------------------------------------- 1911 1928 1912 1929 @page … … 1916 1933 @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} 1917 1934 @item @strong{Authors:} @tab Tomasz Obrêbski 1918 @item @strong{Component category:} @tab filter 1935 @item @strong{Input format:} @tab UTT flattened 1936 @item @strong{Output format:} @tab UTT regular 1937 @item @strong{Required annotation:} @tab - 1919 1938 @end multitable 1920 1939 1940 @menu 1941 * unfla description:: 1942 @c * fla command line options:: 1943 @c * fla usage example:: 1944 @end menu 1945 1946 @node unfla description 1947 @subsection Description 1921 1948 @command{unfla} transforms a flattened UTT file, produced by 1922 1949 @command{fla}, into the regular format by restoring end-of-line … … 1971 1998 1972 1999 @example 1973 cat text | tok | lem --only-fail | cor -1 > output32000 cat text | tok | egrep ' W ' | lem | egrep -v 'lem:' | cor -1 1974 2001 @end example 1975 2002 … … 2020 2047 As @command{grp} (@command{grep}) processes data faster then it is 2021 2048 read from the disk drive, the search time may be still shortened by 2022 using file compression techniques. We suggest usin @command{lzop}. 2049 using file compression techniques. We suggest using the 2050 @command{lzop} compressor/decompressor. 2023 2051 2024 2052 @item the fastest way to search a large corpus 2025 2053 2026 step 1: preprocessing2054 step 1: corpus preprocessing 2027 2055 2028 2056 @example 2029 2057 cat corpus | tok | sen | lem -1 \ 2030 | grp -a p| lzop -7 > corpus.grp.lzo2058 | fla | lzop -7 > corpus.grp.lzo 2031 2059 @end example 2032 2060 … … 2034 2062 2035 2063 @example 2036 lzop -cd corpus.grp.lzo | grp -a gP-e 'cat(<V>) space2064 lzop -cd corpus.grp.lzo | unfla | grp -e 'cat(<V>) space 2037 2065 lexeme(rozmowa)' | ser -e 'cat(<V>) space lexeme(rozmowa)' | con 2038 2066 @end example … … 2040 2068 @end enumerate 2041 2069 2042 @ subsubheading More complicated configurations2043 2044 2045 @ example2046 mknod fifo1 p2047 mknod fifo2 p2048 mknod fifo3 p2049 mknod fifo4 p2050 mknod fifo5 p2051 2052 tok | lem -p W -e fifo1 > fifo2 &2053 cor -e fifo3 < fifo1 | lem > fifo4 &2054 gue < fifo3 > fifo5 &2055 sort -m fifo2 fifo4 fifo52056 2057 rm fifo?2058 @ end example2070 @c @subsubheading More complicated configurations 2071 2072 2073 @c @example 2074 @c mknod fifo1 p 2075 @c mknod fifo2 p 2076 @c mknod fifo3 p 2077 @c mknod fifo4 p 2078 @c mknod fifo5 p 2079 2080 @c tok | lem -p W -e fifo1 > fifo2 & 2081 @c cor -e fifo3 < fifo1 | lem > fifo4 & 2082 @c gue < fifo3 > fifo5 & 2083 @c sort -m fifo2 fifo4 fifo5 2084 2085 @c rm fifo? 2086 @c @end example 2059 2087 2060 2088
Note: See TracChangeset
for help on using the changeset viewer.