Changeset e28a625
- Timestamp:
- 10/29/08 11:17:16 (16 years ago)
- Branches:
- master, help
- Children:
- 91ed676
- Parents:
- 261bf62
- git-author:
- obrebski <obrebski@…> (10/29/08 11:17:16)
- git-committer:
- obrebski <obrebski@…> (10/29/08 11:17:16)
- Files:
-
- 10 edited
Legend:
- Unmodified
- Added
- Removed
-
app/dist/files/README
ra4d0da5 re28a625 18 18 Installation 19 19 ************** 20 Run utt_make_config.pl to create configuration files. 21 Configuration files will be created in ~/.utt/ 20 21 1) unpack the UTT tar archive 22 2) in the same directory, unpack the tar archives of all UTT dictionary modules you have 23 3) run 24 make install 25 in the root directory of the installation 26 4) add the bin directory to the PATH variable 27 28 29 Requirements 30 ************* 31 32 * File::HomeDir 33 34 the Perl package File::HomeDir must be installed 35 (to install the package, run 'perl -MCPAN -e shell' and write 36 'install File::HomeDir' after the 'cpan>' prompt appears) 37 38 * flex 39 40 to run the ser component, flex must be installed in your system 41 42 * ruby 43 44 to run the tre component, ruby must be installed in your system 45 46 * locale pl_PL.iso-8852-2 47 48 the locales pl_PL.iso-8859-2 (pl_PL in short) must be installed 49 and set while using UTT with the Polish module. The text you 50 process with UTT must be encoded in iso-8859-2. 51 -
app/doc/utt.texinfo
r261bf62 re28a625 367 367 @section Flattened UTT file 368 368 369 A UTT file format has two variants: regular and flatten d. The regular369 A UTT file format has two variants: regular and flattened. The regular 370 370 format was described above. In the flattened format some of the 371 371 end-of-line characters are replaced with line-feed characters. … … 1608 1608 1609 1609 @example 1610 cat corpus | tok | sen | lem | grp -a p| lzop -7 > corpus.grp.lzo1611 @end example 1612 1613 @example 1614 lzop -cd corpus.grp.lzo | grp - a gP -e @var{EXPR}| ser -e @var{EXPR}1610 cat corpus | tok | sen | lem -1 | fla | lzop -7 > corpus.grp.lzo 1611 @end example 1612 1613 @example 1614 lzop -cd corpus.grp.lzo | grp -e @var{EXPR} | unfla | ser -e @var{EXPR} 1615 1615 @end example 1616 1616 … … 1627 1627 @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} 1628 1628 @item @strong{Authors:} @tab Marcin Walas, Tomasz Obrêbski 1629 @item @strong{Component category:} @tab filter 1629 @item @strong{Input format:} @tab UTT flattened 1630 @item @strong{Output format:} @tab UTT flattened 1631 @item @strong{Required annotation:} @tab tok, sen, lem -1 1630 1632 @end multitable 1631 1633 1632 1634 [TODO] 1635 1636 (see mar's help 'mar -h' for some information) 1633 1637 1634 1638 @c --------------------------------------------------------------------- … … 1871 1875 1872 1876 1877 @c ------------------------------------------------------------------------------- 1878 @c FLA 1879 @c ------------------------------------------------------------------------------- 1880 1873 1881 @page 1874 1882 @node fla … … 1877 1885 @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} 1878 1886 @item @strong{Authors:} @tab Tomasz Obrêbski 1879 @item @strong{Component category:} @tab filter 1887 @item @strong{Input format:} @tab UTT regular 1888 @item @strong{Output format:} @tab UTT flattened 1889 @item @strong{Required annotation:} @tab sen 1880 1890 @end multitable 1881 1891 @c 1892 1893 @menu 1894 * fla description:: 1895 @c * fla command line options:: 1896 @c * fla usage example:: 1897 @end menu 1898 1899 1900 @node fla description 1901 @subsection Description 1882 1902 1883 1903 @command{fla} ``flattens'' a utt file by merging segments belonging … … 1902 1922 segment contains a fragment matching the @code{<bosregex>}). By 1903 1923 default, segments containing a field @code{BOS} are seeked. 1904 @c @menu 1905 @c * con command line options:: 1906 @c * con usage example:: 1907 @c * con hints:: 1908 @c @end menu 1909 1910 1924 1925 @c ------------------------------------------------------------------------------- 1926 @c UNFLA 1927 @c ------------------------------------------------------------------------------- 1911 1928 1912 1929 @page … … 1916 1933 @multitable {aaaaaaaaaaaaaaaaaaaaaaaaa} {aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} 1917 1934 @item @strong{Authors:} @tab Tomasz Obrêbski 1918 @item @strong{Component category:} @tab filter 1935 @item @strong{Input format:} @tab UTT flattened 1936 @item @strong{Output format:} @tab UTT regular 1937 @item @strong{Required annotation:} @tab - 1919 1938 @end multitable 1920 1939 1940 @menu 1941 * unfla description:: 1942 @c * fla command line options:: 1943 @c * fla usage example:: 1944 @end menu 1945 1946 @node unfla description 1947 @subsection Description 1921 1948 @command{unfla} transforms a flattened UTT file, produced by 1922 1949 @command{fla}, into the regular format by restoring end-of-line … … 1971 1998 1972 1999 @example 1973 cat text | tok | lem --only-fail | cor -1 > output32000 cat text | tok | egrep ' W ' | lem | egrep -v 'lem:' | cor -1 1974 2001 @end example 1975 2002 … … 2020 2047 As @command{grp} (@command{grep}) processes data faster then it is 2021 2048 read from the disk drive, the search time may be still shortened by 2022 using file compression techniques. We suggest usin @command{lzop}. 2049 using file compression techniques. We suggest using the 2050 @command{lzop} compressor/decompressor. 2023 2051 2024 2052 @item the fastest way to search a large corpus 2025 2053 2026 step 1: preprocessing2054 step 1: corpus preprocessing 2027 2055 2028 2056 @example 2029 2057 cat corpus | tok | sen | lem -1 \ 2030 | grp -a p| lzop -7 > corpus.grp.lzo2058 | fla | lzop -7 > corpus.grp.lzo 2031 2059 @end example 2032 2060 … … 2034 2062 2035 2063 @example 2036 lzop -cd corpus.grp.lzo | grp -a gP-e 'cat(<V>) space2064 lzop -cd corpus.grp.lzo | unfla | grp -e 'cat(<V>) space 2037 2065 lexeme(rozmowa)' | ser -e 'cat(<V>) space lexeme(rozmowa)' | con 2038 2066 @end example … … 2040 2068 @end enumerate 2041 2069 2042 @ subsubheading More complicated configurations2043 2044 2045 @ example2046 mknod fifo1 p2047 mknod fifo2 p2048 mknod fifo3 p2049 mknod fifo4 p2050 mknod fifo5 p2051 2052 tok | lem -p W -e fifo1 > fifo2 &2053 cor -e fifo3 < fifo1 | lem > fifo4 &2054 gue < fifo3 > fifo5 &2055 sort -m fifo2 fifo4 fifo52056 2057 rm fifo?2058 @ end example2070 @c @subsubheading More complicated configurations 2071 2072 2073 @c @example 2074 @c mknod fifo1 p 2075 @c mknod fifo2 p 2076 @c mknod fifo3 p 2077 @c mknod fifo4 p 2078 @c mknod fifo5 p 2079 2080 @c tok | lem -p W -e fifo1 > fifo2 & 2081 @c cor -e fifo3 < fifo1 | lem > fifo4 & 2082 @c gue < fifo3 > fifo5 & 2083 @c sort -m fifo2 fifo4 fifo5 2084 2085 @c rm fifo? 2086 @c @end example 2059 2087 2060 2088 -
app/src/common/cmdline_common.ggo
r25ae32e re28a625 2 2 3 3 4 option "input" f "Input file" string no hidden4 option "input" f "Input file" string no 5 5 6 option "output" o "Output file " string no hidden6 option "output" o "Output file for succesfully processed segments" string no 7 7 8 option "fail" e "Output file for unsuccesfully processed segments " string no hidden8 option "fail" e "Output file for unsuccesfully processed segments " string no 9 9 10 10 option "only-fail" - "Print only segments the program failed to process" flag off hidden … … 12 12 option "no-fail" - "Print only segments the program processed" flag off hidden 13 13 14 option "copy" c "Copy succesfully processed segments to standard output" flag off hidden14 option "copy" c "Copy succesfully processed segments to standard output" flag off 15 15 16 16 option "process" p "Process segments with this tag" string no multiple -
app/src/cor/Makefile
r13a8a67 re28a625 1 PAR=-Wno-deprecated -m32 -fpermissive 2 # -static 1 PAR=-Wno-deprecated -m32 -fpermissive -static 3 2 PAR2=-c -Wno-deprecated -m32 -fpermissive 4 3 LIB_PATH=../lib -
app/src/cor/cmdline_cor.ggo
r25ae32e re28a625 5 5 option "dictionary" d "Dictionary" string typestr="FILENAME" default="cor.bin" no 6 6 option "distance" n "Maximal edit distance." int default="1" no 7 option "replace" r "Replace original form with corrected form, place original form in the cor field. This option has no effect in single mode" flag off 7 option "replace" r "Replace original form with corrected form, place original form in the cor field. This option has no effect in single mode" flag off hidden 8 8 #option "single" - "Place all alternatives in the same line" flag off -
app/src/gue/Makefile
r8d3e6ab re28a625 1 PAR=-Wno-deprecated -O3 -fpermissive -m32 2 #-static 1 PAR=-Wno-deprecated -O3 -fpermissive -m32 -static 3 2 PAR2=-c -Wno-deprecated -O3 -fpermissive -m32 4 3 LIB_PATH=../lib -
app/src/kor/Makefile
r13a8a67 re28a625 1 PAR=-Wno-deprecated -m32 -fpermissive 2 # -static 1 PAR=-Wno-deprecated -m32 -fpermissive -static 3 2 PAR2=-c -Wno-deprecated -m32 -fpermissive 4 3 LIB_PATH=../lib -
app/src/lem/Makefile
r13a8a67 re28a625 1 PAR=-Wno-deprecated -m32 -O3 -fpermissive 2 #-static 3 PAR2=-c -Wno-deprecated -m32 -O3 -fpermissive 1 PAR=-Wno-deprecated -m32 -O3 -fpermissive -static 2 PAR2=-c -Wno-deprecated -m32 -O3 -fpermissive -static 4 3 LIB_PATH=../lib 5 4 COMMON_PATH=../common -
lang/Makefile
ref85bd7 re28a625 11 11 export UTT_DIC_OUTPUT=${CUR_DIR} 12 12 13 export LANG_MODULES=pl_PL.ISO-8852-2 pl_PL.UTF-8 13 14 14 15 # path to dictionary compiler … … 32 33 cd dist && make tarball; cd ${CUR_DIR}; 33 34 35 36 .PHONY: dist_tarball_pl_PL.ISO-8859-2 37 dist_tarball: 38 export DIC_LANG=pl_PL.ISO-8859-2 && \ 39 cd dist && make tarball; cd ${CUR_DIR}; 40 -
lang/dist/tarball/Makefile
r9b57c4d re28a625 13 13 _TARBALL_ROOT=$(DIR)/utt-$(_UTT_VER).$(_UTT_REL) 14 14 _UTT_DIC_HOME=share/utt 15 _TAR_FILE_NAME=utt.dic.$(_UTT_VER)_$(_UTT_REL) 15 _TAR_FILE_NAME=utt.$(_UTT_VER)_$(_UTT_REL) 16 16 17 17 18 #defualt task … … 21 22 @echo Output directory for tarball: ${UTT_DIC_OUTPUT} 22 23 mkdir -p ${_TARBALL_ROOT}/${_UTT_DIC_HOME} 23 if test -n "${DIC_LANG}" -a -d ${UTT_DIC_BIN}/${DIC_LANG}; \24 if [[ -n "${DIC_LANG}" && -d ${UTT_DIC_BIN}/${DIC_LANG} ]]; \ 24 25 then \ 25 26 echo "Tworze dystrybucje ${DIC_LANG}"; \
Note: See TracChangeset
for help on using the changeset viewer.