cnreader

command module
v0.0.80 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 19, 2023 License: Apache-2.0 Imports: 29 Imported by: 0

README

Chinese Text Reader

A command line app for Chinese text corpus index, translation memory, and HTML reader page generation.

Many HTML pages at chinesenotes.com and family of sites are generated from Go templates from the lexical database and text corpus. This readme gives instructions. It assumes that you have already cloned the project from GitHub.

The tool can create markup that can be used with JavaScript to bring up a dialog like shown below for each term in the corpus:

screenshot of vocabulary dialog

The app also compiles unigram and bigram index for full text searching of the corpus.

In addition, a translation memory is compiled from the dictionary, named entity database, and phrase memory.

Setup

Install the Go SDK - Install Documentation

Make sure that your the go executable is on your path. You may need to do something like

$ export PATH=$PATH:/usr/local/go/bin

Quickstart

Download the app and the dicitonary

go get github.com/alexamies/cnreader
go run github.com/alexamies/cnreader -download_dict

Set an environment variable to let the app know where its home is

export CNREADER_HOME=.

Supply Chinese text on the command line. Observe tokenization and matching to English equivalents

go run github.com/alexamies/cnreader -source_text="君不見黃河之水天上來"

Mark up a plain text file with HTML

go run github.com/alexamies/cnreader -source_file=testdata/sampletest.txt

The output will be written to output.html.

Basic Use

The software is contained in this project, related dictionary and corpus files are in the hinesenotes.com project

Build the project

Clone and build the project

https://github.com/alexamies/cnreader.git
cd cnreader
go build
Get Sample dictionary and corpus file

Get the linguistic assets and add the directory CNREADER_HOME env variable to your PATH

cd ..
git clone https://github.com/alexamies/chinesenotes.com.git
export CNREADER_HOME=$PWD/chinesenotes.com
cd cnreader
Generate word definition files

Generate a HTML page for the defintion of each word its usage in the corpus:

go run github.com/alexamies/cnreader -hwfiles
Analyze the corpus

Analyze the corpus, including word frequencies and writing out docs to HTML

cd $CNREADER_HOME/go/src/cnreader
./cnreader.go
Markup HTML pages containing Chinese text

To mark English equivalents for all Chinese words in all files listed in data/corpus/html-conversions.csv:

./cnreader -html
Markup a list of file

To mark English equivalents for all Chinese words in the corpus file modern_articles.csv:

./cnreader -collection modern_articles.csv

Analyzing your own corpus

The cnreader program looks at the file $CNREADER_HOME/data/corpus/collections.csv and analyzes the lists of texts under there. To analyze your own corpus, create a new directory tree with your own collections.csv file and set the environment variable CNREADER_HOME to the top of that directory.

Go API

The API documentation is given at

https://pkg.go.dev/github.com/alexamies/cnreader

Testing

Run unit tests with the command

go test ./... -cover

Run an integration test with the command

go test -integration ./...

Dataflow - Not Finished

The goal of the Dataflow job is to analyze the corpus to create two index files, one for term frequencies and the other for bigram frequencies. The file for term frequencies has tab separated variable entries like this (word_freq_doc.txt):

Term    Frequency  Collection     File                    IDF     Doc len
秘      3          jinshu.html    jinshu/jinshu058.html   0.8642  6520
堅冰    1          weishu.html    weishu/weishu130.html   2.1332  13168
...

The file for bigram frequencies has entries like this (bigram_doc_freq.txt):

Bigram    Frequency  Collection       File                            IDF     Doc len
此初      1          shiji.html       shiji/shiji123.html              2.5626  4508
偶示      1          jiuwudaishi.html jiuwudaishi/jiuwudaishi030.html  3.7667  3850
...

where IDF is the inverse document frequency for a term w calculated as

idf(w) = log[(M + 1) / df(w)]

and M is the number of documents in the corpus and df is the number of documents that the term occurs within. The IDF is used in computing the TF-IDF, which is done at the time when a user queries the corpus.

The corpus files are text files indexed by collection files with the example format given by testdata/testcollection.tsv.

To get setup with Dataflow, follow instructions at Create a Dataflow pipeline using Go

Create a GCP service account

export GOOGLE_APPLICATION_CREDENTIALS=${PWD}/dataflow-service-account.json

Create a GCS bucket to read text from

TEXT_BUCKET=[your GCS bucket]

Copy the files testdatatest/sampletest.txt and testdata/sampletest2.txt to the bucket:

gsutil cp testdata/sampletest*.txt gs://${TEXT_BUCKET}/testdata/

The location to read config, dictionary, and corpus files from

export CNREADER_HOME=${PWD}

The GCP project:

PROJECT_ID=[your project id]

Run the pipeline locally

cd tfidf
CORPUS=cnreader
GEN=0
go run tfidf.go \
  --input gs://${TEXT_BUCKET} \
  --cnreader_home ${CNREADER_HOME} \
  --corpus_fn testdata/testcorpus.tsv \
  --corpus $CORPUS \
  --generation $GEN \
  --project $PROJECT_ID

Run the pipeline on Dataflow

DATAFLOW_REGION=us-central1
go run tfidf.go \
  --input gs://${TEXT_BUCKET} \
  --cnreader_home ${CNREADER_HOME} \
  --corpus_fn testdata/testcorpus.tsv \
  --corpus $CORPUS \
  --generation $GEN \
  --runner dataflow \
  --project $PROJECT_ID \
  --region $DATAFLOW_REGION \
  --staging_location gs://${TEXT_BUCKET}/binaries/

After saving the indexes, Firestore will need to generate its own indexes. The links for this can be found by running the following validation test:

cd ..
COLLECTION=testcollection.html
./cnreader --test_index_terms "而,不" \
  --project $PROJECT_ID \
  --collection ${COLLECTION}

Title index in Firestore

To update the document title index in Firestore

./cnreader --project $PROJECT_ID --titleindex 

Also, generate a file for the document index, needed for the web app:

./cnreader --titleindex 

Search the title index

./cnreader --project $PROJECT_ID --titlesearch "測試"

Run a full text search:

export TEXT_BUCKET=chinesenotes_tfidf
./cnreader --project $PROJECT_ID --find_docs "不見古人" --outfile results.csv

Indexing of Idioms

To index idioms use the command

./cnreader --project $PROJECT_ID --dict_index Idiom

Use the substring index for searching

./cnreader --project $PROJECT_ID --find_dict_substring 明 --find_dict_domain Idiom

Translation Memory Index

To index the translation memory use the command

./cnreader --project $PROJECT_ID --tmindex

Alternatively, use a Cloud Run Job. Push a Docker impage to the Google Cloud Artifact Repository using Cloud Build:

BUILD_ID=[your build id, eg 1234]
gcloud builds submit --config cloudbuild.yaml . \
  --substitutions=_IMAGE_TAG="$BUILD_ID"

Search the translation memory index

./cnreader --project $PROJECT_ID --tmsearch 柳暗名明

Documentation

Overview

Command line utility to analyze Chinese text, including corpus analysis, compilation of a full text search index, and mark up HTML files in reader style.

This utility is used to generate web pages for https://chinesenotes.com, https://ntireader.org, and https://hbreader.org

Quickstart:

Supply Chinese text on the command line. Observe tokenization and matching to English equivalents

go get github.com/alexamies/cnreader go run github.com/alexamies/cnreader -download_dict go run github.com/alexamies/cnreader -source_text="君不見黃河之水天上來"

Flags:

 -download_dict 	Download the dicitonary files from GitHub and save locally.
 -source_text 		Analyze vocabulary for source input on the command line
 -source_file 		Analyze vocabulary for source file and write to output.html.
 -collection 		Enhance HTML markup and do vocabulary analysis for all the
             		files listed in given collection.
 -html						Enhance HTML markup for all files listed in
									data/corpus/html-conversion.csv
 -hwfiles				Compute and write HTML entries for each headword, writing
									the files to the web/words directory.
 -librarymeta		collection entries for the digital library.
 -tmindex				Compute and write translation memory index.
 -titleindex			Builds a flat index of document titles

Follow instructions in the README.md file for setup.

Directories

Path Synopsis
Package for vocabulary analysis of a monolingual Chinese text corpus
Package for vocabulary analysis of a monolingual Chinese text corpus
Package for scanning the corpus collections
Package for scanning the corpus collections
Package for generating HTML files This includes HTML templates embedded in source for zero-config usage.
Package for generating HTML files This includes HTML templates embedded in source for zero-config usage.
Library for documents retrieval
Library for documents retrieval
The library package allows for analysis of multiple corpora together.
The library package allows for analysis of multiple corpora together.
Package for ngram analysis
Package for ngram analysis
tfidf module
Package for translation memory index
Package for translation memory index

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL