This repo contains all files required to deploy korp in our infrastructure. There's three parts to this mayhem:
- The configuration files expected by both korp-frontend and korp-backend, which is most of the content in this repository.
- The files required to generate our docker images for both the frontend and the backend (in
docker/).- The
docker/docker-compose.ymlfile includes instructions for the deployment of our images - The
docker/deployscript contains all commands required to perform a "korp upgrade" in the repo: These upgrades are for local altlab changes only: The versions of korp-frontend, korp-backend, and cwb are fixed and updates to those are likely to require further changes in the code, so developers are expected to manually change versions when needed by modifying the docker files. - The image folders,
backend/andfrontend/. Each contains:- A
Dockerfilewith the instructions to make the images - A set of files that have been customized and that need to replace other files in their respective korp codebases. Eventually we could instead have forks of the source repos, but the effort in rebasing and identifying repo divergences seems equivalent.
- A
- The
- The
corporaconfiguration files and the scripts that can properly importvrtfiles into korp.
If you follow our instructions to upload a new corpus, later updates to the .vrt file require only the following (say, for example, for the bloomfield corpus):
- Copy the
bloomfield.vrtfile to the appropriate location: Check the folder mappings indocker/docker-compose.ymlfor up-to-date info, but it is likely to be justaltlab-itw:/data_local/application-data/korp-backend/vrt_files. Ensure that thekorpuser has read access to this file. If you havesudopowers inaltlab-itw, you can move the file from your local computer to the appropriate location
local$ scp bloomfield.vrt altlab.dev:
local$ ssh altlab.dev
altlab-gw$ scp bloomfield.vrt altlab-itw:
altlab-gw$ ssh altlab-itw:
altlab-itw$ sudo -u korp cp -v bloomfield.vrt /data_local/application-data/korp-backend/vrt_files/
- Run the
update_corpusscript with the corpus name without thevrtextension. If you just ran the previous step:
you@altlab-itw$ sudo -i -u korp
korp@altlab-itw$ cd korp-config/docker/
korp@altlab-itw$ docker-compose exec korp-backend bash /app/update_corpus.sh bloomfield
Respond y and press enter when asked to confirm that you want to update this corpus.
There are some steps involved in the generation of a new corpus.
You will likely want to read and understand the CWB Corpus Encoding and Management Manual. But there are some details that are missing in this documentation:
- The backend is very sensitive to the use of characters that it may consider as escape characters. In particular, this means that you want to avoid the usage of spaces
or slashes/in any p-attributes. We ask that you follow these conventions invrtfiles:- Replace all occurrences of a space in a field (e.g. glosses) with the
 HTML entity string. The frontend still shows that entity as a space, and extended search inserts these characters automatically when there is a space in a search item. - Replace most occurrences of a slash in a field (e.g. glosses) with the
/HTML entity string. Do not replace slashes in the word p-attribute (usually the first one). The frontend does not escape the HTML entity inside the text of the sentence. - Replace all other escape characters used by CWB: (
<by<,>by>, and|by|)
- Replace all occurrences of a space in a field (e.g. glosses) with the
Once you have a .vrt file, you can continue the process.
- Create a
corpora/corpus_name.yamlfile (replacecorpus_name). Unless you have new kinds of fields and a very differentvrtfile structure from the ones already used in altlab, you can:- Copy one of the existing
yamlfiles, for example,cp corpora/wolfart_ahenakew.yaml corpora/corpus_name.yaml - Change the
id,title, anddescriptionfields. - Select a
folderfor the corpus in thedefaultmode. Make sure the folder exists in themodes/default.yamlfile. - If you want search to immediately work on this corpus, add the corpus to the list in
modes/default.yaml(One could also create different modes)
- Copy one of the existing
- Make sure that the
backend/import_vrt.shfile allows CWB to understand your specificvrtformat:- If you are just following the
wolfart_ahenakew.vrtformat, there's nothing you need to do. Currently, the format is:This format means that there are 5 p-attributes (word, lemma, analysis, deps, gloss) and 4 xml-based s-attributes (sentence, paragraph, text, corpus). The-P word -P lemma -P analysis -P deps -P gloss -S sentence:0+id -S paragraph -S text:2+id+lang+title+author -S corpus:0+id -U ""texttag can be used recursively up to a nesting of 2, the others cannot.sentenceandcorpusXML tags can have anidattribute, whiletexttags can haveid,lang,title, andauthorattributes. - If your format is different, generate a special case for the script to handle it. follow the example of the following lines:
if [ "$NAME" = "bloomfield" ]; then VRT_FORMAT_STRUCTURE="-P word -P lemma -P analysis -P deps -P gloss -S sentence:0+id -S paragraph -S text:2+id+title+author -S corpus:0+id+lang -U \"\"" fi
corpora/corpus_name.yamlmatch the structure of thevrtfile. Theyamlfile assumes mappings of the formcwb_field_name: attribute_presentation_yaml. The keys correspond to the name of the field in the CWB registry file for the corpus, and the values correspond to the file name of an attribute description in theattributes/folder of this repository, or otherwise completely inlined. - If you are just following the
- Commit your changes to this repo and push
- Deploy your repo changes:
you@local $ ssh altlab.dev you@altlab-gw $ ssh altlab-itw you@altlab-itw $ sudo -i -u korp korp@altlab-itw $ cd korp-config/docker/ korp@altlab-itw $ ./deploy - Load the new corpus
.vrtinto CWB and the korp backend using the appropriate script:korp@altlab-itw $ docker-compose exec korp-backend bash /app/first_load_corpus.sh corpus_name