User:Dennisroczek/CDP

This is a draft and an idea collection for a third project by The Document Foundation (after LibreOffice and Document Liberation Project).

This project aims for easier maintaining dictionaries and easier collaboration for the dictionary team plus a central system for downloading up-to-date dictionaries. An initial request was posted to the i10n@undefinedglobal.libreoffice.org mailing list on the 2nd April 2016 by me, see.

Actual Situation
The proposal was not flying. Feel free to restart a discussion.

Naming
For now, I chose Collaboration Dictionary Project because nothing better came to my mind. Feel free to add proposals here in this section.

Maintainer

 * easy to use
 * no technical understanding getting new extensions packed
 * no git / gerrit
 * no "wrong usage" of git, see Marco's en_GB case
 * centralized repository
 * ideally base for many projects (distributions, Linux distributions, other application distributions, PEP's Trustwords, etc.)
 * not doubled maintained word lists by multiple maintainers (not knowing each other)
 * important to recognize that many dictionaries are maintained in sophisticated lexical databases that can export Hunspell word lists; these languages are unlikely to use this system as the primary repository for their word lists, but may want to use it to (1) export LO/Mozilla extension and (2) to accept suggested words and corrections from the community; important to allow easy import of word lists for such languages, and export of suggestions/corrections.
 * optimized compression (for zipped packages like LibreOffice's OXT)
 * use of Hunspell features correctly (not simple word lists, but by logic)
 * one nice feature; if a user suggests a new word with an affix flag ("foo/X"), the interface should show the user all words generated by the affix file before committing (foo, foos, fooing, foobar, etc.). Will be a big help to people just learning the affix file format.
 * simple distribution (if APIs exists for extensions center, the new release could be pushed automatically)
 * orphaned dictionaries could be easily overtaken
 * creating such a complete project helps all languages (e.g. Czech do thinking about to do it on their own)

Normal User

 * easy to use
 * no search for a dictionary for the wanted language (e.g. not search for dictionaries in Hunspell's own code)
 * centralized repository
 * always up to date
 * a place to propose new words or corrections

Long Term

 * invite other projects to join
 * provide an editing interface for maintainers to validate/reject words; candidate words might come via periodic imports from the Crúbadán project linked below
 * Allow filtering of candidate words by various criteria (kscanne has scripts for this)
 * Allow maintainer to "tag" words by part-of-speech when validating; important first step for long-term development of rich lexical database for more advanced NLP tools
 * convert word lists to other dictionary system (if technically possible)
 * convince other projects to use the same structure?
 * integration in applications for downloading new languages (or updates) when requesting
 * possibility of thesaurus (Long therm?)

Maintainers

 * open question of license (different licenses over different dictionaries)
 * Google Chrome doesn't accept every language

Server

 * GIT, or something similar, that hosts word lists;
 * A BuildBot system, that creates the dictionary extensions/packages/etc. on a weekly/fortnightly/monthly/quarterly builds;
 * Automatic uploading of spelling dictionaries to the Dictionary Extensions host used by LibreOffice
 * conversation tools to import existing / mature projects

Documentation
(for manual work when not having an automated system, but useful nevertheless) And maybe also:
 * How to create a word list;
 * How to modify a word list;
 * How to upload the word list to the repository;
 * How third parties can download content in the repository;
 * How to transform the data in the repository to use with other software;

existing scripts collection

 * GitHub Project: Scripts to create LibreOffice, OpenOffice, and Mozilla dictionaries from word lists
 * GitHub Project: Data files and scripts for building Scottish Gaelic spell checkers
 * Proofing Tool GUI - easy editing of the Dictionary/Thesaurus/Hyphenation/Autocorrect files
 * Web-crawled word lists for more than 2000 languages under open licenses, an academic project by kscanne

stuff to check myself

 * corpora
 * "My impression is that there is a python library that takes word lists, and creates affix files from them." (can we find it?)