Category: Corpus: Min_length: Max_length: Min_Freq.million: Max_Freq.million: Min_logCD: Max_logCD:

Please direct any queries to Professor Qing Cai (qcai@psy.ecnu.edu.cn). If you use the database for your research, please refer to us as follows:

Li, L., Yang, Y., Song, M., Fang, S.-Y., Zhang, M.-Y., Chen, Q.-R., Cai, Q. (2022). CCLOWW: A grade-level Chinese children’s lexicon of written words. Behavior Research Methods. DOI : 10.3758/s13428-022-01890-9.

How to use

To generate a word/character list, you can enter the ranges for the length, frequency and/or lexical diversity of the words/characters in desired corpus.

Category: Select from word (POS-tagged words), lemma (word forms) or character.

Corpus: Select from all (the entire corpus), or the subcorpus of G2 (Grade 2 and below), G34 (Grades 3-4) or G56 (Grades 5-6).

length: (For word/lemma) The number of constituent characters.

Freq.million: The number of occurrences per million words/characters.

logCD: The log-transformed counts of documents.

Other metrics:

POS: The syntactic category of words. See reference here.

rawFreq: The raw number of occurrences.

Zipf: A standardized frequency measure, calculated as log10(Freq.million)+3.

Note. The ranges differ across category and across subcorpora. For a full description of the metrics, please refer to:

Li, L., Yang, Y., Song, M., Fang, S.-Y., Zhang, M.-Y., Chen, Q.-R., Cai, Q. (2022). CCLOWW: A grade-level Chinese children’s lexicon of written words. Behavior Research Methods. DOI : 10.3758/s13428-022-01890-9.

使用说明

用户可直接下载全部字词数据，也可以通过设定字、词或词元的特征数据范围，搜索和下载所需要的字词。

Category类别: 可选范围为char字、word词（带词性标注）、lemma词元（无词性标注）

Corpus数据库: 可选范围为all（总库）、G2（2年级及以下子库）、G34（3-4年级子库）、G56（5-6年级子库）

length词长: 组成词的字数。在category选择word或lemma时，可选择词长的min最小和max最大范围。词长分布范围因数据库而异。

Freq.million百万频次: 每百万字/词中的出现频次。可选择字词频的min最小和max最大范围。百万频次分布范围因数据库而异。

logCD上下文多样性: 字/词出现文档数量的对数转换值。可选择上下文多样性的min最小和max最大范围。上下文多样性分布范围因数据库而异。

其它指标:

POS: 词性，标注含义 参考此处

rawFreq: 字/词出现的绝对频次

Zipf: 字/词频标准值，计算公式为log10(Freq.million)+3

关注我们