To generate a word/character list, you can enter the ranges for the length, frequency and/or lexical diversity of the words/characters in desired corpus.
Category: Select from word (POS-tagged words), lemma (word forms) or character.
Corpus: Select from all (the entire corpus), or the subcorpus of G2 (Grade 2 and below), G34 (Grades 3-4) or G56 (Grades 5-6).
length: (For word/lemma) The number of constituent characters.
Freq.million: The number of occurrences per million words/characters.
logCD: The log-transformed counts of documents.
Other metrics:
POS: The syntactic category of words. See reference here.
rawFreq: The raw number of occurrences.
Zipf: A standardized frequency measure, calculated as log10(Freq.million)+3.
用户可直接下载全部字词数据,也可以通过设定字、词或词元的特征数据范围,搜索和下载所需要的字词。
Category类别: 可选范围为char字、word词(带词性标注)、lemma词元(无词性标注)
Corpus数据库: 可选范围为all(总库)、G2(2年级及以下子库)、G34(3-4年级子库)、G56(5-6年级子库)
length词长: 组成词的字数。在category选择word或lemma时,可选择词长的min最小和max最大范围。词长分布范围因数据库而异。
Freq.million百万频次: 每百万字/词中的出现频次。可选择字词频的min最小和max最大范围。百万频次分布范围因数据库而异。
logCD上下文多样性: 字/词出现文档数量的对数转换值。可选择上下文多样性的min最小和max最大范围。上下文多样性分布范围因数据库而异。
其它指标:
POS: 词性,标注含义 参考此处
rawFreq: 字/词出现的绝对频次
Zipf: 字/词频标准值,计算公式为log10(Freq.million)+3