Search

Computational linguistics is cool as hell. It may be the link between my inherent computer nerd self and a desire to learn languages. It is, at very least, more interesting than politics (or, rather, it has captured my attention away from politics, which is frustrating as hell right now). Let’s say I’ve added it as Life Option #123995.

Anyway, I’ve lately been working on some tools to help study written Chinese. My grasp of the spoken is coming along nicely, but the written language is a good bit more difficult. One of the tools I wrote was a HTML parser for GB-2312 encoded pages that would strip everything but the characters and give the character’s frequency counts (as well as parse out n-grams, but that’s another story altogether). I’ve fed it a few hundred pages, and come up with 3824 distinct characters. The top ten are:

#

–>

# Char Count Freq # Char Count Freq # Char Count Freq Char Count Freq # Char Count Freq
1 29630 3.5368% 2 16309 1.9467% 3 15855 1.8926%
4 14019 1.6734% 5 11330 1.3524% 6 10547 1.2590%
7 9467 1.1300% 8 9434 1.1261% 9 8930 1.0659%
10 7365 0.8791% 11 7158 0.8544% 12 6697 0.7994%

Thinking about it, though, I guess it sort of makes sense. After all, the definition of literacy in China is somewhere around 1500 characters, and it would seem reasonable that newspapers are written at a barely-literate level (as they are in this country).

Update: Yeah, my parser was totally fucked up and reporting bad statistics. It’s better now, and the updated graph reflects the correct data so far.

There are, of course, more pages to parse and more information to extract, but I thought that what I’ve found so far was pretty intriguing.

5 Responses to “汉字 frequency”

    I really think you should just study Chinese full-time. It’s obvious you’re interested/obsessed. Once you become fully literate both spoken and written, you will find plenty of opportunities in a wide array of jobs. The “nerd’ in you will continue to fluorish on it’s own w/out structure…isn’t that what makes a nerd a nerd???

    Newspapers in China seem much easier to read than those in Hong Kong or Taiwan… I think they’ve really cut back on the literary language, along with proverbs, etc…

    They have, particularly the literary forms… a lot of the old style of writing (and especially _wenyan_, which was seen as a tool that the bourgeoisie used to keep the workers uneducated) has been thrown out in the quest of universal literacy.

    I guess if you can’t make the education system better, you can just lower the standards. It seems to work for Florida! :)

    Yeah, wenyan is not used in daily life any more. However, students in high school still need to learn about it (not in depth). They are expected to have the abilities to read and understand simple wenyan. And they also get questions about wenyan in their university entrance examinations.

    Many people think Chinese is a language which hard to learn especially different tones, but writing is much more hard than speaking.