2008-04-20

NLTKで英文の文末判定

英文の文末を判定する簡易なルールベースのアルゴリズム。

Separating sentences

1年前にこの手のアルゴリズムを実装しようとしたが、この問題は非常にやっかいです。
たとえばこんな例:

CELLULAR COMMUNICATIONS INC. sold 1,550,000 common shares at $21.75 each
yesterday, according to lead underwriter L.F. Rothschild & Co.

"INC."の直後や"$21.75"、"L.F."などのピリオドを文末と認識しては大間違いになるのです。

この問題を解決するのに自分が1年前に着目していた論文は以下のもの。

Unsupervised Multilingual Sentence Boundary Detection

この論文では、特に言語を英語だけに限定しない方法を提案しています。
大規模な生のテキストデータから得られる統計情報のみで、文末判定を行えます。
難点はルールベースなどとは違って、事前の綿密な統計の作成、統計処理後の各種特別処理の実装が面倒くさいこと。
昨年途中まで実装したが、Python用の自然言語処理ライブラリNLTKでどうも実装予定との情報を見つけて半端でやめていました。

それで久しぶりに調べてみたら、このアルゴリズムがすでにNLTKで実装されて公開されているではないですか!

早速インストールして使ってみました。

インストール
% wget http://prdownloads.sourceforge.net/nltk/nltk-0.9.2.tar.gz
% tar zxvf nltk-0.9.2.tar.gz
% cd nltk-0.9.2
% sudo python setup.py install
% cd ..
% sudo mkdir /usr/share/nltk
% cd /usr/share/nltk
% sudo wget http://prdownloads.sourceforge.net/nltk/nltk-data-0.9.2.zip
% sudo unzip nltk-data-0.9.2.zip
% sudo chmod -R g+r data
% export NLTK_DATA=/usr/share/nltk/data
% python
>>> import nltk
>>> nltk.corpus.brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
>>>

ここまでできればインストール完了。

生テキストデータを用意
例えば、Google News(英語版)などからリンクをたどって、ひたすらニュース本文をファイルnews.txtにコピペする。
実験でやるんであれば1000行程度で十分でした。

実験!!
まず生テキストを食わせて学習
>>> from nltk.tokenize import PunktSentenceTokenizer
>>> p = PunktSentenceTokenizer()
>>> fp = file("news.txt")
>>> p.train(fp.read())


次に実際に判定してみる。
判定に使うテキストは次のような少々意地の悪い例
The Finland-based company expects a weaker dollar and slower economic growth in the U.S. and parts of Europe to dampen the overall handset market this year. About half of Nokia's (NOK) sales are in dollars or currencies tied to it; a weaker dollar makes imports more expensive.

"What spooked us was its outlook for the industry in general," said Rick Franklin, equities analyst at Edward Jones.

Nokia reiterated projections that the industry shipments of handsets will grow 10% this year over last. In the first quarter, though, global shipments rose 17%, suggesting a slowdown in the remainder of the year.

For the quarter that ended March 31, Nokia earned $1.9 billion (1.2 euros), up 25% from the same quarter last year but short of an expected $2.3 billion. Overall sales rose 28% to $20.1 billion (12.6 billion euros), roughly in line with views.


>>> a=p.tokenize("""The Finland-based company expects
a weaker dollar and slower economic growth in the U.S. and parts of Europe to
dampen the overall handset market this year. About half of Nokia's (NOK) sales
are in dollars or currencies tied to it; a weaker dollar makes imports more
expensive.

"What spooked us was its outlook for the industry in general," said Rick Franklin,
equities analyst at Edward Jones.

Nokia reiterated projections that the industry shipments of handsets will grow 10%
this year over last. In the first quarter, though, global shipments rose 17%,
suggesting a slowdown in the remainder of the year.

For the quarter that ended March 31, Nokia earned $1.9 billion (1.2 euros), up 25%
from the same quarter last year but short of an expected $2.3 billion. Overall
sales rose 28% to $20.1 billion (12.6 billion euros), roughly in line with views.""")
>>> for x in a:
... print x
... print "-"*20
The Finland-based company expects a weaker dollar and slower economic growth
in the U.S. and parts of Europe to dampen the overall handset market this year.
--------------------
About half of Nokia's (NOK) sales are in dollars or currencies tied to it;
a weaker dollar makes imports more expensive.
--------------------
"What spooked us was its outlook for the industry in general," said Rick Franklin,
equities analyst at Edward Jones.
--------------------
Nokia reiterated projections that the industry shipments of handsets will
grow 10% this year over last.
--------------------
In the first quarter, though, global shipments rose 17%, suggesting a slowdown
in the remainder of the year.
--------------------
For the quarter that ended March 31, Nokia earned $1.9 billion (1.2 euros), up
25% from the same quarter last year but short of an expected $2.3 billion.
--------------------
Overall sales rose 28% to $20.1 billion (12.6 billion euros), roughly in line
with views.
--------------------
>>>

きちんと、ピリオドで文をわけつつも、"U.S."や、"1.2 euros"などで区切るのは避けていることが分かります。

精度を上げるには、もっともっと大量の生テキストを食わせる必要があります。