Give AlbumentationsX a star on GitHub — it powers this leaderboard

Star on GitHub

sudachipy

Python version of Sudachi, the Japanese Morphological Analyzer

Rank: #2897Downloads: 1,830,841 (30 days)Stars: 427Forks: 47

Description

SudachiPy

PyPi version Documentation

SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer.

This is not a pure Python implementation, but bindings for the Sudachi.rs.

Binary wheels

We provide binary builds for macOS (10.14+), Windows and Linux only for x86_64 architecture. x86 32-bit architecture is not supported and is not tested. MacOS source builds seem to work on ARM-based (Aarch64) Macs, but this architecture also is not tested and require installing Rust toolchain and Cargo.

More information here.

TL;DR

$ pip install sudachipy sudachidict_core

$ echo "高輪ゲートウェイ駅" | sudachipy
高輪ゲートウェイ駅	名詞,固有名詞,一般,*,*,*	高輪ゲートウェイ駅
EOS

$ echo "高輪ゲートウェイ駅" | sudachipy -m A
高輪	名詞,固有名詞,地名,一般,*,*	高輪
ゲートウェイ	名詞,普通名詞,一般,*,*,*	ゲートウェー
駅	名詞,普通名詞,一般,*,*,*	駅
EOS

$ echo "空缶空罐空きカン" | sudachipy -a
空缶	名詞,普通名詞,一般,*,*,*	空き缶	空缶	アキカン	0
空罐	名詞,普通名詞,一般,*,*,*	空き缶	空罐	アキカン	0
空きカン	名詞,普通名詞,一般,*,*,*	空き缶	空きカン	アキカン	0
EOS
from sudachipy import Dictionary, SplitMode

tokenizer = Dictionary().create()

morphemes = tokenizer.tokenize("国会議事堂前駅")
print(morphemes[0].surface())  # '国会議事堂前駅'
print(morphemes[0].reading_form())  # 'コッカイギジドウマエエキ'
print(morphemes[0].part_of_speech())  # ['名詞', '固有名詞', '一般', '*', '*', '*']

morphemes = tokenizer.tokenize("国会議事堂前駅", SplitMode.A)
print([m.surface() for m in morphemes])  # ['国会', '議事', '堂', '前', '駅']

Setup

You need SudachiPy and a dictionary.

Step 1. Install SudachiPy

pip install sudachipy

Step 2. Get a Dictionary

You can get dictionary as a Python package. It may take a while to download the dictionary file (around 70MB for the core edition).

pip install sudachidict_core

Alternatively, you can choose other dictionary editions. See this section for the detail.

Usage: As a command

There is a CLI command sudachipy.

$ echo "外国人参政権" | sudachipy
外国人参政権	名詞,普通名詞,一般,*,*,*	外国人参政権
EOS
$ echo "外国人参政権" | sudachipy -m A
外国	名詞,普通名詞,一般,*,*,*	外国
人	接尾辞,名詞的,一般,*,*,*	人
参政	名詞,普通名詞,一般,*,*,*	参政
権	接尾辞,名詞的,一般,*,*,*	権
EOS
$ sudachipy tokenize -h
usage: sudachipy tokenize [-h] [-r file] [-m {A,B,C}] [-o file] [-s string]
                          [-a] [-d] [-v]
                          [file [file ...]]

Tokenize Text

positional arguments:
  file           text written in utf-8

optional arguments:
  -h, --help     show this help message and exit
  -r file        the setting file in JSON format
  -m {A,B,C}     the mode of splitting
  -o file        the output file
  -s string      sudachidict type
  -a             print all of the fields
  -d             print the debug information
  -v, --version  print sudachipy version

Note: The Debug option (-d) is disabled in version 0.6.*

Output

Columns are tab separated.

  • Surface
  • Part-of-Speech Tags (comma separated)
  • Normalized Form

When you add the -a option, it additionally outputs

  • Dictionary Form
  • Reading Form
  • Dictionary ID
    • 0 for the system dictionary
    • 1 and above for the user dictionaries
    • -1 if a word is Out-of-Vocabulary (not in the dictionary)
  • Synonym group IDs
  • (OOV) if a word is Out-of-Vocabulary (not in the dictionary)
$ echo "外国人参政権" | sudachipy -a
外国人参政権	名詞,普通名詞,一般,*,*,*	外国人参政権	外国人参政権	ガイコクジンサンセイケン	0	[]
EOS
echo "阿quei" | sudachipy -a
阿	名詞,普通名詞,一般,*,*,*	阿	阿		-1	[]	(OOV)
quei	名詞,普通名詞,一般,*,*,*	quei	quei		-1	[]	(OOV)
EOS

Usage: As a Python package

API

See API reference page.

Example

from sudachipy import Dictionary, SplitMode

tokenizer_obj = Dictionary().create()
# Multi-granular Tokenization

# SplitMode.C is the default mode
[m.surface() for m in tokenizer_obj.tokenize("国家公務員", SplitMode.C)]
# => ['国家公務員']

[m.surface() for m in tokenizer_obj.tokenize("国家公務員", SplitMode.B)]
# => ['国家', '公務員']

[m.surface() for m in tokenizer_obj.tokenize("国家公務員", SplitMode.A)]
# => ['国家', '公務', '員']
# Morpheme information

m = tokenizer_obj.tokenize("食べ")[0]

m.surface() # => '食べ'
m.dictionary_form() # => '食べる'
m.reading_form() # => 'タベ'
m.part_of_speech() # => ['動詞', '一般', '*', '*', '下一段-バ行', '連用形-一般']
# Normalization

tokenizer_obj.tokenize("附属", mode)[0].normalized_form()
# => '付属'
tokenizer_obj.tokenize("SUMMER", mode)[0].normalized_form()
# => 'サマー'
tokenizer_obj.tokenize("シュミレーション", mode)[0].normalized_form()
# => 'シミュレーション'

(With 20210802 core dictionary. The results may change when you use other versions)

Dictionary Edition

There are three editions of Sudachi Dictionary, namely, small, core, and full. See WorksApplications/SudachiDict for the detail.

SudachiPy uses sudachidict_core by default.

Dictionaries can be installed as Python packages sudachidict_small, sudachidict_core, and sudachidict_full.

The dictionary files are not in the package itself, but it is downloaded upon installation.

Dictionary option: command line

You can specify the dictionary with the tokenize option -s.

$ pip install sudachidict_small
$ echo "外国人参政権" | sudachipy -s small
$ pip install sudachidict_full
$ echo "外国人参政権" | sudachipy -s full

Dictionary option: Python package

You can specify the dictionary with the Dicionary() argument; config or dict.

class Dictionary(config=None, resource_dir=None, dict=None)
  1. config
    • You can specify the file path to the setting file with config (See [Dictionary in The Setting File](#Dictionary in The Setting File) for the detail).
    • If the dictionary file is specified in the setting file as systemDict, SudachiPy will use the dictionary.
  2. dict
    • You can also specify the dictionary type with dict.
    • The available arguments are small, core, full, or a path to the dictionary file.
    • If different dictionaries are specified with config and dict, a dictionary defined dict overrides those defined in the config.
from sudachipy import Dictionary

# default: sudachidict_core
tokenizer_obj = Dictionary().create()

# The dictionary given by the `systemDict` key in the config file (/path/to/sudachi.json) will be used
tokenizer_obj = Dictionary(config="/path/to/sudachi.json").create()

# The dictionary specified by `dict` will be used.
tokenizer_obj = Dictionary(dict="core").create()  # sudachidict_core (same as default)
tokenizer_obj = Dictionary(dict="small").create()  # sudachidict_small
tokenizer_obj = Dictionary(dict="full").create()  # sudachidict_full

# The dictionary specified by `dict` overrides those defined in the config.
# In the following code, `sudachidict_full` will be used regardless of a dictionary defined in the config file.
tokenizer_obj = Dictionary(config="/path/to/sudachi.json", dict="full").create()

Dictionary in The Setting File

Alternatively, if the dictionary file is specified in the setting file, sudachi.json, SudachiPy will use that file.

{
    "systemDict" : "relative/path/from/resourceDir/to/system.dic",
    ...
}

The default setting file is sudachi.json. You can specify your sudachi.json with the -r option.

$ sudachipy -r path/to/sudachi.json

User Dictionary

To use a user dictionary, user.dic, place sudachi.json to anywhere you like, and add userDict value with the relative path from sudachi.json to your user.dic.

{
    "userDict" : ["relative/path/to/user.dic"],
    ...
}

Then specify your sudachi.json with the -r option.

$ sudachipy -r path/to/sudachi.json

You can build a user dictionary with the subcommand ubuild.

$ sudachipy ubuild -h
usage: sudachipy ubuild [-h] [-o file] [-d string] -s file file [file ...]

Build User Dictionary

positional arguments:
  file        source files with CSV format (one or more)

options:
  -h, --help  show this help message and exit
  -o file     output file (default: user.dic)
  -d string   description comment to be embedded on dictionary

required named arguments:
  -s file     system dictionary path

About the dictionary file format, please refer to this document (written in Japanese, English version is not available yet).

Customized System Dictionary

$ sudachipy build -h
usage: sudachipy build [-h] [-o file] [-d string] -m file file [file ...]

Build Sudachi Dictionary

positional arguments:
  file        source files with CSV format (one of more)

optional arguments:
  -h, --help  show this help message and exit
  -o file     output file (default: system.dic)
  -d string   description comment to be embedded on dictionary

required named arguments:
  -m file     connection matrix file with MeCab's matrix.def format

To use your customized system.dic, place sudachi.json to anywhere you like, and o