wcwidth

Measures the displayed width of unicode strings in a terminal
Downloads: 0 (30 days)
Description

|pypi_downloads| |codecov| |license|

============
Introduction
============

This library is mainly for CLI/TUI programs that carefully produce output for Terminals.

Installation
------------

The stable version of this package is maintained on pypi, install or upgrade, using pip::

    pip install --upgrade wcwidth

Problem
-------

All Python string-formatting functions, `textwrap.wrap()`_, `str.ljust()`_, `str.rjust()`_, and
`str.center()`_ **incorrectly** measure the displayed width of a string as equal to the number of
their codepoints.

Some examples of **incorrect results**:

.. code-block:: python

    >>> # result consumes 16 total cells, 11 expected,
    >>> 'コンニチハ'.rjust(11, 'X')
    'XXXXXXコンニチハ'

    >>> # result consumes 5 total cells, 6 expected,
    >>> 'café'.center(6, 'X')
    'caféX'

Solution
--------

The lowest-level functions in this library are the POSIX.1-2001 and POSIX.1-2008 `wcwidth(3)`_ and
`wcswidth(3)`_, which this library precisely copies by interface as `wcwidth()`_ and `wcswidth()`_.
These functions return -1 when C0 and C1 control codes are present.

An easy-to-use `width()`_ function is provided as a wrapper of `wcswidth()`_ that is also capable of
measuring most terminal control codes and sequences, like colors, bold, tabstops, and horizontal
cursor movement.

Text-justification is solved by the grapheme and sequence-aware functions `ljust()`_,
`rjust()`_, `center()`_, and `wrap()`_, serving as drop-in replacements to python standard functions
of the same names.

The iterator functions `iter_graphemes()`_ and `iter_sequences()`_ allow for careful navigation of
grapheme and terminal control sequence boundaries.  `iter_graphemes_reverse()`_, and
`grapheme_boundary_before()`_ are useful for editing and searching of complex unicode.  The
`clip()`_ function extracts substrings by display column positions, and `strip_sequences()`_ removes
terminal escape sequences from text altogether.

Discrepancies
-------------

You may find that support *varies* for complex unicode sequences or codepoints.

A companion utility, `jquast/ucs-detect`_ was authored to gather and publish the results of Wide
character, language/grapheme clustering and complex script support, emojis and zero-width joiner,
variations, and regional indicator (flags) as a `General Tabulated Summary`_ by terminal emulator
software and version.

========
Overview
========

wcwidth()
---------

Use function ``wcwidth()`` to determine the length of a *single unicode
codepoint*.

A brief overview, through examples, for all of the public API functions.

Full API Documentation at https://wcwidth.readthedocs.io/en/latest/api.html

wcwidth()
---------

Measures width of a single codepoint,

.. code-block:: python

    >>> # '♀' narrow emoji
    >>> wcwidth.wcwidth('\u2640')
    1

Use function `wcwidth()`_ to determine the length of a *single unicode character*.

See specification_ of character measurements. Note that ``-1`` is returned for control codes.

wcswidth()
----------

Measures width of a string, returns -1 for control codes.

.. code-block:: python

    >>> # '♀️' emoji w/vs-16
    >>> wcwidth.wcswidth('\u2640\ufe0f')
    2

Use function `wcswidth()`_ to determine the length of many, a *string of unicode characters*.

See specification_ of character measurements. Note that ``-1`` is returned if control codes occurs
anywhere in the string.

width()
-------

Use function `width()`_ to measure a string with improved handling of ``control_codes``.

.. code-block:: python

    >>> # same support as wcswidth(), eg. regional indicator flag:
    >>> wcwidth.width('\U0001F1FF\U0001F1FC')
    2
    >>> # but also supports SGR colored text, 'WARN', followed by SGR reset
    >>> wcwidth.width('\x1b[38;2;255;150;100mWARN\x1b[0m')
    4
    >>> # tabs,
    >>> wcwidth.width('\t', tabsize=4)
    4
    >>> # or, tab and all other control characters can be ignored
    >>> wcwidth.width('\t', control_codes='ignore')
    0
    >>> # "vertical" control characters are ignored
    >>> wcwidth.width('\n')
    0
    >>> # as well as sequences with "indeterminate" effects like Home + Clear
    >>> wcwidth.width('\x1b[H\x1b[2J')
    0
    >>> # or, raise ValueError for "indeterminate" effects using control_codes='strict'
    >>> wcwidth.width('\n', control_codes='strict')
    Traceback (most recent call last):
    ...
    ValueError: Vertical movement character 0xa at position 0

Use ``control_codes='ignore'`` when the input is known not to contain any control characters or
terminal sequences for slightly improved performance. Note that TAB (``'\t'``) is a control
character and is also ignored, you may want to use `str.expandtabs()`_, first.

iter_sequences()
----------------

Iterates through text, segmented by terminal sequence,

.. code-block:: python

    >>> list(wcwidth.iter_sequences('hello'))
    [('hello', False)]
    >>> list(wcwidth.iter_sequences('\x1b[31mred\x1b[0m'))
    [('\x1b[31m', True), ('red', False), ('\x1b[0m', True)]

Use `iter_sequences()`_ to split text into segments of plain text and escape sequences. Each tuple
contains the segment string and a boolean indicating whether it is an escape sequence (``True``) or
text (``False``).

iter_graphemes()
----------------

Use `iter_graphemes()`_ to iterate over *grapheme clusters* of a string.

.. code-block:: python

    >>> from wcwidth import iter_graphemes
    >>> # ok + Regional Indicator 'Z', 'W' (Zimbabwe)
    >>> list(wcwidth.iter_graphemes('ok\U0001F1FF\U0001F1FC'))
    ['o', 'k', '🇿🇼']

    >>> # cafe + combining acute accent
    >>> list(wcwidth.iter_graphemes('cafe\u0301'))
    ['c', 'a', 'f', 'é']

    >>> # ok + Emoji Man + ZWJ + Woman + ZWJ + Girl
    >>> list(wcwidth.iter_graphemes('ok\U0001F468\u200D\U0001F469\u200D\U0001F467'))
    ['o', 'k', '👨\u200d👩\u200d👧']

A grapheme cluster is what a user perceives as a single character, even if it is composed of
multiple Unicode codepoints. This function implements `Unicode Standard Annex #29`_ grapheme cluster
boundary rules.

ljust()
-------

Use `ljust()`_ as replacement of `str.ljust()`_:

.. code-block:: python

    >>> 'コンニチハ'.ljust(11, '*')             # don't do this
    'コンニチハ******'
    >>> wcwidth.ljust('コンニチハ', 11, '*')    # do this!
    'コンニチハ*'

rjust()
-------

Use `rjust()`_ as replacement of `str.rjust()`_:

.. code-block:: python

    >>> 'コンニチハ'.rjust(11, '*')             # don't do this
    '******コンニチハ'
    >>> wcwidth.rjust('コンニチハ', 11, '*')    # do this!
    '*コンニチハ'

center()
--------

Use `center()`_ as replacement of `str.center()`_:

.. code-block:: python

    >>> 'cafe\u0301'.center(6, '*')             # don't do this
    'café*'
    >>> wcwidth.center('cafe\u0301', 6, '*')
    '*café*'                                    # do this!

wrap()
------

Use function `wrap()`_ to wrap text containing terminal sequences, Unicode grapheme
clusters, and wide characters to a given display width.

.. code-block:: python

    >>> from wcwidth import wrap
    >>> # Basic wrapping
    >>> wrap('hello world', 5)
    ['hello', 'world']

    >>> # Wrapping CJK text (each character is 2 cells wide)
    >>> wrap('コンニチハ', 4)
    ['コン', 'ニチ', 'ハ']

    >>> # Text with ANSI color sequences - SGR codes are propagated by default
    >>> # Each line ends with reset, next line starts with restored style
    >>> wrap('\x1b[1;31mhello world\x1b[0m', 5)
    ['\x1b[1;31mhello\x1b[0m', '\x1b[1;31mworld\x1b[0m']

clip()
------

Use `clip()`_ to extract a substring by column positions, preserving terminal sequences.

.. code-block:: python

    >>> from wcwidth import clip
    >>> # Wide characters split to Narrow boundaries using fillchar=' '
    >>> clip('中文字', 0, 3)
    '中 '
    >>> clip('中文字', 1, 5, fillchar='.')
    '.文.'

    >>> # SGR codes are propagated by default - result begins with active style
    >>> # and ends with reset if styles are active
    >>> clip('\x1b[1;31mHello world\x1b[0m', 6, 11)
    '\x1b[1;31mworld\x1b[0m'

    >>> # Disable SGR propagation to preserve original sequences as-is
    >>> clip('\x1b[31m中文\x1b[0m', 0, 3, propagate_sgr=False)
    '\x1b[31m中 \x1b[0m'

strip_sequences()
-----------------

Use `strip_sequences()`_ to remove all terminal escape sequences from text.

.. code-block:: python

    >>> from wcwidth import strip_sequences
    >>> strip_sequences('\x1b[31mred\x1b[0m')
    'red'

.. _ambiguous_width:

ambiguous_width
---------------

Some Unicode characters have "East Asian Ambiguous" (A) width. These characters display as 1 cell by
default, matching Western terminal contexts, but many CJK (Chinese, Japanese, Korean) environments
may have a preference for 2 cells.  This is often found as boolean option, "Ambiguous width as wide"
in Terminal Emulator software preferences.

By default, wcwidth treats ambiguous characters as narrow (width 1). For CJK environments where your
terminal is configured to display ambiguous characters as double-width, pass ``ambiguous_width=2``:

.. code-block:: python

    >>> # CIRCLED DIGIT ONE - ambiguous width
    >>> wcwidth.width('\u2460')
    1
    >>> wcwidth.width('\u2460', ambiguous_width=2)
    2

The ``ambiguous_width`` parameter is available on all width-measuring functions: `wcwidth()`_,
`wcswidth()`_, `width()`_, `ljust()`_, `rjust()`_, `center()`_, `wrap()`_, and `clip()`_.

**Terminal Detection**

The most reliable method to detect whether a terminal profile is set for "Ambiguous width as wide"
mode is to display an ambiguous character surrounded by a pair of Cursor Position Report (CPR)
queries with a terminal in cooked or raw mode, and to parse the responses for their ``(y, x)``
locations and measure the difference ``x``.

This code should also be careful check whether it is attached to a terminal and be careful of
possible timeout, slow network, or non-response when working with "dumb terminals" like a CI build.

`jquast/blessed`_ library provides such a helping `Terminal.detect_ambiguous_width()`_ method:

.. code-block:: p