inscriptis

inscriptis - HTML to text converter.
Rank: #3045Downloads: 1,651,128 (30 days)Stars: 337Forks: 35
Description

==================================================================================
inscriptis -- HTML to text conversion library, command line client and Web service
==================================================================================

.. image:: https://img.shields.io/pypi/pyversions/inscriptis   
   :target: https://badge.fury.io/py/inscriptis
   :alt: Supported python versions

.. image:: https://codecov.io/gh/weblyzard/inscriptis/branch/master/graph/badge.svg
   :target: https://codecov.io/gh/weblyzard/inscriptis/
   :alt: Coverage

.. image:: https://github.com/weblyzard/inscriptis/actions/workflows/python-package.yml/badge.svg
   :target: https://github.com/weblyzard/inscriptis/actions/workflows/python-package.yml
   :alt: Build status

.. image:: https://readthedocs.org/projects/inscriptis/badge/?version=latest
   :target: https://inscriptis.readthedocs.io/en/latest/?badge=latest
   :alt: Documentation status

.. image:: https://badge.fury.io/py/inscriptis.svg
   :target: https://badge.fury.io/py/inscriptis
   :alt: PyPI version

.. image:: https://pepy.tech/badge/inscriptis
   :target: https://pepy.tech/project/inscriptis
   :alt: PyPI downloads

.. image:: https://joss.theoj.org/papers/10.21105/joss.03557/status.svg
   :target: https://doi.org/10.21105/joss.03557


A python based HTML to text conversion library, command line client and Web
service with support for **nested tables**, a **subset of CSS** and optional
support for providing an **annotated output**. 

Inscriptis is particularly well suited for applications that require high-performance, high-quality (i.e., layout-aware) text representations of HTML content, and will aid knowledge extraction and data science tasks conducted upon Web data.

Please take a look at the
`Rendering <https://github.com/weblyzard/inscriptis/blob/master/RENDERING.md>`_
document for a demonstration of inscriptis' conversion quality.

A Java port of inscriptis 1.x has been published by
`x28 <https://github.com/x28/inscriptis-java>`_.

This document provides a short introduction to Inscriptis. 

- The full documentation is built automatically and published on `Read the Docs <https://inscriptis.readthedocs.org/en/latest/>`_. 
- If you are interested in a more general overview on the topic of *text extraction from HTML*, this `blog post on different HTML to text conversion approaches, and criteria for selecting them <https://www.semanticlab.net/linux/big%20data/knowledge%20extraction/Extracting-text-from-HTML-with-Python/>`_ might be interesting to you.

.. contents:: Table of contents

Statement of need - why inscriptis?
===================================

1. Inscriptis provides a **layout-aware** conversion of HTML that more closely resembles the rendering obtained from standard Web browsers and, therefore, better preserves the spatial arrangement of text elements. 

   Conversion quality becomes a factor once you need to move beyond simple HTML snippets. Non-specialized approaches and less sophisticated libraries do not correctly interpret HTML semantics and, therefore, fail to properly convert constructs such as itemizations, enumerations, and tables.

   Beautiful Soup's ``get_text()`` function, for example, converts the following HTML enumeration to the string ``firstsecond``.

   .. code-block:: HTML
   
      <ul>
        <li>first</li>
        <li>second</li>
      <ul>


   Inscriptis, in contrast, not only returns the correct output
   
   .. code-block::
   
      * first
      * second

   but also supports much more complex constructs such as nested tables and also interprets a subset of HTML (e.g., ``align``, ``valign``) and CSS (e.g., ``display``, ``white-space``, ``margin-top``, ``vertical-align``, etc.) attributes that determine the text alignment. Any time the spatial alignment of text is relevant (e.g., for many knowledge extraction tasks, the computation of word embeddings and language models, and sentiment analysis) an accurate HTML to text conversion is essential.

2. Inscriptis supports `annotation rules <#annotation-rules>`_, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes used for controlling structure and layout in the original HTML document. These rules might be used to

   - provide downstream knowledge extraction components with additional information that may be leveraged to improve their respective performance.
   - assist manual document annotation processes (e.g., for qualitative analysis or gold standard creation). ``Inscriptis`` supports multiple export formats such as XML, annotated HTML and the JSONL format that is used by the open source annotation tool `doccano <https://github.com/doccano/doccano>`_.
   - enabling the use of ``Inscriptis``  for tasks such as content extraction (i.e., extract task-specific relevant content from a Web page) which rely on information on the HTML document's structure.


Installation
============

At the command line::

    $ pip install inscriptis

Or, if you don't have pip installed::

    $ easy_install inscriptis


Python library
==============

Embedding inscriptis into your code is easy, as outlined below:

.. code-block:: python
   
   import urllib.request
   from inscriptis import get_text
   
   url = "https://www.fhgr.ch"
   html = urllib.request.urlopen(url).read().decode('utf-8')
   
   text = get_text(html)
   print(text)


Standalone command line client
==============================
The command line client converts HTML files or text retrieved from Web pages to
the corresponding text representation.


Command line parameters
-----------------------

The inscript command line client supports the following parameters::

    usage: inscript [-h] [-o OUTPUT] [-e ENCODING] [-i] [-d] [-l] [-a] [-r ANNOTATION_RULES] [-p POSTPROCESSOR] [--indentation INDENTATION]
                       [--table-cell-separator TABLE_CELL_SEPARATOR] [-v]
                       [input]

    Convert the given HTML document to text.

    positional arguments:
      input                 Html input either from a file or a URL (default:stdin).

    optional arguments:
      -h, --help            show this help message and exit
      -o OUTPUT, --output OUTPUT
                            Output file (default:stdout).
      -e ENCODING, --encoding ENCODING
                            Input encoding to use (default:utf-8 for files; detected server encoding for Web URLs).
      -i, --display-image-captions
                            Display image captions (default:false).
      -d, --deduplicate-image-captions
                            Deduplicate image captions (default:false).
      -l, --display-link-targets
                            Display link targets (default:false).
      -a, --display-anchor-urls
                            Display anchor URLs (default:false).
      -r ANNOTATION_RULES, --annotation-rules ANNOTATION_RULES
                            Path to an optional JSON file containing rules for annotating the retrieved text.
      -p POSTPROCESSOR, --postprocessor POSTPROCESSOR
                            Optional component for postprocessing the result (html, surface, xml).
      --indentation INDENTATION
                            How to handle indentation (extended or strict; default: extended).
      --table-cell-separator TABLE_CELL_SEPARATOR
                            Separator to use between table cells (default: three spaces).
      -v, --version         display version information

   

HTML to text conversion
-----------------------
convert the given page to text and output the result to the screen::

  $ inscript https://www.fhgr.ch
   
convert the file to text and save the output to fhgr.txt::

  $ inscript fhgr.html -o fhgr.txt

convert the file using strict indentation (i.e., minimize indentation and extra spaces) and save the output to fhgr-layout-optimized.txt::

  $ inscript --indentation strict fhgr.html -o fhgr-layout-optimized.txt
   
convert HTML provided via stdin and save the output to output.txt::

  $ echo "<body><p>Make it so!</p></body>" | inscript -o output.txt 


HTML to annotated text conversion
---------------------------------
convert and annotate HTML from a Web page using the provided annotation rules. 

Download the example `annotation-profile.json <https://github.com/weblyzard/inscriptis/blob/master/examples/annotation/annotation-profile.json>`_ and save it to your working directory::

  $ inscript https://www.fhgr.ch -r annotation-profile.json

The annotation rules are specified in `annotation-profile.json`:

.. code-block:: json

   {
    "h1": ["heading", "h1"],
    "h2": ["heading", "h2"],
    "b": ["emphasis"],
    "div#class=toc": ["table-of-contents"],
    "#class=FactBox": ["fact-box"],
    "#cite": ["citation"]
   }

The dictionary maps an HTML tag and/or attribute to the annotations
inscriptis should provide for them. In the example above, for instance, the tag
``h1`` yields the annotations ``heading`` and ``h1``, a ``div`` tag with a
``class`` that contains the value ``toc`` results in the annotation
``table-of-contents``, and all tags with a ``cite`` attribute are annotated with
``citation``.

Given these annotation rules the HTML file

.. code-block:: HTML

   <h1>Chur</h1>
   <b>Chur</b> is the capital and largest town of the Swiss canton of the
   Grisons and lies in the Grisonian Rhine Valley.

yields the following JSONL output

.. code-block:: json

   {"text": "Chur\n\nChur is the capital and largest town of the Swiss canton
             of the Grisons and lies in the Grisonian Rhine Valley.",
    "label": [[0, 4, "heading"], [0, 4, "h1"], [6, 10, "emphasis"]]}

The provided list of labels contains all annotated text elements with their
start index, end index and the assigned label.


Annotation postprocessors
-------------------------
Annotation postprocessors enable the post processing of annotations to formats
that are suitab