justext
Heuristic based boilerplate removal tool
Rank: #2142Downloads: 3,284,325 (30 days)Stars: 811Forks: 86
Description
.. _jusText: http://code.google.com/p/justext/
.. _Python: http://www.python.org/
.. _lxml: http://lxml.de/
jusText
=======
.. image:: https://github.com/miso-belica/jusText/actions/workflows/run-tests.yml/badge.svg
:target: https://github.com/miso-belica/jusText/actions/workflows/run-tests.yml
Program jusText is a tool for removing boilerplate content, such as navigation
links, headers, and footers from HTML pages. It is
`designed <doc/algorithm.rst>`_ to preserve
mainly text containing full sentences and it is therefore well suited for
creating linguistic resources such as Web corpora. You can
`try it online <http://nlp.fi.muni.cz/projects/justext/>`_.
This is a fork of original (currently unmaintained) code of jusText_ hosted
on Google Code.
Adaptations of the algorithm to other languages:
- `C++ <https://github.com/endredy/jusText>`_
- `Go <https://github.com/JalfResi/justext>`_
- `Java <https://github.com/wizenoze/justext-java>`_
Some libraries using jusText:
- `chirp <https://github.com/9b/chirp>`_
- `lazynlp <https://github.com/chiphuyen/lazynlp>`_
- `off-topic-memento-toolkit <https://github.com/oduwsdl/off-topic-memento-toolkit>`_
- `pears <https://github.com/PeARSearch/PeARS-orchard>`_
- `readability calculator <https://github.com/joaopalotti/readability_calculator>`_
- `sky <https://github.com/kootenpv/sky>`_
Some currently (Jan 2020) maintained alternatives:
- `dragnet <https://github.com/dragnet-org/dragnet>`_
- `html2text <https://github.com/Alir3z4/html2text>`_
- `inscriptis <https://github.com/weblyzard/inscriptis>`_
- `newspaper <https://github.com/codelucas/newspaper>`_
- `python-readability <https://github.com/buriy/python-readability>`_
- `trafilatura <https://github.com/adbar/trafilatura>`_
Installation
------------
Make sure you have Python_ 2.7+/3.5+ and `pip <https://pip.pypa.io/en/stable/>`_
(`Windows <http://docs.python-guide.org/en/latest/starting/install/win/>`_,
`Linux <http://docs.python-guide.org/en/latest/starting/install/linux/>`_) installed.
Run simply:
.. code-block:: bash
$ [sudo] pip install justext
Dependencies
------------
::
lxml (version depends on your Python version)
Usage
-----
.. code-block:: bash
$ python -m justext -s Czech -o text.txt http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ python -m justext -s English -o plain_text.txt english_page.html
$ python -m justext --help # for more info
Python API
----------
.. code-block:: python
import requests
import justext
response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
if not paragraph.is_boilerplate:
print paragraph.text
Testing
-------
Run tests via
.. code-block:: bash
$ py.test-2.7 && py.test-3.5 && py.test-3.6 && py.test-3.7 && py.test-3.8 && py.test-3.9
Acknowledgements
----------------
.. _`Natural Language Processing Centre`: http://nlp.fi.muni.cz/en/nlpc
.. _`Masaryk University in Brno`: http://nlp.fi.muni.cz/en
.. _PRESEMT: http://presemt.eu/
.. _`Lexical Computing Ltd.`: http://lexicalcomputing.com/
.. _`PhD research`: http://is.muni.cz/th/45523/fi_d/phdthesis.pdf
This software has been developed at the `Natural Language Processing Centre`_ of
`Masaryk University in Brno`_ with a financial support from PRESEMT_ and
`Lexical Computing Ltd.`_ It also relates to `PhD research`_ of Jan Pomikálek.
.. :changelog:
Changelog for jusText
=====================
3.0.2 (2025-02-25)
------------------
- *BUG FIX:* Handle urllib imports in Python 2 and 3 correctly `#51 <https://github.com/miso-belica/jusText/pull/51>`_.
3.0.1 (2024-05-09)
------------------
- *BUG FIX:* Fix issue with new version of lxml `#48 <https://github.com/miso-belica/jusText/pull/48>`_.
3.0.0 (2021-10-21)
------------------
- *INCOMPATIBLE CHANGE:* Dropped support for Python 3.4 and below.
- *BUG FIX:* Don't join words separated only by ``<br>`` tag.
- *BUG FIX:* List available stop-lists alphabetically.
2.2.0 (2016-03-06)
------------------
- *INCOMPATIBLE CHANGE:* Stop words are case insensitive.
- *INCOMPATIBLE CHANGE:* Dropped support for Python 3.2
- *BUG FIX:* Preserve new lines from original text in paragraphs.
2.1.1 (2014-05-27)
------------------
- *BUG FIX:* Function ``decode_html`` now respects parameter ``errors`` when falling to ``default_encoding`` `#9 <https://github.com/miso-belica/jusText/issues/9>`_.
2.1.0 (2014-01-25)
------------------
- *FEATURE:* Added XPath selector to the paragrahs. XPath selector is also available in detailed output as ``xpath`` attribute of ``<p>`` tag `#5 <https://github.com/miso-belica/jusText/pull/5>`_.
2.0.0 (2013-08-26)
------------------
- *FEATURE:* Added pluggable DOM preprocessor.
- *FEATURE:* Added support for Python 3.2+.
- *INCOMPATIBLE CHANGE:* Paragraphs are instances of
``justext.paragraph.Paragraph``.
- *INCOMPATIBLE CHANGE:* Script 'justext' removed in favour of
command ``python -m justext``.
- *FEATURE:* It's possible to enter an URI as input document in CLI.
- *FEATURE:* It is possible to pass unicode string directly.
1.2.0 (2011-08-08)
------------------
- *FEATURE:* Character counts used instead of word counts where possible in
order to make the algorithm work well in the language independent
mode (without a stoplist) for languages where counting words is
not easy (Japanese, Chinese, Thai, etc).
- *BUG FIX:* More robust parsing of meta tags containing the information about
used charset.
- *BUG FIX:* Corrected decoding of HTML entities € to Ÿ
1.1.0 (2011-03-09)
------------------
- First public release.