smart-open
Utils for streaming large files (S3, HDFS, GCS, SFTP, Azure Blob Storage, gzip, bz2, zst...)
Downloads: 0 (30 days)
Description
======================================================
smart_open — utils for streaming large files in Python
======================================================
|License|_ |CI|_ |Coveralls|_ |Version|_ |Python|_ |Downloads|_
.. |License| image:: https://img.shields.io/pypi/l/smart_open.svg
.. |CI| image:: https://github.com/piskvorky/smart_open/actions/workflows/python-package.yml/badge.svg?branch=develop&event=push
.. |Coveralls| image:: https://coveralls.io/repos/github/RaRe-Technologies/smart_open/badge.svg?branch=develop
.. |Version| image:: https://img.shields.io/pypi/v/smart-open.svg?logo=pypi&logoColor=white
.. |Python| image:: https://img.shields.io/pypi/pyversions/smart-open.svg?logo=python&logoColor=white
.. |Downloads| image:: https://pepy.tech/badge/smart-open/month
.. _License: https://github.com/piskvorky/smart_open/blob/master/LICENSE
.. _CI: https://github.com/piskvorky/smart_open/actions/workflows/python-package.yml
.. _Coveralls: https://coveralls.io/github/RaRe-Technologies/smart_open?branch=HEAD
.. _Version: https://pypi.org/project/smart-open/
.. _Python: https://pypi.org/project/smart-open/
.. _Downloads: https://pypistats.org/packages/smart-open
What?
=====
``smart_open`` is a Python 3 library for **efficient streaming of very large files** from/to storages such as S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats.
``smart_open`` is a drop-in replacement for Python's built-in ``open()``: it can do anything ``open`` can (100% compatible, falls back to native ``open`` wherever possible), plus lots of nifty extra stuff on top.
Why?
====
Working with large remote files, for example using Amazon's `boto3 <https://boto3.amazonaws.com/v1/documentation/api/latest/index.html>`_ Python library, is a pain.
``boto3``'s ``Object.upload_fileobj()`` and ``Object.download_fileobj()`` methods require gotcha-prone boilerplate to use successfully, such as constructing file-like object wrappers.
``smart_open`` shields you from that. It builds on boto3 and other remote storage libraries, but offers a **clean unified Pythonic API**. The result is less code for you to write and fewer bugs to make.
How?
=====
``smart_open`` is well-tested, well-documented, and has a simple Pythonic API:
.. _doctools_before_examples:
.. code-block:: python
>>> from smart_open import open
>>>
>>> # stream lines from an S3 object
>>> for line in open('s3://commoncrawl/robots.txt'):
... print(repr(line))
... break
'User-Agent: *\n'
>>> # stream from/to compressed files, with transparent (de)compression:
>>> for line in open('tests/test_data/1984.txt.gz', encoding='utf-8'):
... print(repr(line))
'It was a bright cold day in April, and the clocks were striking thirteen.\n'
'Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n'
'wind, slipped quickly through the glass doors of Victory Mansions, though not\n'
'quickly enough to prevent a swirl of gritty dust from entering along with him.\n'
>>> # can use context managers too:
>>> with open('tests/test_data/1984.txt.gz') as fin:
... with open('tests/test_data/1984.txt.bz2', 'w') as fout:
... for line in fin:
... fout.write(line)
74
80
78
79
>>> # can use any IOBase operations, like seek
>>> with open('s3://commoncrawl/robots.txt', 'rb') as fin:
... for line in fin:
... print(repr(line.decode('utf-8')))
... break
... offset = fin.seek(0) # seek to the beginning
... print(fin.read(4))
'User-Agent: *\n'
b'User'
>>> # stream from HTTP
>>> for line in open('http://example.com'):
... print(repr(line[:15]))
... break
'<!doctype html>'
.. _doctools_after_examples:
For more examples of URIs that ``smart_open`` accepts, see `help.txt <https://github.com/piskvorky/smart_open/blob/master/help.txt>`__ or ``help('smart_open')``.
Some examples::
s3://bucket/key
s3://access_key_id:secret_access_key@bucket/key
gs://bucket/blob
azure://bucket/blob
hdfs://path/file
./local/path/file.gz
file:///home/user/file.bz2
[ssh|scp|sftp]://username:password@host/path/file
Documentation
=============
The API reference can be viewed at `help.txt <https://github.com/piskvorky/smart_open/blob/master/help.txt>`__ or using ``help('smart_open')``.
Installation
------------
``smart_open`` supports a wide range of storage solutions. For all options, see the API reference.
Each individual solution has its own dependencies.
By default, ``smart_open`` does not install any dependencies in order to keep the installation size small.
You can install one or more of these dependencies explicitly using optional dependencies defined in
`pyproject.toml <https://github.com/piskvorky/smart_open/blob/master/pyproject.toml>`__ :
.. code-block:: sh
pip install 'smart_open[s3,gcs,azure,http,webhdfs,ssh,zst]'
Or, if you don't mind installing a large number of third party libraries, you can install all dependencies using:
.. code-block:: sh
pip install 'smart_open[all]'
Built-in help
-------------
To view the API reference, use the ``help`` python builtin:
.. code-block:: python
help('smart_open')
or view `help.txt <https://github.com/piskvorky/smart_open/blob/master/help.txt>`__ in your browser.
More examples
-------------
For the sake of simplicity, the examples below assume you have all the dependencies installed, i.e. you have done:
.. code-block:: sh
pip install 'smart_open[all]'
.. code-block:: python
import os, boto3, botocore
from smart_open import open
# stream content *into* S3 (write mode) using a custom client
# this client is thread-safe ref https://github.com/boto/boto3/blob/1.38.41/docs/source/guide/clients.rst?plain=1#L111
config = botocore.client.Config(
max_pool_connections=64,
tcp_keepalive=True,
retries={"max_attempts": 6, "mode": "adaptive"},
)
client = boto3.Session(
aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'],
aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'],
).client("s3", config=config)
with open('s3://smart-open-py37-benchmark-results/test.txt', 'wb', transport_params={'client': client}) as fout:
bytes_written = fout.write(b'hello world!')
print(bytes_written)
# perform a single-part upload to S3 (saves billable API requests, and allows seek() before upload)
with open('s3://smart-open-py37-benchmark-results/test.txt', 'wb', transport_params={'multipart_upload': False}) as fout:
bytes_written = fout.write(b'hello world!')
print(bytes_written)
# now with tempfile.TemporaryFile instead of the default io.BytesIO (to reduce memory footprint)
import tempfile
with tempfile.TemporaryFile() as tmp, open('s3://smart-open-py37-benchmark-results/test.txt', 'wb', transport_params={'multipart_upload': False, 'writebuffer': tmp}) as fout:
bytes_written = fout.write(b'hello world!')
print(bytes_written)
# stream from HDFS
for line in open('hdfs://user/hadoop/my_file.txt', encoding='utf8'):
print(line)
# stream from WebHDFS
for line in open('webhdfs://host:port/user/hadoop/my_file.txt'):
print(line)
# stream content *into* HDFS (write mode):
with open('hdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout:
fout.write(b'hello world')
# stream content *into* WebHDFS (write mode):
with open('webhdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout:
fout.write(b'hello world')
# stream from a completely custom s3 server, like s3proxy:
for line in open('s3u://user:secret@host:port@mybucket/mykey.txt'):
print(line)
# Stream to Digital Ocean Spaces bucket providing credentials from boto3 profile
session = boto3.Session(profile_name='digitalocean')
client = session.client('s3', endpoint_url='https://ams3.digitaloceanspaces.com')
transport_params = {'client': client}
with open('s3://bucket/key.txt', 'wb', transport_params=transport_params) as fout:
fout.write(b'here we stand')
# stream from GCS
for line in open('gs://my_bucket/my_file.txt'):
print(line)
# stream content *into* GCS (write mode):
with open('gs://my_bucket/my_file.txt', 'wb') as fout:
fout.write(b'hello world')
# stream from Azure Blob Storage
connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING']
transport_params = {
'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),
}
for line in open('azure://mycontainer/myfile.txt', transport_params=transport_params):
print(line)
# stream content *into* Azure Blob Storage (write mode):
connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING']
transport_params = {
'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),
}
with open('azure://mycontainer/my_file.txt', 'wb', transport_params=transport_params) as fout:
fout.write(b'hello world')
Compression Handling
--------------------
The top-level `compression` parameter controls compression/decompression behavior when reading and writing.
The supported values for this parameter are:
- ``infer_from_extension`` (default behavior)
- ``disable``
- ``.bz2``
- ``.gz``
- ``.xz``
- ``.zst``
By default, ``smart_open`` automatically (de)compresses the file if the filename ends with one of these extensions.
`See also <https://github.com/piskvorky/smart_open/blob/master/smart_open/compression.py>`__
``smart_open.compression.get_supported_compression_types`` and ``mart_open.compression.register_compressor``.
.. code-block:: python
>>> from smart_open import open
>>> with open('tests/test_data/1984.txt.gz') as fin:
... print(fin.read(32))
It was a bright c