Azure AI Inference client library for Python

Use the Inference client library (in preview) to:

Authenticate against the service
Get information about the AI model
Do chat completions
Get text embeddings
Get image embeddings

The Inference client library supports AI models deployed to the following services:

GitHub Models - Free-tier endpoint for AI models from different providers
Serverless API endpoints and Managed Compute endpoints - AI models from different providers deployed from Azure AI Foundry. See Overview: Deploy models, flows, and web apps with Azure AI Foundry.
Azure OpenAI Service - OpenAI models deployed from Azure AI Foundry. See What is Azure OpenAI Service?. Although we recommend you use the official OpenAI client library in your production code for this service, you can use the Azure AI Inference client library to easily compare the performance of OpenAI models to other models, using the same client library and Python code.

The Inference client library makes services calls using REST API version 2024-05-01-preview, as documented in Azure AI Model Inference API.

Product documentation | Samples | API reference documentation | Package (Pypi) | SDK source code

Reporting issues

To report an issue with the client library, or request additional features, please open a GitHub issue here. Mention the package name "azure-ai-inference" in the title or content.

Getting started

Prerequisites

Python 3.8 or later installed, including pip.
For GitHub models
- The AI model name, such as "gpt-4o" or "mistral-large"
- A GitHub personal access token. Create one here. You do not need to give any permissions to the token. The token is a string that starts with github_pat_.
For Serverless API endpoints or Managed Compute endpoints
- An Azure subscription.
- An AI Model from the catalog deployed through Azure AI Foundry.
- The endpoint URL of your model, in of the form https://<your-host-name>.<your-azure-region>.models.ai.azure.com, where your-host-name is your unique model deployment host name and your-azure-region is the Azure region where the model is deployed (e.g. eastus2).
- Depending on your authentication preference, you either need an API key to authenticate against the service, or Entra ID credentials.
For Azure OpenAI (AOAI) service
- An Azure subscription.
- An OpenAI Model from the catalog deployed through Azure AI Foundry.
- The endpoint URL of your model, in the form https://<your-resouce-name>.openai.azure.com/openai/deployments/<your-deployment-name>, where your-resource-name is your globally unique AOAI resource name, and your-deployment-name is your AI Model deployment name.
- Depending on your authentication preference, you either need an API key to authenticate against the service, or Entra ID credentials.
- An api-version. Latest preview or GA version listed in the Data plane - inference row in the API Specs table. At the time of writing, latest GA version was "2024-06-01".

Install the package

To install the Azure AI Inferencing package use the following command:

pip install azure-ai-inference

To update an existing installation of the package, use:

pip install --upgrade azure-ai-inference

If you want to install Azure AI Inferencing package with support for OpenTelemetry based tracing, use the following command:

pip install azure-ai-inference[opentelemetry]

Key concepts

Create and authenticate a client directly, using API key or GitHub token

The package includes two clients ChatCompletionsClient and EmbeddingsClient. Both can be created in the similar manner. For example, assuming endpoint, key and github_token are strings holding your endpoint URL, API key or GitHub token, this Python code will create and authenticate a synchronous ChatCompletionsClient:

from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential

# For GitHub models
client = ChatCompletionsClient(
    endpoint="https://models.inference.ai.azure.com",
    credential=AzureKeyCredential(github_token),
    model="mistral-large" # Update as needed. Alternatively, you can include this is the `complete` call.
)

# For Serverless API or Managed Compute endpoints
client = ChatCompletionsClient(
    endpoint=endpoint,  # Of the form https://<your-host-name>.<your-azure-region>.models.ai.azure.com
    credential=AzureKeyCredential(key)
)

# For Azure OpenAI endpoint
client = ChatCompletionsClient(
    endpoint=endpoint,  # Of the form https://<your-resouce-name>.openai.azure.com/openai/deployments/<your-deployment-name>
    credential=AzureKeyCredential(key),
    api_version="2024-06-01",  # Azure OpenAI api-version. See https://aka.ms/azsdk/azure-ai-inference/azure-openai-api-versions
)

A synchronous client supports synchronous inference methods, meaning they will block until the service responds with inference results. For simplicity the code snippets below all use synchronous methods. The client offers equivalent asynchronous methods which are more commonly used in production.

To create an asynchronous client, Install the additional package aiohttp:

pip install aiohttp

and update the code above to import asyncio, and import ChatCompletionsClient from the azure.ai.inference.aio namespace instead of azure.ai.inference. For example:

import asyncio
from azure.ai.inference.aio import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential

# For Serverless API or Managed Compute endpoints
client = ChatCompletionsClient(
    endpoint=endpoint,
    credential=AzureKeyCredential(key)
)

Create and authenticate a client directly, using Entra ID

_Note: At the time of writing, only Managed Compute endpoints and Azure OpenAI endpoints support Entra ID authentication.

To use an Entra ID token credential, first install the azure-identity package:

pip install azure.identity

You will need to provide the desired credential type obtained from that package. A common selection is DefaultAzureCredential and it can be used as follows:

from azure.ai.inference import ChatCompletionsClient
from azure.identity import DefaultAzureCredential

# For Managed Compute endpoints
client = ChatCompletionsClient(
    endpoint=endpoint,
    credential=DefaultAzureCredential(exclude_interactive_browser_credential=False)
)

# For Azure OpenAI endpoint
client = ChatCompletionsClient(
    endpoint=endpoint,
    credential=DefaultAzureCredential(exclude_interactive_browser_credential=False),
    credential_scopes=["https://cognitiveservices.azure.com/.default"],
    api_version="2024-06-01",  # Azure OpenAI api-version. See https://aka.ms/azsdk/azure-ai-inference/azure-openai-api-versions
)

During application development, you would typically set up the environment for authentication using Entra ID by first Installing the Azure CLI, running az login in your console window, then entering your credentials in the browser window that was opened. The call to DefaultAzureCredential() will then succeed. Setting exclude_interactive_browser_credential=False in that call will enable launching a browser window if the user isn't already logged in.

Defining default settings while creating the clients

You can define default chat completions or embeddings configurations while constructing the relevant client. These configurations will be applied to all future service calls.

For example, here we create a ChatCompletionsClient using API key authentication, and apply two settings, temperature and max_tokens:

from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential

# For Serverless API or Managed Compute endpoints
client = ChatCompletionsClient(
    endpoint=endpoint,
    credential=AzureKeyCredential(key),
    temperature=0.5,
    max_tokens=1000
)

Default settings can be overridden in individual service calls.

Create and authenticate clients using `load_client`

If you are using Serverless API or Managed Compute endpoints, there is an alternative to creating a specific client directly. You can instead use the function load_client to return the relevant client (of types ChatCompletionsClient or EmbeddingsClient) based on the provided endpoint:

from azure.ai.inference import load_client
from azure.core.credentials import AzureKeyCredential

# For Serverless API or Managed Compute endpoints only.
# This will not work on GitHub Models end

azure-ai-inference

Description