BlogAzure AI

Getting Started with Azure OpenAI Service

Azure OpenAI Service brings the most capable large language models — GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo, and text embedding models — directly into your Azure environment.

Author

Artan Ajredini

Artan Ajredini

CEO & Cloud Architect

4 min read
28 April 2025

Introduction to Azure OpenAI Service

Azure OpenAI Service brings the most capable large language models — GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo, and text embedding models — directly into your Azure environment. You get the same models as OpenAI's API, but with the enterprise controls that production workloads demand: regional data residency, private networking, Azure Active Directory authentication, content filtering, and compliance certifications.

For organisations already running workloads on Azure, this is the natural starting point for AI features. Your data does not leave your Azure region, you manage access through the same identity platform you already use, and the service integrates with Azure Monitor, Key Vault, Private Endpoints, and your existing CI/CD pipelines.

Azure OpenAI is not just OpenAI with a different URL. It is OpenAI's models wrapped in Azure's enterprise security, compliance, and networking model — the difference that matters when you are handling customer data in production.

Available models

  • GPT-4o — the most capable multimodal model. Accepts text and images as input. Best for complex reasoning, document analysis, and high-quality generation.
  • GPT-4 Turbo — large context window (128k tokens). Best for long documents, summarisation, and multi-turn conversations with extensive history.
  • GPT-3.5 Turbo — fast and cost-efficient. Best for simpler tasks, high-volume applications, and latency-sensitive use cases.
  • text-embedding-3-large / text-embedding-ada-002 — converts text into vector embeddings for semantic search and RAG pipelines.
  • DALL-E 3 — generates images from text prompts. Available in select regions.

Setting Up Azure OpenAI

Before you can make API calls, you need to provision an Azure OpenAI resource and deploy a model. The resource is the billing and access container; the deployment is the specific model instance your application will call.

Step 1: Request access and provision the resource

Azure OpenAI requires an approved subscription. Submit a request through the Azure portal — approval typically takes 1–2 business days. Once approved, create the resource via the portal, Azure CLI, or Bicep.

bicep
resource openAIAccount 'Microsoft.CognitiveServices/accounts@2023-05-01' = {
  name: 'openai-${environment}'
  location: 'swedencentral'   // choose a region with your required model availability
  kind: 'OpenAI'
  sku: { name: 'S0' }
  properties: {
    publicNetworkAccess: 'Disabled'   // use Private Endpoint for production
    customSubDomainName: 'mycompany-openai'
  }
}

Step 2: Deploy a model

Model deployments are separate from the resource. Each deployment has a name (which you reference in API calls), a model version, and a tokens-per-minute (TPM) capacity limit. Deploy through Azure AI Studio or via Bicep.

bicep
resource gpt4oDeployment 'Microsoft.CognitiveServices/accounts/deployments@2023-05-01' = {
  parent: openAIAccount
  name: 'gpt-4o'
  properties: {
    model: {
      format: 'OpenAI'
      name: 'gpt-4o'
      version: '2024-08-06'
    }
  }
  sku: {
    name: 'Standard'
    capacity: 30   // 30K tokens per minute
  }
}

Step 3: Store credentials in Key Vault

Never hardcode the API key or endpoint URL in your application code. Store them in Azure Key Vault and retrieve them at runtime using a managed identity — no secrets in environment variables, no secrets in source control.

bicep
// Grant the app's managed identity access to read Key Vault secrets
resource kvSecretAccess 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  scope: keyVault
  name: guid(keyVault.id, appIdentityPrincipalId, 'Key Vault Secrets User')
  properties: {
    roleDefinitionId: subscriptionResourceId(
      'Microsoft.Authorization/roleDefinitions',
      '4633458b-17de-408a-b874-0445c86b69e6'  // Key Vault Secrets User
    )
    principalId: appIdentityPrincipalId
    principalType: 'ServicePrincipal'
  }
}

Making Your First API Call

The Azure OpenAI SDK is available for Python, .NET, JavaScript, and Java. The API is compatible with the OpenAI SDK — you only need to change the endpoint and add the Azure-specific deployment name.

Python

python
import os
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

# Use managed identity (recommended for production)
token_provider = get_bearer_token_provider(
    DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default"
)

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    azure_ad_token_provider=token_provider,
    api_version="2024-10-21"
)

response = client.chat.completions.create(
    model="gpt-4o",        # your deployment name
    messages=[
        { "role": "system", "content": "You are a helpful Azure cloud assistant." },
        { "role": "user",   "content": "Explain Azure Blob Storage in two sentences." }
    ],
    temperature=0.3,
    max_tokens=300
)

print(response.choices[0].message.content)

.NET / C#

csharp
using Azure.AI.OpenAI;
using Azure.Identity;

var endpoint = new Uri(Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT")!);

// Use managed identity — no API key required
var client = new AzureOpenAIClient(endpoint, new DefaultAzureCredential());
var chatClient = client.GetChatClient("gpt-4o");  // deployment name

var response = await chatClient.CompleteChatAsync(
    new SystemChatMessage("You are a helpful Azure cloud assistant."),
    new UserChatMessage("Explain Azure Blob Storage in two sentences.")
);

Console.WriteLine(response.Value.Content[0].Text);

Streaming responses

For user-facing applications, stream the response token by token rather than waiting for the full completion. This dramatically improves perceived responsiveness — users see text appearing immediately instead of waiting several seconds for a complete response.

python
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{ "role": "user", "content": "Write a short summary of Zero Trust security." }],
    stream=True
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Production Tips

Getting a prototype working is straightforward. Getting it reliable, cost-efficient, and safe in production requires a few more considerations.

Understand token limits and costs

Every API call consumes tokens — both the input (prompt + context) and the output (completion). Token usage directly drives cost and determines whether you hit rate limits. GPT-4o costs significantly more per token than GPT-3.5 Turbo — profile your use case before choosing a model.

  • Use tiktoken (Python) or the Azure OpenAI tokenizer to estimate prompt size before sending requests.
  • Set max_tokens on every request — without it, the model may generate a very long (and expensive) response.
  • Cache responses for identical or near-identical prompts using Azure Cache for Redis.
  • Use GPT-3.5 Turbo for classification, extraction, and simple Q&A — reserve GPT-4o for tasks that genuinely need it.

Write effective system prompts

The system prompt defines the model's persona, constraints, and output format. A well-written system prompt is the single most impactful way to improve consistency and reduce hallucinations.

python
system_prompt = """
You are a customer support assistant for NativeCloud, an Azure consulting company.

Rules:
- Answer only questions related to Azure and cloud infrastructure.
- If a question is outside your scope, say: "I can only help with Azure and cloud topics."
- Always be concise — maximum 3 sentences unless the user asks for detail.
- Never make up product names, prices, or features. If unsure, say so.
- Format lists using bullet points.
"""

Handle rate limits and errors gracefully

Azure OpenAI enforces tokens-per-minute (TPM) and requests-per-minute (RPM) limits per deployment. In production, implement exponential backoff with jitter when you receive a 429 (rate limit) response. Use multiple deployments or regions as fallback for high-availability applications.

python
import time, random
from openai import RateLimitError

def call_with_retry(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="gpt-4o",
                messages=messages
            )
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)

Content filtering

Azure OpenAI has built-in content filters that block harmful input and output across categories: hate speech, violence, sexual content, and self-harm. In Azure AI Studio, configure custom filter thresholds and enable prompt shields to protect against jailbreak and indirect prompt injection attacks — especially important for customer-facing applications.

Want to build an AI application on Azure?

We help teams design and ship production-ready AI features — from first prototype to scaled, secure deployment.

Schedule a call

Closing Thoughts

Azure OpenAI Service removes the gap between AI capability and enterprise requirements. You get GPT-4o and the full OpenAI model family with private networking, managed identity authentication, regional data residency, and the compliance certifications your organisation likely already requires.

Start by provisioning a resource, deploying GPT-3.5 Turbo, and making your first API call in Python or .NET. Then add Key Vault for credential management, streaming for better UX, and a well-crafted system prompt. Those foundations will carry you from prototype to production.

More articles

View all
Kubernetes on AKS: Production Best Practices
about 1 year ago1 min read

Kubernetes on AKS: Production Best Practices

Running Kubernetes in production is very different from running it in a demo. Cluster configuration decisions made early can be difficult and costly to undo later. In this article, we share the production best practices we apply on every AKS cluster we deploy: node pool design with system and user pools separated, cluster autoscaler tuning, Pod Disruption Budgets for zero-downtime maintenance, resource requests and limits to prevent noisy-neighbour problems, and Network Policies to enforce micro-segmentation. We also cover workload identity using Azure Workload Identity (replacing the deprecated pod-managed identities), secret injection from Azure Key Vault using the Secrets Store CSI Driver, and multi-zone node pools for high availability. Each section includes real configuration examples you can adapt for your own clusters.

Read article
Azure Security Fundamentals: Zero Trust for Cloud Workloads
about 1 year ago1 min read

Azure Security Fundamentals: Zero Trust for Cloud Workloads

The traditional perimeter-based security model does not work in the cloud. With resources spread across regions, teams accessing systems from anywhere, and workloads communicating over public networks, Zero Trust — never trust, always verify — is the only viable approach. In this article, we implement Zero Trust across an Azure workload step by step: enabling Microsoft Defender for Cloud and addressing its security score recommendations, configuring Entra ID Conditional Access policies to require MFA and compliant devices, applying Just-in-Time VM access to eliminate persistent inbound ports, using Private Endpoints to remove public exposure from storage accounts and databases, and setting up Microsoft Sentinel for centralised security monitoring and automated incident response. This article is a practical starting point for any team that wants to harden their Azure environment against modern threats.

Read article
Azure Infrastructure as Code (IaC) Guide: 10 Best Practices
11 months ago1 min read

Azure Infrastructure as Code (IaC) Guide: 10 Best Practices

Are you still deploying Azure resources manually in the Azure Portal? What starts as a quick setup often turns into inconsistencies across environments, undocumented changes, and errors that are hard to trace. The solution is Infrastructure as Code. This guide covers what IaC is, its benefits, how it works in Azure, the best tools (Bicep, ARM, Terraform, Pulumi), and 10 best practices to get you started.

Read article