In the first part, we learnt how to create generative AI solition using Azure OpenAI service, what this service is, how to provision a resource, what is Azure OpenAI Studio and different types of models available in Azure OpenAI service.

In this part, we will see how we can deploy a model in Azure OpenAI Studio and test it in playground.

Deploy generative AI models

You first need to deploy a model to make API calls to receive completions to prompts. When you create a new deployment, you need to indicate which base model to deploy. You can deploy any number of deployments in one or multiple Azure OpenAI resources. There are several ways you can deploy your base model.

Deploy using Azure OpenAI Studio

In Azure OpenAI Studio's Deployments page, you can create a new deployment by selecting a model name from the menu. The available base models come from the list in the models page.

From the Deployments page in the Studio, you can also view information about all your deployments including deployment name, model name, model version, status, date created, and more.

Deploy using Azure CLI

You can also deploy a model using the console. Using this example, replace the following variables with your own resource values:

OAIResourceGroup: replace with your resource group name
MyOpenAIResource: replace with your resource name
MyModel: replace with a unique name for your model
gpt-35-turbo: replace with the base model you wish to deploy

az cognitiveservices account deployment create \
   -g OAIResourceGroup \
   -n MyOpenAIResource \
   --deployment-name MyModel \
   --model-name gpt-35-turbo \
   --model-version "0301"  \
   --model-format OpenAI \
   --sku-name "Standard" \
   --sku-capacity 1

Deploy using the REST API

You can deploy a model using the REST API. In the request body, you specify the base model you wish to deploy. With the Completions operation, the model generates one or more predicted completions based on a provided prompt. The service can also return the probabilities of alternative tokens at each position.

Example request

curl https://YOUR_RESOURCE_NAME.openai.azure.com/openai/deployments/YOUR_DEPLOYMENT_NAME/completions?api-version=2024-02-01\
  -H "Content-Type: application/json" \
  -H "api-key: YOUR_API_KEY" \
  -d "{
  \"prompt\": \"Once upon a time\",
  \"max_tokens\": 5
}"

Here are the details of the parameters used in the above request

Parameter	Type	Required?	Description
`your-resource-name`	string	Required	The name of your Azure OpenAI Resource.
`deployment-id`	string	Required	The deployment name you chose when you deployed the model.
`api-version`	string	Required	The API version to use for this operation. This follows the YYYY-MM-DD format.
`prompt`	string or array	Optional	The prompt or prompts to generate completions for, encoded as a string, or array of strings.
`max_tokens`	integer	Optional	The maximum number of tokens to generate in the completion.

Example response

{
    "id": "cmpl-4kGh7iXtjW4lc9eGhff6Hp8C7btdQ",
    "object": "text_completion",
    "created": 1646932609,
    "model": "ada",
    "choices": [
        {
            "text": ", a dark line crossed",
            "index": 0,
            "logprobs": null,
            "finish_reason": "length"
        }
    ]
}

You can get more information about other parameters and other endpoints like embeddings at Microsoft Learn documentation.

Using prompts to get completions

Once the model is deployed, you can test how it completes prompts. A prompt is the text portion of a request that is sent to the deployed model's completions endpoint. Responses are referred to as completions, which can come in form of text, code, or other formats.

Prompt Types

Prompts can be grouped into types of requests based on task.

Task type	Prompt example	Completion example
Classifying content	Tweet: I enjoyed the trip.

Sentiment: | Positive | | Generating new content | List ways of traveling | 1. Bike
2. Car ... | | Holding a conversation | A friendly AI assistant | See examples | | Transformation (translation and symbol conversion) | English: Hello
French: | bonjour | | Summarizing content | Provide a summary of the content
{text} | The content shares methods of machine learning. | | Picking up where you left off | One way to grow tomatoes | is to plant seeds. | | Giving factual responses | How many moons does Earth have? | One |

Several factors affect the quality of completions you'll get from a generative AI solution.

The way a prompt is engineered. Learn more about prompt engineering here.
The model parameters (covered next)
The data the model is trained on, which can be adapted through model fine-tuning with customization

You have more control over the completions returned by training a custom model than through prompt engineering and parameter adjustment.

You can start making calls to your deployed model via the REST API, Python, C#, or from the Studio.

Test models in Studio's playgrounds

Playgrounds are useful interfaces in Azure OpenAI Studio that you can use to experiment with your deployed models without needing to develop your own client application. Azure OpenAI Studio offers multiple playgrounds with different parameter tuning options.

Completions playground

The Completions playground allows you to make calls to your deployed models through a text-in, text-out interface and to adjust parameters. You need to select the deployment name of your model under Deployments. Optionally, you can use the provided examples to get you started, and then you can enter your own prompts.

There are many parameters that you can adjust to change the performance of your model:

Temperature: Controls randomness. Lowering the temperature means that the model produces more repetitive and deterministic responses. Increasing the temperature results in more unexpected or creative responses. Try adjusting temperature or Top P but not both.
Max length (tokens): Set a limit on the number of tokens per model response. The API supports a maximum of 4000 tokens shared between the prompt (including system message, examples, message history, and user query) and the model's response. One token is roughly four characters for typical English text.
Stop sequences: Make responses stop at a desired point, such as the end of a sentence or list. Specify up to four sequences where the model will stop generating further tokens in a response. The returned text won't contain the stop sequence.
Top probabilities (Top P): Similar to temperature, this controls randomness but uses a different method. Lowering Top P narrows the model’s token selection to likelier tokens. Increasing Top P lets the model choose from tokens with both high and low likelihood. Try adjusting temperature or Top P but not both.
Frequency penalty: Reduce the chance of repeating a token proportionally based on how often it has appeared in the text so far. This decreases the likelihood of repeating the exact same text in a response.
Presence penalty: Reduce the chance of repeating any token that has appeared in the text at all so far. This increases the likelihood of introducing new topics in a response.
Pre-response text: Insert text after the user’s input and before the model’s response. This can help prepare the model for a response.
Post-response text: Insert text after the model’s generated response to encourage further user input, as when modeling a conversation.

Chat playground

The Chat playground is based on a conversation-in, message-out interface. You can initialize the session with a system message to set up the chat context.

In the Chat playground, you're able to add few-shot examples. The term few-shot refers to providing a few of examples to help the model learn what it needs to do. You can think of it in contrast to zero-shot, which refers to providing no examples.

In the Assistant setup, you can provide few-shot examples of what the user input may be, and what the assistant response should be. The assistant tries to mimic the responses you include here in tone, rules, and format you've defined in your system message.

The Chat playground, like the Completions playground, also includes the Temperature parameter. The Chat playground also supports other parameters not available in the Completions playground. These include:

Max response: Set a limit on the number of tokens per model response. The API supports a maximum of 4000 tokens shared between the prompt (including system message, examples, message history, and user query) and the model's response. One token is roughly four characters for typical English text.
Top P: Similar to temperature, this controls randomness but uses a different method. Lowering Top P narrows the model’s token selection to likelier tokens. Increasing Top P lets the model choose from tokens with both high and low likelihood. Try adjusting temperature or Top P but not both.
Past messages included: Select the number of past messages to include in each new API request. Including past messages helps give the model context for new user queries. Setting this number to 10 will include five user queries and five system responses.

The Current token count is viewable from the Chat playground. Since the API calls are priced by token and it's possible to set a max response token limit, you'll want to keep an eye out for the current token count to make sure the conversation-in doesn't exceed the max response token count.

Conclusion

In this two part series, we learnt how to create generative AI solution using Azure OpenAI service, what this service is, how to provision a resource, what is Azure OpenAI Studio and different types of models available in Azure OpenAI service. Also we looked at how to deploy a model in Azure OpenAI Studio and use playgrounds to send prompt (with various parameters) and get completions from the deployed model.

Understanding, learning, and utilizing capabilities of Generative AI is going to be a key skill in near future. Lets get ready for it using Azure OpenAI service.

Generative AI with Azure OpenAI - Part 2

Develop Generative AI solutions with Azure OpenAI Service

Table of contents