Working around Terraform’s Azure inequities – Part 2: Azure Shared Private Links

By | June 7, 2024

In this post we’ll go over how to work around Terraform to both deploy a shared private link between two resources (Azure AI Search & Azure OpenAI) and then automatically approve it.

Disclaimer : In this series you will find I am very critical of Terraform as a project. All opinions are my own and do not reflect those of my employer.

Shared Private Links (SPLs) are a way for two Azure resources to communicate over a Private Endpoint (PE), which routes traffic over an Azure Virtual Network (VNet); making it possible for the resources to never be exposed to the public internet. When you deploy a PE, you tie it to a Subnet within a VNet. Then, if you have another resource that needs to directly communicate with it, you create an SPL between them. One common use-case for this is when utilizing AI Search’s built-in vectorization capabilities. When doing so, Azure AI Search uses an Embedding model deployed in Azure OpenAI to generate the vectors for the content it ingests from blob storage automatically. It’s beneficial because you offload the compute required to do this into the automatic routine that ingests your data into your vector DB, and you do it with a model that both AI Search and your LLM application can use.

Let’s get started.

Azure AI Search -> Azure OpenAI Shared Private Link deployment

Relevant GH issue:

Solution:
No doubt due to exactly what I’m pointing out in this series, Microsoft finally had enough and created the azapi provider for Terraform, which allows you to deploy ARM templates in a more Terraform-esque way. Since this resource is created on the ARM provider, we can use azapi_resource to work around this one with the following:

resource "azapi_resource" "ai_search_shared_private_link_openai" {
  type      = "Microsoft.Search/searchServices/sharedPrivateLinkResources@2024-03-01-preview"
  name      = "Desired Shared Link Name Goes Here"
  parent_id = azurerm_search_service.ai_search.id
 
  schema_validation_enabled = false
 
  body = jsonencode({
    properties = {
      groupID               = "openai_account"
      privateLinkResourceId = azurerm_cognitive_account.openai.id
      requestMessage        = "Required to use OpenAI embedding models to vectorize data"
    }
  })
 
  response_export_values = ["*"]
}

And voila, whatever TF is doing to block the groupID of openai_account and throw an error is now circumvented because Azure does support this group type & it only fails because TF hasn’t kept up. (This will be a theme in your TF journey should you continue using it.)

Approving a Shared Private link via Terraform

But, before you can effectively use the link, you have to approve it. But we’re deploying this as Infrastructure-as-Code (IaC) so we’d love to be able to make this automatic, right? In general, I’m pretty against mixing & matching things to compose a solution if I can help it. It complicates the local developer experience for, often, not much real gain. So, I wanted to see if/how I could do the approval entirely within Terraform, at apply time.

When a Shared Private Link gets created, it makes a new resource on the “target” resource (in this case, Azure OpenAI) which, in the portal, you have to go to, select, then Approve. The best part is that this resource doesn’t have the exact same name as what you gave it on the consumer (e.g. Azure AI Search side) – instead they append a GUID to it, which is fun. So, we first have to find it in Terraform, then go ahead and issue the PUT request to it.

Notice in the above creation of the link, we have response_export_values = ["*"], this makes it so a JSON object is output from this creation. We then use this to find the resource that was created on the target via some fancy Terraform function calling:

locals {
  new_spl_pe_name = one([for i in jsondecode(data.azapi_resource_list.open_ai_azure_search_private_link_endpoints.output).value : one(regex("/(${replace(azapi_resource.ai_search_shared_private_link_openai.name, ".", "\\.")}\\..+)", i.name)) if strcontains(i.name, "/${azapi_resource.ai_search_shared_private_link_openai.name}.")])
}

Here I take the output from the SPL creation on the OpenAI instance and find the first one that has the substring matching the name I gave the SPL. This will find the instance that has the GUID appended on it. I use this local, then, in the next TF resource:

resource "azapi_update_resource" "open_ai_azure_search_private_endpoint_approver" {
  depends_on = [data.azapi_resource_list.open_ai_azure_search_private_link_endpoints]

  type      = "Microsoft.CognitiveServices/accounts/privateEndpointConnections@2023-10-01-preview"
  parent_id = azurerm_cognitive_account.openai.id
  name      = local.new_spl_pe_name

  body = jsonencode({
    location = azurerm_cognitive_account.openai.location
    properties = {
      privateLinkServiceConnectionState = {
        status      = "Approved"
        description = "Auto-Approved - ${jsondecode(azapi_resource.ai_search_shared_private_link_openai.output).properties.requestMessage}"
      }
    }
  })
}

No surprise again, I have to resort to using the azapi provider to do this work. Here, though, I want to modify an existing resource, so I use the azapi_update_resource resource. By setting the Connection State to Approved, the link gets approved and ready for use. I also keep the description that was given to it initially but prepend “Auto-Approved” onto it.

However, there’s one big caveat I found when doing this. The initial creation of the SPL can “complete” in TF, but not be ready for approval. In this case I was getting a 409 Conflict saying it wasn’t in a “terminal” state. I was never able to observe the actual state it reported, so I resorted to developing a TF module which ensures that an Azure resource has a provisioningState of Succeeded before the module “completes.” In my testing, this has solved the problem (at least so far). I don’t know if it’s just because the delay injected by running the module is enough to solve the race condition or if this is actually the right thing to be doing 😅 At any rate, the module looks like this:

variable "resource_id" {
  type = string
}

variable "sec_between_retries" {
  type    = number
  default = 3
}

variable "max_retries" {
  type    = number
  default = 10
}

resource "terraform_data" "wait_for_ready" {
  input = {
    sleep       = var.sec_between_retries
    retries     = var.max_retries
    resource_id = var.resource_id
  }

  triggers_replace = var.resource_id

  provisioner "local-exec" {
    quiet       = true
    interpreter = ["bash", "-c"]
    command     = <<EOF
counter=0
max_retries=${self.input.retries}
token=$(az account get-access-token --resource=https://management.azure.com --query accessToken --output tsv)
while [ $counter -lt $max_retries ]; do 
  response=$(curl -s -w "\n%%{http_code}" -H "Authorization: Bearer $token" "https://management.azure.com${self.input.resource_id}?api-version=2023-05-01")
  body=$(echo "$response" | head -n -1)
  status_code=$(echo "$response" | tail -n 1)
  if [ "$status_code" -eq 200 ]; then 
      status=$(echo "$body" | jq -r ".. | .properties? | .provisioningState?" | grep -v null) || { echo "Error in parsing JSON."; exit 1; }
      if [ "$status" == "Succeeded" ]; then 
        echo "Provisioning complete."
        exit 0
      else
        echo "Still waiting for provisioning. Current Status: $status"
      fi
  fi
  counter=$((counter+1))
  sleep ${self.input.sleep}
done

if [ $counter -eq $max_retries]; then
    echo "Max retries hit for ready check!"
    exit 1
fi
EOF
  }
}

The basic concept is to use Azure’s raw REST API to issue GET requests on the target resource and use jq to inspect the resulting output’s provisioningState property. I do this all in bash because, unfortunately, it’s the one cross-platform thing that was present at the customer; they do most of their script execution on their local machines in Git Bash and the actual pipelines run on Linux agents. I don’t like that it makes my world a bit unnatural (can’t be at PowerShell, etc. as I must remember to do my tf apply, etc. in a Git Bash window) but it is what it is. Feel free to modify as needed!

With this module in place, then, the previous azapi_update_resource changes simply to:

module "wait_for_search_link_ready" {
  source = "./modules/wait_for_ready"

  resource_id = "${azurerm_cognitive_account.openai.id}/privateEndpointConnections/${local.new_spl_pe_name}"
}

resource "azapi_update_resource" "open_ai_azure_search_private_endpoint_approver" {
  depends_on = [module.wait_for_search_link_ready, data.azapi_resource_list.open_ai_azure_search_private_link_endpoints]

...
}

Of note are the addition of the module resource, of course, but the change to the depends_on for the approver to ensure that it waits for the waiter to complete.

This concludes this exciting episode of “Remind me why I’m still using this, again?” – until next time!