Has anyone ran into this error when adding a nodep...
# azure
a
Has anyone ran into this error when adding a nodepool to an existing AKS cluster before? No reason why this should be an issue + replacing the entire cluster isn't feasible
Copy code
azure-native:containerservice:ManagedCluster (ml-main-prod):
    error: Code="BadRequest" Message="A new agent pool was introduced. Adding agent pools to an existing cluster is not allowed through managed cluster operations. For agent pool specific change, please use per agent pool operations: <https://aka.ms/agent-pool-rest-api>" Target="agentPoolProfiles"
b
it seems you’re defining the nodepools inline. Don’t do that, add it as a distinct resource using https://www.pulumi.com/registry/packages/azure-native/api-docs/containerservice/agentpool/
this is a limitation of the Azure API
a
gotcha, thanks!
@billowy-army-68599 how do you generally organize the systempool formation? do you pass it inline, or add it as a different resource as well - atm I am running into a few issues mixing the two as pulumi tries to replace the whole cluster
b
the system pool must be defined inline
all other pools need to be distinct resources
generally, I create an AKS cluster with a single system pool and then don’t do any other resource inline
a
makes sense, I guess I'm just blocked by our current deployment because recreating the nodepools is forcing replacement of the entire cluster - any ideas around it?
b
you mean the line nodepools?
do you have a before and after diff of your code?
a
Sure, this is the before:
Copy code
# Create a Kubernetes cluster
    k8s_cluster = containerservice.ManagedCluster(
        f"ml-main-{stack_name}",
        location=resource_group.location,
        resource_group_name=resource_group.name,
        agent_pool_profiles=[
            # System Node Pool
            containerservice.ManagedClusterAgentPoolProfileArgs(
                name="systempool",
                mode="System",
                os_disk_size_gb=30,
                count=1,
                os_type="Linux",
                vm_size="standard_b2pls_v2",
                vnet_subnet_id=subnet1.id,
                type="VirtualMachineScaleSets",
            ),
            containerservice.ManagedClusterAgentPoolProfileArgs(
                name="gpunodepool",
                mode="User",
                os_type="Ubuntu",
                scale_set_priority="Regular",
                vm_size="standard_nc6s_v3",  # GPU enabled VM
                node_labels={"gpu": "true"},
                vnet_subnet_id=subnet1.id,
                type="VirtualMachineScaleSets",
                node_taints=["gpu=true:NoSchedule"],
                **stack_gpu_autoscaler_settings[stack_name],
            ),
        ],
        dns_prefix=f"ml-main-{stack_name}",
        enable_rbac=True,
        linux_profile={
            "admin_username": "someAdmin",
            "ssh": {
                "publicKeys": [
                    {
                        "keyData": AKS_SSH_PUBKEY,
                    }
                ]
            },
        },
        service_principal_profile=containerservice.ManagedClusterServicePrincipalProfileArgs(
            client_id=app.application_id,
            secret=sp_password.value,
        ),
        network_profile=containerservice.ContainerServiceNetworkProfileArgs(
            network_plugin="azure",
            network_policy="azure",
            service_cidr="10.96.0.0/16",
            dns_service_ip="10.96.0.10",
        ),
    )
And after:
Copy code
# Create a Kubernetes cluster
    k8s_cluster = containerservice.ManagedCluster(
        f"ml-main-{stack_name}",
        location=resource_group.location,
        resource_group_name=resource_group.name,
        agent_pool_profiles=[
            # System Node Pool
            containerservice.ManagedClusterAgentPoolProfileArgs(
                name="systempool",
                mode="System",
                os_disk_size_gb=30,
                count=1,
                os_type="Linux",
                vm_size="standard_b2pls_v2",
                vnet_subnet_id=subnet1.id,
                type="VirtualMachineScaleSets",
            ),
        ],
        dns_prefix=f"ml-main-{stack_name}",
        enable_rbac=True,
        linux_profile={
            "admin_username": "someAdmin",
            "ssh": {
                "publicKeys": [
                    {
                        "keyData": AKS_SSH_PUBKEY,
                    }
                ]
            },
        },
        service_principal_profile=containerservice.ManagedClusterServicePrincipalProfileArgs(
            client_id=app.application_id,
            secret=sp_password.value,
        ),
        network_profile=containerservice.ContainerServiceNetworkProfileArgs(
            network_plugin="azure",
            network_policy="azure",
            service_cidr="10.96.0.0/16",
            dns_service_ip="10.96.0.10",
        ),
    )

gpu_nodepool =  containerservice.AgentPool(
                "gpu_nodepool",
                agentpool_name="gpunodepool",
                mode="User",
                os_type="Ubuntu",
                scale_set_priority="Regular",
                vm_size="standard_nc6s_v3",
                node_labels={"cpu": "true"},
                vnet_subnet_id=subnet1.id,
                type="VirtualMachineScaleSets",
                node_taints=["cpu=true:NoSchedule"],
                **stack_gpu_autoscaler_settings[stack_name],
                _resource_name_=k8s_cluster.name
                resource_group_name=resource_group.name.
)
just moved the GPU nodepool out, and its replacing the whole cluster
b
Yep, you can’t modify those inline nodepools at all.
Even removing them will force a recreate, sadly. https://github.com/pulumi/pulumi-azure-native/issues/579
a
That's pretty frustrating, imo the docs should be updated to reflect this
Also now that I have tried this, reverting the change completely still is forcing a recreate, even after deleting the nodepool manually and running a refresh
Any way I can find the exact reason why a recreate is being triggered?
b
again, this is a limitation on the azure side. I agree it’s frustrating, but there’s not a whole lot we can do. The azure API doesn’t even publish docs on this
a
Totally understand. Atm, I'm just reverting everything to the way it was (no new nodepools, pulumi config is inline as before, azure infra matches the config) - jsut trying to figure out why
pulumi refresh
still prompts a recreate
seems like its just a nodepool ordering thing from the api
I think I managed to find a workaround: • Add
_opts_=pulumi.ResourceOptions(_ignore_changes_=["agent_pool_profiles"]),
to the cluster resource, and remove inline agent pools • Create separate Agent pool resources • try
pulumi up
- this will inevitably fail because of "existing resources" • Use
pulumi import azure-native:containerservice:AgentPool <nodepoolResourceName> /subscriptions/<sid>/resourceGroups/<rg_name>/providers/Microsoft.ContainerService/managedClusters/<clusterName>/agentPools/<existingNodepoolName>
for each existing nodepool • Refresh, and you are good to go
b
that looks like a good approach!
a
fingers crossed it doesn't break anything that'll come back to bite me haha, but for now it looks okay
bumping this because I just noticed something funky wrt networking with the new AgentPool format. nodepools are created successfully + they're in the same vnet/subnet, but the healthcheck for the new cpu nodepool doesn't work and the Ingress throws 502s for any services on the new nodepool. port forwarding everything works locally so i know the services are running correctly. do you do any networking config beyond setting the same vnet + subnet for the agentpools as the cluster?