This message was deleted.
s
This message was deleted.
b
This is a network issue really, has anything changed on your network?
d
No, this has always been running in Github Actions, and it has never had access to the cluster. In hindsight, I'm somewhat surprised it worked in the past, but it definitely did. My main suspicion is that something is causing Pulumi to try to refresh resources that were previously just using the saved state?
s
This sounds like a network issue to me too. You can confirm it if you just run a plain kubectl command against the cluster in a github action.
or
nc -v 172.18.3.244 6443
172 is a private ip space so your runners either need to exist in the same VPC or something like a VPN. If you don't have that type of setup, it won't work and wouldn't have ever worked. I would double check if things changed about your network, agents, or cluster recently. Even things like if the cluster used to have a publicly accessible API server
d
If you don't have that type of setup, it won't work and wouldn't have ever worked..
Exactly, I'm 99% certain that github actions never had access to the Kubernetes API. The fact remains that this did work, for several months, until a week or two ago. It's a private repo, so I can't share a link to the action run/PR, but it absolutely produced previews of changes to this cluster. The only possible explanation is that
pulumi preview
was previously not contacting the API server at all, and just diffing the saved state with what running the stack gives (which, unexpectedly, seems to have been possible without contacting the API server). And recently, something changed which caused a preview to require the API server. The long-term solution is to set up a VPN for our runners, and I'm working on this, I was just curious if anyone was aware of something that changed on the Pulumi side that now requires access to the API server.
g
Did enableServersideApply become default in the k8s provider? I'm wondering if that has anything to do with it. Also, do you have something defined in the
.kubeconfig
somewhere?
in your CI job, when you run
kubectl config current-context
, is there a context defined? if so, drill into that and see if it's defining a server that matches that host?
s
and/or are you possibly using the --refresh flag?
d
Hm, the kubeconfig hasn't changed - it's generated as an output of another stack, and passed explicitly to the provider via the
kubeconfig
option. I'm not setting the
--refresh
flag in CI. I tried setting it to
false
locally (looks like
true
is the default) and running it while disconnected from the VPN, but this shows the same error.
g
Well, generally speaking and considering serverSideAPply is going to become the provider default in the future, not having access to a k8s cluster, even a mock one, is going to make the provider preview relatively useless going forward... I'm not sure what your specific set up looks like, but Pulumi is trying to reach out to a cluster to diff the resource changes.
are you using an explicit provider with ResourceOptions?
Did someone else operate on this state at one point and now the cluster was persisted to it? e.g. someone ran a
pulumi update
with live cluster access, and the CI is trying to use the same state?
d
considering serverSideAPply is going to become the provider default in the future, not having access to a k8s cluster, even a mock one, is going to make the provider preview relatively useless going forward
Fair enough. I'm really looking forward to the patch functionality - that'll make some parts of our stack a little tidier. I'll get working on the VPN setup. That's definitely the way forward, I was just hoping for a flag I could toggle to work around this for a day or two. But no worries if not.
are you using an explicit provider with ResourceOptions?
Yes, I'm only setting
kubeconfig
to the provider.
g
Something smells like someone ran a pulumi update against this stack/state while connected to this "VPN", and it got stored in the state as a refreshable resource, but it's not accessible from CI
perhaps the context names are the same, but the endpoints differ
d
someone ran a
pulumi update
with live cluster access, and the CI is trying to use the same state?
Yes, our normal workflow is: 1. Run
pulumi preview
from CI (this is the thing that's no longer working) 2. If the diff looks good, merge the PR 3. Run
pulumi update
from a machine that has access to the VPN This has been working for some time, nothing about this has changed recently.
g
No, but perhaps Pulumi is now storing some additional metadata about this and future runs are detecting it.
I can only assume that your actual cluster _is located at
172.18.3.244
1
d
Yep, that is the correct IP address for the control plane.
Anyway, since server-side-apply is the way forward, and there's nothing obvious that can be done to work around in the short term, I'll proceed with enabling the VPN for our CI runners.
s
Not sure if this is an option for you, but you could also consider creating runners in the same network space as your K8s cluster (you could even create them inside the k8s cluster with actions-runner-controller)
👀 1
g
we're using actions-runner-controller^
Are you using any Helm Chart()/Release() resources? wonder if somehow Helm is invoking the live connection to check for something
I'm curious what the contents of the
kubeconfig
param you're passing in looks like, or if it is too sensitive to share
smells a bit like this
this too
despite dry run being enabled, the
Release()
resource still tries to reach the cluster, live.. and all Release does, AFAIK, is just leverage the built-in Helm v3 bindings.. perhaps these have changed their behavior as well.
d
I'm using helm
Release
resources, but they haven't changed recently. The
kubeconfig
is below (secrets snipped) - afaik nothing unusual about it.
Copy code
{
  "apiVersion": "v1",
  "kind": "Config",
  "clusters": [
    {
      "name": "on-prem",
      "cluster": {
        "certificate-authority-data": "...secret...",
        "server": "<https://172.18.3.244:6443>"
      }
    }
  ],
  "users": [
    {
      "name": "pulumi",
      "user": {
        "client-certificate-data": "...secret...",
        "client-key-data": "...secret..."
      }
    }
  ],
  "contexts": [
    {
      "name": "pulumi@on-prem",
      "context": {
        "cluster": "on-prem",
        "user": "pulumi"
      }
    }
  ],
  "current-context": "pulumi@on-prem"
}
We haven't set up
actions-runner-controller
yet, but I'll add this to the list of advantages it has 🙂
g
so the server is being defined in the config, that would make sense it's going to try and compare the Release meta data against it. Not sure what process you use to generate that kubeconfig payload, but perhaps server was null before or empty string
s
ahh don't use that env var for this situation. it will remove the resources it can't reach from state, meaning all of your k8s resources
g
Hmm yeah maybe that's not what we want,.. interesting though that this all of a sudden is cropping up for them
Copy code
Previewing update (prd-bravo):
  pulumi:pulumi:Stack: (same)
    [urn=urn:pulumi:prd-bravo::k8s-aws-auth-config::pulumi:pulumi:Stack::k8s-aws-auth-config-prd-bravo]
    > kubernetes:core/v1:Namespace: (read)
        [id=kube-system]
        [urn=urn:pulumi:prd-bravo::k8s-aws-auth-config::kubernetes:core/v1:Namespace::kube-system]
        [provider=urn:pulumi:prd-bravo::k8s-aws-auth-config::pulumi:providers:kubernetes::eks_prd-bravo::8a15f8ed-f5ed-4537-a7bf-41d1aed334b8]
warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get "https://<REDACTED>.sk1.us-gov-west-1.eks.amazonaws.com1/openapi/v2?timeout=32s": dial tcp: lookup <REDACTED>.sk1.us-gov-west-1.eks.amazonaws.com1 on 10.0.0.2:53: no such host
error: Preview failed: failed to read resource state due to unreachable cluster. If the cluster has been deleted, you can edit the pulumi state to remove this resource or retry with the PULUMI_K8S_DELETE_UNREACHABLE environment variable set to true.
error: preview failed
Resources:
    2 unchanged
warning: A new version of Pulumi is available. To upgrade from version '3.53.0' to '3.55.0', visit <https://p>
I only mentioned it since I was able to repro by using a bogus hostname in my kubeconfig for this named context
👍 1
and if I blank it out I definitely get a different error, as expected:
Copy code
warning: configured Kubernetes cluster is unreachable: unable to load Kubernetes client configuration from kubeconfig file. Make sure you have: 

         • set up the provider as per <https://www.pulumi.com/registry/packages/kubernetes/installation-configuration/>
Whatever is generating
server
in the context is definitely injecting private IPs. You mentioned this value was coming from an output of another thing, perhaps another Pulumi state/project that is exporting the API server endpoint? As Mike indicated, did someone make this endpoint private [which is a good idea of course], and now the output contains this IP instead of the publicly accessible hostname that 'worked' before? Do you use EKS?
d
No, that's definitely the same IP it's always used. It's a bare-metal cluster running in an on-prem DC.
g
Very odd -- I mean nothing stands out as obvious since the code around
clusterUnreachable
hasn't changed in 2-4 years, and the only recent change made to that provider for that exception is the mention of the env var I put above. 🤷
🤷 1
Copy code
// We use this information to read the live version of a Kubernetes resource. This is sometimes
	// then checkpointed (e.g., in the case of `refresh`). Specifically:
	//
	// * The return is formatted as a "checkpoint object", i.e., an object of the form
	//   {inputs: {...}, live: {...}}. This is important both for `Diff` and for `Update`. See
	//   comments in those methods for details.
	//
only thing I can think of is something did an implicit refresh which invalidated all of the checkpoints and that's what it had been relying on in the past.
488 Views