2 months ago
    I built an EKS cluster using the EKS module v 0.37.1 with RBAC and efs-csi pvc support I then launched fluent-bit and prometheus on it all from helm charts. Between the fleunt-bit and prometheus installs I upgraded all my pulumi modules so upgraded eks to 0.41.0 Now whenever I try and run
    kubectl logs <pod>
    I get a long delay followed by an error
    Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)
    Reading on stackexchange etc. it seems that this means I have somehow hosed my RBAC. I'm considering taking off and nuking the whole site from orbit, as this cluster is not in production yet, but before I do I was hoping to understand how I have broken it. I have been comparing clusterRoles and clusterRoleBindings between this new broken cluster and the other one which I have built using the 0.37.1 code. I can't find any mention of
    in there but
    Name:         system:kubelet-api-admin
    Labels:       <|>
    Annotations:  <|>: true
      Resources      Non-Resource URLs  Resource Names  Verbs
      ---------      -----------------  --------------  -----
      nodes/log      []                 []              [*]
      nodes/metrics  []                 []              [*]
      nodes/proxy    []                 []              [*]
      nodes/spec     []                 []              [*]
      nodes/stats    []                 []              [*]
      nodes          []                 []              [get list watch proxy]
    Looks the same on both clusters. Does anyone have any pointers as to where I can look, or what causes this error?
    Whilst this may sound a lot like I have checked and both DNS Hostnames and DNS Resolution is turned on in the VPC, and the working cluster is in the VPC too 😖
    Hmmm a few pulumi refreshes and a pulumi up later and the issue seems to have gone away. The pulumi up modified a ClusterIngressRule and the cloudformation stack for the nodes, which is likely what fixed the kluster