I built an EKS cluster using the EKS module v 0.37...
# aws
p
I built an EKS cluster using the EKS module v 0.37.1 with RBAC and efs-csi pvc support I then launched fluent-bit and prometheus on it all from helm charts. Between the fleunt-bit and prometheus installs I upgraded all my pulumi modules so upgraded eks to 0.41.0 Now whenever I try and run
kubectl logs <pod>
I get a long delay followed by an error
Copy code
Error from server (InternalError): Internal error occurred: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)
Reading on stackexchange etc. it seems that this means I have somehow hosed my RBAC. I'm considering taking off and nuking the whole site from orbit, as this cluster is not in production yet, but before I do I was hoping to understand how I have broken it. I have been comparing clusterRoles and clusterRoleBindings between this new broken cluster and the other one which I have built using the 0.37.1 code. I can't find any mention of
kube-apiserver-kubelet-client
in there but
Copy code
Name:         system:kubelet-api-admin
Labels:       <http://kubernetes.io/bootstrapping=rbac-defaults|kubernetes.io/bootstrapping=rbac-defaults>
Annotations:  <http://rbac.authorization.kubernetes.io/autoupdate|rbac.authorization.kubernetes.io/autoupdate>: true
PolicyRule:
  Resources      Non-Resource URLs  Resource Names  Verbs
  ---------      -----------------  --------------  -----
  nodes/log      []                 []              [*]
  nodes/metrics  []                 []              [*]
  nodes/proxy    []                 []              [*]
  nodes/spec     []                 []              [*]
  nodes/stats    []                 []              [*]
  nodes          []                 []              [get list watch proxy]
Looks the same on both clusters. Does anyone have any pointers as to where I can look, or what causes this error?
Whilst this may sound a lot like https://eksctl.io/usage/troubleshooting/#kubectl-logs-and-kubectl-run-fails-with-authorization-error I have checked and both DNS Hostnames and DNS Resolution is turned on in the VPC, and the working cluster is in the VPC too 😖
Hmmm a few pulumi refreshes and a pulumi up later and the issue seems to have gone away. The pulumi up modified a ClusterIngressRule and the cloudformation stack for the nodes, which is likely what fixed the kluster