Amazon EKS CoreDNS Addon Degraded 에러

Amazon EKS CoreDNS Addon state is "Degraded"

unchaptered

Oct 18, 2024

Contents

개요 문제 상황 시스템 스펙 해결 방법 해결 방식

개요

본 문서는 Amazon EKS에서 CoreDNS Degraded 경고문구를 해결하는 과정을 다룹니다.

투덜투덜
솔직히 최근 몇개월 동안 EKS 작업하면서 “이거 왜이래?”라는 부분이 없었는데,
최근에 마주한 상황 중에 제일 당황스러운 순간 중 하나였습니다.

회사에서는 직접 만든 Terraform Module을 사용하고 있는데,
2024.05. ~ 2024.08. 간 문제가 없었던 부분에서 문제가 발생했습니다.
딱히 구글링해도 명확한 이유가 적혀 있지 않은것 같아서 문서로 기록합니다.

문제 상황

Amazon EKS에서 CoreDNS Addon을 활성화하다 보면 이런 에러가 나옵니다.
내장 Terraform Module을 사용하던, AWS Console로 배포하던 문제 상황은 똑같습니다.

시스템 스펙

제가 문제를 겪었던 시스템 스펙을 정확히 공유합니다.
일부 메이저 버전의 차이가 있더라도 비슷한 현상이 발생할 것 같습니다.

Amazon EKS version 1.30 / platform eks.12
Amazon EKS Addon
- Amazon VPC CNI : v1.18.1-eksbuild.3
- kube-proxy v1.30.0-eksbuild.3
- CoreDNS v.11.1.eksbuild.8

Amazon EKS NodeGruop
- AMI: AL2023_x86_64_STANDARD
- Release : 1.30.4.20241011
- Replicas : 2
- Instance Type : t3.medium
- Disk Size : 20 GiB (gp3)
- desire 2, minimum 2, maximum 2

해결 방법

AWS EKS CoreDNS Addon의 에러를 확인하기 위해서
pod/coredns-** 의 상태를 확인했고 aws-vpc-cni 문제임을 알았습니다.

따라서
daemonset/aws-node 하위에 있는 pod/aws-node를 모두 재시작하였습니다.
이후 pod/coredns-** 를 비롯한 모두 어플리케이션 파드가 정상적으로 IP를 할당받았습니다.

해결 방식

신규 배포한 EKS Cluster에서
kube-system에 deploy/coredns가 정상 배포되었는지 확인했습니다.
그 결과 pod/coredns가 2개 중 1개만 배포된 것을 알았습니다.

deploy 확인

kubectl get deploy coredns -n kube-syste

NAME      READY   UP-TO-DATE   AVAILABLE   AGE
coredns   1/2     2            1           91m

deploy 상세 확인

kubectl describe deploy coredns -n kube-system

Name:                   coredns
Namespace:              kube-system
CreationTimestamp:      Thu, 17 Oct 2024 16:32:51 +0900
Labels:                 eks.amazonaws.com/component=coredns
                        k8s-app=kube-dns
                        kubernetes.io/name=CoreDNS
Annotations:            deployment.kubernetes.io/revision: 2
Selector:               eks.amazonaws.com/component=coredns,k8s-app=kube-dns
Replicas:               2 desired | 2 updated | 2 total | 1 available | 1 unavailable

...

Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    False   ProgressDeadlineExceeded

...

이후 CoreDNS 에서 가동에 실패한 파드의 로그를 확인했습니다.
그 결과 aws-cni가 pod/coredns에 IP Adrress를 할당하지 못하는 것을 확인하였습니다.

실행에 실패한 pod/coredns 이름 확인하기

kubectl get pods -n kube-system | grep core | grep -v Running

coredns-6558b6db9c-zlvmd   0/1     ContainerCreating   0          100m

pod/coredns 로그 확인하기

kubectl describe pod <POD_NAME> -n kube-system

kubectl describe pod coredns-6558b6db9c-zlvmd

...

Events:
  Type     Reason                  Age                    From     Message
  ----     ------                  ----                   ----     -------
  Warning  FailedCreatePodSandBox  4m39s (x382 over 87m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "<SANDBOX>": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

EKS Cluster에 별도의 설치를 진행하지 않았으나
네트워킹 모드*, 노드별 MaxPods, 가동 중 Pods 확인했습니다.
그 결과 Prefix Delegation 환경에서 440개의 공간 중 9개를 사용한 것을 알았습니다.

네트워킹 모드 확인하기

kubectl describe daemonset aws-node -n kube-system | grep ENABLE_PREFIX_DELEGATION

kubectl describe daemonset aws-node -n kube-system | grep  WARM_PREFIX_TARGET

      ENABLE_PREFIX_DELEGATION:               true
      WARM_PREFIX_TARGET:                     1

노드의 수와 노드별 MaxPods 확인하기

kubectl get nodes -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable.pods}{"\n"}{end}'

ip-X-X-X-X.<region>.compute.internal 110
ip-X-X-X-X.<region>.compute.internal 110
ip-X-X-X-X.<region>.compute.internal 110
ip-X-X-X-X.<region>.compute.internal 110

가동 중인 Pods 확인하기

kubectl get pods -A | grep "Running" | wc -l

현재 EKS NodeGroup에 할당 가능한 파드 숫자가 431개가 있는데,
파드가 안켜지는게 이해가 되지 않아 다른 이미지로 실행을 해보았습니다.
그 결과 이미지와 무관하게 현재 aws-cni가 파드를 할당하지 못함을 확신했습니다.

pod/nginx 실행하기

kubectl run nginx --image=nginx:latest --port=80

pod/nginx 상태 확인하기

kubectl describe pod nginx

...

Events:
  Type     Reason                  Age                    From     Message
  ----     ------                  ----                   ----     -------
  Warning  FailedCreatePodSandBox  4m39s (x382 over 87m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "<SANDBOX>": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

실행 중인 EC2, RDS 및 파드가 적어 확률은 낮다고 생각했지만
EKS Node가 속한 서브넷의 서브넷 마스크가 /24 이며 IP Pool이 254개인 점을 감안해서
서브넷의 IP Pool의 남은 IP를 확인했으나 대부분 188 ~ 230개가 남았음을 알았습니다.

최종적으로 aws-cni 자체가
서브넷 혹은 할당 가능한 IP Pool을 인식 못하는 경우도 가정하였고 옵션을 확인했습니다.
하지만 정상 작동 중인 다른 EKS Cluster의 aws-node와 모든 옵션이 동일했으며,
각 옵션들을 github.com/aws/amazon-vpc-cni-k8s에서 확인했음에도 큰 차이점을 찾지 못했습니다.

pod/aws-node 리스트 조회

kubectl get pod -n kube-system | grep "Running" | grep "aws-node"

kube-system   aws-node-gxhdl             2/2     Running             0          15h
kube-system   aws-node-q7nzd             2/2     Running             0          15h
kube-system   aws-node-xqmhw             2/2     Running             0          15h
kube-system   aws-node-z28hb             2/2     Running             0          15h

pod/aws-node 상태 확인하기

kubectl describe pod aws-node-gxhdl -n kube-system

      ADDITIONAL_ENI_TAGS:                    {}
      ANNOTATE_POD_IP:                        false
      AWS_VPC_CNI_NODE_PORT_SUPPORT:          true
      AWS_VPC_ENI_MTU:                        9001
      AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG:     false
      AWS_VPC_K8S_CNI_EXTERNALSNAT:           false
      AWS_VPC_K8S_CNI_LOGLEVEL:               DEBUG
      AWS_VPC_K8S_CNI_LOG_FILE:               /host/var/log/aws-routed-eni/ipamd.log
      AWS_VPC_K8S_CNI_RANDOMIZESNAT:          prng
      AWS_VPC_K8S_CNI_VETHPREFIX:             eni
      AWS_VPC_K8S_PLUGIN_LOG_FILE:            /var/log/aws-routed-eni/plugin.log
      AWS_VPC_K8S_PLUGIN_LOG_LEVEL:           DEBUG
      CLUSTER_NAME:                           <EKS-CLUSTER>
      DISABLE_INTROSPECTION:                  false
      DISABLE_METRICS:                        false
      DISABLE_NETWORK_RESOURCE_PROVISIONING:  false
      ENABLE_IPv4:                            true
      ENABLE_IPv6:                            false
      ENABLE_POD_ENI:                         false
      ENABLE_PREFIX_DELEGATION:               true
      ENABLE_SUBNET_DISCOVERY:                true
      NETWORK_POLICY_ENFORCING_MODE:          standard
      VPC_CNI_VERSION:                        v1.18.5
      VPC_ID:                                 <VPC-ID>
      WARM_ENI_TARGET:                        1
      WARM_PREFIX_TARGET:                     1
      MY_NODE_NAME:                            (v1:spec.nodeName)
      MY_POD_NAME:                            aws-node-fx4sw (v1:metadata.name)

에러 메세지의 키워드를 기반으로 검색한 결과로
aws/amazon-vpc-cni-k8s (github) - [EKS] Pods stuck in ContainerCreating status after upgrading to Kubernetes version 1.30 에서 아래 답글을 봤습니다.

I am not sure what could have led to this stage. But you can downgrade the addon the previous version, and restart the pods, and upgrade the addons again.

AWS EKS 클러스터와 Addon을 동시에 업데이트 하다가 문제가 생긴 케이스였으며,
이를 해결하기 위해서 애드온을 이전 버전으로 다운그레이드 하고 파드를 다시 시작한 다음에 다시 애드온을 업그레이드하라는 내용이었습니다.

이후 daemonset/aws-node 하위의 pod/aws-node들을 모두 재시작하여 해결했습니다.

See more posts

Amazon EKS CoreDNS Addon Degraded 에러

개요

문제 상황

시스템 스펙

해결 방법

해결 방식

More articles

쿠버네티스 오브젝트란?

D03/Docker 보안 강화 - 네트워크 세분화 및 방화벽

CloudWatch와 top이 다른 이유

D02/Docker 보안 강화 - 패치 관리 전략

문제 상황