第1页
Self driving infrastructure
Xiang Li
xiang.li@coreos.com | Head of distributed system
第2页
Topics
● Cluster management systems ● Today’s problems with operating cluster management
systems ● A self-driving approach
第3页
Motivation: microservices
● Increased operational cost ○ a lot of components ○ dynamic dependencies ○ fast deployment iteration
● Solution: automation
第4页
Cluster management system
● Automation ○ Scheduling ○ Deployment ○ Healing ○ Discovery/load balancing ○ Scaling
第5页
Scheduling
Scheduler
第6页
Scheduling
Scheduler
第7页
Scheduling
Scheduler
第8页
Discovery
color=yellow
第9页
Discovery color=yellow Select color = yellow
第10页
Load balancing
yellow.mycluster
Select color = yellow
第11页
Healing
Controller manager
第12页
Healing
Controller manager
第13页
Healing
Controller manager
第14页
People love automation!
第16页
I hate Kubernetes!
第17页
I hate to OPERATE Kubernetes!
第18页
Kubernetes Architecture
第19页
Operating Kubernetes
● Installation ● Upgrade ● Healing ● Scaling ● Security ● Monitoring ● ...
第20页
Installation
- SSH - Install kubelet
- $pkgmanager install kubelet - Install container runtime
- $pkgmanager install [docker|rkt] - Start kubelet
- Systemctl start kubelet
第21页
Installation - master
- SSH - Install scheduler - Install controller manager - Install API server - Config them correctly - Start them
第22页
Installation - etcd
- SSH - Install etcd - Config them correctly - Start them
第23页
Installation
kops, kubeup.sh, kube-AWS,...
AWS, GCP API
node1
node2
node3
第24页
Upgrade
- SSH - Upgrade container runtime - Upgrade Kubelet
第25页
Upgrade - master
- SSH - Upgrade master components
第26页
Upgrade - etcd
- SSH - Upgrade etcd
第27页
Upgrade
kops AWS, GCP API
node1
node2
node3
第28页
Rollback
jQuery11020470094545113128_1492610438640? AWS, GCP API
node1
node2
node3
第29页
Healing
AWS, GCP API
node2
node3
第30页
Healing
Create node
AWS, GCP API
node1’
node2
node3
第31页
Healing
Install Config
AWS, GCP API
node1’
node2
node3
第32页
Problems
A lot of manual/semi-manual work No standard way to approach all the problems
do it wrong, lose the cluster!
第33页
Self hosting
// gcc source code #include <stdio.h> int main() {
compile_c(argv[1]); }
gcc
gcc
第34页
Self hosting
// golang source code package main import "os" func main() {
compile_go(os.Args[1:]) }
go
go
第35页
Self hosting
第36页
Self hosting
$ uname -s minix $ gcc linux.c
第37页
Self hosting
$ uname -s minix $ gcc linux.c
第38页
Self hosting
第39页
Self hosting
$ uname -s linux $ gcc linux.c
第40页
Self hosting
$ uname -s linux $ gcc linux.c
第41页
Self-hosted Kubernetes?
第42页
What is self-hosted Kubernetes?
● Kubernetes manages own core components ● Core components deployed as native API objects
第43页
Self-hosted k8s Architecture
第44页
Why Self-host Kubernetes?
● Operational expertise around app management in k8s extends to k8s itself ○ E.g. scaling
● Bootstrapping simplified ● Simply cluster life cycle management
○ E.g. updates ● Upstream improvements in Kubernetes directly
translate to improvements in managing Kubernetes
第45页
Simplify Node Bootstrap
On-host requirements become: ● Kubelet ● Container Runtime (docker, rkt, …)
第46页
Any Distro Node Bootstrap
● Install kubelet ○ $pkgmanager install kubelet
● Install container runtime ○ $pkgmanager install [docker|rkt]
● Write kubeconfig ○ scp kubeconfig user@host:/etc/kubernetes/kubeconfig
● Start kubelet ○ Systemctl start kubelet
第47页
Simplify k8s lifecycle management
Manage your cluster with only kubectl
Upgrading a self-hosted Kubernetes cluster:
$ kubectl apply -f kube-apiserver.yaml $ kubectl apply -f kube-scheduler.yaml $ kubectl apply -f kube-controller-manager.yaml $ kubectl apply -f kube-proxy.yaml
第48页
Launching a self-hosted cluster
Need an initial control plane to bootstrap a self-hosted cluster
Bootkube:
● Acts as a temporary control plane long enough to be replaced by a self-hosted control plane.
● Run only on very first node, then not needed again.
github.com/kubernetes-incubator/bootkube
第49页
How Bootkube Works
第50页
etcd Kubelet
第51页
Bootkube
API Server
Scheduler
Controller Manager
etcd Kubelet
第52页
Bootkube
API Server
Scheduler
Controller Manager
etcd Kubelet
第53页
Bootkube
API Server
Scheduler
Controller Manager
etcd Kubelet
第54页
Bootkube
API Server
Create:
Deployment Daemonset Service Secret
Scheduler
Controller Manager
etcd Kubelet
第55页
Bootkube
API Server
Scheduler
Controller Manager
etcd Kubelet
Pods
API Server
Scheduler
Controller Manager
第56页
Bootkube
API Server
Scheduler
Controller Manager
etcd Kubelet
Pods
API Server
Scheduler
Controller Manager
第57页
etcd Kubelet
Pods
API Server
Scheduler
Controller Manager
第58页
etcd Kubelet
Pods
API Server
Scheduler
Controller Manager
第59页
But wait! There’s more!
You can even self-host etcd!
https://coreos.com/blog/introducing-the-etcd-operator.html https://github.com/coreos/etcd-operator
第60页
How to bootstrap self-hosted etcd
第61页
Bootkube
API Server Scheduler Controller
Manager
etcd
Kubelet
第62页
Bootkube
API Server Scheduler Controller
Manager
etcd
Kubelet
Pods
API Server
Scheduler
Controller Manager
etcd operator
第63页
Bootkube
API Server Scheduler Controller
Manager
etcd
Kubelet Seed node
Pods
API Server
Scheduler
Controller Manager
etcd operator
第64页
Bootkube
API Server Scheduler Controller
Manager
etcd
Kubelet
Pods
API Server
Scheduler
Controller Manager
etcd operator
etcd
Add Member
第65页
Bootkube
API Server Scheduler Controller
Manager
etcd
Kubelet Remove member
Pods
API Server
Scheduler
Controller Manager
etcd operator
etcd
第66页
Kubelet
Pods
API Server
Scheduler
Controller Manager
etcd operator
etcd
第67页
Disaster Recovery
Node failure in HA deployments (Kubernetes) Partial loss of control plane components (Kubernetes) Power cycling the entire control plane (Kubernetes) Permanent loss of control plane (External tool)
第68页
Disaster Recovery
Permanent loss of control plane ● Similar situation to initial node bootstrap, but utilizing
existing etcd state or etcd backup. ● Need to start a temporary replacement api-server
○ Could be binary, static pod, new tool, bootkube, etc. ● Recovery once etcd+api is available can be done via
kubectl (as seen previously)
第69页
Self-Driving Kubernetes
第70页
Self driving
- A self-hosted cluster launched via Bootkube - Upgraded via Kubernetes APIs and an Operator - Automated by single-button or fully automatic
第71页
Kubernetes Version Operator
Cluster is running v1.4.3 and configured to run v1.4.5 ● API Server is v1.4.3 ● Scheduler is v1.4.3
Differences from desired config ● API Server should be v1.4.5 ● Scheduler should be v1.4.5
How to get there ● Upgrade all API servers Daemons to v1.4.5 safely
one-by-one ● Upgrade all Scheduler Deployments to v1.4.5 ● Update status to v1.4.5
第72页
The infrastructure
Workload driven Automation driven Easy to manage: self driving approach (Today’s topic) Security focused
第73页
Xiang Li
xiang.li@coreos.com
Thank you!