kubernetes版本 :V1.22.2
CNI 概述
CNI(Container Network Interface),是CNCF基金会的一个子项目,包括一些规则说明(SPEC),库文件(libcni),一些插件(bridge,macvlan等)。CNI只关注容器网络的连通性,在创建和删除POD时,创建或删除相应的网络资源。用通俗的话就是说,CNI只是定义了一些接口和一些规范,接口是给CRI调用使用的,规范是CNI插件要遵守的。
CNI规范
1、定义网络配置文件的格式
2、定义container runtime向cni插件的请求协议
3、定义cni插件的二进制文件的执行参数
4、定义插件的委托模式,即将一些功能交给其他plugin去处理,如ipam plugin,multus
5、定义cni返回结果的数据类型
为避免歧义,在这里粘贴下英文原文 :
- A format for administrators to define network configuration.
- A protocol for container runtimes to make requests to network plugins.
- A procedure for executing plugins based on a supplied configuration.
- A procedure for plugins to delegate functionality to other plugins.
- Data types for plugins to return their results to the runtime.
来个概览图
docker与CNI
docker也是CRI的一种实现,但是docker的CRI实现部分(dockershim)在kubernetes v1.23.0版本依然在kubernetes的代码库中。
# 命名空间的路径文件
dockerNetNSFmt = "/proc/%v/ns/net" %pid
# 调用cni二进制的代码,c.exec即是二进制文件的路径
pluginPath, err := c.exec.FindInPath(net.Network.Type, c.Path)
invoke.ExecPluginWithResult(ctx, pluginPath, newConf.Bytes, c.args("ADD", rt), c.exec)
containerd与CNI
比起docker的CNI接口实现,containerd的cni实现更加的简洁,标准。这里需要注意 :是CRI调用CNI,CNI调用具体的网络插件。
RunPodSandbox
sandbox.NetNSPath = sandbox.NetNS.GetPath()
c.setupPodNetwork(ctx, &sandbox)
c.netPlugin.Setup(ctx, id, path, opts...)
network.Attach(ctx, ns)
n.cni.AddNetworkList(ctx, n.config, ns.config(n.ifName))
c.addNetwork(ctx, list.Name, list.CNIVersion, net, result, rt)
invoke.ExecPluginWithResult(ctx, pluginPath, newConf.Bytes, c.args("ADD", rt), c.exec)
calico cni实现
cni规范的最新版本是V1.0.0,calico目前最高支持到v0.3.1。
skel.PluginMain(cmdAdd, nil, cmdDel,
cniSpecVersion.PluginSupports("0.1.0", "0.2.0", "0.3.0", "0.3.1"),
"Calico CNI plugin "+version)
// 1) Call the configured IPAM plugin to get IP address(es)
// 2) Configure the Calico endpoint
// 3) Create the veth, configuring it on both the host and container namespace.
// 取得calixx网卡的名称
desiredVethName := k8sconversion.NewConverter().VethNameForWorkload(epIDs.Namespace, epIDs.Pod)
下图重点展示DoNetworking函数的作用 :
包括设置网卡地址对,设置路由,启用转发等。
calico ipam插件
// 注册ADD和DEL回调函数
skel.PluginMain(cmdAdd, nil, cmdDel,
cniSpecVersion.PluginSupports("0.1.0", "0.2.0", "0.3.0", "0.3.1"),
"Calico CNI IPAM "+version)
utils.ResolvePools(ctx, calicoClient, conf.IPAM.IPv4Pools, true)
v6pools, err := utils.ResolvePools(ctx, calicoClient, conf.IPAM.IPv6Pools, false)
calicoClient.IPAM().AutoAssign(ctx, assignArgs
c.autoAssign(ctx, args.Num4, args.HandleID, args.Attrs, args.IPv4Pools, 4, hostname, args.MaxBlocksPerHost, args.HostReservedAttrIPv4s)
c.prepareAffinityBlocksForHost(ctx, requestedPools, version, host, rsvdAttr)
c.determinePools(ctx, requestedPools, version, *v3n, maxPrefixLen)
c.blockReaderWriter.getAffineBlocks(ctx, host, version, pools)
.assignFromExistingBlock(ctx, b, rem, handleID, attrs, host, false)
b.autoAssign(num, handleID, host, attrs, affCheck)
attrIndex := b.findOrAddAttribute(handleID, attrs)
c.blockReaderWriter.updateBlock(ctx, block)
涉及到的资源对象有IPPool,IPAMBlock,BlockAffinity,WorkloadEndpoint,IPAMHandle
看懂IPAMBLOCK
// ipamblock的资源定义
[root@10 cni]# kubectl explain ipamblocks.crd.projectcalico.org.spec
KIND: IPAMBlock
VERSION: crd.projectcalico.org/v1
RESOURCE: spec <Object>
DESCRIPTION:
IPAMBlockSpec contains the specification for an IPAMBlock resource.
FIELDS:
affinity <string>
allocations <[]> -required-
attributes <[]Object> -required-
cidr <string> -required-
deleted <boolean>
strictAffinity <boolean> -required-
unallocated <[]integer> -required-
[root@10 cni]# kubectl get ipamblocks.crd.projectcalico.org 10-244-5-128-26 -o yaml
apiVersion: crd.projectcalico.org/v1
kind: IPAMBlock
metadata:
name: 10-244-5-128-26
uid: abb02b81-3dfc-4a30-a0d9-58f15f24c47d
spec:
affinity: host:10.10.101.91-slave
allocations:
- null
- 0
- null
- null
- 1
- 省略
- 2
- null
- 3
- 省略
- null
- 6
- null
- null
- 7
- null
- null
- null
- 5
- 4
- null
attributes:
- handle_id: vxlan-tunnel-addr-10.10.101.91-slave
secondary:
node: 10.10.101.91-slave
type: vxlanTunnelAddress
- handle_id: k8s-pod-network.014688d18778c72045e9c0b05260c90dda84a7faea109d01adee487e6b40d70e
secondary:
namespace: kube-system
node: 10.10.101.91-slave
pod: calico-kube-controllers-cf4844b67-sddbj
timestamp: 2021-11-28 03:18:53.22987459 +0000 UTC
- handle_id: k8s-pod-network.2393551f10bd7b4371499f53ce28815222814db3ae291c5fe865faebc291070d
secondary:
namespace: default
node: 10.10.101.91-slave
pod: lugl-deploy-busybox-66f4444d68-kbgdc
timestamp: 2021-12-15 10:19:14.495804833 +0000 UTC
- handle_id: k8s-pod-network.0e34072f8b17070afd0d5163284ae42f95379432cacfd030476cdc7f7fce2652
secondary:
namespace: default
node: 10.10.101.91-slave
pod: lugl-deploy-busybox-66f4444d68-5k2cm
timestamp: 2021-12-15 10:19:14.524535051 +0000 UTC
cidr: 10.244.5.128/26
deleted: false
strictAffinity: false
unallocated:
- 49
- 35
- 52
- 60
- 8
- 40
- 54
- 省略
attributes 包含了所有使用该ipamblock的ip的资源对象,并且是有序的。
比如要查看attributes数组的第三个资源的ip,要在alloctions数组中找到值为3看看是在哪一项,比如allocations[5] = 3,那么第三个资源的ip就是这个block的cidr的第5个ip。
相反的,如果删除了一个资源对象,对应的ipamblock资源对象也需要删除,首先在attributes数组中删除对应的资源对象,再将allocations[i]置为nil,再将i值append到unallocations中,这就完成了ip的释放。
kubelet配置项
HairpinMode
func NewContainerRuntimeOptions() *config.ContainerRuntimeOptions {
dockerEndpoint := ""
if runtime.GOOS != "windows" {
dockerEndpoint = "unix:///var/run/docker.sock"
}
return &config.ContainerRuntimeOptions{
ContainerRuntime: kubetypes.DockerContainerRuntime,
DockerEndpoint: dockerEndpoint,
DockershimRootDirectory: "/var/lib/dockershim",
PodSandboxImage: defaultPodSandboxImage,
ImagePullProgressDeadline: metav1.Duration{Duration: 1 * time.Minute},
CNIBinDir: "/opt/cni/bin",
CNIConfDir: "/etc/cni/net.d",
CNICacheDir: "/var/lib/cni/cache",
}
}
有用的冷知识
Q : 容器内部的lo网卡是如何创建的?
A : 内核中注册了很多命名空间的初始化函数,其中就包括lo设备。当新建一个命名空间时,内核会自动在该命名空间内创建lo设备。
static __net_init int loopback_net_init(struct net *net)
{
struct net_device *dev;
int err;
dev = alloc_netdev(0, "lo", NET_NAME_UNKNOWN, loopback_setup);
dev_net_set(dev, net);
// 注册环回口lo设备
err = register_netdev(dev);
net->loopback_dev = dev;
return 0;
}
struct pernet_operations __net_initdata loopback_net_ops = {
.init = loopback_net_init,
};
register_pernet_device(&loopback_net_ops)
Q : 路由条目中的onlink参数
A : 添加路由时,加上onlink参数,内核就不再判断下一跳是否可达。
参考文档 :
https://github.com/containernetworking/cni
https://github.com/containernetworking/cni/blob/master/SPEC.md