-
Notifications
You must be signed in to change notification settings - Fork 243
Description
While debugging flaky e2e tests I've noticed that /var/log/azure-vnet.log has logs which are obviously from two different azure-vnet binaries.
For example, I ran:
KEEP_VMSS=1 go test -v -timeout 90m -run '^Test_AzureLinuxV2_AzureCNI$'
Here are the first 10 lines from azure-vnet.log:
2025/11/27 15:09:12 [5672] [cni] reboot time 2025-11-27 15:08:22 +0000 UTC
2025/11/27 15:09:12 [5672] Connection to telemetry socket failed: dial unix /var/run/azure-vnet-telemetry.sock: connect: no such file or directory
2025/11/27 15:09:12 [5672] [Azure-Utils] pkill -f azure-vnet-telemetry
2025/11/27 15:09:12 [5672] [Telemetry] Starting telemetry service process :/opt/cni/bin/azure-vnet-telemetry args:[-d /opt/cni/bin]
2025/11/27 15:09:12 [5672] [Telemetry] Telemetry service started
2025/11/27 15:09:12 [5692] [cni] reboot time 2025-11-27 15:08:22 +0000 UTC
2025/11/27 15:09:12 [5706] [cni] reboot time 2025-11-27 15:08:22 +0000 UTC
2025/11/27 15:09:12 [5672] Connected to telemetry service
2025/11/27 15:09:12 [5672] [cni-net] Plugin azure-vnet version v1.1.8.
2025/11/27 15:09:12 [5672] [cni-net] Running on Linux version 5.15.186.1-1.cm2 (root@CBL-Mariner) (gcc (GCC) 11.2.0, GNU ld (GNU Binutils) 2.37) #1 SMP Tue Jul 29 02:54:43 UTC 2025
And here are more lines from further down the log file:
{"level":"info","ts":"2025-11-27T15:09:27.399Z","caller":"./main.go:93","msg":"Environment variable set","pid":7191,"component":"cni-main","CNI_COMMAND":"ADD"}
{"level":"info","ts":"2025-11-27T15:09:27.401Z","caller":"store/json.go:196","msg":"Acquiring process lock","pid":7191,"component":"cni-store"}
{"level":"info","ts":"2025-11-27T15:09:27.401Z","caller":"store/json.go:215","msg":"Acquired process lock with timeout value of","pid":7191,"component":"cni-store","timeout":30}
{"level":"info","ts":"2025-11-27T15:09:27.402Z","caller":"telemetry/telemetrybuffer.go:383","msg":"Connected to telemetry service","pid":7191,"component":"cni-main"}
{"level":"info","ts":"2025-11-27T15:09:27.402Z","caller":"network/network.go:166","msg":"Plugin Info","pid":7191,"component":"cni-net","name":"azure-vnet","version":"v1.6.32"}
As can be seen, the format of the top lines is unstructured text whereas the bottom is JSON. In addition, we can see that the top lines are from v1.1.8 whereas the bottom is v1.6.32.
Turns out that nowadays we use a DaemonSet named azure-cns with an init container named cni-installer to deploy the Azure CNI to cluster nodes. Therefore, when the user specifies --network-plugin=azure when creating a cluster, RP passes none to AB here:
| if [ "${NETWORK_PLUGIN}" = "azure" ]; then |
As a result, the CSE script doesn't install the plugin.
However, looks like in the e2e tests we still specify azure as the network plugin in tests which use the Azure CNI:
AgentBaker/e2e/scenario_test.go
Lines 102 to 114 in 79d2559
| func Test_AzureLinuxV2_AzureCNI(t *testing.T) { | |
| RunScenario(t, &Scenario{ | |
| Description: "azurelinuxv2 scenario on a cluster configured with Azure CNI", | |
| Config: Config{ | |
| Cluster: ClusterAzureNetwork, | |
| VHD: config.VHDAzureLinuxV2Gen2, | |
| BootstrapConfigMutator: func(nbc *datamodel.NodeBootstrappingConfiguration) { | |
| nbc.ContainerService.Properties.OrchestratorProfile.KubernetesConfig.NetworkPlugin = string(armcontainerservice.NetworkPluginAzure) | |
| nbc.AgentPoolProfile.KubernetesConfig.NetworkPlugin = string(armcontainerservice.NetworkPluginAzure) | |
| }, | |
| }, | |
| }) | |
| } |
As a result, we now have two competing mechanisms for installing the CNI plugin in the e2e tests. I suspect that this has led to all sorts of race conditions and weird behavior.