Have you ever wondered how does the Docker nat networking mechanism really work? How does it translate its magical capabilities into a practical reality? Well, a few weeks ago I found myself asking the same question, and today, I’m ready to share my insights with you.
Let’s start from the beginning. Docker is a set of PaaS products that use OS kernel virtualization (a.k.a Containers), isolated and self-contained filesystems, software, configurations, and libraries. Additionally, as I quote from the official Docker documentation:
“Containers can communicate with each other through well-defined channels”
What is “well-defined”? What are the “channels” and how do they communicate in practice? In this detailed blog post, I will try to find the answers to these questions.
The Docker networking mechanism was built in a very similar way to that of the OS kernel. It uses much of the same concepts, benefits and capabilities, such as abstractions, but it extends and expands them to fit its own needs.
Before we dive deeper into the technical analysis, let’s make sure that you are familiar with some important definitions:
- Layer 2: is the data link layer, a protocol layer that transfers frames between nodes in a typical wide area network. An example protocol in this layer would be ARP, which discovers MAC addresses with its IP address.
- Layer 3: is the network layer, a routing layer that transfers packets between local area network hosts. Example protocols in this layer would be IP and ICMP (ping command).
- NAT: Network Address Translation, provides a simple mapping from one IP address (or subnet) to another. Typically, the NAT gives the kernel the ability to provide “virtual” large private networks to connect to the Internet using a single public IP address. To achieve the above, the NAT maintains a set of rules (generally speaking, ports masquerade and translation).
- Bridge: the Network Bridge is a device (can also be a virtual one) that creates a communication surface which connects two or more interfaces to a single flat broadcast network. The Bridge uses a table, forwarding information base, maintains a forwarding pairs entries (for example, record might look like MAC_1 → IF_1).
- Network Namespace: by namespacing, the kernel is able to logically separate its processes to multiple different network “areas”. Each network looks like a “standalone” network area, with its own stack, Ethernet devices, routes and firewall rules. By default, the kernel provides a “default” namespace in its bootstrap (if not stating otherwise, every process will be spawned/forked in the default namespace). Every child process inherits its namespace from its ancestors.
- Veth: Virtual Ethernet device is a virtual device that acts as a tunnel between network namespaces. These devices create interconnected peering between the two connected links and pass direct traffic between them. This concept mainly belongs to UNIX OS. On Windows OS it works differently.
Now that we are familiar with the basic terms, let’s start with our first observation. If you installed a Docker daemon and ran the following command
ip link show
You might notice that something has changed:
But what exactly changed? What is this MAC address? (check here that it’s not a physical NIC interface). docker0 is a virtual bridge interface created by Docker. It randomly chooses an address and subnet from a private defined range. All the Docker containers are connected to this bridge and use the NAT rules created by docker to communicate with the outside world. Remember the “channels” I mentioned above? Well, these channels are actually a veth “tunnel” (a bi-directional connection between each container namespace and the docker0 bridge).
For example, let’s analyze a simple Debian container:
docker run — rm -it debian:stable-slim
When drilling down its network setup and configuration we learn that the container (as any other Docker container) is an isolated virtual OS, so it maintains a unique namespace:
I highlighted the crucial points above: the docker holds layer 2 interface named eth0 and routes any IP address to the default gateway with IP 172.17.0.1, which is, unsurprisingly, our docker0 interface:
A respective veth (identifier vethad06c29) created in order to transfer traffic between the container and the bridge:
ip link show | grep veth
Moreover, as you can see, docker0 manages the LAN subnet 172.17.X.X with default gateway 172.17.0.1 (in our container above, its IP was 172.17.0.3 which satisfies the condition aforementioned). To conclude, we now understand how multiple containers can link each other via the Docker Bridge and via veth tunneling:
Docker and Kernel Networking
How can containers transfer data to the kernel, and from there, to the outside world? Let’s take a closer look at the process as we cover two network manipulation techniques that Docker uses to achieve its external communication capability:
- Port Forwarding — forwards traffic on a specific port from a container to the kernel.
- Host Networking — disables the network namespace stack isolation from the Docker host.
The examples below will demonstrate the communication pipeline for each use case.
Let’s review the current status of the kernel NAT rules. We’ll only filter the table which is consulted when a new connection has been established.
sudo iptables -t nat -L -n
We can see 5 rule sections (a.k.a chains): PREROUTING, INPUT, OUTPUT, POSTROUTING and DOCKER. We will only focus on PREROUTING, POSTROUTING and DOCKER:
- PREROUTING — rules altering packets before they come into the network stack (immediately after being received by an interface).
- POSTROUTING — rules altering packets before they go out from the network stack (right before leaving an interface).
- DOCKER — rules altering packets before they enter/leave the Docker bridge interface.
The PREROUTING rules lists any packet targeting the DOCKER rules section, before they enter the interface network stack. Currently, the only rule is RETURN (returns back to the caller). The POSTROUTING describes how each source IP in the Docker subnet (e.g. 172.17.X.X) will be targeted as MASQUERADE when sent to any destination IP, which overrides the source IP with the interface IP.
Now, let’s create a lightweight Python http webserver container listening to port 5000 and forwarding it:
docker run -p 5000:5000 –rm -it python:3.7-slim python3 -m http.server 5000 — bind=0.0.0.0
And again, let’s review the NAT rules:
sudo iptables -t nat -L -n
We can see two main differences (marked) from the original NAT configurations:
- POSTROUTING — a new MASQUERADE target was added. MASQUERADE, in a nutshell, (a future blog post will elaborate, stay tuned!) is like a SNAT (source NAT) target, but instead of overriding the source IP with static/elastic inet IP as “to-source” option, the external IP of the inet interface is determined dynamically by an algorithm. Back to our case, the traffic from IP address 172.17.0.2 (the container IP) on dpt (destination port) 5000 will be directed to the interface IP.
- DOCKER — a new DNAT (destination NAT) target. DNAT is commonly used to publish a service from internal network to an external IP. The rule states that each IP packet from any IP on destination port 5000 will be altered to internal IP 172.17.0.2 on port 5000.
In order to test the rule’s behavior, we will perform an http request to our webserver
curl -XGET ‘http://localhost:5000’
The sniffing trace looks like the following:
As you can see, our http request is divided into two sequential TCP streams, one in the loopback interface, and the second in the docker0 bridge. The pipeline in details:
- The http GET request went up from the application layer to the transport layer (TCP request).
- The http destination IP is localhost (the DNS resolver will interpret it as 127.0.0.1). The source IP is also localhost, and therefore the request is sent on the loopback interface.
- As for the PREROUTING NAT rule – for any IP (including 127.0.0.1), an interface (e.g. loopback) chains the request to the DOCKER target.
- In the DOCKER chain there is a DNAT rule. The rule alters the packet destination to 172.17.0.2:5000.
- The POSTROUTING rule masquerades packets with source IP 172.17.0.2 and port 5000 by changing the source IP to the interface IP. As a result of this modification, the packet transfers through the docker0 interface.
- Now, the packet has arrived at the docker0 interface (after rules 3+5 were applied on the packet). With the support of the veth tunnel, the gateway IP (172.17.0.1, which is effectively the packet source) now establishes a TCP connection with the container IP (172.17.0.2).
- The webserver binds IP 0.0.0.0 and listens to port 5000, therefore it receives all the frames on its eth0 interface and answers the request.
- From there, it generally goes in the opposite direction.
Let’s create a lightweight Python http webserver container listening to port 5000. We will change the Docker networking setup by the option flag — net=host.
docker run — net=host — rm -it python:3.7-slim python3 -m http.server 5000 — bind=0.0.0.0
Again, let’s review the NAT rules:
sudo iptables -t nat -L -n
We can see two differences from the original NAT configurations (from the port forwarding example):
- POSTROUTING — the MASQUERADE rule for IP 127.17.0.2 is eliminated.
- DOCKER — there is a new DNAT target rule that is eliminated as well.
The sniffing trace looks like the following:
We can actually see that the packet doesn’t leave the loopback interface, because we mask the docker0 network namespace (with net=host flag) which leads the container kernel to use the same namespace as the loopback interface.
Let’s analyze this last case by reviewing the detailed pipeline:
Stages 1–3 remain as in the port forwarding pipeline.
- The DOCKER chain only has a RETURN target for any IP, so no rule applied here and the rule returns the routing decision to the caller target.
- The POSTROUTING rule masquerades all the packets source IP on subnet 172.17.0.0/16 to the interface inet, the loopback interface IP (127.0.0.1).
- We are left with the source IP 127.0.0.1 and destination IP 0.0.0.0 on port 5000, in the same network stack, which leaves the packet in the loopback interface and the webserver running as it would run on the OS kernel itself.
I hope you found this detailed analysis interesting and insightful.
I will be more than happy to accept your feedback via [email protected]