Preface

To my girls: wife, daughter, and mom - for their patience and love

Who should read this book?

To be honest, I don’t have vast practical experience of building distributed systems except the most popular one among web developers (hint: client-server). But a usual requirement to acquire this practical experience is some non-basic knowledge of the topic. So this is a classical “chicken-egg” problem: to work on a distributed software product you are usually required to have some related projects in your portfolio. So writing this book is a way to close these gaps.

Anyway I assume that readers have developed web projects and already have some experience in vertical and horizontal scaling, containerization, data races, classical algorithms and data structures. I will try at my best to introduce all used vocabulary.

I also do not devote any decent time for single machine patterns, as I believe you have enough working practice with them.

Why yet another book on distributed systems?

I’ve read some excellent books devoted to distributed systems or related topics. Among them:

All above mentioned books have their own view on what topics and at what level should be covered. Martin Kleppman’s outstanding book (highly recommended) obviously focuses on data-related topics.

Despite I’ve read the books above, I had no clear mental picture and systematic view of Raft Algorithm, Lamport clocks and other distributed buzzwords. So this book tries to cover important topics (choice is subjective, of course) and to serve as a practical guide for busy engineers striving to grasp core concepts in a timely manner.

True understanding comes with real doing, so practice through exercises is a necessary part of learning. I have chosen Rust1 as a language to use throughout the book as I believe that Rust incorporates modern best practices and solid theoretical achievements. Experienced developers will dive without further ado. But for those who like more gradual involvement I could recommend the following excellent free resources to start with:

Why should we learn about distributed systems?

I admit that you already know the answer if you’ve started reading the book. But for the sake of completeness let’s enlist some points:

  • modern scalable products inevitably rely on concepts discussed further. So if you want to get working experience in modern tech companies or even build your own product with highload and fault-tolerance in mind, you should grasp the basics2;
  • expecting your software being distributed at some point in the future can radically improve your architecture decisions and even determine your preferred stack of technologies to use;
  • emerging Web 3.0 is build around an idea of decentralization which is tightly connected with distributed. So at least you will be able to understand underlying technologies and explore them deeper.

The following goal for the book comes out of the points above - get reader familiar with selected topics on distributed systems with strong emphasis on practice.

Errata, typos, bugs and other feedback

I am not a native English speaker so I will be glad to get any feedback on a correct usage of English grammar and phrases. Also any suggestions, improvements, fixes are welcome. You can provide merge requests at book’s Github repository or email me at {auhtor’s name}.{author’s last name}@gmail.com.

Acknowledgments

I would like to tell many kind words to people who influenced me as a software engineer during my career. First of all, the friend of mine, Farit Sadykov, who helped me a lot to become a developer. Also my colleagues, especially Pavel Antonov, Nikolay Yurov, Vladimir Galkin, Pavel Gorbunov, Andrey Belov, Evgeny Orlov, Ilya Zhigalko, Slava Cheremushkin, Kirill Zhuravlev, Ilya Scherbakov, Dariusz Antoniuk, Ruud van Asseldonk, Michael Livshin.

Also many thanks to open source enthusiasts creating wonderful products, books, and articles, generously sharing knowledge with the rest of the world. We truly stand on the shoulders of giants.

Maksim Ryndin

Montenegro, 2022.

1

This book is not affiliated with Rust Foundation

2

Being scalable doesn’t only mean software and infrastructure while adding more servers. It’s also about your team being ready to grow with growing traffic. So being prepared to go distributed is a must-have for every team member in terms of some introductory learning experience, automated pipelines and merge reviews etc. Please watch very informative presentation by Lisa Guo on scaling Instagram.

Introduction

The big idea is “messaging” <..> The key in making great and growable systems is much more to design how its modules communicate rather than what their internal properties and behaviors should be. Alan Kay, creator of Smalltalk1

What is distributed?

Distributed system is any system in which a number of independent interconnected computers can cooperate in order to achieve a common goal. Oxford Dictionary of Computer Science

If we have two virtual machines (VM) with a web server on the first VM and a database server on the second one, do we consider such a system distributed? If VMs are on the same physical node, is it a distributed system? If the web server and the database are running in separate containers and these containers are on different physical nodes, then this system is distributed. But a container orchestrator then moves both containers on the same node. Does the system of two connected containers lose its distributed flavor?

So there are many questions like above and for the purpose of the book we will follow the next definition of a distributed system.

Distributed system is a software which has at least two components interacting with each other only by a network.

If not otherwise specified, we will call components nodes and a group of connected nodes is named cluster.

Even if a network used for communication of the components for the present moment is a loopback interface (“localhost”), it can become a real network for the next deployment.

Why we need make our software distributed?

At least, to make it highly available at the cost of an increased complexity and managing effort.

Overview of the book

TODO

2

In this case we have a complete graph, so every node has connections with other n-1 nodes. That’s we have n*(n-1) edges but as we count every edge twice, total number of edges is n*(n-1)/2.

Network stack

Our definition of a distributed system mentions a network as a medium for communication. This medium is a very interesting and complex subject per se but here we only review some core concepts relevant to our study.

Physically nodes (or hosts, think, computers) in the network are connected by means of network interface controllers (NIC, also network adapter) - a special hardware1 allowing a computer to connect to the network via a cable (Ethernet) or a radio (WiFi)2. This physical layer is called a link in TCP/IP stack3. Nodes at the link layer are identified by Media Access Control Address (MAC) - a unique hardware identifier of every network interface of the host4.

We can use ifconfig (or more modern ip addr and ip link show) utility to manage network interfaces. We just describe the most common data:

  • <UP,BROADCAST,RUNNING,MULTICAST> means that the interface supports broadcasting and multicasting, is loaded (UP as opposite to DOWN), is ready to accept connections (RUNNING). Other popular options are LOOPBACK (it is a loopback interface) and PROMISC (interface is in the promiscuous mode when it captures all packets in the network).
  • mtu 1500 - Maximum Transmission Unit is 1500 bytes. So IP packet greater than 1500 bytes will be fragmented.
  • ether bc:d0:74:ec:76:17 means MAC is bc:d0:74:ec:76:17
  • inet 192.168.43.174 netmask 0xffffff00 broadcast 192.168.43.255 means IPv4 address 192.168.43.174 with network mask 255.255.255.0 so we can have hosts 192.168.43.1-254, 192.168.43.255 is a broadcast address and 192.168.43.0 is the network itself.
  • inet6 fe80::88f:5ab4:1ad8:eed5 prefixlen 64 secured scopeid 0xe means IPv6 address fe80::88f:5ab4:1ad8:eed5

So we have a network of nodes (also called LAN - local area network5). How do we interconnect networks (to get WAN - wide area network) so that we can send a packet with data (called a datagram) between two hosts from different networks? Here a routing problem arises. Computers called routers6 are relaying datagrams from source to destination accross networks borders7. But how do we distinguish a host in one network from a host from another network? We need some addressing at an internet layer. IPv4 addressing assigns each host an IP address of the form xx.xx.xx.xx (4 octets of 8 bits each). But with 2^32 unique IP addresses we cannot mark all hosts so IPv6 was developed (128 bits). IPv4 address exaustion was partially mitigated by so called Network Address Translation (NAT) and private IPs8 which results in violation of the end-to-end principle9. Current adoption of IPv6 you can check with Google data. Internet layer is independent of the network topology and the physical way of hosts connection.

Within LAN we need to establish a mapping between an IP addresses of hosts and their corresponding hardware addresses (MAC) to send packet from one host to another. Such a mapping is called ARP table (or ARP cache) and is filled at each host with an Address Resolution Protocol (ARP) broadcast messages. In Linux/Unix ARP table can be manipulated with an arp utility. For IPv6 Neighbor Discovery Protocol handles that.

Routing is mapping of IP addresses to interfaces and can be displayed with netstat -rn. You can trace packet path with tracepath.

After we identified a route between to hosts, we can use a transport protocol to, at least, specify ports on hosts to connect specific application processes running on hosts involved. Because internet layer is only responsible for routing and for reliability of communication, transport layer’s protocols can also offer some reliability mechanisms like congestion control, preserves data ordering, eliminate packet loss, provides packet deduplication.

So we have a connection between hosts and we can exchange application data over the transport protocol.

1

Network interface can also be virtual to allow userspace applications to intercept raw packets (see Universal TUN/TAP device driver). Also the famous loopback interface (127.0.0.1 for IPv4 and ::1 for IPv6) is virtual.

2

we consider only packet networks where the same communication channel can be shared between several communication sessions by means of packetizing data. In contrast, in circuit networks (for example, public switched telephone network, PSTN) the channel capacity is completely used only by a single communication session between two nodes.

3

Note that common reference to TCP/IP stack (also called Internet protocol suite) includes not only Internet Protocol and Transmission Control Protocol (TCP) but also other protocols such as a User Datagram Protocol (UDP) and QUIC at a transport layer and Internet Protocol Security (IPSec) at an internet layer. TCP/IP stack as a practical network stack predates the OSI theoretical model for general networks. Since the adoption of TCP/IP in ARPANET in 1983 several proprietary implementations of the stack for different operating systems emerged. But the TCP/IP popularity increased when the University of California at Berkley had open sourced its implementation for BSD Unix becoming famous BSD sockets (see Wikipedia Internet protocol suite). Among alternatives to TCP/IP were IPX/SPX and AppleTalk.

4

Many network interface controllers allow to overwrite its MAC. In this case the MAC should be unique within the network where the NIC lives. Moreover, software defined MACs are used in VM managed by hypervisors such as QEMU and MAC randomization is gaining popularity to avoid mobile device tracking (see Wikipedia MAC address)

5

the main criteria here is that nodes are physically located nearby. See also Computer Networking Introduction: Ethernet and IP (Heavily Illustrated) by Ivan Velichko

6

Do not confuse with network bridges. Routers allow separate networks to communicate while bridges join networks making them a single network. While routers interconnect networks at the internet layer, a gateway is a more general term - gateways can interconnect PSTN (Public switched telephone network) and VoIP network acting as VoIP gateway or even the internet and satellites orbiting Earth acting as Internet-to-orbit gateway. A default gateway is the destination of IP packets if the destination IP of the packet doesn’t belong to the network mask.

7

When a packet crosses a border of networks it is called a hop. The time-to-live (TTL) in IPv4 (and hop limit in IPv6) prevents infinite cycling of an IP packet between networks as it is decreased by 1 at every router the packet visits.

8

Private IPs are defined to belong to the following subnets: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16. Number after the slash is the network prefix (CIDR notation) and denotes number of bits which is common for all hosts in the network. So, for example, 192.168.0.0/16 means that the first 16 bits are the same for all hosts and 192.168 in decimal and the rest 16 bits of total 32 bits define the host (2^16 possible addresses in the network). To be correct, the total number of possible addresses should be decreased by two as all binary zeroes hosts denotes the network itself, while all binary ones host is the broadcast address. So when you encounter an ip 192.168.12.134, you already know that this ip is not reachable publicly (from the Internet), it is some internal private host. For IPv6 private ip address is not desirable because its main goal is to provide globally unique addressing. So for the LAN use a unique local unicast address starts with a prefix fd (8 bits), then 40 bits of global id (choosen randomly by the operator of the network), then 16 bits for a subnet and the rest 64 bits define the host. So with IPv6 local private ips are essentially globally unique if 40 bits of global id indeed are random and this global uniqueness of local addresses allows to merge two networks without reassigning addresses or using NATs. For IPv6 there is no broadcast - there is only a multicast - when packets are received only by hosts which declared an interest in specific packets.

9

End-to-end principle means that end hosts themselves (and not other parties) participating in the communication are responsible for communication reliability and security. Lower routing layer is connectionless, i.e. each packet goes its own route rather than a prearranged one. Interesting fact: the principle was pioneered by CYCLADES - an INRIA network started the same time as the well-known ARPANET.

Practice: virtual interface

As we already discussed, at the link layer of TCP/IP we have network interfaces allowing us to interact with a network devices. But this network devices can be completely virtual allowing a userspace program interact with a network1 at the link layer (TAP device) and the internet layer (TUN device) which is used in different tunneling programs like vpns. You can create such interfaces with console commands (like ip tuntap) but let’s try to create a TAP device programmaticaly. There C examples but let’s do it in Rust from scratch2.

We create rust workspace (a collection of several packages) to have some experiments with the network in further chapters (you can find all the code in the netstack repository).

$ mkdir netstack
$ cd netstack

In the netstack directory we create Cargo.toml with the contents:

[workspace]

members = [
    "virtual-interface",
    "bindings",
]

After we create libs running cargo new --lib virtual-interface and cargo new --lib bindings in the netstack directory. So to create a virtual network device, we need to open for a read-write access file /dev/net/tun and configure it with ioctl - a swiss knife of device management in Unix/Linux. Essentialy, ioctl as a universal system call (please see man ioctl) was created to control different devices and be extensible to manage new types of devices. To achieve this goal ioctl accepts an open file descriptor (device), a request code - unsigned long number which is unique across devices and it can accept a pointer to some datastructure related to the specific device. In our case of a network device (TAP device) such a control structure is ifreq (interface request, see man 7 netdevice) and a request code is TUNSETIFF.

Rust provides libc crate as raw ffi bindings for system libraries so we could use ioctl and some constants but it is missing constants for TUN and TAP devices. So anyway as we need to create some bindings ourselves let use bindgen crate to generate bindings for system libraries at the build time. Following bidngen tutorial we create a netstack/bindings/wrapper.h file where we place necessary headers:

#include <sys/ioctl.h>
#include <linux/if.h>
#include <linux/if_tun.h>

and create bindings/build.rs with the contents

extern crate bindgen;

use std::env;
use std::path::PathBuf;

fn main() {
    // Tell cargo to invalidate the built crate whenever the wrapper changes
    println!("cargo:rerun-if-changed=wrapper.h");

    // The bindgen::Builder is the main entry point
    // to bindgen, and lets you build up options for
    // the resulting bindings.
    let bindings = bindgen::Builder::default()
        // The input header we would like to generate
        // bindings for.
        .header("wrapper.h")
        // Tell cargo to invalidate the built crate whenever any of the
        // included header files changed.
        .parse_callbacks(Box::new(bindgen::CargoCallbacks))
        // Finish the builder and generate the bindings.
        .generate()
        // Unwrap the Result and panic on failure.
        .expect("Unable to generate bindings");

    // Write the bindings to the $OUT_DIR/bindings.rs file.
    let out_path = PathBuf::from(env::var("OUT_DIR").unwrap());
    bindings
        .write_to_file(out_path.join("bindings.rs"))
        .expect("Couldn't write bindings!");
}

Run cargo build and at target/debug/build/bindings-.../out/bindings.rs we can find needed bindings except TUNSETIFF which is a required request code for ioctl (bindgen do not generate bindings for functional macros) so if we check contents of /usr/include/linux/if_tun.h then we see TUNSETIFF is a functional macro #define TUNSETIFF _IOW('T', 202, int) so let’s port the C macro code from /usr/include/asm-generic/ioctl.h where participating macros live.

In netstack/bindings/src/lib.rs

include!(concat!(env!("OUT_DIR"), "/bindings.rs"));
use std::mem;
use std::os::raw;

// /usr/include/asm-generic/ioctl.h
const IOC_SIZEBITS: u8 = 14;
const IOC_NRBITS: u8 = 8;
const IOC_TYPEBITS: u8 = 8;
const IOC_NRSHIFT: u8 = 0;
const IOC_TYPESHIFT: u8 = IOC_NRSHIFT + IOC_NRBITS;
const IOC_SIZESHIFT: u8 = IOC_TYPESHIFT + IOC_TYPEBITS;
const IOC_DIRSHIFT: u8 = IOC_SIZESHIFT + IOC_SIZEBITS;
const IOC_WRITE: u32 = 1;

//#define TUNSETIFF     _IOW('T', 202, int)
const INT_SIZE: usize = mem::size_of::<raw::c_int>();
pub const TUNSETIFF: u32 = IOC_WRITE << IOC_DIRSHIFT
    | (b'T' as u32) << IOC_TYPESHIFT
    | 202 << IOC_NRSHIFT
    | (INT_SIZE as u32) << IOC_SIZESHIFT;

Let’s add file netstack/virtual-interface/src/tap.rs where main things happen

use std::ffi;
use std::fs::{File, OpenOptions};
use std::io;
use std::mem;
use std::os::unix::io::{FromRawFd, IntoRawFd};

use bindings;

const VIRTUAL_DEVICE: &str = "/dev/net/tun";

#[derive(Debug)]
pub enum VirtualInterfaceError {
    IoError(io::Error),
    IoctlError,
    DeviceNameTooLong,
    DeviceNameContainsNulByte(ffi::NulError),
}

impl From<io::Error> for VirtualInterfaceError {
    fn from(error: io::Error) -> Self {
        VirtualInterfaceError::IoError(error)
    }
}

impl From<ffi::NulError> for VirtualInterfaceError {
    fn from(error: ffi::NulError) -> Self {
        VirtualInterfaceError::DeviceNameContainsNulByte(error)
    }
}

pub struct VirtualInterface {
    device: File,
}

impl VirtualInterface {
    pub fn create(name: &str) -> Result<Self, VirtualInterfaceError> {
        // reserve 1 byte for '\0'
        if name.len() >= bindings::IFNAMSIZ as usize {
            return Err(VirtualInterfaceError::DeviceNameTooLong);
        }
        // We have to check that the device name has no zero bytes in the middle
        let device_name = ffi::CString::new(name)?.into_bytes_with_nul();
        let device = OpenOptions::new()
            .read(true)
            .write(true)
            .open(VIRTUAL_DEVICE)?;
        // ifreq is a structure to control network device (see man 7 netdevice)
        let mut ifr: bindings::ifreq = unsafe { mem::zeroed() };

        // create stack allocated array to hold the device name
        let mut name_buffer = [0_u8; bindings::IFNAMSIZ as usize];
        // and copy name bytes to it
        for (i, b) in device_name.into_iter().enumerate() {
            name_buffer[i] = b;
        }
        ifr.ifr_ifrn.ifrn_name = name_buffer;
        // IFF_TAP - tap device
        // IFF_NO_PI - no additional info for Ethernet package
        // IFF_TUN_EXCL - prevent creation of duplicates
        ifr.ifr_ifru.ifru_flags = (bindings::IFF_TAP | bindings::IFF_NO_PI | bindings::IFF_TUN_EXCL)
            as std::os::raw::c_short;

        let raw_fd = device.into_raw_fd();
        if unsafe { bindings::ioctl(raw_fd, bindings::TUNSETIFF as u64, &mut ifr as *mut _) } == -1
        {
            return Err(VirtualInterfaceError::IoctlError);
        }
        let device = unsafe { File::from_raw_fd(raw_fd) };
        Ok(Self { device })
    }

    pub fn device(&mut self) -> &mut File {
        &mut self.device
    }
}

and import it inside netstack/virtual-interface/src/lib.rs:

pub mod tap;

To test the code let’s start with an example netstack/virtual-interface/examples/read_loop.rs

use std::io::Read;
use virtual_interface::tap::VirtualInterface;

fn main() {
    let mut interface = VirtualInterface::create("dev0").unwrap();
    let mut buffer = [0; 4096];
    println!("starting read loop for device `dev0`");
    loop {
        let n = interface.device().read(&mut buffer[..]).unwrap();
        println!("The bytes: {:?}", &buffer[..n]);
    }
}

and build and run it:

$ cargo build --example read_loop
$ ./target/debug/examples/read_loop

Oh, no some ioctl error: thread 'main' panicked at 'called Result::unwrap() on an Err value: IoctlError'. Let’s debug with strace.

$ strace ./target/debug/examples/read_loop
...
ioctl(3, TUNSETIFF, 0xffffce5f30e0)     = -1 EPERM (Operation not permitted)
...

So it complains about permissions. Yes, checking man 7 netdevice again

If an ioctl is marked as privileged, then using it requires an effective user ID of 0 or the CAP_NET_ADMIN capability. If this is not the case, EPERM will be returned.

So let’s try with sudo ./target/debug/examples/read_loop and check if the inerface is created in another terminal:

$ ip link show dev0
dev0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether b2:24:f3:9a:71:c1 brd ff:ff:ff:ff:ff:ff

It works!3 But how our binary knows how to call libc functions and use libc structs? Rust dynamically links some system libraries and libc is among them:

$ ldd ./target/debug/examples/read_loop
linux-vdso.so.1 (0x0000ffffb2199000)
libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffffb20c0000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffb1f10000)
/lib/ld-linux-aarch64.so.1 (0x0000ffffb2160000)

So our netstack/bindings works as expected.

Now you can think that giving sudo access to the program is probably too much power and you’re right. And Linux comes to resque with its capabilities4 system allowing to provide grains of superuser power. So we need to provide only CAP_NET_ADMIN capability to our program. From man capabilities:

CAP_NET_ADMIN
    Perform various network-related operations:
    * interface configuration;
    * administration of IP firewall, masquerading, and accounting;
    * modify routing tables;
    * bind to any address for transparent proxying;
    * set type-of-service (TOS);
    * clear driver statistics;
    * set promiscuous mode;
    * enabling multicasting;

Note: if a filesystem with your compiled binary is mounted with nosuid flag (for example, a home directory is encrypted - you can check the flag for a home directory with the command mount | grep $USER) then you should copy the binary to some other directory which filesystem has not nosuid flag.

$ sudo setcap cap_net_admin=eip target/debug/examples/read_loop

Now let’s run our example binary. In another terminal run ip link show dev0 and check that the interface is down, so let’s make it up with sudo ip link set dev0 up. Now we can see some bytes in the console and our next task is to decode them.

1

Actually offloading network stack to a userspace in traditional heavy-kernel OSes (as opposed to so called microkernel OSes) can bring performance boost - see for example OpenOnload.

2

Many of the practice in this chapter is inspired by an excellent Let’s code a TCP/IP stack, 1: Ethernet & ARP tutorial by Sami Niiranen.

4

Comprehensive overview of Linux capabilities and different related exploits can be found at Linux Capabilities by Carlos Polop.

3

We could create a TAP device with the command sudo ip tuntap add dev0 mode tap. The only difference with our programmatically created device is that our device is not persistent - it lives while our process lives. To make our device persistent we could just add the flag IFF_PERSIST to ifru_flags.

Unreliable network

A network of interconnected computers is based on some physical layer (Ethernet cables or radio in case of wireless networks)1. The physical layer itself is subject to different disruptions, noisiness, and other problems. So we always have to assume that the network is unreliable and implement different retry, ordering, and healthcheck policies in our services.

Moreover, network capacity is limited and we should estimate ahead what an amount of data the system under development is going to send through the network.

Network throughput is an amount of data that can be transferred through a network within a certain time period Oxford Dictionary of the Internet (4 ed.). Usually, an average throughput is used: the number of the transferred bits divided by the number of passed seconds.

We can measure a network throughput in bits per second (bit/s or bps) or even in data packets per a unit of time.

Stop&Think: How do you measure a network throughput? How do you estimate a maximum throughput (also called bandwidth or channel capacity) of a network?2

Congestion control algorithms https://cloud.google.com/blog/products/networking/tcp-bbr-congestion-control-comes-to-gcp-your-internet-just-got-faster https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/tcp_bbr.c

https://cacm.acm.org/magazines/2017/2/212428-bbr-congestion-based-congestion-control/fulltext

https://dl.acm.org/doi/10.1145/52324.52356

https://github.com/google/bbr

Typical interservice communication within a corporate network often causes to think that the communication is almost instanteneous. But in case of a world wide internet connection spanning several networks the time to deliver data significantly increases.

Network latency is the time taken for a given signal or data packet to propagate through a circuit or network from its source point to its destination Oxford Dictionary of Electronics and Electrical Engineering (5 ed.)

Network latency usually measured from source to destination and back constituting the so called round trip time (RTT) which is rougly speaking a double latency.3

Caching, keep-alive connections, geoghraphical proximity to the users are among usual options to decrease latency4.

Network partition happens if some nodes of a cluster in a distributed system cannot communicate due to a network failure but are supposed to.

Depending on the network usage profile, targets on throughput, latency, paket loss and rate a standard Linux networking stack can be partially or completely replaced with:

  • a user-space networking stack (no kernel at all) like DPDK5 and other packet processing frameworks6
  • a kernel executed user-defined network modules (eBPF)
  • raw sockets ()

So for a distributed system it is absolutely necessary to handle different network failures (which are quite common):

  • meet bandwith requirements and handle network congestion on scaling;
  • deal with increased latencies;
  • have policies for network partitions: if a network partition creates two and more subclusters and such subclusters can behave independently (situation known as a split-brain7), how should the system evolve and finally merge conflicting states?
  • implement backpressure
1

Most of us have heard about OSI model as a theoretical framework to discuss networking stack. Just for reference:

  • L1 - Physical layer (bits)
  • L2 - Data link (frames)
  • L3 - Network (packets)
  • L4 - Transport (datagrams)
  • L5 - Session (data)
  • L6 - Presentation (data)
  • L7 - Application (data)
2

Also note that the maximum throughput is not actually an upper limit for application specific data throughput as there is always an overhead of a latency, protocols, encryption, compression, packet retransmission etc. A useful data throughput is sometimes called a goodput (see also Wikipedia Goodput).

3

An actual latency measurement is complicated by the presence of several nodes in the packet way, queuing (several packets from different sources to the same destination are put in a waiting list to send) on the gateway, processing delays. Tools like a famous ping can use special control protocols (such as ICMP) which differ from those protocols that you actually use for data (such as TCP) so measurements are biased.

4

Analyzing your application stack and CPU/IO profile can also help to choose an appropriate operating system if it is possible - see benchmarks Benchmarks: FreeBSD 13 vs. NetBSD 9.2 vs. OpenBSD 7 vs. DragonFlyBSD 6 vs. Linux

5

see also SPDK for a storage

6

see also https://github.com/ntop/PF_RING/tree/dev, https://github.com/luigirizzo/netmap

Create your own tool to measure network throughput

Here we will create our own Rust version of iperf3 to better understand concepts around network

https://github.com/esnet/iperf

Wireshark/ to analyze packets

Write our own DPI (nDPI, goodbyeDPI)

Availability

Availability in the common practice is understood as a metric rather than a property of the system. While in the theory of distributed systems (especially, CAP theorem1) it is more about property leading to confusion.

We will follow practical considerations and under availability will understand the following concepts.

A service level indicator (SLI) is a carefully defined quantitative measure of some aspect of the level of service that is being provided. It is a metric, not a target.2

For example, it can be a ratio of successfully handled requests.

A service level objective (SLO) specifies a target level for the reliability of your service. The SLO is a target value for an SLI.2

Service Level Agreement (SLA) declares your commitment to satisfy SLO. SLO also determines your error budget, i.e. duration of the allowed downtime within different time periods

Going distributed is just one of the ways to strengthen availability via adding scalabity and redundancy.

While network being a primary source of failures for distributed systems, we should also take into consideration other resources related to a single node:

  • Storage (disks are corrupted, drivers contain bugs, various filesystems have different memory orderings models and are mounted with different policies3)
  • CPU (processes are stopped while VM is migrated, system clocks may drift etc, cpu usage patterns, align with branch prediction and caches sizes)
  • Memory (can be corrupted, defragmentated, what are the memory usage patterns)

So we should define fault models for different resources in our distributed system4, i.e. a specification of how our application handles different failures.

Clever design (including API), code quality assurance (including configuration changes tests5), proper algorithms, performance tuning, application monitoring, best DevSecOps practices, business and tech metrics collection significantly improves availability and should be considered first before introducing complexity of a distribution.

1

Martin Kleppmann A Critique of the CAP Theorem

2

Google Cloud Architecture Framework Reliability principles

3

An eyes opening reading for the author was Files are fraught with peril by Dan Luu - how difficult and error-prone is to use File API and how weak in terms of error-handling many filesystems are. General advice is to rely on already robust and battle-tested products like SQLite (which handles different edge cases) if you need some local persistent storage.

4

See for an example the Design Document of TigerBeetle.

5

see Dan Luu’s Reading postmortems

Distributed vs Decentralized

Decentralized network is a dispersed network having several or multiple large hubs <..> and no dominant central hub. Oxford Dictionary of Media and Communication

Consider master-slave replication of database. We have some slave databases running only read requests and a master database server executing read and writes queries and replicating all changes to slaves. Clearly this system is distributed due to the replication link. If we had no mechanism of promoting slave to master in case of the master’s failure, then our system is not decentralized.

So decentralization insreases fault-tolerance removing single point of failure. But there is also a catch. Consider two network topologies: Mesh (in case of a full decentralization) and Star (with a central master).

Mesh vs Star

So in the case of n nodes of mesh network topology we have n*(n-1)/2 connections[^graph] while in the case of the star network we have only n-1 connections. So heavily meshed distributed systems produce increasing network contention.

Taste of distributed

Exercises

  1. Due to growing traffic we decided to shard users data by a starting letter of last name. Can we consider such a sharded database as a distributed system? Explain your answer.

  2. Model packet loss in the network

  3. Implement interface for TUN device

Established systems

Successful software is composed not built. Roger Johansson, creator of ProtoActor

Implementing distributed systems right way is hard. In most cases we can compose our software within well established distributed systems. Of course, the goal of the book is to learn how to implement such systems from scratch. But we should be familiar with other options. Some are listed below in the order of decreasing technology lock-in.

  • Erlang/OTP
  • ETCD
  • Tendermint
  • Kubernetes

Battle-tested Erlang/OTP

Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang. Robert Virding, one of the creators of Erlang

Erlang still remains to be a unique fusion of practical engineering and clever architectural vision. Created in 1980s, it still has such a great potential to evolve in a modern web[^klarna]1.

We won’t dive in the language per se and pragmatic ideas of “let it fail”, supervision trees, and fault tolerance. You can read the very concise and full of practical wisdom thesis (highly recommended) of Joe Armstrong, one of the Erlang creators.

Let’s focus more on Erlang distribution capabilities. Erlang is compiled to run in a virtual machine called BEAM2. Elixir is also a BEAM-runnable language with interoperability with Erlang.

Erlang lightweight processes, while executing application logic, communicate with each other using signals. Most used signal type is a message3. Processes are scheduled for execution by BEAM. You can run millions of processes4. OTP is a collection of reusable components (abstractions for concurrent and distributed patterns such as client-server). So Erlang/OTP is a framework allowing you to quickly create complex concurrent and distributed systems.

BEAM (written in C) can be extended with dynamically loaded compiled modules with Native Implemented Functions (NIFs). So you can write this shared module in a language you prefer (supporting C interoperability, of course)5.
Erlang is primarily well suited for a massive IO-bound load so for CPU-bound tasks you should use, for example, above mentioned NIFs6.

Every node (Erlang VM) to form a cluster should share the same cookie file with a secret. It is rather a basic “security” with the aim to differentiate two or more clusters. Every Erlang VM is started with a predefined name and then it’s enough to register the rest n-1 nodes on the first node. All nodes sharing the same cookie will propagate connections to each other creating a fully connected mesh network. This default transitive connection propagation can be configured – a node called hidden node without transitive connections can be used to create clusters with less connections7.

By default, communication between nodes of a cluster is done over TCP. Separate daemon process (called EPMD - Erlang Port Mapper Daemon) starts (if not already running) on every host machine with Erlang VMs. EPMD uses DNS to resolve node names to ip addresses and maps node names to ip+port (tcp socket). EPMD serves as a name resolver for nodes on the host machine.

Erlang

You can also implement your own procedures for nodes discovery and discovery under container orchestrator.

Default Erlang distribution with cookies assumes trusted network so you should change default communication mechanism in case of untrusted network8. Moreover, large number of nodes with fully connected mesh communicating over large and uncontrolled network can be prohibitatively costly. This break point may range from 40 to 140 nodes9 depending on load and amount of global state required to sync over cluster (such as a process namespace or getting now time which requires global lock to provide monotonically increasing time over the cluster). In such cases federated10 clusters and partitioning of global state in separate groups of nodes inside a cluster is a way to go11.

Erlang is actively modernized and continuosly developed. So it’s a solid foundation for a distributed system.

Erlang lessons to go distributed:

  • separation of concerns and modularity - you can configure your communication transport, algorithm of node discovery, network topology;
  • distributed system must be observable (Erlang has excellent tracing and monitoring tools allowing to observe even specific Erlang processes);
  • communication is asynchronous so no node has to wait any acknowledgement that its message was received by another one;
  • message passing is location transparent (the code to send message to local Erlang process is the same as for sending to a process on another node in the cluster – at a cost of more RAM and time to memcpy as every message is deeply copied);
  • maintaining global mutable data (namespace of lightweight processes in case of Erlang) and full connectivity severely limits scalability.
  • fail-fast - correctness is more important than availability (of course, depends on the startup time). Erlang philosophy denies defensive programming (when you try to recover from all errors). So assertions in the production code are beneficial - they declare negative flow and don’t allow the distributed system (which is designed to be fault tolerant) to behave incorrectly12.
13

For example, Swedish fintech Klarna uses Erlang as it’s core platform handling 2 million transactions per day - look at their job descriptions.

1

At one of my work places I almost convinced my CTO to write IO-bound service for proxying media streams and unary requests in Erlang. For two weeks I read everything on Erlang and finally presented a workable version which was deployed to production. It had worked under production for about 3 weeks till the CTO had a look at the code. He was rather afraid of completely non-imperative style of code and inability to scale his Erlang team (consisted of only me). So he gently asked me to rewrite the app in Go. For those brave of you Adopting Erlang.

2

Alternative execution environments (including not only VMs) emerge periodically but I haven’t heard about any production-ready. BEAM is really complicated to reinvent. See also Lumen.

3

A lot of holy wars around considering Erlang as an actor language. See Stack Exchange question and Robert Virding’s email.

5

See Rustler

10

Federation is a technique to scale software by grouping its parts by feature. For example, you can federate database in separate servers: one for sessions, second for users, third for sales etc. Increased throughput comes with cost of joining data and holding transactions on the application side.

A several nodes in a distributed system should agree on a value. More formally, a set of replicated state machines computes the same state by applying commands from a replicated log. And the consensus goal is to synchronize the log for every node.

Distributed systems can be classified by different criteria which have both theoretical (as a constraints for models) and practical (as foundations for design decisions) importance. Let’s list mostly used:

  1. By a communication time bound.

    A synchronous system assumes that there is some bound (known in advance) on the delivery time of messages between nodes.

    An asynchronous system assumes that there is no such a bound (it is impossible to say either a node has failed or is slowly processing). An asynchrounous system seems to be more suitable for nodes interacting via networks but there is no known consensus for such a model (see below).

  2. By a fault mode.

    Non-faulty nodes do not crash or misbehave at all.

    Fail-stop nodes stop when fails/crashes and do not restart.

    Fail-recover nodes recover after crashes and behave correctly after the restart.

    Byzantine-faulty nodes may crash intentionally and moreover misbehave to hamper non-byzantine nodes to achieve consensus.

  3. By reliability of communication (transport).

    Systems with a reliable transport where sent messages are finally delivered (e.g. stream-oriented protocols like TCP, HTTP2, QUICK)

    Systems with an unreliable transport where some sent messages can be dropped while in transit (e.g. UDP)

All type under each criteria are listed from the most restrictive to the most general. As a consequence, if something holds for a more resrictive subtype, then it holds for a more general one. Why is this reasoning important? Let’s illustrate with some impossibility results.

Some impossibility results (when consensus is impossible to achieve):

  • FLP: in an asynchronous disributed system with a reliable communication and fail-stop nodes a consensus is impossible. As a consequence, the statement holds also for asynchronous systems with fail-recover or byzantine-faulty nodes with a reliable communication (or with an unreliable communication).
  • In a presence of an unreliable transport a consensus is impossible even for a synchronous system (TODO link?)
  • consensus impossible even in a synchronous distributed system when BFT faults are possible and one third or more processes fail (N <= 3F) Lamport Pease
  • Atomic broadcast and its duality to consensus

How to deal with such impossiblity results?

  • in theory - additional constraints are introduced - the network will be periodically synchronous for short intervals of time and in such intervals of synchrony, all undelivered messages will be delivered under some fixed time bound - see [DLS88] C. Dwork, N. A. Lynch, and L. J. Stockmeyer. Consensus in the presence of partial synchrony.
  • in practice - by using failure detectors. The most simple one is a periodic healthcheck (when a communicating node checks) and heartbeats (when the node itself emits).

Chandra: accuracy vs liveness of failure detectors lease and leased clocks

TODO Consider 2PC (2-phase commit) for synchronous non-faulty system with a reliable transport as a first step If introduce fail-stop nodes (e.g. coordinator itself) than consensus will halt with 2PC.

Desired properties of consensus

  • safety - never return an incorrect result under non-Byzantinne conditions (network problems,), all nodes agree on the order of events
  • liveness - the consensus makes progress
  • support cluster changes in cluster membership

Consensus protocols

  • with elected leader (Raft, Paxos, Tendermint, pBFT, VR, Nakamoto consensus - POW, POS, POA)

  • leaderless (EPaxos, Avalanche, IOTA)

  • bft-tolerant (e.g. nodes never assume the leader is correct or honest) PBFT, Nakamoto, Tendermint

  • non-bft tolerant (require trusted setup - Raft, VR, Paxos)

CHandra - atomic broadcast is dual to consensus

Raft

Raft is now one of the widely used consensus algorithms. For example it is used in Etcd (link to the established systems) and TiKV.

The animated introduction is here should be checked first, and after that the article.

Key takeaways

  • we should try to merge results from equal peers (like in Paxos?), it is much more simplier to elect a leader and make it responsible for the state changes
  • a modular design: separate leader election, log replication, safety and membership changes

Consider Raft critique (e.g. openraft repo)

Distributed transactions

2-phase commit 3-phase commit

Viewstamped Replication

Used by Tigerbeetle.

Martin Kleppman Critique CAP

latency-sensitivity framework

availability as a metric

operational and network latency

algorithm sensitivity to latency

communication patterns

  • unicast (receiver - sender)
  • broadcast (sender to all possible receivers)
  • multicast (sender to a selected group of receivers)
  • convergecast (a group of senders to a receiver)

Supermajority (2/3) vs simple majority

Network partitions can be modelled as large packet delays (up to the duration of the partition), provided that the duration of the partition is finite and lost packets are retransmitted an unbounded number of times1.

1

Martin Kleppmann A Critique of the CAP Theorem

ML training

Machine learning involves many matrix operations and is naturally parallelized not only within single node but also over distributed nodes. Out-of-box solutions like Distributed training with TensorFlow transfer parameters (aka weights) from parameters nodes to worker nodes and back during model training. But large parameters matrices for sparse data heavily consume network bandwith and severely decrease model training speed1.

Sparse data means having many null values. For example, unpacked data batch X looks like

[[0, 0, .., 0, 9], 
 [5,0, .., 0, 34], 
 ...
 [0, .., 720, 14]]

where total number of rows is a batch size m and each features row is about 2^20 values (yes, it’s a Twitter’s scale).

So the first layer of the network requires 2^20 rows in its parameters matrix W consisting of float32 numbers (4 bytes each) and it’s number of columns depends on the next layer size, typically 50 - 400 in case of Twitter. So this results in first layer matrices of 50 * 2^20 * (4*2^-20) Mb = 200 Mb to 400 * 2^20 * (4*2^-20) Mb = 1600 Mb. When you have a lot of worker nodes exhanging gigabyte of data each with parameters servers, your network bandwith is in danger. While first layer is huge, other layers are considerably smaller.

Twitter approaches this scalability problem with a clever nodes organization2.

Assume we have K worker nodes, P parameters nodes, and n features in a dataset row. Then if we break the weights matrix of the first (sparse) layer in P submatrices by features (by columns):

X * W = X1 * W1 + .. + XP * WP

then each worker node should only work with block of input Xi and block of weights matrix Wi. Each parameters server i is responsible for ith partition Wi of the sparse first layer.

Each worker node runs two processes with the following pseudocode:

Data unpacking and partitioning:

const m // batch size
const P // number of partitions of the sparse layer

fn process_input() {
    loop {
        batch = receive_input_batch(m) // batch of size m
        for i in 0..P {
            submatrix = unpack_submatrix_i_of_features(batch, i) // with m rows and n/P columns
            send_this_submatrix_to_parameters_server(submatrix, i) // send to server i
        }
    }
}

Reconstructing output layer of the first sparse layer and training rest layers:

const P // number of partitions of the sparse layer

fn train_rest_of_layers() {
    loop {
        output_layer // product XW of size m rows (batch size) and n columns (number of features)
        for i in 0..P {
            output_layer += receive_output_submatrix_from_parameters_server(i) // product of block of X and block of W from parameters server i
        }
        calculate_other_layers_of_network(output_layer)
    }
}

Because the second and further layers are much smaller than the first “sparse” layer, each worker node has these layers locally.

Each parameters node i runs the following process:

const node_number = i
const K // number of workers

fn train_first_layer() {
    loop {
        Wi // block i of weights matrix for the first sparse layer
        Xi = receive_submatrix_of_features(i)
        block_i = calculate_block_product(Xi, Wi)
        for j in 0..K {
            send_output_layer_block_to_worker(block_i, j) // send to worker j
        }
    }
}

Twitter ML cluster

Such a cluster organization allows for failure and restart of a worker node without loss of weights of the first layer (failure of a parameters node is more damaging). Each worker node is stateless and no transfer of a huge sparse matrix W occurs over network3.

Blockchain

Permissioned vs permisionless

TODO: Focus on consensus (POW, POS, POH) as a lesson for the book purposes

TPS (transactions per second)

A permissioneless blockchain as a solution to BFT1

Blockchain network (also called cluster) is p2p network of nodes (called validators in Solana) exchanging with a copy of a ledger (blockchain). Typically 10 - 50 validators totaling 1000-2000 nodes2

  • distributed state transition requires consensus

Replicated state machine, a batch of state transitions (transactions) is recorded as a block. The first state is the genesis block. State transition is done via applying a transaction. Several state transitions (i.e. transactions) are usually combined into a single block for time and space efficiency. When several competing chains exist, we need to choose one (called canonical chain) via a consensus mechanism (PoW, PoS, PoA). Notion of finality.

Smart contracts are run on VM (Sealevel for Solana, EVM for Ethereum)

Client is a program communicating with a cluster via transactions (composed of several intructions in terms of solana).

Transactions are signed by the client by means of Public Key cryptography.

dApp -decentralized app is a UI (typically writen in some Javascript flavor) client

Node which appends new transaction to the ledger is an elected leader.

Token is a unit of payments for computing resources of the cluster (running smart contract or validating its output). You can invest some tokens (delegate a stake) into validator gain a reward. Validator takes some fee (commission) for their services (in bitcoin by including a transaction with 25 BTC out of nowhere3. Growing validators can offer lower commission. It favors large validators to become even larger because lower fee insentives investors to delegate their stakes at a lower commission4.

If validator behaves maliciously, you los e your delegated tokens and also validator loses its delegated power and cannot earn more or even as before.

You can use a wallet which is a client allowing you to operate on your token account (send and receive). To make stakes usually separate stake accounts are used. Solana: Tokens on the stake account can only be invested in the one validator the whole amount at one time. To invest to several validators simultaneously you should use several stake accounts.

How do you get your initial tokens to start? Crypto exchanges (like Binance) allow you to buy some tokens in exchange for a usual currency like US dollars.

Account is typically identified by public key.

Validators advertised their public key.

Bitcoin core can be used to build dApps

  • with bitcoin scripting
  • with some metaprotocols requiring to use trusted nodes Or you should invent your own blockchain

Buterin suggested EVM to

Classical consenus requires all nodes to participates which prevents scalability. So either some nodes (called validators) are selected to validate transactions or another way of consensus is suggested (probabilistic consensus of Bitcoin called Nakamoto consensus, probabilistic Avalanche consensus). Or the topology of network is changed

  • it is sharded by feature (DeFi blockchain, smart contracts etc) (Cosmos blockchain)
  • centralization - when there is central network to validate (Polkadot)
  • L2 (layer 2, also called sidechains) A Foray Into Blockchain Scalability

In case of validators pool with classic consensus we need some voting mechanism to allow for more even participation (quadratic voting, https://wiki.polkadot.network/docs/learn-phragmen)

Byzantine Fault Tolerance (BFT, Tendermint) and Byzantine Agreement-based (Algorand) => limit the number of validators

POW and POS tend to created concentrated pools which contradicts to decentralization

There are several proposed decentralization indices like Nakamoto Index or Edinburgh Decentralisation Index. We will not delve deeply into specific methodologies but will name a few common subsystems being considered.

  • (full) nodes distribution by geography and ownership, cloud providers and data centers
  • hardware distribution (by geography, vendor, architecture)
  • value owners distribution (by accounts, geography, amount of wealth)
  • codebase diversity5
  • exhanges support
  • number of autonomous systems (AS)
  • wallets developers distribution
  • juridistiction contraints (e.g. ban on mining)
  • governance
  • code repository diversity (usually only GitHub)

Any of such subsystems can become a bootleneck effectively reducing decentralization to very low level (actually the Nakamoto index measures the minimum number of parties to control 51% of the resources in any subsystem6). 51% attacks are real https://www.crypto51.app/

Permisionneless (public) blockchains

  • Ethereum POS (POW earlier)
  • Bitcoin POW
  • Solana POH
  • Cosmos Tendermint
  • Avalanche ???

Permissioned blockchains to create CBDC (Cetral Bank Digital Currency, Interbanking communication - like SWIFT traditionally does) (not about decentralization)

  • HyperledgerFabric by IBM/Linux Foundation
  • Corda by R3
  • atop Ethereum https://github.com/Consensys/quorum
  • with Polkadot Substrate
  • with Avalanche Evergreen

Problem - grows of the blockchain data7 in terms of hundreds of GBs. Notion of the succint blockchain (example of Mina)

Blockchain Trilemma by Vitalik Buterin - choose 2:

  • Security (how easily a node can join the network)
  • Decentralization (how many nodes are participating in the consensus)
  • Scalability (how many transactions per second (TPS) can the network handle)

DePINs (decentralized physical infrastructure)

Finality is a probabilistic property even for deterministic protocols as there always exists a chance for a hard fork (consensus can be coordinated off-chain like in the DAO hack case)

1

Our focus is on distributed nature of blockchains, namely Distributed Ledger Technology (DLT), rather than money-related issues. While being an innovative, advanced technology, revolutionalizing several aspects of life, blockchain projects sometimes suffer from ordinary security issues and try to solve them with traditional centralized methods - check https://www.web3isgoinggreat.com/ for a rich collection.

4

See, for example, Solana’s validators market shares and commissions https://solanabeach.io/validators

3

(https://ethereum.org/en/whitepaper/)

8

https://courses.cfte.education/blockchain-and-defi-projects/

6

See also Balaji S. Srinivasan Quantifying Decentralization

5

Implementation languages also matter: niche ones (like Haskell or OCaml) or too low level ones (like C) can also prevent wider adoption by developers.

7

see Blockchain Size https://bitinfocharts.com/bitcoin/ for Bitcoin, https://etherscan.io/chartsync/chaindefault for Ethereum

Starting with Ethereum1, almost all blockchains provide a Turing-complete computation environment to execute programs (so-called smart contracts2, but in Solana they are called just programs) on chain. The execution is usually handled by some VM (the most popular choice is WASM-compatible VMs3) which is builtin in the node. The smart-contracts are written either in domain languages like Solidity and Cairo or in general languages like Rust (to be compiled to wasm or other ISA)4.

Because of the whole domain complexity and available sources of contracts, any overlooked bug can bring a disaster5

If create such smart-contracts which allow some (based for example on the entrance fee) accounts in the blockchain to vote on some proposals, we will get a collective decision making, or so-called DAO (Decentralized Autonomous Organization)6. DAOs are real7.

1

Bitcoin also has Bitcoin Script but it doesn’t provide loops as a control structure.

2

The concept of smart contract was proposed by Nick Szabo in 1994

3

besides WASM-compatible VMs (like Wasmer), there is a Sealevel - Solana EBPf compatible VM with parallel execution, zkVM, Risc0 to name a few

4

Check https://defillama.com/languages to see the most popular languages. But also take into account the target chain and estimate computation costs (aka “gas fees”) as an execution of a smart contract is not free.

6

Despite evident pros (decentralized, automatic process) and cons (e.g. bugs in the smart contracts, longer decision process), in case of DAO governance with fungible tokens there is a problem of “whales” - large token holders which can shift voting or in case of DAO governance with the whitelisted accounts the risk of Sybill attack. They can even use so called flash loans(when tokens are borrowed and returned within one block so no collateral is required) to gain a voting power in momentum.

7

A US state of Wyoming even recognizes DAOs as LLC to provide them a full legal status. See also Securities Law: Social & Community Tokens

Here are the list of references to cool articles and books etc I’ve used to prepare this book.

https://www.practicalnetworking.net/series/nat/nat/ https://www.practicalnetworking.net/series/arp/address-resolution-protocol/ https://www.practicalnetworking.net/stand-alone/communication-through-multiple-switches/ https://www.practicalnetworking.net/series/arp/gratuitous-arp/ https://www.practicalnetworking.net/series/arp/arp-probe-arp-announcement/ https://www.practicalnetworking.net/stand-alone/vlans/ https://www.practicalnetworking.net/stand-alone/eigrp-metric/ https://www.practicalnetworking.net/series/packet-traveling/packet-traveling/ https://www.practicalnetworking.net/series/packet-traveling/packet-traveling-series-finale/ https://www.practicalnetworking.net/series/packet-traveling/host-to-host-through-a-router/ https://www.practicalnetworking.net/series/packet-traveling/host-to-host/ https://www.practicalnetworking.net/series/packet-traveling/host-to-host-through-a-switch/ https://www.practicalnetworking.net/stand-alone/ethernet-wiring/ https://habr.com/ru/post/684524/ https://habr.com/ru/post/675812/ https://habr.com/ru/post/686230/

http://book.itep.ru/4/6/bitcoin.htm http://book.itep.ru//4/6/blockchain.htm http://book.itep.ru/5/cicl_54.htm http://book.itep.ru/6/vlan_62.htm http://book.itep.ru/6/fwal_63.htm http://book.itep.ru/6/pgp_644.htm http://book.itep.ru/6/ipsec.htm http://book.itep.ru/10/sem_net.htm http://book.itep.ru/4/44/ip6_4411.htm http://book.itep.ru/4/44/tun_4412.htm http://book.itep.ru/4/44/nd.htm http://book.itep.ru/4/44/rtp_4492.htm http://book.itep.ru/4/44/rtc_4493.htm http://book.itep.ru/4/4/sip.htm http://book.itep.ru/4/44/rut_4411.htm

https://blog.twitter.com/engineering/en_us/topics/insights/2021/simple-scalable-graph-neural-networks https://blog.twitter.com/engineering/en_us/topics/insights/2021/fusing-elasticsearch-with-neural-networks-to-identify-data https://blog.twitter.com/engineering/en_us/topics/infrastructure/2021/processing-billions-of-events-in-real-time-at-twitter- https://blog.twitter.com/engineering/en_us/topics/insights/2021/graph-neural-networks-through-the-lens-of-differential-geometry- https://blog.twitter.com/engineering/en_us/topics/insights/2022/graph-machine-learning-with-missing-node-features

https://blog.twitter.com/engineering/en_us/topics/open-source/2020/hunting-a-linux-kernel-bug

https://ethereum.org/en/developers/docs/evm/ https://ethereum.github.io/yellowpaper/paper.pdf https://ethereum.org/en/developers/docs/ https://docs.soliditylang.org/en/v0.6.0/introduction-to-smart-contracts.html#the-ethereum-virtual-machine

https://habr.com/ru/company/yandex/blog/564510/

https://habr.com/ru/company/dsec/blog/278335/ https://habr.com/ru/post/168407/ https://habr.com/ru/post/430172/