Preface

To my girls: wife, daughter, and mom - for their patience and love

Who should read this book?

To be honest, I don’t have vast practical experience of building distributed systems except the most popular one among web developers (hint: client-server). But a usual requirement to acquire this practical experience is some non-basic knowledge of the topic. So this is a classical “chicken-egg” problem: to work on a distributed software product you are usually required to have some related projects in your portfolio. So writing this book is a way to close these gaps.

Anyway I assume that readers have developed web projects and already have some experience in vertical and horizontal scaling, containerization, data races, classical algorithms and data structures. I will try at my best to introduce all used vocabulary.

I also do not devote any decent time for single machine patterns, as I believe you have enough working practice with them.

Why yet another book on distributed systems?

I’ve read some excellent books devoted to distributed systems or related topics. Among them:

Mikito Takada Distributed systems: for fun and profit
Brendan Burns Designing Distributed Systems
Martin Kleppmann Designing Data-Intensive Applications
Donne Martin System design primer

All above mentioned books have their own view on what topics and at what level should be covered. Martin Kleppman’s outstanding book (highly recommended) obviously focuses on data-related topics.

Despite I’ve read the books above, I had no clear mental picture and systematic view of Raft Algorithm, Lamport clocks and other distributed buzzwords. So this book tries to cover important topics (choice is subjective, of course) and to serve as a practical guide for busy engineers striving to grasp core concepts in a timely manner.

True understanding comes with real doing, so practice through exercises is a necessary part of learning. I have chosen Rust¹ as a language to use throughout the book as I believe that Rust incorporates modern best practices and solid theoretical achievements. Experienced developers will dive without further ado. But for those who like more gradual involvement I could recommend the following excellent free resources to start with:

Why should we learn about distributed systems?

I admit that you already know the answer if you’ve started reading the book. But for the sake of completeness let’s enlist some points:

modern scalable products inevitably rely on concepts discussed further. So if you want to get working experience in modern tech companies or even build your own product with highload and fault-tolerance in mind, you should grasp the basics²;
expecting your software being distributed at some point in the future can radically improve your architecture decisions and even determine your preferred stack of technologies to use;
emerging Web 3.0 is build around an idea of decentralization which is tightly connected with distributed. So at least you will be able to understand underlying technologies and explore them deeper.

The following goal for the book comes out of the points above - get reader familiar with selected topics on distributed systems with strong emphasis on practice.

Errata, typos, bugs and other feedback

I am not a native English speaker so I will be glad to get any feedback on a correct usage of English grammar and phrases. Also any suggestions, improvements, fixes are welcome. You can provide merge requests at book’s Github repository or email me at {auhtor’s name}.{author’s last name}@gmail.com.

Acknowledgments

I would like to tell many kind words to people who influenced me as a software engineer during my career. First of all, the friend of mine, Farit Sadykov, who helped me a lot to become a developer. Also my colleagues, especially Pavel Antonov, Nikolay Yurov, Vladimir Galkin, Pavel Gorbunov, Andrey Belov, Evgeny Orlov, Ilya Zhigalko, Slava Cheremushkin, Kirill Zhuravlev, Ilya Scherbakov, Dariusz Antoniuk, Ruud van Asseldonk, Michael Livshin.

Also many thanks to open source enthusiasts creating wonderful products, books, and articles, generously sharing knowledge with the rest of the world. We truly stand on the shoulders of giants.

Maksim Ryndin

Montenegro, 2022.

This book is not affiliated with Rust Foundation

Being scalable doesn’t only mean software and infrastructure while adding more servers. It’s also about your team being ready to grow with growing traffic. So being prepared to go distributed is a must-have for every team member in terms of some introductory learning experience, automated pipelines and merge reviews etc. Please watch very informative presentation by Lisa Guo on scaling Instagram.

Introduction

The big idea is “messaging” <..> The key in making great and growable systems is much more to design how its modules communicate rather than what their internal properties and behaviors should be. Alan Kay, creator of Smalltalk¹

What is distributed?

Distributed system is any system in which a number of independent interconnected computers can cooperate in order to achieve a common goal. Oxford Dictionary of Computer Science

If we have two virtual machines (VM) with a web server on the first VM and a database server on the second one, do we consider such a system distributed? If VMs are on the same physical node, is it a distributed system? If the web server and the database are running in separate containers and these containers are on different physical nodes, then this system is distributed. But a container orchestrator then moves both containers on the same node. Does the system of two connected containers lose its distributed flavor?

So there are many questions like above and for the purpose of the book we will follow the next definition of a distributed system.

Distributed system is a software which has at least two components interacting with each other only by a network.

If not otherwise specified, we will call components nodes and a group of connected nodes is named cluster.

Even if a network used for communication of the components for the present moment is a loopback interface (“localhost”), it can become a real network for the next deployment.

Why we need make our software distributed?

At least, to make it highly available at the cost of an increased complexity and managing effort.

Overview of the book

TODO

Re: prototypes vs classes was: Re: Sun’s HotSpot

In this case we have a complete graph, so every node has connections with other n-1 nodes. That’s we have n*(n-1) edges but as we count every edge twice, total number of edges is n*(n-1)/2.

Network stack

Our definition of a distributed system mentions a network as a medium for communication. This medium is a very interesting and complex subject per se but here we only review some core concepts relevant to our study.

Physically nodes (or hosts, think, computers) in the network are connected by means of network interface controllers (NIC, also network adapter) - a special hardware¹ allowing a computer to connect to the network via a cable (Ethernet) or a radio (WiFi)². This physical layer is called a link in TCP/IP stack³. Nodes at the link layer are identified by Media Access Control Address (MAC) - a unique hardware identifier of every network interface of the host⁴.

We can use ifconfig (or more modern ip addr and ip link show) utility to manage network interfaces. We just describe the most common data:

<UP,BROADCAST,RUNNING,MULTICAST> means that the interface supports broadcasting and multicasting, is loaded (UP as opposite to DOWN), is ready to accept connections (RUNNING). Other popular options are LOOPBACK (it is a loopback interface) and PROMISC (interface is in the promiscuous mode when it captures all packets in the network).
mtu 1500 - Maximum Transmission Unit is 1500 bytes. So IP packet greater than 1500 bytes will be fragmented.
ether bc:d0:74:ec:76:17 means MAC is bc:d0:74:ec:76:17
inet 192.168.43.174 netmask 0xffffff00 broadcast 192.168.43.255 means IPv4 address 192.168.43.174 with network mask 255.255.255.0 so we can have hosts 192.168.43.1-254, 192.168.43.255 is a broadcast address and 192.168.43.0 is the network itself.
inet6 fe80::88f:5ab4:1ad8:eed5 prefixlen 64 secured scopeid 0xe means IPv6 address fe80::88f:5ab4:1ad8:eed5

So we have a network of nodes (also called LAN - local area network⁵). How do we interconnect networks (to get WAN - wide area network) so that we can send a packet with data (called a datagram) between two hosts from different networks? Here a routing problem arises. Computers called routers⁶ are relaying datagrams from source to destination accross networks borders⁷. But how do we distinguish a host in one network from a host from another network? We need some addressing at an internet layer. IPv4 addressing assigns each host an IP address of the form xx.xx.xx.xx (4 octets of 8 bits each). But with 2^32 unique IP addresses we cannot mark all hosts so IPv6 was developed (128 bits). IPv4 address exaustion was partially mitigated by so called Network Address Translation (NAT) and private IPs⁸ which results in violation of the end-to-end principle⁹. Current adoption of IPv6 you can check with Google data. Internet layer is independent of the network topology and the physical way of hosts connection.

Within LAN we need to establish a mapping between an IP addresses of hosts and their corresponding hardware addresses (MAC) to send packet from one host to another. Such a mapping is called ARP table (or ARP cache) and is filled at each host with an Address Resolution Protocol (ARP) broadcast messages. In Linux/Unix ARP table can be manipulated with an arp utility. For IPv6 Neighbor Discovery Protocol handles that.

Routing is mapping of IP addresses to interfaces and can be displayed with netstat -rn. You can trace packet path with tracepath.

After we identified a route between to hosts, we can use a transport protocol to, at least, specify ports on hosts to connect specific application processes running on hosts involved. Because internet layer is only responsible for routing and for reliability of communication, transport layer’s protocols can also offer some reliability mechanisms like congestion control, preserves data ordering, eliminate packet loss, provides packet deduplication.

So we have a connection between hosts and we can exchange application data over the transport protocol.

Network interface can also be virtual to allow userspace applications to intercept raw packets (see Universal TUN/TAP device driver). Also the famous loopback interface (127.0.0.1 for IPv4 and ::1 for IPv6) is virtual.

we consider only packet networks where the same communication channel can be shared between several communication sessions by means of packetizing data. In contrast, in circuit networks (for example, public switched telephone network, PSTN) the channel capacity is completely used only by a single communication session between two nodes.

Note that common reference to TCP/IP stack (also called Internet protocol suite) includes not only Internet Protocol and Transmission Control Protocol (TCP) but also other protocols such as a User Datagram Protocol (UDP) and QUIC at a transport layer and Internet Protocol Security (IPSec) at an internet layer. TCP/IP stack as a practical network stack predates the OSI theoretical model for general networks. Since the adoption of TCP/IP in ARPANET in 1983 several proprietary implementations of the stack for different operating systems emerged. But the TCP/IP popularity increased when the University of California at Berkley had open sourced its implementation for BSD Unix becoming famous BSD sockets (see Wikipedia Internet protocol suite). Among alternatives to TCP/IP were IPX/SPX and AppleTalk.

⁴

Many network interface controllers allow to overwrite its MAC. In this case the MAC should be unique within the network where the NIC lives. Moreover, software defined MACs are used in VM managed by hypervisors such as QEMU and MAC randomization is gaining popularity to avoid mobile device tracking (see Wikipedia MAC address)

⁵

the main criteria here is that nodes are physically located nearby. See also Computer Networking Introduction: Ethernet and IP (Heavily Illustrated) by Ivan Velichko

⁶

Do not confuse with network bridges. Routers allow separate networks to communicate while bridges join networks making them a single network. While routers interconnect networks at the internet layer, a gateway is a more general term - gateways can interconnect PSTN (Public switched telephone network) and VoIP network acting as VoIP gateway or even the internet and satellites orbiting Earth acting as Internet-to-orbit gateway. A default gateway is the destination of IP packets if the destination IP of the packet doesn’t belong to the network mask.

⁷

When a packet crosses a border of networks it is called a hop. The time-to-live (TTL) in IPv4 (and hop limit in IPv6) prevents infinite cycling of an IP packet between networks as it is decreased by 1 at every router the packet visits.

⁸

Private IPs are defined to belong to the following subnets: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16. Number after the slash is the network prefix (CIDR notation) and denotes number of bits which is common for all hosts in the network. So, for example, 192.168.0.0/16 means that the first 16 bits are the same for all hosts and 192.168 in decimal and the rest 16 bits of total 32 bits define the host (2^16 possible addresses in the network). To be correct, the total number of possible addresses should be decreased by two as all binary zeroes hosts denotes the network itself, while all binary ones host is the broadcast address. So when you encounter an ip 192.168.12.134, you already know that this ip is not reachable publicly (from the Internet), it is some internal private host. For IPv6 private ip address is not desirable because its main goal is to provide globally unique addressing. So for the LAN use a unique local unicast address starts with a prefix fd (8 bits), then 40 bits of global id (choosen randomly by the operator of the network), then 16 bits for a subnet and the rest 64 bits define the host. So with IPv6 local private ips are essentially globally unique if 40 bits of global id indeed are random and this global uniqueness of local addresses allows to merge two networks without reassigning addresses or using NATs. For IPv6 there is no broadcast - there is only a multicast - when packets are received only by hosts which declared an interest in specific packets.

⁹

End-to-end principle means that end hosts themselves (and not other parties) participating in the communication are responsible for communication reliability and security. Lower routing layer is connectionless, i.e. each packet goes its own route rather than a prearranged one. Interesting fact: the principle was pioneered by CYCLADES - an INRIA network started the same time as the well-known ARPANET.

Practice: virtual interface

As we already discussed, at the link layer of TCP/IP we have network interfaces allowing us to interact with a network devices. But this network devices can be completely virtual allowing a userspace program interact with a network¹ at the link layer (TAP device) and the internet layer (TUN device) which is used in different tunneling programs like vpns. You can create such interfaces with console commands (like ip tuntap) but let’s try to create a TAP device programmaticaly. There C examples but let’s do it in Rust from scratch².

We create rust workspace (a collection of several packages) to have some experiments with the network in further chapters (you can find all the code in the netstack repository).

$ mkdir netstack
$ cd netstack

In the netstack directory we create Cargo.toml with the contents:

[workspace]

members = [
    "virtual-interface",
    "bindings",
]

After we create libs running cargo new --lib virtual-interface and cargo new --lib bindings in the netstack directory. So to create a virtual network device, we need to open for a read-write access file /dev/net/tun and configure it with ioctl - a swiss knife of device management in Unix/Linux. Essentialy, ioctl as a universal system call (please see man ioctl) was created to control different devices and be extensible to manage new types of devices. To achieve this goal ioctl accepts an open file descriptor (device), a request code - unsigned long number which is unique across devices and it can accept a pointer to some datastructure related to the specific device. In our case of a network device (TAP device) such a control structure is ifreq (interface request, see man 7 netdevice) and a request code is TUNSETIFF.

Rust provides libc crate as raw ffi bindings for system libraries so we could use ioctl and some constants but it is missing constants for TUN and TAP devices. So anyway as we need to create some bindings ourselves let use bindgen crate to generate bindings for system libraries at the build time. Following bidngen tutorial we create a netstack/bindings/wrapper.h file where we place necessary headers:

#include <sys/ioctl.h>
#include <linux/if.h>
#include <linux/if_tun.h>

and create bindings/build.rs with the contents

extern crate bindgen;

use std::env;
use std::path::PathBuf;

fn main() {
    // Tell cargo to invalidate the built crate whenever the wrapper changes
    println!("cargo:rerun-if-changed=wrapper.h");

    // The bindgen::Builder is the main entry point
    // to bindgen, and lets you build up options for
    // the resulting bindings.
    let bindings = bindgen::Builder::default()
        // The input header we would like to generate
        // bindings for.
        .header("wrapper.h")
        // Tell cargo to invalidate the built crate whenever any of the
        // included header files changed.
        .parse_callbacks(Box::new(bindgen::CargoCallbacks))
        // Finish the builder and generate the bindings.
        .generate()
        // Unwrap the Result and panic on failure.
        .expect("Unable to generate bindings");

    // Write the bindings to the $OUT_DIR/bindings.rs file.
    let out_path = PathBuf::from(env::var("OUT_DIR").unwrap());
    bindings
        .write_to_file(out_path.join("bindings.rs"))
        .expect("Couldn't write bindings!");
}

Run cargo build and at target/debug/build/bindings-.../out/bindings.rs we can find needed bindings except TUNSETIFF which is a required request code for ioctl (bindgen do not generate bindings for functional macros) so if we check contents of /usr/include/linux/if_tun.h then we see TUNSETIFF is a functional macro #define TUNSETIFF _IOW('T', 202, int) so let’s port the C macro code from /usr/include/asm-generic/ioctl.h where participating macros live.

In netstack/bindings/src/lib.rs

include!(concat!(env!("OUT_DIR"), "/bindings.rs"));
use std::mem;
use std::os::raw;

// /usr/include/asm-generic/ioctl.h
const IOC_SIZEBITS: u8 = 14;
const IOC_NRBITS: u8 = 8;
const IOC_TYPEBITS: u8 = 8;
const IOC_NRSHIFT: u8 = 0;
const IOC_TYPESHIFT: u8 = IOC_NRSHIFT + IOC_NRBITS;
const IOC_SIZESHIFT: u8 = IOC_TYPESHIFT + IOC_TYPEBITS;
const IOC_DIRSHIFT: u8 = IOC_SIZESHIFT + IOC_SIZEBITS;
const IOC_WRITE: u32 = 1;

//#define TUNSETIFF     _IOW('T', 202, int)
const INT_SIZE: usize = mem::size_of::<raw::c_int>();
pub const TUNSETIFF: u32 = IOC_WRITE << IOC_DIRSHIFT
    | (b'T' as u32) << IOC_TYPESHIFT
    | 202 << IOC_NRSHIFT
    | (INT_SIZE as u32) << IOC_SIZESHIFT;

Let’s add file netstack/virtual-interface/src/tap.rs where main things happen

use std::ffi;
use std::fs::{File, OpenOptions};
use std::io;
use std::mem;
use std::os::unix::io::{FromRawFd, IntoRawFd};

use bindings;

const VIRTUAL_DEVICE: &str = "/dev/net/tun";

#[derive(Debug)]
pub enum VirtualInterfaceError {
    IoError(io::Error),
    IoctlError,
    DeviceNameTooLong,
    DeviceNameContainsNulByte(ffi::NulError),
}

impl From<io::Error> for VirtualInterfaceError {
    fn from(error: io::Error) -> Self {
        VirtualInterfaceError::IoError(error)
    }
}

impl From<ffi::NulError> for VirtualInterfaceError {
    fn from(error: ffi::NulError) -> Self {
        VirtualInterfaceError::DeviceNameContainsNulByte(error)
    }
}

pub struct VirtualInterface {
    device: File,
}

impl VirtualInterface {
    pub fn create(name: &str) -> Result<Self, VirtualInterfaceError> {
        // reserve 1 byte for '\0'
        if name.len() >= bindings::IFNAMSIZ as usize {
            return Err(VirtualInterfaceError::DeviceNameTooLong);
        }
        // We have to check that the device name has no zero bytes in the middle
        let device_name = ffi::CString::new(name)?.into_bytes_with_nul();
        let device = OpenOptions::new()
            .read(true)
            .write(true)
            .open(VIRTUAL_DEVICE)?;
        // ifreq is a structure to control network device (see man 7 netdevice)
        let mut ifr: bindings::ifreq = unsafe { mem::zeroed() };

        // create stack allocated array to hold the device name
        let mut name_buffer = [0_u8; bindings::IFNAMSIZ as usize];
        // and copy name bytes to it
        for (i, b) in device_name.into_iter().enumerate() {
            name_buffer[i] = b;
        }
        ifr.ifr_ifrn.ifrn_name = name_buffer;
        // IFF_TAP - tap device
        // IFF_NO_PI - no additional info for Ethernet package
        // IFF_TUN_EXCL - prevent creation of duplicates
        ifr.ifr_ifru.ifru_flags = (bindings::IFF_TAP | bindings::IFF_NO_PI | bindings::IFF_TUN_EXCL)
            as std::os::raw::c_short;

        let raw_fd = device.into_raw_fd();
        if unsafe { bindings::ioctl(raw_fd, bindings::TUNSETIFF as u64, &mut ifr as *mut _) } == -1
        {
            return Err(VirtualInterfaceError::IoctlError);
        }
        let device = unsafe { File::from_raw_fd(raw_fd) };
        Ok(Self { device })
    }

    pub fn device(&mut self) -> &mut File {
        &mut self.device
    }
}

and import it inside netstack/virtual-interface/src/lib.rs:

pub mod tap;

To test the code let’s start with an example netstack/virtual-interface/examples/read_loop.rs

use std::io::Read;
use virtual_interface::tap::VirtualInterface;

fn main() {
    let mut interface = VirtualInterface::create("dev0").unwrap();
    let mut buffer = [0; 4096];
    println!("starting read loop for device `dev0`");
    loop {
        let n = interface.device().read(&mut buffer[..]).unwrap();
        println!("The bytes: {:?}", &buffer[..n]);
    }
}

and build and run it:

$ cargo build --example read_loop
$ ./target/debug/examples/read_loop

Oh, no some ioctl error: thread 'main' panicked at 'called Result::unwrap() on an Err value: IoctlError'. Let’s debug with strace.

$ strace ./target/debug/examples/read_loop
...
ioctl(3, TUNSETIFF, 0xffffce5f30e0)     = -1 EPERM (Operation not permitted)
...

So it complains about permissions. Yes, checking man 7 netdevice again

If an ioctl is marked as privileged, then using it requires an effective user ID of 0 or the CAP_NET_ADMIN capability. If this is not the case, EPERM will be returned.

So let’s try with sudo ./target/debug/examples/read_loop and check if the inerface is created in another terminal:

$ ip link show dev0
dev0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether b2:24:f3:9a:71:c1 brd ff:ff:ff:ff:ff:ff

It works!³ But how our binary knows how to call libc functions and use libc structs? Rust dynamically links some system libraries and libc is among them:

$ ldd ./target/debug/examples/read_loop
linux-vdso.so.1 (0x0000ffffb2199000)
libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffffb20c0000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffb1f10000)
/lib/ld-linux-aarch64.so.1 (0x0000ffffb2160000)

So our netstack/bindings works as expected.

Now you can think that giving sudo access to the program is probably too much power and you’re right. And Linux comes to resque with its capabilities⁴ system allowing to provide grains of superuser power. So we need to provide only CAP_NET_ADMIN capability to our program. From man capabilities:

CAP_NET_ADMIN
    Perform various network-related operations:
    * interface configuration;
    * administration of IP firewall, masquerading, and accounting;
    * modify routing tables;
    * bind to any address for transparent proxying;
    * set type-of-service (TOS);
    * clear driver statistics;
    * set promiscuous mode;
    * enabling multicasting;

Note: if a filesystem with your compiled binary is mounted with nosuid flag (for example, a home directory is encrypted - you can check the flag for a home directory with the command mount | grep $USER) then you should copy the binary to some other directory which filesystem has not nosuid flag.

$ sudo setcap cap_net_admin=eip target/debug/examples/read_loop

Now let’s run our example binary. In another terminal run ip link show dev0 and check that the interface is down, so let’s make it up with sudo ip link set dev0 up. Now we can see some bytes in the console and our next task is to decode them.

Actually offloading network stack to a userspace in traditional heavy-kernel OSes (as opposed to so called microkernel OSes) can bring performance boost - see for example OpenOnload.

Many of the practice in this chapter is inspired by an excellent Let’s code a TCP/IP stack, 1: Ethernet & ARP tutorial by Sami Niiranen.

⁴

Comprehensive overview of Linux capabilities and different related exploits can be found at Linux Capabilities by Carlos Polop.

We could create a TAP device with the command sudo ip tuntap add dev0 mode tap. The only difference with our programmatically created device is that our device is not persistent - it lives while our process lives. To make our device persistent we could just add the flag IFF_PERSIST to ifru_flags.

Unreliable network

A network of interconnected computers is based on some physical layer (Ethernet cables or radio in case of wireless networks)¹. The physical layer itself is subject to different disruptions, noisiness, and other problems. So we always have to assume that the network is unreliable and implement different retry, ordering, and healthcheck policies in our services.

Moreover, network capacity is limited and we should estimate ahead what an amount of data the system under development is going to send through the network.

Network throughput is an amount of data that can be transferred through a network within a certain time period Oxford Dictionary of the Internet (4 ed.). Usually, an average throughput is used: the number of the transferred bits divided by the number of passed seconds.

We can measure a network throughput in bits per second (bit/s or bps) or even in data packets per a unit of time.

Stop&Think: How do you measure a network throughput? How do you estimate a maximum throughput (also called bandwidth or channel capacity) of a network?²

Congestion control algorithms https://cloud.google.com/blog/products/networking/tcp-bbr-congestion-control-comes-to-gcp-your-internet-just-got-faster https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/tcp_bbr.c

https://cacm.acm.org/magazines/2017/2/212428-bbr-congestion-based-congestion-control/fulltext

https://dl.acm.org/doi/10.1145/52324.52356

https://github.com/google/bbr

Typical interservice communication within a corporate network often causes to think that the communication is almost instanteneous. But in case of a world wide internet connection spanning several networks the time to deliver data significantly increases.

Network latency is the time taken for a given signal or data packet to propagate through a circuit or network from its source point to its destination Oxford Dictionary of Electronics and Electrical Engineering (5 ed.)

Network latency usually measured from source to destination and back constituting the so called round trip time (RTT) which is rougly speaking a double latency.³

Caching, keep-alive connections, geoghraphical proximity to the users are among usual options to decrease latency⁴.

Network partition happens if some nodes of a cluster in a distributed system cannot communicate due to a network failure but are supposed to.

Depending on the network usage profile, targets on throughput, latency, paket loss and rate a standard Linux networking stack can be partially or completely replaced with:

a user-space networking stack (no kernel at all) like DPDK⁵ and other packet processing frameworks⁶
a kernel executed user-defined network modules (eBPF)
raw sockets ()

So for a distributed system it is absolutely necessary to handle different network failures (which are quite common):

meet bandwith requirements and handle network congestion on scaling;
deal with increased latencies;
have policies for network partitions: if a network partition creates two and more subclusters and such subclusters can behave independently (situation known as a split-brain⁷), how should the system evolve and finally merge conflicting states?
implement backpressure

Most of us have heard about OSI model as a theoretical framework to discuss networking stack. Just for reference:

L1 - Physical layer (bits)
L2 - Data link (frames)
L3 - Network (packets)
L4 - Transport (datagrams)
L5 - Session (data)
L6 - Presentation (data)
L7 - Application (data)

Also note that the maximum throughput is not actually an upper limit for application specific data throughput as there is always an overhead of a latency, protocols, encryption, compression, packet retransmission etc. A useful data throughput is sometimes called a goodput (see also Wikipedia Goodput).

An actual latency measurement is complicated by the presence of several nodes in the packet way, queuing (several packets from different sources to the same destination are put in a waiting list to send) on the gateway, processing delays. Tools like a famous ping can use special control protocols (such as ICMP) which differ from those protocols that you actually use for data (such as TCP) so measurements are biased.

⁴

Analyzing your application stack and CPU/IO profile can also help to choose an appropriate operating system if it is possible - see benchmarks Benchmarks: FreeBSD 13 vs. NetBSD 9.2 vs. OpenBSD 7 vs. DragonFlyBSD 6 vs. Linux

⁵

Create your own tool to measure network throughput

Here we will create our own Rust version of iperf3 to better understand concepts around network

https://github.com/esnet/iperf

Wireshark/ to analyze packets

Write our own DPI (nDPI, goodbyeDPI)

Availability

Availability in the common practice is understood as a metric rather than a property of the system. While in the theory of distributed systems (especially, CAP theorem¹) it is more about property leading to confusion.

We will follow practical considerations and under availability will understand the following concepts.

A service level indicator (SLI) is a carefully defined quantitative measure of some aspect of the level of service that is being provided. It is a metric, not a target.²

For example, it can be a ratio of successfully handled requests.

A service level objective (SLO) specifies a target level for the reliability of your service. The SLO is a target value for an SLI.²

Service Level Agreement (SLA) declares your commitment to satisfy SLO. SLO also determines your error budget, i.e. duration of the allowed downtime within different time periods

Going distributed is just one of the ways to strengthen availability via adding scalabity and redundancy.

While network being a primary source of failures for distributed systems, we should also take into consideration other resources related to a single node:

Storage (disks are corrupted, drivers contain bugs, various filesystems have different memory orderings models and are mounted with different policies³)
CPU (processes are stopped while VM is migrated, system clocks may drift etc, cpu usage patterns, align with branch prediction and caches sizes)
Memory (can be corrupted, defragmentated, what are the memory usage patterns)

So we should define fault models for different resources in our distributed system⁴, i.e. a specification of how our application handles different failures.

Clever design (including API), code quality assurance (including configuration changes tests⁵), proper algorithms, performance tuning, application monitoring, best DevSecOps practices, business and tech metrics collection significantly improves availability and should be considered first before introducing complexity of a distribution.

Martin Kleppmann A Critique of the CAP Theorem

Google Cloud Architecture Framework Reliability principles

An eyes opening reading for the author was Files are fraught with peril by Dan Luu - how difficult and error-prone is to use File API and how weak in terms of error-handling many filesystems are. General advice is to rely on already robust and battle-tested products like SQLite (which handles different edge cases) if you need some local persistent storage.

⁴

See for an example the Design Document of TigerBeetle.

⁵

see Dan Luu’s Reading postmortems

Distributed vs Decentralized

Decentralized network is a dispersed network having several or multiple large hubs <..> and no dominant central hub. Oxford Dictionary of Media and Communication

Consider master-slave replication of database. We have some slave databases running only read requests and a master database server executing read and writes queries and replicating all changes to slaves. Clearly this system is distributed due to the replication link. If we had no mechanism of promoting slave to master in case of the master’s failure, then our system is not decentralized.

So decentralization insreases fault-tolerance removing single point of failure. But there is also a catch. Consider two network topologies: Mesh (in case of a full decentralization) and Star (with a central master).

Mesh vs Star

So in the case of n nodes of mesh network topology we have n*(n-1)/2 connections[^graph] while in the case of the star network we have only n-1 connections. So heavily meshed distributed systems produce increasing network contention.

Taste of distributed

Exercises

Due to growing traffic we decided to shard users data by a starting letter of last name. Can we consider such a sharded database as a distributed system? Explain your answer.
Model packet loss in the network
Implement interface for TUN device

Established systems

Successful software is composed not built. Roger Johansson, creator of ProtoActor

Implementing distributed systems right way is hard. In most cases we can compose our software within well established distributed systems. Of course, the goal of the book is to learn how to implement such systems from scratch. But we should be familiar with other options. Some are listed below in the order of decreasing technology lock-in.

Erlang/OTP
ETCD
Tendermint
Kubernetes

Battle-tested Erlang/OTP

Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang. Robert Virding, one of the creators of Erlang

Erlang still remains to be a unique fusion of practical engineering and clever architectural vision. Created in 1980s, it still has such a great potential to evolve in a modern web[^klarna]¹.

We won’t dive in the language per se and pragmatic ideas of “let it fail”, supervision trees, and fault tolerance. You can read the very concise and full of practical wisdom thesis (highly recommended) of Joe Armstrong, one of the Erlang creators.

Let’s focus more on Erlang distribution capabilities. Erlang is compiled to run in a virtual machine called BEAM². Elixir is also a BEAM-runnable language with interoperability with Erlang.

Erlang lightweight processes, while executing application logic, communicate with each other using signals. Most used signal type is a message³. Processes are scheduled for execution by BEAM. You can run millions of processes⁴. OTP is a collection of reusable components (abstractions for concurrent and distributed patterns such as client-server). So Erlang/OTP is a framework allowing you to quickly create complex concurrent and distributed systems.

BEAM (written in C) can be extended with dynamically loaded compiled modules with Native Implemented Functions (NIFs). So you can write this shared module in a language you prefer (supporting C interoperability, of course)⁵.
Erlang is primarily well suited for a massive IO-bound load so for CPU-bound tasks you should use, for example, above mentioned NIFs⁶.

Every node (Erlang VM) to form a cluster should share the same cookie file with a secret. It is rather a basic “security” with the aim to differentiate two or more clusters. Every Erlang VM is started with a predefined name and then it’s enough to register the rest n-1 nodes on the first node. All nodes sharing the same cookie will propagate connections to each other creating a fully connected mesh network. This default transitive connection propagation can be configured – a node called hidden node without transitive connections can be used to create clusters with less connections⁷.

By default, communication between nodes of a cluster is done over TCP. Separate daemon process (called EPMD - Erlang Port Mapper Daemon) starts (if not already running) on every host machine with Erlang VMs. EPMD uses DNS to resolve node names to ip addresses and maps node names to ip+port (tcp socket). EPMD serves as a name resolver for nodes on the host machine.

Erlang

You can also implement your own procedures for nodes discovery and discovery under container orchestrator.

Default Erlang distribution with cookies assumes trusted network so you should change default communication mechanism in case of untrusted network⁸. Moreover, large number of nodes with fully connected mesh communicating over large and uncontrolled network can be prohibitatively costly. This break point may range from 40 to 140 nodes⁹ depending on load and amount of global state required to sync over cluster (such as a process namespace or getting now time which requires global lock to provide monotonically increasing time over the cluster). In such cases federated¹⁰ clusters and partitioning of global state in separate groups of nodes inside a cluster is a way to go¹¹.

Erlang is actively modernized and continuosly developed. So it’s a solid foundation for a distributed system.

Erlang lessons to go distributed:

separation of concerns and modularity - you can configure your communication transport, algorithm of node discovery, network topology;
distributed system must be observable (Erlang has excellent tracing and monitoring tools allowing to observe even specific Erlang processes);
communication is asynchronous so no node has to wait any acknowledgement that its message was received by another one;
message passing is location transparent (the code to send message to local Erlang process is the same as for sending to a process on another node in the cluster – at a cost of more RAM and time to memcpy as every message is deeply copied);
maintaining global mutable data (namespace of lightweight processes in case of Erlang) and full connectivity severely limits scalability.
fail-fast - correctness is more important than availability (of course, depends on the startup time). Erlang philosophy denies defensive programming (when you try to recover from all errors). So assertions in the production code are beneficial - they declare negative flow and don’t allow the distributed system (which is designed to be fault tolerant) to behave incorrectly¹².

¹³

For example, Swedish fintech Klarna uses Erlang as it’s core platform handling 2 million transactions per day - look at their job descriptions.

At one of my work places I almost convinced my CTO to write IO-bound service for proxying media streams and unary requests in Erlang. For two weeks I read everything on Erlang and finally presented a workable version which was deployed to production. It had worked under production for about 3 weeks till the CTO had a look at the code. He was rather afraid of completely non-imperative style of code and inability to scale his Erlang team (consisted of only me). So he gently asked me to rewrite the app in Go. For those brave of you Adopting Erlang.

Alternative execution environments (including not only VMs) emerge periodically but I haven’t heard about any production-ready. BEAM is really complicated to reinvent. See also Lumen.

A lot of holy wars around considering Erlang as an actor language. See Stack Exchange question and Robert Virding’s email.

⁴

⁵

See Rustler

⁶

Why we used Pony to write Wallaroo,

⁷

Distributed Erlang

⁸

¹⁰

Federation is a technique to scale software by grouping its parts by feature. For example, you can federate database in separate servers: one for sessions, second for users, third for sales etc. Increased throughput comes with cost of joining data and holding transactions on the application side.

⁹

Scaling Reliably: Improving the Scalability of the Erlang Distributed Actor Platform

¹¹

See Scaling Reliably: Improving the Scalability of the Erlang Distributed Actor Platform, Scaling Erlang Cluster to 10,000 Nodes, Stackoverflow question

¹²

See also Joran Dirk Greef TigerStyle! (Or How To Design Safer Systems in Less Time)

A several nodes in a distributed system should agree on a value. More formally, a set of replicated state machines computes the same state by applying commands from a replicated log. And the consensus goal is to synchronize the log for every node.

Distributed systems can be classified by different criteria which have both theoretical (as a constraints for models) and practical (as foundations for design decisions) importance. Let’s list mostly used:

By a communication time bound.

A synchronous system assumes that there is some bound (known in advance) on the delivery time of messages between nodes.

An asynchronous system assumes that there is no such a bound (it is impossible to say either a node has failed or is slowly processing). An asynchrounous system seems to be more suitable for nodes interacting via networks but there is no known consensus for such a model (see below).
By a fault mode.

Non-faulty nodes do not crash or misbehave at all.

Fail-stop nodes stop when fails/crashes and do not restart.

Fail-recover nodes recover after crashes and behave correctly after the restart.

Byzantine-faulty nodes may crash intentionally and moreover misbehave to hamper non-byzantine nodes to achieve consensus.
By reliability of communication (transport).

Systems with a reliable transport where sent messages are finally delivered (e.g. stream-oriented protocols like TCP, HTTP2, QUICK)

Systems with an unreliable transport where some sent messages can be dropped while in transit (e.g. UDP)

All type under each criteria are listed from the most restrictive to the most general. As a consequence, if something holds for a more resrictive subtype, then it holds for a more general one. Why is this reasoning important? Let’s illustrate with some impossibility results.

Some impossibility results (when consensus is impossible to achieve):

FLP: in an asynchronous disributed system with a reliable communication and fail-stop nodes a consensus is impossible. As a consequence, the statement holds also for asynchronous systems with fail-recover or byzantine-faulty nodes with a reliable communication (or with an unreliable communication).
In a presence of an unreliable transport a consensus is impossible even for a synchronous system (TODO link?)
consensus impossible even in a synchronous distributed system when BFT faults are possible and one third or more processes fail (N <= 3F) Lamport Pease
Atomic broadcast and its duality to consensus

How to deal with such impossiblity results?

in theory - additional constraints are introduced - the network will be periodically synchronous for short intervals of time and in such intervals of synchrony, all undelivered messages will be delivered under some fixed time bound - see [DLS88] C. Dwork, N. A. Lynch, and L. J. Stockmeyer. Consensus in the presence of partial synchrony.
in practice - by using failure detectors. The most simple one is a periodic healthcheck (when a communicating node checks) and heartbeats (when the node itself emits).

Chandra: accuracy vs liveness of failure detectors lease and leased clocks

TODO Consider 2PC (2-phase commit) for synchronous non-faulty system with a reliable transport as a first step If introduce fail-stop nodes (e.g. coordinator itself) than consensus will halt with 2PC.

Desired properties of consensus

safety - never return an incorrect result under non-Byzantinne conditions (network problems,), all nodes agree on the order of events
liveness - the consensus makes progress
support cluster changes in cluster membership

Consensus protocols

with elected leader (Raft, Paxos, Tendermint, pBFT, VR, Nakamoto consensus - POW, POS, POA)
leaderless (EPaxos, Avalanche, IOTA)
bft-tolerant (e.g. nodes never assume the leader is correct or honest) PBFT, Nakamoto, Tendermint
non-bft tolerant (require trusted setup - Raft, VR, Paxos)

CHandra - atomic broadcast is dual to consensus

Raft

Raft is now one of the widely used consensus algorithms. For example it is used in Etcd (link to the established systems) and TiKV.

The animated introduction is here should be checked first, and after that the article.

Key takeaways

we should try to merge results from equal peers (like in Paxos?), it is much more simplier to elect a leader and make it responsible for the state changes
a modular design: separate leader election, log replication, safety and membership changes

Consider Raft critique (e.g. openraft repo)

Distributed transactions

2-phase commit 3-phase commit

Viewstamped Replication

Used by Tigerbeetle.

Martin Kleppman Critique CAP

latency-sensitivity framework

availability as a metric

operational and network latency

algorithm sensitivity to latency

communication patterns

unicast (receiver - sender)
broadcast (sender to all possible receivers)
multicast (sender to a selected group of receivers)
convergecast (a group of senders to a receiver)

Supermajority (2/3) vs simple majority

Network partitions can be modelled as large packet delays (up to the duration of the partition), provided that the duration of the partition is finite and lost packets are retransmitted an unbounded number of times¹.

Martin Kleppmann A Critique of the CAP Theorem

ML training

Machine learning involves many matrix operations and is naturally parallelized not only within single node but also over distributed nodes. Out-of-box solutions like Distributed training with TensorFlow transfer parameters (aka weights) from parameters nodes to worker nodes and back during model training. But large parameters matrices for sparse data heavily consume network bandwith and severely decrease model training speed¹.

Sparse data means having many null values. For example, unpacked data batch X looks like

[[0, 0, .., 0, 9], 
 [5,0, .., 0, 34], 
 ...
 [0, .., 720, 14]]

where total number of rows is a batch size m and each features row is about 2^20 values (yes, it’s a Twitter’s scale).

So the first layer of the network requires 2^20 rows in its parameters matrix W consisting of float32 numbers (4 bytes each) and it’s number of columns depends on the next layer size, typically 50 - 400 in case of Twitter. So this results in first layer matrices of 50 * 2^20 * (4*2^-20) Mb = 200 Mb to 400 * 2^20 * (4*2^-20) Mb = 1600 Mb. When you have a lot of worker nodes exhanging gigabyte of data each with parameters servers, your network bandwith is in danger. While first layer is huge, other layers are considerably smaller.

Twitter approaches this scalability problem with a clever nodes organization².

Assume we have K worker nodes, P parameters nodes, and n features in a dataset row. Then if we break the weights matrix of the first (sparse) layer in P submatrices by features (by columns):

X * W = X₁ * W₁ + .. + X_P * W_P

then each worker node should only work with block of input X_i and block of weights matrix W_i. Each parameters server i is responsible for i^th partition W_i of the sparse first layer.

Each worker node runs two processes with the following pseudocode:

Data unpacking and partitioning:

const m // batch size
const P // number of partitions of the sparse layer

fn process_input() {
    loop {
        batch = receive_input_batch(m) // batch of size m
        for i in 0..P {
            submatrix = unpack_submatrix_i_of_features(batch, i) // with m rows and n/P columns
            send_this_submatrix_to_parameters_server(submatrix, i) // send to server i
        }
    }
}

Reconstructing output layer of the first sparse layer and training rest layers:

const P // number of partitions of the sparse layer

fn train_rest_of_layers() {
    loop {
        output_layer // product XW of size m rows (batch size) and n columns (number of features)
        for i in 0..P {
            output_layer += receive_output_submatrix_from_parameters_server(i) // product of block of X and block of W from parameters server i
        }
        calculate_other_layers_of_network(output_layer)
    }
}

Because the second and further layers are much smaller than the first “sparse” layer, each worker node has these layers locally.

Each parameters node i runs the following process:

const node_number = i
const K // number of workers

fn train_first_layer() {
    loop {
        Wi // block i of weights matrix for the first sparse layer
        Xi = receive_submatrix_of_features(i)
        block_i = calculate_block_product(Xi, Wi)
        for j in 0..K {
            send_output_layer_block_to_worker(block_i, j) // send to worker j
        }
    }
}

Such a cluster organization allows for failure and restart of a worker node without loss of weights of the first layer (failure of a parameters node is more damaging). Each worker node is stateless and no transfer of a huge sparse matrix W occurs over network³.

Distributed training of sparse ML models — Part 1: Network bottlenecks

Distributed training of sparse ML models — Part 2: Optimized strategies

Distributed training of sparse ML models — Part 3: Observed speedups

Blockchain

Permissioned vs permisionless

TODO: Focus on consensus (POW, POS, POH) as a lesson for the book purposes

TPS (transactions per second)

A permissioneless blockchain as a solution to BFT¹

Blockchain network (also called cluster) is p2p network of nodes (called validators in Solana) exchanging with a copy of a ledger (blockchain). Typically 10 - 50 validators totaling 1000-2000 nodes²

distributed state transition requires consensus

Replicated state machine, a batch of state transitions (transactions) is recorded as a block. The first state is the genesis block. State transition is done via applying a transaction. Several state transitions (i.e. transactions) are usually combined into a single block for time and space efficiency. When several competing chains exist, we need to choose one (called canonical chain) via a consensus mechanism (PoW, PoS, PoA). Notion of finality.

Smart contracts are run on VM (Sealevel for Solana, EVM for Ethereum)

Client is a program communicating with a cluster via transactions (composed of several intructions in terms of solana).

Transactions are signed by the client by means of Public Key cryptography.

dApp -decentralized app is a UI (typically writen in some Javascript flavor) client

Node which appends new transaction to the ledger is an elected leader.

Token is a unit of payments for computing resources of the cluster (running smart contract or validating its output). You can invest some tokens (delegate a stake) into validator gain a reward. Validator takes some fee (commission) for their services (in bitcoin by including a transaction with 25 BTC out of nowhere³. Growing validators can offer lower commission. It favors large validators to become even larger because lower fee insentives investors to delegate their stakes at a lower commission⁴.

If validator behaves maliciously, you los e your delegated tokens and also validator loses its delegated power and cannot earn more or even as before.

You can use a wallet which is a client allowing you to operate on your token account (send and receive). To make stakes usually separate stake accounts are used. Solana: Tokens on the stake account can only be invested in the one validator the whole amount at one time. To invest to several validators simultaneously you should use several stake accounts.

How do you get your initial tokens to start? Crypto exchanges (like Binance) allow you to buy some tokens in exchange for a usual currency like US dollars.

Account is typically identified by public key.

Validators advertised their public key.

Bitcoin core can be used to build dApps

with bitcoin scripting
with some metaprotocols requiring to use trusted nodes Or you should invent your own blockchain

Buterin suggested EVM to

Classical consenus requires all nodes to participates which prevents scalability. So either some nodes (called validators) are selected to validate transactions or another way of consensus is suggested (probabilistic consensus of Bitcoin called Nakamoto consensus, probabilistic Avalanche consensus). Or the topology of network is changed

it is sharded by feature (DeFi blockchain, smart contracts etc) (Cosmos blockchain)
centralization - when there is central network to validate (Polkadot)
L2 (layer 2, also called sidechains) A Foray Into Blockchain Scalability

In case of validators pool with classic consensus we need some voting mechanism to allow for more even participation (quadratic voting, https://wiki.polkadot.network/docs/learn-phragmen)

Byzantine Fault Tolerance (BFT, Tendermint) and Byzantine Agreement-based (Algorand) => limit the number of validators

POW and POS tend to created concentrated pools which contradicts to decentralization

There are several proposed decentralization indices like Nakamoto Index or Edinburgh Decentralisation Index. We will not delve deeply into specific methodologies but will name a few common subsystems being considered.

(full) nodes distribution by geography and ownership, cloud providers and data centers
hardware distribution (by geography, vendor, architecture)
value owners distribution (by accounts, geography, amount of wealth)
codebase diversity⁵
exhanges support
number of autonomous systems (AS)
wallets developers distribution
juridistiction contraints (e.g. ban on mining)
governance
code repository diversity (usually only GitHub)

Any of such subsystems can become a bootleneck effectively reducing decentralization to very low level (actually the Nakamoto index measures the minimum number of parties to control 51% of the resources in any subsystem⁶). 51% attacks are real https://www.crypto51.app/

Permisionneless (public) blockchains

Ethereum POS (POW earlier)
Bitcoin POW
Solana POH
Cosmos Tendermint
Avalanche ???

Permissioned blockchains to create CBDC (Cetral Bank Digital Currency, Interbanking communication - like SWIFT traditionally does) (not about decentralization)

HyperledgerFabric by IBM/Linux Foundation
Corda by R3
atop Ethereum https://github.com/Consensys/quorum
with Polkadot Substrate
with Avalanche Evergreen

Problem - grows of the blockchain data⁷ in terms of hundreds of GBs. Notion of the succint blockchain (example of Mina)

Blockchain Trilemma by Vitalik Buterin - choose 2:

Security (how easily a node can join the network)
Decentralization (how many nodes are participating in the consensus)
Scalability (how many transactions per second (TPS) can the network handle)

DePINs (decentralized physical infrastructure)

Finality is a probabilistic property even for deterministic protocols as there always exists a chance for a hard fork (consensus can be coordinated off-chain like in the DAO hack case)

Our focus is on distributed nature of blockchains, namely Distributed Ledger Technology (DLT), rather than money-related issues. While being an innovative, advanced technology, revolutionalizing several aspects of life, blockchain projects sometimes suffer from ordinary security issues and try to solve them with traditional centralized methods - check https://www.web3isgoinggreat.com/ for a rich collection.

https://forums.solana.com/t/validator-information-thread/577/51

⁴

See, for example, Solana’s validators market shares and commissions https://solanabeach.io/validators

(https://ethereum.org/en/whitepaper/)

⁸

https://courses.cfte.education/blockchain-and-defi-projects/

⁶

See also Balaji S. Srinivasan Quantifying Decentralization

⁵

Implementation languages also matter: niche ones (like Haskell or OCaml) or too low level ones (like C) can also prevent wider adoption by developers.

⁷

see Blockchain Size https://bitinfocharts.com/bitcoin/ for Bitcoin, https://etherscan.io/chartsync/chaindefault for Ethereum

Starting with Ethereum¹, almost all blockchains provide a Turing-complete computation environment to execute programs (so-called smart contracts², but in Solana they are called just programs) on chain. The execution is usually handled by some VM (the most popular choice is WASM-compatible VMs³) which is builtin in the node. The smart-contracts are written either in domain languages like Solidity and Cairo or in general languages like Rust (to be compiled to wasm or other ISA)⁴.

Because of the whole domain complexity and available sources of contracts, any overlooked bug can bring a disaster⁵

If create such smart-contracts which allow some (based for example on the entrance fee) accounts in the blockchain to vote on some proposals, we will get a collective decision making, or so-called DAO (Decentralized Autonomous Organization)⁶. DAOs are real⁷.

Bitcoin also has Bitcoin Script but it doesn’t provide loops as a control structure.

The concept of smart contract was proposed by Nick Szabo in 1994

besides WASM-compatible VMs (like Wasmer), there is a Sealevel - Solana EBPf compatible VM with parallel execution, zkVM, Risc0 to name a few

⁴

Check https://defillama.com/languages to see the most popular languages. But also take into account the target chain and estimate computation costs (aka “gas fees”) as an execution of a smart contract is not free.

⁵

see https://rekt.news/leaderboard/

⁶

Despite evident pros (decentralized, automatic process) and cons (e.g. bugs in the smart contracts, longer decision process), in case of DAO governance with fungible tokens there is a problem of “whales” - large token holders which can shift voting or in case of DAO governance with the whitelisted accounts the risk of Sybill attack. They can even use so called flash loans(when tokens are borrowed and returned within one block so no collateral is required) to gain a voting power in momentum.

⁷

A US state of Wyoming even recognizes DAOs as LLC to provide them a full legal status. See also Securities Law: Social & Community Tokens

Here are the list of references to cool articles and books etc I’ve used to prepare this book.

Mikito Takada Distributed systems: for fun and profit
Brendan Burns Designing Distributed Systems
Martin Kleppmann Designing Data-Intensive Applications
Wikipedia Web 3.0
Easy Rust
Rust by example
System design primer
Lisa Guo on scaling Instagram
Re: prototypes vs classes was: Re: Sun’s HotSpot
Martin Kleppmann A Critique of the CAP Theorem
Martin Logan, Eric Merritt, and Richard Carlsson Erlang and OTP in Action
https://github.com/tsloughter/epmdless
https://contactchanaka.medium.com/erlang-cluster-peer-discovery-on-kubernetes-aa2ed15663f9
Joe Armstrong Making reliable distributed systems in the presence of software errors
https://www.erlang.org
https://stackoverflow.com/questions/43173196/what-is-the-maximum-practical-number-of-nodes-in-an-erlang-system
https://stackoverflow.com/questions/5044574/how-scalable-is-distributed-erlang
Why we used Pony to write Wallaroo
Adopting Erlang
BEAM book
https://stackoverflow.com/questions/32846615/what-is-the-best-way-of-doing-computationally-intensive-tasks-in-erlang-w-o-scal
Timmo Verlaan No(de) discovery without DNS & EPMD - Code BEAM STO
Robert on anything
Phil Trinder et al Scaling Reliably: Improving the Scalability of the Erlang Distributed Actor Platform
Scaling Erlang Cluster to 10,000 Nodes
https://softwareengineering.stackexchange.com/questions/277464/is-erlang-really-an-actor-model-language
Fallacies of distributed computing
Fred Hebert Lessons Learned while Working on Large-Scale Server Software
Distributed training with TensorFlow
Distributed training of sparse ML models — Part 1: Network bottlenecks
Distributed training of sparse ML models — Part 2: Optimized strategies
Distributed training of sparse ML models — Part 3: Observed speedups
What is blockchain technology?
Mario Zupan How to build a blockchain in Rust
Pascal Akunne A guide to blockchain consensus protocols
Ethereum
Satoshi Nakamoto Bitcoin: A Peer-to-Peer Electronic Cash System
Chase Barker Getting Started with Solana Development
Solana Core Concepts (Community Video)
Solana Docs
https://chorus.one/networks/solana/
Felix Lutsch The Basics of Staking
Satoshi Nakamoto Bitcoin: A Peer-to-Peer Electronic Cash System
The Launch of Chorus Ventures
Nick Szabo The Idea of Smart Contracts
Eddie Xie, Yuanjun Yang How we built Twitter’s highly reliable ads pacing service
Взлетит или нет — две разные точки зрения на Web3
etcd
Etcd official website
Wikipedia Network throughput
Wikipedia Network congestion
Wikipedia Goodput
Wikipedia Measuring network throughput
Wikipedia Latency (engineering)
Wikipedia Network bridge
Wikipedia Network delay
Alex Diaconu Navigating the 8 fallacies of distributed computing
Kubernetes: Why Use It, How It Works, Options and Alternatives
Kubernetes Nodes: Components and Basic Operations
Clustering and Network Partitions
Martin Kleppmann A Critique of the CAP Theorem
Benchmarks: FreeBSD 13 vs. NetBSD 9.2 vs. OpenBSD 7 vs. DragonFlyBSD 6 vs. Linux
Urbit
Ethereum Whitepaper
Urbit for Normies
Confidential Computing: Hardware-Based Trusted Execution for Applications and Data
A Technical Analysis of Confidential Computing v1.2
Brian “Beej Jorgensen” Hall Beej’s Guide to Network Programming
Wikipedia Network interface controller
Wikipedia Internet protocol suite
Wikipedia Routing protocol
Wikipedia IPsec
Wikipedia Network address translation
Wikipedia Private network
RFC 1918 Address Allocation for Private Internets
Wikipedia Classless Inter-Domain Routing
Wikipedia Subnetwork
Wikipedia Unique local address
RFC 4193 Unique Local IPv6 Unicast Addresses
Wikipedia UDP hole punching
Wikipedia TCP hole punching
Wikipedia Network switch
Wikipedia End-to-end principle
Wikipedia Connectionless communication
Wikipedia CYCLADES
Wikipedia Port forwarding
Wikipedia Internet Protocol
Wikipedia Interactive Connectivity Establishment
Wikipedia IPv4 address exhaustion
Wikipedia Maximum transmission unit
Wikipedia Address Resolution Protocol
Wikipedia Dynamic Host Configuration Protocol
Wikipedia Default gateway
Wikipedia Path MTU Discovery
Wikipedia Internet Control Message Protocol
Wikipedia Tunneling protocol
Wikipedia Network packet
Wikipedia Router (computing)
Wikipedia Packet switching
Wikipedia Circuit switching
Wikipedia QUIC
Wikipedia HTTP/3
Wikipedia HTTP/2
Wikipedia Default gateway
Wikipedia Time to live
Wikipedia Neighbor Discovery Protocol
Wikipedia MAC address
Wikipedia VoIP gateway
Wikipedia Gateway (telecommunications)
Wikipedia MAC spoofing
Wikipedia IP address
Wikipedia Berkeley sockets
Wikipedia Ethernet
Wikipedia Virtual network interface
Sami Niiranen OpenVPN puts packets inside your packets
Universal TUN/TAP device driver
FAQ: What is the difference between bridging and routing?
Yury Pitsishin Demystifying ifconfig and network interfaces in Linux
Ifconfig Command - Explained in Detail
Private IP Address Ranges
Getting IPv6 private addressing right
Robert Greiner CAP Theorem: Revisited
Wikipedia Public key infrastructure
Wikipedia Virtual private network
Wikipedia Multicast
Wikipedia IP multicast
Автономный способ обхода DPI и эффективный способ обхода блокировок сайтов по IP-адресу
Let’s code a TCP/IP stack, 1: Ethernet & ARP
Let’s code a TCP/IP stack, 2: IPv4 & ICMPv4
Let’s code a TCP/IP stack, 3: TCP Basics & Handshake
Routing traffic through custom tap device en-route to internet on linux
Two tap device can’t communicate over bridge
Get the IP address of a network interface in C using SIOCGIFADDR
How to connect a tap interface to the internet?
How to read data from tap interface?
Graham Smith Understanding TUN TAP Interfaces
What is BGP hijacking?
What is routing? | IP routing
Wikipedia ioctl
IPv6 on Linux tun/tap: NDP not working
Herman J. Radtke III Working with C unions in Rust FFI
Using libc::ioctl to read interface flags
what is the meaning of this macro _IOR(MY_MACIG, 0, int)?
Ivan Ristić Bulletproof TLS Guide
Ioctl Numbers
TUN/TAP interface (on Linux)
The bindgen User Guide
Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman Linux Device Drivers
Wikipedia IPX/SPX
Wikipedia AppleTalk
Linux Capabilities: Why They Exist and How They Work
Linux Capabilities In Practice
Raw capture capabilities (CAP_NET_RAW, CAP_NET_ADMIN) not working outside /usr/bin and friends for packet capture program using libpcap
Carlos Polop HackTricks
Неубиваемый P2P-интернет
Еще одна «критическая» «уязвимость» «VPN» и почему Port Fail — ерунда
Shodan собирал IPv6-адреса NTP-клиентов и сканировал их в ответ
Как IPv6 помогает роутеры ломать
TCP Congestion Control или Почему скорость прыгает
Moxie Marlinspike My first impressions of web3
Cryptocurrency wallets and keys — an introduction to digital asset custody and security
VPN with Overlapping Networks
Linux programmatically up/down an interface kernel
Using ip, what does LOWER_UP mean?
What is SOCK_DGRAM and SOCK_STREAM?
What is a generic socket and how does it relate to a network device?
Семенов Ю.А. 4.4.11 Протоколы маршрутизации (обзор, таблицы маршрутизации, вектор расстояния)
Семенов Ю.А. 4.4.11.1 Внутренний протокол маршрутизации RIP
“Tell me everything that happens when you type google.com into a web browser”
How Data moves through the Internet - Networking Fundamentals
Что происходит, когда вводишь url, или как работает интернет
Validator FAQs
Design Document
Ed Harmoush What happens when…
Ed Harmoush Key Players
Ed Harmoush ARP in 5 Minutes
Ed Harmoush Traditional ARP
Ed Harmoush Proxy ARP
Ed Harmoush What is Subnetting? - Subnetting Mastery - Part 1 of 7
Ed Harmoush Drawing the Cheat Sheet - Subnetting Mastery - Part 2 of 7
Ed Harmoush How to solve ANY Subnetting Problems in 60 seconds or less - Subnetting Mastery - Part 3 of 7
Ed Harmoush Practice Examples - Subnetting Mastery - Part 4 of 7
Ed Harmoush
Ed Harmoush
Ed Harmoush
Ed Harmoush
Ed Harmoush
Ed Harmoush
Ed Harmoush
Protocol Buffers - Google’s data interchange format
Scapy
Cap’n Proto
FlatBuffers
Improving Facebook’s performance on Android with FlatBuffers
FlexBuffers
Serde
Сравнение гетерогенных блокчейнов (Cosmos, Polkadot, Avalanche)
Wikipedia Атака Сивиллы
Ethereum Whitepaper
Wikipedia Reed’s law
Как работает Wi-fi. Часть 1. История беспроводных сетей
Как работает Wi-fi. Часть 2. Физический уровень
What Everyone Gets Wrong About Blockchains
Что такое слои блокчейна L0, L1, L2 и L3 и зачем они нужны
About Staking on Ethereum
Заблуждения программистов о времени
Urbit Reference
Opentelemetry Concepts
TCP — плохой вариант для дата-центров. Встречайте новый протокол Homa
Snap: a Microkernel Approach to Host Networking
Mastering Ethereum, by Andreas M. Antonopoulos, Gavin Wood
Running A Full Node
Andy Pavlo Databases in 2022: A Year in Review
Blockchain Fails to Gain Traction in the Enterprise
Fast and Transparent Interbank Reconciliation Powered by Distributed Ledger Technology
Что такое блокчейн-мост и какие риски он в себе таит?
VPN Gate, или — неубиваемая Великим Китайским Файрволом распределённая сеть VPN
Melissa E. O’Neill PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation
Fast and Transparent Interbank Reconciliation Powered by Distributed Ledger Technology
Random Number Generation (Entropy)
Understanding random number generators, and their limitations, in Linux
How does a true random number generator collect entropy data?
Nick Sullivan Ensuring Randomness with Linux’s Random Number Generator
Igor Adamenko Basics of the Internet
Igor Adamenko TCP & UDP, or the two pillars of the Internet
Igor Adamenko Internet addressing
Igor Adamenko Where are the borders in the Internet?
Igor Adamenko How DNS servers and resolvers work
Myths about /dev/urandom
SSU2 — транспортный протокол I2P нового поколения на базе UDP
VictoriaMetrics Key concepts
Distributed SQL: An Alternative to Database Sharding
Marc Brooker Exponential Backoff And Jitter
What Happens After Finality in ETH2?
Ognyan Chikov The Process of Creating Decentralized Apps (dApps)
Dan Luu Files are hard
Dan Luu Files are fraught with peril
Dan Luu Reading postmortems
Dan Luu Notes on concurrency bugs
The Beacon Chain Ethereum 2.0 explainer you need to read first
TigerBeetle - A Million Financial Transactions per Second in Zig
Joran Dirk Greef TigerStyle! (Or How To Design Safer Systems in Less Time)
Interledger Community Call - 25 November 2020
Martin Thompson Evolution of Financial Exchange Architectures
Как концептуально работает Tornado Cash, который «забанили» власти США
Перехват трафика как вектор атаки на пользователей блокчейн-проектов
Preethi Kasireddy How does Ethereum work, anyway?
Preethi Kasireddy How does the NEW Ethereum work?
Distributed validator technology
Ethereum Contract ABI Specification. Взаимодействие с контрактом
Anton Ilinchik, Vishal Varshney All you need to know about timeouts
Konstantinos Tsiaras Study and Resource Analysis of Ethereum Execution Client Bootstrapping
Proof-of-Work vs Proof-of-Stake
How Ouroboros Samasika Upholds Mina’s Goals of Decentralization
Brad Cohn, Evan Shapiro, and Emre Tekişalp Mina: Economics and Monetary Policy
Jeff Hodges Notes on Distributed Systems for Young Bloods
Blockchain and Digital Asset projects worldwide in 2023
Polkadot documentation
Polkadot light paper
Substrate Documentation
Dr. Gavin Wood POLKADOT: VISION FOR A HETEROGENEOUS MULTI-CHAIN FRAMEWORK
Alvaro Videla Failure modes in distributed systems
Vaidehi Joshi Modes of Failure (Part 1)
Vaidehi Joshi Modes of Failure (Part 2)
Leslie Lamport Time, Clocks, and the Ordering of Events in a Distributed System
Wikipedia Vector clock
Bryan Fink Why Vector Clocks are Easy
Justin Sheehy Why Vector Clocks Are Hard
Giuseppe DeCandia et al Dynamo: Amazon’s Highly Available Key-value Store
Distributed Computing Manifesto
Jonathan Ellis Why Cassandra Doesn’t Need Vector Clocks
Carlos Baquero, Nuno Preguiça Why Logical Clocks are Easy
Barbara Simons, Jennifer Lundelius Welch and Nancy Lynch An Overview of Clock Synchronization
Shihui Song Survey on Scalable Failure Detectors
Wikipedia Lamport timestamp
Tushar Deepak Chandra and Sam Toueg Unreliable Failure Detectors for Reliable Distributed Systems
A Foray Into Blockchain Scalability
Ivan Velichko Computer Networking Introduction: Ethernet and IP (Heavily Illustrated)
Ivan Velichko Bridge vs. Switch: What I Learned From a Data Center Tour
Ivan Velichko Illustrated introduction to Linux iptables
Thomas Graf Why is the kernel community replacing iptables with BPF?
Jeremy Colvin What is Kube-Proxy and why move from iptables to eBPF?
Manish Sampat When (And When Not) to Use eBPF
Proof of Useful Work
Nakamoto Coefficient: An Accurate Indicator for Blockchain Decentralization?
Jan Camenisch Chain Key Cryptography: The Scientific Breakthrough Behind the Internet Computer
The DFINITY Team The Internet Computer for Geeks
Marek Majkowski Select is fundamentally broken
Marek Majkowski Epoll is fundamentally broken 1/2
Marek Majkowski Epoll is fundamentally broken 2/2
Rob Habermeier Polkadot v1.0: Sharding and Economic Security
SPARTAN Research Bitcoin Layers: Tapestry of a Trustless Financial Era
Balaji S. Srinivasan Quantifying Decentralization
Dimitris Karakostas, Aggelos Kiayias, and Christina Ovezik SoK: A Stratified Approach to Blockchain Decentralization
Ivan Velichko How Container Networking Works: a Docker Bridge Network From Scratch
Ivan Velichko Networking Lab: Ethernet Broadcast Domains
Diego Ongaro and John Ousterhout In Search of an Understandable Consensus Algorithm
Jim Zhang Consensus Algorithms: PoA, IBFT or Raft?
Jeff Chase Distributed Systems, Failures, and Consensus
Pierre Krieger Everything I know about networking
A Brief Tour of FLP Impossibility
Michael Fischer, Nancy Lynch, Michael Paterson Impossibility of Distributed Consensus with One Faulty Process
Wikipedia Quorum (distributed computing)
Consensus Protocols: Two-Phase Commit
Isak Toivanen and Maximilian Vorbrodt Analyzing the Performance of Linux Networking Approaches for Packet Processing
Sebastian Gallenmüller et al. Comparison of frameworks for high-performance packet IO
AF_XDP
Andrew Poelstra A Treatise on Altcoins
A Deep Dive Into Blockchain Scalability
DePIN: What are Decentralized Physical Infrastructure Networks?
Vitalik Buterin DAOs, DACs, DAs and More: An Incomplete Terminology Guide
Adam J. Kerpelman What is a DAO and What is it For?
Reuben Bramanathan Securities Law: Social & Community Tokens
Solidity
Bridging and Finality: An Introduction
Bridging and Finality: Ethereum

https://www.practicalnetworking.net/series/nat/nat/ https://www.practicalnetworking.net/series/arp/address-resolution-protocol/ https://www.practicalnetworking.net/stand-alone/communication-through-multiple-switches/ https://www.practicalnetworking.net/series/arp/gratuitous-arp/ https://www.practicalnetworking.net/series/arp/arp-probe-arp-announcement/ https://www.practicalnetworking.net/stand-alone/vlans/ https://www.practicalnetworking.net/stand-alone/eigrp-metric/ https://www.practicalnetworking.net/series/packet-traveling/packet-traveling/ https://www.practicalnetworking.net/series/packet-traveling/packet-traveling-series-finale/ https://www.practicalnetworking.net/series/packet-traveling/host-to-host-through-a-router/ https://www.practicalnetworking.net/series/packet-traveling/host-to-host/ https://www.practicalnetworking.net/series/packet-traveling/host-to-host-through-a-switch/ https://www.practicalnetworking.net/stand-alone/ethernet-wiring/ https://habr.com/ru/post/684524/ https://habr.com/ru/post/675812/ https://habr.com/ru/post/686230/

http://book.itep.ru/4/6/bitcoin.htm http://book.itep.ru//4/6/blockchain.htm http://book.itep.ru/5/cicl_54.htm http://book.itep.ru/6/vlan_62.htm http://book.itep.ru/6/fwal_63.htm http://book.itep.ru/6/pgp_644.htm http://book.itep.ru/6/ipsec.htm http://book.itep.ru/10/sem_net.htm http://book.itep.ru/4/44/ip6_4411.htm http://book.itep.ru/4/44/tun_4412.htm http://book.itep.ru/4/44/nd.htm http://book.itep.ru/4/44/rtp_4492.htm http://book.itep.ru/4/44/rtc_4493.htm http://book.itep.ru/4/4/sip.htm http://book.itep.ru/4/44/rut_4411.htm

https://blog.twitter.com/engineering/en_us/topics/insights/2021/simple-scalable-graph-neural-networks https://blog.twitter.com/engineering/en_us/topics/insights/2021/fusing-elasticsearch-with-neural-networks-to-identify-data https://blog.twitter.com/engineering/en_us/topics/infrastructure/2021/processing-billions-of-events-in-real-time-at-twitter- https://blog.twitter.com/engineering/en_us/topics/insights/2021/graph-neural-networks-through-the-lens-of-differential-geometry- https://blog.twitter.com/engineering/en_us/topics/insights/2022/graph-machine-learning-with-missing-node-features

https://blog.twitter.com/engineering/en_us/topics/open-source/2020/hunting-a-linux-kernel-bug

https://ethereum.org/en/developers/docs/evm/ https://ethereum.github.io/yellowpaper/paper.pdf https://ethereum.org/en/developers/docs/ https://docs.soliditylang.org/en/v0.6.0/introduction-to-smart-contracts.html#the-ethereum-virtual-machine

https://habr.com/ru/company/yandex/blog/564510/

https://habr.com/ru/company/dsec/blog/278335/ https://habr.com/ru/post/168407/ https://habr.com/ru/post/430172/

Distributed systems for a busy engineer