Preface
To my girls: wife, daughter, and mom - for their patience and love
Who should read this book?
To be honest, I don’t have vast practical experience of building distributed systems except the most popular one among web developers (hint: client-server). But a usual requirement to acquire this practical experience is some non-basic knowledge of the topic. So this is a classical “chicken-egg” problem: to work on a distributed software product you are usually required to have some related projects in your portfolio. So writing this book is a way to close these gaps.
Anyway I assume that readers have developed web projects and already have some experience in vertical and horizontal scaling, containerization, data races, classical algorithms and data structures. I will try at my best to introduce all used vocabulary.
I also do not devote any decent time for single machine patterns, as I believe you have enough working practice with them.
Why yet another book on distributed systems?
I’ve read some excellent books devoted to distributed systems or related topics. Among them:
- Mikito Takada Distributed systems: for fun and profit
- Brendan Burns Designing Distributed Systems
- Martin Kleppmann Designing Data-Intensive Applications
- Donne Martin System design primer
All above mentioned books have their own view on what topics and at what level should be covered. Martin Kleppman’s outstanding book (highly recommended) obviously focuses on data-related topics.
Despite I’ve read the books above, I had no clear mental picture and systematic view of Raft Algorithm, Lamport clocks and other distributed buzzwords. So this book tries to cover important topics (choice is subjective, of course) and to serve as a practical guide for busy engineers striving to grasp core concepts in a timely manner.
True understanding comes with real doing, so practice through exercises is a necessary part of learning. I have chosen Rust1 as a language to use throughout the book as I believe that Rust incorporates modern best practices and solid theoretical achievements. Experienced developers will dive without further ado. But for those who like more gradual involvement I could recommend the following excellent free resources to start with:
Why should we learn about distributed systems?
I admit that you already know the answer if you’ve started reading the book. But for the sake of completeness let’s enlist some points:
- modern scalable products inevitably rely on concepts discussed further. So if you want to get working experience in modern tech companies or even build your own product with highload and fault-tolerance in mind, you should grasp the basics2;
- expecting your software being distributed at some point in the future can radically improve your architecture decisions and even determine your preferred stack of technologies to use;
- emerging Web 3.0 is build around an idea of decentralization which is tightly connected with distributed. So at least you will be able to understand underlying technologies and explore them deeper.
The following goal for the book comes out of the points above - get reader familiar with selected topics on distributed systems with strong emphasis on practice.
Errata, typos, bugs and other feedback
I am not a native English speaker so I will be glad to get any feedback on a correct usage of English grammar and phrases. Also any suggestions, improvements, fixes are welcome. You can provide merge requests at book’s Github repository or email me at {auhtor’s name}.{author’s last name}@gmail.com.
Acknowledgments
I would like to tell many kind words to people who influenced me as a software engineer during my career. First of all, the friend of mine, Farit Sadykov, who helped me a lot to become a developer. Also my colleagues, especially Pavel Antonov, Nikolay Yurov, Vladimir Galkin, Pavel Gorbunov, Andrey Belov, Evgeny Orlov, Ilya Zhigalko, Slava Cheremushkin, Kirill Zhuravlev, Ilya Scherbakov, Dariusz Antoniuk, Ruud van Asseldonk, Michael Livshin.
Also many thanks to open source enthusiasts creating wonderful products, books, and articles, generously sharing knowledge with the rest of the world. We truly stand on the shoulders of giants.
Maksim Ryndin
Montenegro, 2022.
This book is not affiliated with Rust Foundation
Being scalable doesn’t only mean software and infrastructure while adding more servers. It’s also about your team being ready to grow with growing traffic. So being prepared to go distributed is a must-have for every team member in terms of some introductory learning experience, automated pipelines and merge reviews etc. Please watch very informative presentation by Lisa Guo on scaling Instagram.
Introduction
The big idea is “messaging” <..> The key in making great and growable systems is much more to design how its modules communicate rather than what their internal properties and behaviors should be. Alan Kay, creator of Smalltalk1
What is distributed?
Distributed system is any system in which a number of independent interconnected computers can cooperate in order to achieve a common goal. Oxford Dictionary of Computer Science
If we have two virtual machines (VM) with a web server on the first VM and a database server on the second one, do we consider such a system distributed? If VMs are on the same physical node, is it a distributed system? If the web server and the database are running in separate containers and these containers are on different physical nodes, then this system is distributed. But a container orchestrator then moves both containers on the same node. Does the system of two connected containers lose its distributed flavor?
So there are many questions like above and for the purpose of the book we will follow the next definition of a distributed system.
Distributed system is a software which has at least two components interacting with each other only by a network.
If not otherwise specified, we will call components nodes and a group of connected nodes is named cluster.
Even if a network used for communication of the components for the present moment is a loopback interface (“localhost”), it can become a real network for the next deployment.
Why we need make our software distributed?
At least, to make it highly available at the cost of an increased complexity and managing effort.
Overview of the book
TODO
In this case we have a complete graph, so every node has connections with other n-1
nodes. That’s we have n*(n-1)
edges but as we count every edge twice, total number of edges is n*(n-1)/2
.
Network stack
Our definition of a distributed system mentions a network as a medium for communication. This medium is a very interesting and complex subject per se but here we only review some core concepts relevant to our study.
Physically nodes (or hosts, think, computers) in the network are connected by means of network interface controllers (NIC, also network adapter) - a special hardware1 allowing a computer to connect to the network via a cable (Ethernet) or a radio (WiFi)2. This physical layer is called a link in TCP/IP stack3. Nodes at the link layer are identified by Media Access Control Address (MAC) - a unique hardware identifier of every network interface of the host4.
We can use ifconfig
(or more modern ip addr
and ip link show
) utility to manage network interfaces. We just describe the most common data:
<UP,BROADCAST,RUNNING,MULTICAST>
means that the interface supports broadcasting and multicasting, is loaded (UP
as opposite toDOWN
), is ready to accept connections (RUNNING
). Other popular options areLOOPBACK
(it is a loopback interface) andPROMISC
(interface is in the promiscuous mode when it captures all packets in the network).mtu 1500
- Maximum Transmission Unit is 1500 bytes. So IP packet greater than 1500 bytes will be fragmented.ether bc:d0:74:ec:76:17
means MAC isbc:d0:74:ec:76:17
inet 192.168.43.174 netmask 0xffffff00 broadcast 192.168.43.255
means IPv4 address 192.168.43.174 with network mask 255.255.255.0 so we can have hosts 192.168.43.1-254, 192.168.43.255 is a broadcast address and 192.168.43.0 is the network itself.inet6 fe80::88f:5ab4:1ad8:eed5 prefixlen 64 secured scopeid 0xe
means IPv6 addressfe80::88f:5ab4:1ad8:eed5
So we have a network of nodes (also called LAN - local area network5). How do we interconnect networks (to get WAN - wide area network) so that we can send a packet with data (called a datagram) between two hosts from different networks? Here a routing problem arises. Computers called routers6 are relaying datagrams from source to destination accross networks borders7. But how do we distinguish a host in one network from a host from another network? We need some addressing at an internet layer. IPv4 addressing assigns each host an IP address of the form xx.xx.xx.xx (4 octets of 8 bits each). But with 2^32 unique IP addresses we cannot mark all hosts so IPv6 was developed (128 bits). IPv4 address exaustion was partially mitigated by so called Network Address Translation (NAT) and private IPs8 which results in violation of the end-to-end principle9. Current adoption of IPv6 you can check with Google data. Internet layer is independent of the network topology and the physical way of hosts connection.
Within LAN we need to establish a mapping between an IP addresses of hosts and their corresponding hardware addresses (MAC) to send packet from one host to another. Such a mapping is called ARP table (or ARP cache) and is filled at each host with an Address Resolution Protocol (ARP) broadcast messages. In Linux/Unix ARP table can be manipulated with an arp
utility. For IPv6 Neighbor Discovery Protocol handles that.
Routing is mapping of IP addresses to interfaces and can be displayed with netstat -rn
. You can trace packet path with tracepath
.
After we identified a route between to hosts, we can use a transport protocol to, at least, specify ports on hosts to connect specific application processes running on hosts involved. Because internet layer is only responsible for routing and for reliability of communication, transport layer’s protocols can also offer some reliability mechanisms like congestion control, preserves data ordering, eliminate packet loss, provides packet deduplication.
So we have a connection between hosts and we can exchange application data over the transport protocol.
Network interface can also be virtual to allow userspace applications to intercept raw packets (see Universal TUN/TAP device driver). Also the famous loopback interface (127.0.0.1 for IPv4 and ::1 for IPv6) is virtual.
we consider only packet networks where the same communication channel can be shared between several communication sessions by means of packetizing data. In contrast, in circuit networks (for example, public switched telephone network, PSTN) the channel capacity is completely used only by a single communication session between two nodes.
Note that common reference to TCP/IP stack (also called Internet protocol suite) includes not only Internet Protocol and Transmission Control Protocol (TCP) but also other protocols such as a User Datagram Protocol (UDP) and QUIC at a transport layer and Internet Protocol Security (IPSec) at an internet layer. TCP/IP stack as a practical network stack predates the OSI theoretical model for general networks. Since the adoption of TCP/IP in ARPANET in 1983 several proprietary implementations of the stack for different operating systems emerged. But the TCP/IP popularity increased when the University of California at Berkley had open sourced its implementation for BSD Unix becoming famous BSD sockets (see Wikipedia Internet protocol suite). Among alternatives to TCP/IP were IPX/SPX and AppleTalk.
Many network interface controllers allow to overwrite its MAC. In this case the MAC should be unique within the network where the NIC lives. Moreover, software defined MACs are used in VM managed by hypervisors such as QEMU and MAC randomization is gaining popularity to avoid mobile device tracking (see Wikipedia MAC address)
the main criteria here is that nodes are physically located nearby. See also Computer Networking Introduction: Ethernet and IP (Heavily Illustrated) by Ivan Velichko
Do not confuse with network bridges. Routers allow separate networks to communicate while bridges join networks making them a single network. While routers interconnect networks at the internet layer, a gateway is a more general term - gateways can interconnect PSTN (Public switched telephone network) and VoIP network acting as VoIP gateway or even the internet and satellites orbiting Earth acting as Internet-to-orbit gateway. A default gateway is the destination of IP packets if the destination IP of the packet doesn’t belong to the network mask.
When a packet crosses a border of networks it is called a hop. The time-to-live (TTL) in IPv4 (and hop limit in IPv6) prevents infinite cycling of an IP packet between networks as it is decreased by 1 at every router the packet visits.
Private IPs are defined to belong to the following subnets: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16. Number after the slash is the network prefix (CIDR notation) and denotes number of bits which is common for all hosts in the network. So, for example, 192.168.0.0/16 means that the first 16 bits are the same for all hosts and 192.168 in decimal and the rest 16 bits of total 32 bits define the host (2^16 possible addresses in the network). To be correct, the total number of possible addresses should be decreased by two as all binary zeroes hosts denotes the network itself, while all binary ones host is the broadcast address. So when you encounter an ip 192.168.12.134, you already know that this ip is not reachable publicly (from the Internet), it is some internal private host.
For IPv6 private ip address is not desirable because its main goal is to provide globally unique addressing. So for the LAN use a unique local unicast address starts with a prefix fd
(8 bits), then 40 bits of global id (choosen randomly by the operator of the network), then 16 bits for a subnet and the rest 64 bits define the host. So with IPv6 local private ips are essentially globally unique if 40 bits of global id indeed are random and this global uniqueness of local addresses allows to merge two networks without reassigning addresses or using NATs. For IPv6 there is no broadcast - there is only a multicast - when packets are received only by hosts which declared an interest in specific packets.
End-to-end principle means that end hosts themselves (and not other parties) participating in the communication are responsible for communication reliability and security. Lower routing layer is connectionless, i.e. each packet goes its own route rather than a prearranged one. Interesting fact: the principle was pioneered by CYCLADES - an INRIA network started the same time as the well-known ARPANET.
Practice: virtual interface
As we already discussed, at the link layer of TCP/IP we have network interfaces allowing us to interact with a network devices. But this network devices can be completely virtual allowing a userspace program interact with a network1 at the link layer (TAP device) and the internet layer (TUN device) which is used in different tunneling programs like vpns. You can create such interfaces with console commands (like ip tuntap
) but let’s try to create a TAP device programmaticaly. There C examples but let’s do it in Rust from scratch2.
We create rust workspace (a collection of several packages) to have some experiments with the network in further chapters (you can find all the code in the netstack repository).
$ mkdir netstack
$ cd netstack
In the netstack
directory we create Cargo.toml
with the contents:
[workspace]
members = [
"virtual-interface",
"bindings",
]
After we create libs running cargo new --lib virtual-interface
and cargo new --lib bindings
in the netstack
directory. So to create a virtual network device, we need to open for a read-write access file /dev/net/tun
and configure it with ioctl
- a swiss knife of device management in Unix/Linux. Essentialy, ioctl
as a universal system call (please see man ioctl
) was created to control different devices and be extensible to manage new types of devices. To achieve this goal ioctl
accepts an open file descriptor (device), a request code - unsigned long number which is unique across devices and it can accept a pointer to some datastructure related to the specific device. In our case of a network device (TAP device) such a control structure is ifreq
(interface request, see man 7 netdevice
) and a request code is TUNSETIFF
.
Rust provides libc crate as raw ffi bindings for system libraries so we could use ioctl and some constants but it is missing constants for TUN and TAP devices. So anyway as we need to create some bindings ourselves let use bindgen crate to generate bindings for system libraries at the build time. Following bidngen tutorial we create a netstack/bindings/wrapper.h
file where we place necessary headers:
#include <sys/ioctl.h>
#include <linux/if.h>
#include <linux/if_tun.h>
and create bindings/build.rs
with the contents
extern crate bindgen;
use std::env;
use std::path::PathBuf;
fn main() {
// Tell cargo to invalidate the built crate whenever the wrapper changes
println!("cargo:rerun-if-changed=wrapper.h");
// The bindgen::Builder is the main entry point
// to bindgen, and lets you build up options for
// the resulting bindings.
let bindings = bindgen::Builder::default()
// The input header we would like to generate
// bindings for.
.header("wrapper.h")
// Tell cargo to invalidate the built crate whenever any of the
// included header files changed.
.parse_callbacks(Box::new(bindgen::CargoCallbacks))
// Finish the builder and generate the bindings.
.generate()
// Unwrap the Result and panic on failure.
.expect("Unable to generate bindings");
// Write the bindings to the $OUT_DIR/bindings.rs file.
let out_path = PathBuf::from(env::var("OUT_DIR").unwrap());
bindings
.write_to_file(out_path.join("bindings.rs"))
.expect("Couldn't write bindings!");
}
Run cargo build
and at target/debug/build/bindings-.../out/bindings.rs
we can find needed bindings except TUNSETIFF
which is a required request code for ioctl
(bindgen do not generate bindings for functional macros) so if we check contents of /usr/include/linux/if_tun.h
then we see TUNSETIFF
is a functional macro #define TUNSETIFF _IOW('T', 202, int)
so let’s port the C macro code from /usr/include/asm-generic/ioctl.h
where participating macros live.
In netstack/bindings/src/lib.rs
include!(concat!(env!("OUT_DIR"), "/bindings.rs"));
use std::mem;
use std::os::raw;
// /usr/include/asm-generic/ioctl.h
const IOC_SIZEBITS: u8 = 14;
const IOC_NRBITS: u8 = 8;
const IOC_TYPEBITS: u8 = 8;
const IOC_NRSHIFT: u8 = 0;
const IOC_TYPESHIFT: u8 = IOC_NRSHIFT + IOC_NRBITS;
const IOC_SIZESHIFT: u8 = IOC_TYPESHIFT + IOC_TYPEBITS;
const IOC_DIRSHIFT: u8 = IOC_SIZESHIFT + IOC_SIZEBITS;
const IOC_WRITE: u32 = 1;
//#define TUNSETIFF _IOW('T', 202, int)
const INT_SIZE: usize = mem::size_of::<raw::c_int>();
pub const TUNSETIFF: u32 = IOC_WRITE << IOC_DIRSHIFT
| (b'T' as u32) << IOC_TYPESHIFT
| 202 << IOC_NRSHIFT
| (INT_SIZE as u32) << IOC_SIZESHIFT;
Let’s add file netstack/virtual-interface/src/tap.rs
where main things happen
use std::ffi;
use std::fs::{File, OpenOptions};
use std::io;
use std::mem;
use std::os::unix::io::{FromRawFd, IntoRawFd};
use bindings;
const VIRTUAL_DEVICE: &str = "/dev/net/tun";
#[derive(Debug)]
pub enum VirtualInterfaceError {
IoError(io::Error),
IoctlError,
DeviceNameTooLong,
DeviceNameContainsNulByte(ffi::NulError),
}
impl From<io::Error> for VirtualInterfaceError {
fn from(error: io::Error) -> Self {
VirtualInterfaceError::IoError(error)
}
}
impl From<ffi::NulError> for VirtualInterfaceError {
fn from(error: ffi::NulError) -> Self {
VirtualInterfaceError::DeviceNameContainsNulByte(error)
}
}
pub struct VirtualInterface {
device: File,
}
impl VirtualInterface {
pub fn create(name: &str) -> Result<Self, VirtualInterfaceError> {
// reserve 1 byte for '\0'
if name.len() >= bindings::IFNAMSIZ as usize {
return Err(VirtualInterfaceError::DeviceNameTooLong);
}
// We have to check that the device name has no zero bytes in the middle
let device_name = ffi::CString::new(name)?.into_bytes_with_nul();
let device = OpenOptions::new()
.read(true)
.write(true)
.open(VIRTUAL_DEVICE)?;
// ifreq is a structure to control network device (see man 7 netdevice)
let mut ifr: bindings::ifreq = unsafe { mem::zeroed() };
// create stack allocated array to hold the device name
let mut name_buffer = [0_u8; bindings::IFNAMSIZ as usize];
// and copy name bytes to it
for (i, b) in device_name.into_iter().enumerate() {
name_buffer[i] = b;
}
ifr.ifr_ifrn.ifrn_name = name_buffer;
// IFF_TAP - tap device
// IFF_NO_PI - no additional info for Ethernet package
// IFF_TUN_EXCL - prevent creation of duplicates
ifr.ifr_ifru.ifru_flags = (bindings::IFF_TAP | bindings::IFF_NO_PI | bindings::IFF_TUN_EXCL)
as std::os::raw::c_short;
let raw_fd = device.into_raw_fd();
if unsafe { bindings::ioctl(raw_fd, bindings::TUNSETIFF as u64, &mut ifr as *mut _) } == -1
{
return Err(VirtualInterfaceError::IoctlError);
}
let device = unsafe { File::from_raw_fd(raw_fd) };
Ok(Self { device })
}
pub fn device(&mut self) -> &mut File {
&mut self.device
}
}
and import it inside netstack/virtual-interface/src/lib.rs
:
pub mod tap;
To test the code let’s start with an example netstack/virtual-interface/examples/read_loop.rs
use std::io::Read;
use virtual_interface::tap::VirtualInterface;
fn main() {
let mut interface = VirtualInterface::create("dev0").unwrap();
let mut buffer = [0; 4096];
println!("starting read loop for device `dev0`");
loop {
let n = interface.device().read(&mut buffer[..]).unwrap();
println!("The bytes: {:?}", &buffer[..n]);
}
}
and build and run it:
$ cargo build --example read_loop
$ ./target/debug/examples/read_loop
Oh, no some ioctl error: thread 'main' panicked at 'called Result::unwrap() on an Err value: IoctlError'
. Let’s debug with strace.
$ strace ./target/debug/examples/read_loop
...
ioctl(3, TUNSETIFF, 0xffffce5f30e0) = -1 EPERM (Operation not permitted)
...
So it complains about permissions. Yes, checking man 7 netdevice
again
If an ioctl is marked as privileged, then using it requires an effective user ID of 0 or the CAP_NET_ADMIN capability. If this is not the case, EPERM will be returned.
So let’s try with sudo ./target/debug/examples/read_loop
and check if the inerface is created in another terminal:
$ ip link show dev0
dev0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether b2:24:f3:9a:71:c1 brd ff:ff:ff:ff:ff:ff
It works!3 But how our binary knows how to call libc
functions and use libc
structs? Rust dynamically links some system libraries and libc
is among them:
$ ldd ./target/debug/examples/read_loop
linux-vdso.so.1 (0x0000ffffb2199000)
libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000ffffb20c0000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffb1f10000)
/lib/ld-linux-aarch64.so.1 (0x0000ffffb2160000)
So our netstack/bindings
works as expected.
Now you can think that giving sudo
access to the program is probably too much power and you’re right. And Linux comes to resque with its capabilities4 system allowing to provide grains of superuser power. So we need to provide only CAP_NET_ADMIN
capability to our program. From man capabilities
:
CAP_NET_ADMIN
Perform various network-related operations:
* interface configuration;
* administration of IP firewall, masquerading, and accounting;
* modify routing tables;
* bind to any address for transparent proxying;
* set type-of-service (TOS);
* clear driver statistics;
* set promiscuous mode;
* enabling multicasting;
Note: if a filesystem with your compiled binary is mounted with nosuid
flag (for example, a home directory is encrypted - you can check the flag for a home directory with the command mount | grep $USER
) then you should copy the binary to some other directory which filesystem has not nosuid
flag.
$ sudo setcap cap_net_admin=eip target/debug/examples/read_loop
Now let’s run our example binary. In another terminal run ip link show dev0
and check that the interface is down, so let’s make it up with sudo ip link set dev0 up
. Now we can see some bytes in the console and our next task is to decode them.
Actually offloading network stack to a userspace in traditional heavy-kernel OSes (as opposed to so called microkernel OSes) can bring performance boost - see for example OpenOnload.
Many of the practice in this chapter is inspired by an excellent Let’s code a TCP/IP stack, 1: Ethernet & ARP tutorial by Sami Niiranen.
Comprehensive overview of Linux capabilities and different related exploits can be found at Linux Capabilities by Carlos Polop.
We could create a TAP device with the command sudo ip tuntap add dev0 mode tap
. The only difference with our programmatically created device is that our device is not persistent - it lives while our process lives. To make our device persistent we could just add the flag IFF_PERSIST
to ifru_flags
.
Unreliable network
A network of interconnected computers is based on some physical layer (Ethernet cables or radio in case of wireless networks)1. The physical layer itself is subject to different disruptions, noisiness, and other problems. So we always have to assume that the network is unreliable and implement different retry, ordering, and healthcheck policies in our services.
Moreover, network capacity is limited and we should estimate ahead what an amount of data the system under development is going to send through the network.
Network throughput is an amount of data that can be transferred through a network within a certain time period Oxford Dictionary of the Internet (4 ed.). Usually, an average throughput is used: the number of the transferred bits divided by the number of passed seconds.
We can measure a network throughput in bits per second (bit/s or bps) or even in data packets per a unit of time.
Stop&Think: How do you measure a network throughput? How do you estimate a maximum throughput (also called bandwidth or channel capacity) of a network?2
Congestion control algorithms https://cloud.google.com/blog/products/networking/tcp-bbr-congestion-control-comes-to-gcp-your-internet-just-got-faster https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ipv4/tcp_bbr.c
https://cacm.acm.org/magazines/2017/2/212428-bbr-congestion-based-congestion-control/fulltext
https://dl.acm.org/doi/10.1145/52324.52356
https://github.com/google/bbr
Typical interservice communication within a corporate network often causes to think that the communication is almost instanteneous. But in case of a world wide internet connection spanning several networks the time to deliver data significantly increases.
Network latency is the time taken for a given signal or data packet to propagate through a circuit or network from its source point to its destination Oxford Dictionary of Electronics and Electrical Engineering (5 ed.)
Network latency usually measured from source to destination and back constituting the so called round trip time (RTT) which is rougly speaking a double latency.3
Caching, keep-alive connections, geoghraphical proximity to the users are among usual options to decrease latency4.
Network partition happens if some nodes of a cluster in a distributed system cannot communicate due to a network failure but are supposed to.
Depending on the network usage profile, targets on throughput, latency, paket loss and rate a standard Linux networking stack can be partially or completely replaced with:
- a user-space networking stack (no kernel at all) like DPDK5 and other packet processing frameworks6
- a kernel executed user-defined network modules (eBPF)
- raw sockets ()
So for a distributed system it is absolutely necessary to handle different network failures (which are quite common):
- meet bandwith requirements and handle network congestion on scaling;
- deal with increased latencies;
- have policies for network partitions: if a network partition creates two and more subclusters and such subclusters can behave independently (situation known as a split-brain7), how should the system evolve and finally merge conflicting states?
- implement backpressure
Most of us have heard about OSI model as a theoretical framework to discuss networking stack. Just for reference:
- L1 - Physical layer (bits)
- L2 - Data link (frames)
- L3 - Network (packets)
- L4 - Transport (datagrams)
- L5 - Session (data)
- L6 - Presentation (data)
- L7 - Application (data)
Also note that the maximum throughput is not actually an upper limit for application specific data throughput as there is always an overhead of a latency, protocols, encryption, compression, packet retransmission etc. A useful data throughput is sometimes called a goodput (see also Wikipedia Goodput).
An actual latency measurement is complicated by the presence of several nodes in the packet way, queuing (several packets from different sources to the same destination are put in a waiting list to send) on the gateway, processing delays. Tools like a famous ping can use special control protocols (such as ICMP) which differ from those protocols that you actually use for data (such as TCP) so measurements are biased.
Analyzing your application stack and CPU/IO profile can also help to choose an appropriate operating system if it is possible - see benchmarks Benchmarks: FreeBSD 13 vs. NetBSD 9.2 vs. OpenBSD 7 vs. DragonFlyBSD 6 vs. Linux
see also SPDK for a storage
see also https://github.com/ntop/PF_RING/tree/dev, https://github.com/luigirizzo/netmap
Create your own tool to measure network throughput
Here we will create our own Rust version of iperf3 to better understand concepts around network
https://github.com/esnet/iperf
Wireshark/ to analyze packets
Write our own DPI (nDPI, goodbyeDPI)
Availability
Availability in the common practice is understood as a metric rather than a property of the system. While in the theory of distributed systems (especially, CAP theorem1) it is more about property leading to confusion.
We will follow practical considerations and under availability will understand the following concepts.
A service level indicator (SLI) is a carefully defined quantitative measure of some aspect of the level of service that is being provided. It is a metric, not a target.2
For example, it can be a ratio of successfully handled requests.
A service level objective (SLO) specifies a target level for the reliability of your service. The SLO is a target value for an SLI.2
Service Level Agreement (SLA) declares your commitment to satisfy SLO. SLO also determines your error budget, i.e. duration of the allowed downtime within different time periods
Going distributed is just one of the ways to strengthen availability via adding scalabity and redundancy.
While network being a primary source of failures for distributed systems, we should also take into consideration other resources related to a single node:
- Storage (disks are corrupted, drivers contain bugs, various filesystems have different memory orderings models and are mounted with different policies3)
- CPU (processes are stopped while VM is migrated, system clocks may drift etc, cpu usage patterns, align with branch prediction and caches sizes)
- Memory (can be corrupted, defragmentated, what are the memory usage patterns)
So we should define fault models for different resources in our distributed system4, i.e. a specification of how our application handles different failures.
Clever design (including API), code quality assurance (including configuration changes tests5), proper algorithms, performance tuning, application monitoring, best DevSecOps practices, business and tech metrics collection significantly improves availability and should be considered first before introducing complexity of a distribution.
Martin Kleppmann A Critique of the CAP Theorem
Google Cloud Architecture Framework Reliability principles
An eyes opening reading for the author was Files are fraught with peril by Dan Luu - how difficult and error-prone is to use File API and how weak in terms of error-handling many filesystems are. General advice is to rely on already robust and battle-tested products like SQLite (which handles different edge cases) if you need some local persistent storage.
See for an example the Design Document of TigerBeetle.
see Dan Luu’s Reading postmortems
Distributed vs Decentralized
Decentralized network is a dispersed network having several or multiple large hubs <..> and no dominant central hub. Oxford Dictionary of Media and Communication
Consider master-slave replication of database. We have some slave databases running only read requests and a master database server executing read and writes queries and replicating all changes to slaves. Clearly this system is distributed due to the replication link. If we had no mechanism of promoting slave to master in case of the master’s failure, then our system is not decentralized.
So decentralization insreases fault-tolerance removing single point of failure. But there is also a catch. Consider two network topologies: Mesh (in case of a full decentralization) and Star (with a central master).
So in the case of n
nodes of mesh network topology we have n*(n-1)/2
connections[^graph] while in the case of the star network we have only n-1
connections. So heavily meshed distributed systems produce increasing network contention.
Taste of distributed
Exercises
-
Due to growing traffic we decided to shard users data by a starting letter of last name. Can we consider such a sharded database as a distributed system? Explain your answer.
-
Model packet loss in the network
-
Implement interface for TUN device
Established systems
Successful software is composed not built. Roger Johansson, creator of ProtoActor
Implementing distributed systems right way is hard. In most cases we can compose our software within well established distributed systems. Of course, the goal of the book is to learn how to implement such systems from scratch. But we should be familiar with other options. Some are listed below in the order of decreasing technology lock-in.
- Erlang/OTP
- ETCD
- Tendermint
- Kubernetes
Battle-tested Erlang/OTP
Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang. Robert Virding, one of the creators of Erlang
Erlang still remains to be a unique fusion of practical engineering and clever architectural vision. Created in 1980s, it still has such a great potential to evolve in a modern web[^klarna]1.
We won’t dive in the language per se and pragmatic ideas of “let it fail”, supervision trees, and fault tolerance. You can read the very concise and full of practical wisdom thesis (highly recommended) of Joe Armstrong, one of the Erlang creators.
Let’s focus more on Erlang distribution capabilities. Erlang is compiled to run in a virtual machine called BEAM2. Elixir is also a BEAM-runnable language with interoperability with Erlang.
Erlang lightweight processes, while executing application logic, communicate with each other using signals. Most used signal type is a message3. Processes are scheduled for execution by BEAM. You can run millions of processes4. OTP is a collection of reusable components (abstractions for concurrent and distributed patterns such as client-server). So Erlang/OTP is a framework allowing you to quickly create complex concurrent and distributed systems.
BEAM (written in C) can be extended with dynamically loaded compiled modules with Native Implemented Functions (NIFs). So you can write this shared module in a language you prefer (supporting C interoperability, of course)5.
Erlang is primarily well suited for a massive IO-bound load so for CPU-bound tasks you should use, for example, above mentioned NIFs6.
Every node (Erlang VM) to form a cluster should share the same cookie file with a secret. It is rather a basic “security” with the aim to differentiate two or more clusters.
Every Erlang VM is started with a predefined name and then it’s enough to register the rest n-1
nodes on the first node. All nodes sharing the same cookie will propagate connections to each other creating a fully connected mesh network. This default transitive connection propagation can be configured – a node called hidden node without transitive connections can be used to create clusters with less connections7.
By default, communication between nodes of a cluster is done over TCP. Separate daemon process (called EPMD - Erlang Port Mapper Daemon) starts (if not already running) on every host machine with Erlang VMs. EPMD uses DNS to resolve node names to ip addresses and maps node names to ip+port (tcp socket). EPMD serves as a name resolver for nodes on the host machine.
You can also implement your own procedures for nodes discovery and discovery under container orchestrator.
Default Erlang distribution with cookies assumes trusted network so you should change default communication mechanism in case of untrusted network8. Moreover, large number of nodes with fully connected mesh communicating over large and uncontrolled network can be prohibitatively costly. This break point may range from 40 to 140 nodes9 depending on load and amount of global state required to sync over cluster (such as a process namespace or getting now
time which requires global lock to provide monotonically increasing time over the cluster). In such cases federated10 clusters and partitioning of global state in separate groups of nodes inside a cluster is a way to go11.
Erlang is actively modernized and continuosly developed. So it’s a solid foundation for a distributed system.
Erlang lessons to go distributed:
- separation of concerns and modularity - you can configure your communication transport, algorithm of node discovery, network topology;
- distributed system must be observable (Erlang has excellent tracing and monitoring tools allowing to observe even specific Erlang processes);
- communication is asynchronous so no node has to wait any acknowledgement that its message was received by another one;
- message passing is location transparent (the code to send message to local Erlang process is the same as for sending to a process on another node in the cluster – at a cost of more RAM and time to
memcpy
as every message is deeply copied); - maintaining global mutable data (namespace of lightweight processes in case of Erlang) and full connectivity severely limits scalability.
- fail-fast - correctness is more important than availability (of course, depends on the startup time). Erlang philosophy denies defensive programming (when you try to recover from all errors). So assertions in the production code are beneficial - they declare negative flow and don’t allow the distributed system (which is designed to be fault tolerant) to behave incorrectly12.
For example, Swedish fintech Klarna uses Erlang as it’s core platform handling 2 million transactions per day - look at their job descriptions.
At one of my work places I almost convinced my CTO to write IO-bound service for proxying media streams and unary requests in Erlang. For two weeks I read everything on Erlang and finally presented a workable version which was deployed to production. It had worked under production for about 3 weeks till the CTO had a look at the code. He was rather afraid of completely non-imperative style of code and inability to scale his Erlang team (consisted of only me). So he gently asked me to rewrite the app in Go. For those brave of you Adopting Erlang.
Alternative execution environments (including not only VMs) emerge periodically but I haven’t heard about any production-ready. BEAM is really complicated to reinvent. See also Lumen.
A lot of holy wars around considering Erlang as an actor language. See Stack Exchange question and Robert Virding’s email.
See Rustler
See also epmdless, Using TLS for Erlang Distribution
Federation is a technique to scale software by grouping its parts by feature. For example, you can federate database in separate servers: one for sessions, second for users, third for sales etc. Increased throughput comes with cost of joining data and holding transactions on the application side.
See Scaling Reliably: Improving the Scalability of the Erlang Distributed Actor Platform, Scaling Erlang Cluster to 10,000 Nodes, Stackoverflow question
See also Joran Dirk Greef TigerStyle! (Or How To Design Safer Systems in Less Time)
A several nodes in a distributed system should agree on a value. More formally, a set of replicated state machines computes the same state by applying commands from a replicated log. And the consensus goal is to synchronize the log for every node.
Distributed systems can be classified by different criteria which have both theoretical (as a constraints for models) and practical (as foundations for design decisions) importance. Let’s list mostly used:
-
By a communication time bound.
A synchronous system assumes that there is some bound (known in advance) on the delivery time of messages between nodes.
An asynchronous system assumes that there is no such a bound (it is impossible to say either a node has failed or is slowly processing). An asynchrounous system seems to be more suitable for nodes interacting via networks but there is no known consensus for such a model (see below).
-
By a fault mode.
Non-faulty nodes do not crash or misbehave at all.
Fail-stop nodes stop when fails/crashes and do not restart.
Fail-recover nodes recover after crashes and behave correctly after the restart.
Byzantine-faulty nodes may crash intentionally and moreover misbehave to hamper non-byzantine nodes to achieve consensus.
-
By reliability of communication (transport).
Systems with a reliable transport where sent messages are finally delivered (e.g. stream-oriented protocols like TCP, HTTP2, QUICK)
Systems with an unreliable transport where some sent messages can be dropped while in transit (e.g. UDP)
All type under each criteria are listed from the most restrictive to the most general. As a consequence, if something holds for a more resrictive subtype, then it holds for a more general one. Why is this reasoning important? Let’s illustrate with some impossibility results.
Some impossibility results (when consensus is impossible to achieve):
- FLP: in an asynchronous disributed system with a reliable communication and fail-stop nodes a consensus is impossible. As a consequence, the statement holds also for asynchronous systems with fail-recover or byzantine-faulty nodes with a reliable communication (or with an unreliable communication).
- In a presence of an unreliable transport a consensus is impossible even for a synchronous system (TODO link?)
- consensus impossible even in a synchronous distributed system when BFT faults are possible and one third or more processes fail (N <= 3F) Lamport Pease
- Atomic broadcast and its duality to consensus
How to deal with such impossiblity results?
- in theory - additional constraints are introduced - the network will be periodically synchronous for short intervals of time and in such intervals of synchrony, all undelivered messages will be delivered under some fixed time bound - see [DLS88] C. Dwork, N. A. Lynch, and L. J. Stockmeyer. Consensus in the presence of partial synchrony.
- in practice - by using failure detectors. The most simple one is a periodic healthcheck (when a communicating node checks) and heartbeats (when the node itself emits).
Chandra: accuracy vs liveness of failure detectors lease and leased clocks
TODO Consider 2PC (2-phase commit) for synchronous non-faulty system with a reliable transport as a first step If introduce fail-stop nodes (e.g. coordinator itself) than consensus will halt with 2PC.
Desired properties of consensus
- safety - never return an incorrect result under non-Byzantinne conditions (network problems,), all nodes agree on the order of events
- liveness - the consensus makes progress
- support cluster changes in cluster membership
Consensus protocols
-
with elected leader (Raft, Paxos, Tendermint, pBFT, VR, Nakamoto consensus - POW, POS, POA)
-
leaderless (EPaxos, Avalanche, IOTA)
-
bft-tolerant (e.g. nodes never assume the leader is correct or honest) PBFT, Nakamoto, Tendermint
-
non-bft tolerant (require trusted setup - Raft, VR, Paxos)
CHandra - atomic broadcast is dual to consensus
Raft
Raft is now one of the widely used consensus algorithms. For example it is used in Etcd (link to the established systems) and TiKV.
The animated introduction is here should be checked first, and after that the article.
Key takeaways
- we should try to merge results from equal peers (like in Paxos?), it is much more simplier to elect a leader and make it responsible for the state changes
- a modular design: separate leader election, log replication, safety and membership changes
Consider Raft critique (e.g. openraft repo)
Distributed transactions
2-phase commit 3-phase commit
Viewstamped Replication
Used by Tigerbeetle.
Martin Kleppman Critique CAP
latency-sensitivity framework
availability as a metric
operational and network latency
algorithm sensitivity to latency
communication patterns
- unicast (receiver - sender)
- broadcast (sender to all possible receivers)
- multicast (sender to a selected group of receivers)
- convergecast (a group of senders to a receiver)
Supermajority (2/3) vs simple majority
Network partitions can be modelled as large packet delays (up to the duration of the partition), provided that the duration of the partition is finite and lost packets are retransmitted an unbounded number of times1.
Martin Kleppmann A Critique of the CAP Theorem
ML training
Machine learning involves many matrix operations and is naturally parallelized not only within single node but also over distributed nodes. Out-of-box solutions like Distributed training with TensorFlow transfer parameters (aka weights) from parameters nodes to worker nodes and back during model training. But large parameters matrices for sparse data heavily consume network bandwith and severely decrease model training speed1.
Sparse data means having many null values. For example, unpacked data batch X
looks like
[[0, 0, .., 0, 9],
[5,0, .., 0, 34],
...
[0, .., 720, 14]]
where total number of rows is a batch size m
and each features row is about 2^20 values (yes, it’s a Twitter’s scale).
So the first layer of the network requires 2^20 rows in its parameters matrix W
consisting of float32
numbers (4 bytes each) and it’s number of columns depends on the next layer size, typically 50 - 400 in case of Twitter. So this results in first layer matrices of
50 * 2^20 * (4*2^-20) Mb = 200 Mb
to 400 * 2^20 * (4*2^-20) Mb = 1600 Mb
. When you have a lot of worker nodes exhanging gigabyte of data each with parameters servers, your network bandwith is in danger. While first layer is huge, other layers are considerably smaller.
Twitter approaches this scalability problem with a clever nodes organization2.
Assume we have K
worker nodes, P
parameters nodes, and n
features in a dataset row.
Then if we break the weights matrix of the first (sparse) layer in P
submatrices by features (by columns):
X * W
= X
1 * W
1 + .. + X
P * W
P
then each worker node should only work with block of input X
i and block of weights matrix W
i.
Each parameters server i
is responsible for ith partition W
i of the sparse first layer.
Each worker node runs two processes with the following pseudocode:
Data unpacking and partitioning:
const m // batch size
const P // number of partitions of the sparse layer
fn process_input() {
loop {
batch = receive_input_batch(m) // batch of size m
for i in 0..P {
submatrix = unpack_submatrix_i_of_features(batch, i) // with m rows and n/P columns
send_this_submatrix_to_parameters_server(submatrix, i) // send to server i
}
}
}
Reconstructing output layer of the first sparse layer and training rest layers:
const P // number of partitions of the sparse layer
fn train_rest_of_layers() {
loop {
output_layer // product XW of size m rows (batch size) and n columns (number of features)
for i in 0..P {
output_layer += receive_output_submatrix_from_parameters_server(i) // product of block of X and block of W from parameters server i
}
calculate_other_layers_of_network(output_layer)
}
}
Because the second and further layers are much smaller than the first “sparse” layer, each worker node has these layers locally.
Each parameters node i
runs the following process:
const node_number = i
const K // number of workers
fn train_first_layer() {
loop {
Wi // block i of weights matrix for the first sparse layer
Xi = receive_submatrix_of_features(i)
block_i = calculate_block_product(Xi, Wi)
for j in 0..K {
send_output_layer_block_to_worker(block_i, j) // send to worker j
}
}
}
Such a cluster organization allows for failure and restart of a worker node without loss of weights of the first layer (failure of a parameters node is more damaging). Each worker node is stateless and no transfer of a huge sparse matrix W
occurs over network3.
Blockchain
Permissioned vs permisionless
TODO: Focus on consensus (POW, POS, POH) as a lesson for the book purposes
TPS (transactions per second)
A permissioneless blockchain as a solution to BFT1
Blockchain network (also called cluster) is p2p network of nodes (called validators in Solana) exchanging with a copy of a ledger (blockchain). Typically 10 - 50 validators totaling 1000-2000 nodes2
- distributed state transition requires consensus
Replicated state machine, a batch of state transitions (transactions) is recorded as a block. The first state is the genesis block. State transition is done via applying a transaction. Several state transitions (i.e. transactions) are usually combined into a single block for time and space efficiency. When several competing chains exist, we need to choose one (called canonical chain) via a consensus mechanism (PoW, PoS, PoA). Notion of finality.
Smart contracts are run on VM (Sealevel for Solana, EVM for Ethereum)
Client is a program communicating with a cluster via transactions (composed of several intructions in terms of solana).
Transactions are signed by the client by means of Public Key cryptography.
dApp -decentralized app is a UI (typically writen in some Javascript flavor) client
Node which appends new transaction to the ledger is an elected leader.
Token is a unit of payments for computing resources of the cluster (running smart contract or validating its output). You can invest some tokens (delegate a stake) into validator gain a reward. Validator takes some fee (commission) for their services (in bitcoin by including a transaction with 25 BTC out of nowhere3. Growing validators can offer lower commission. It favors large validators to become even larger because lower fee insentives investors to delegate their stakes at a lower commission4.
If validator behaves maliciously, you los e your delegated tokens and also validator loses its delegated power and cannot earn more or even as before.
You can use a wallet which is a client allowing you to operate on your token account (send and receive). To make stakes usually separate stake accounts are used. Solana: Tokens on the stake account can only be invested in the one validator the whole amount at one time. To invest to several validators simultaneously you should use several stake accounts.
How do you get your initial tokens to start? Crypto exchanges (like Binance) allow you to buy some tokens in exchange for a usual currency like US dollars.
Account is typically identified by public key.
Validators advertised their public key.
Bitcoin core can be used to build dApps
- with bitcoin scripting
- with some metaprotocols requiring to use trusted nodes Or you should invent your own blockchain
Buterin suggested EVM to
Classical consenus requires all nodes to participates which prevents scalability. So either some nodes (called validators) are selected to validate transactions or another way of consensus is suggested (probabilistic consensus of Bitcoin called Nakamoto consensus, probabilistic Avalanche consensus). Or the topology of network is changed
- it is sharded by feature (DeFi blockchain, smart contracts etc) (Cosmos blockchain)
- centralization - when there is central network to validate (Polkadot)
- L2 (layer 2, also called sidechains) A Foray Into Blockchain Scalability
In case of validators pool with classic consensus we need some voting mechanism to allow for more even participation (quadratic voting, https://wiki.polkadot.network/docs/learn-phragmen)
Byzantine Fault Tolerance (BFT, Tendermint) and Byzantine Agreement-based (Algorand) => limit the number of validators
POW and POS tend to created concentrated pools which contradicts to decentralization
There are several proposed decentralization indices like Nakamoto Index or Edinburgh Decentralisation Index. We will not delve deeply into specific methodologies but will name a few common subsystems being considered.
- (full) nodes distribution by geography and ownership, cloud providers and data centers
- hardware distribution (by geography, vendor, architecture)
- value owners distribution (by accounts, geography, amount of wealth)
- codebase diversity5
- exhanges support
- number of autonomous systems (AS)
- wallets developers distribution
- juridistiction contraints (e.g. ban on mining)
- governance
- code repository diversity (usually only GitHub)
Any of such subsystems can become a bootleneck effectively reducing decentralization to very low level (actually the Nakamoto index measures the minimum number of parties to control 51% of the resources in any subsystem6). 51% attacks are real https://www.crypto51.app/
Permisionneless (public) blockchains
- Ethereum POS (POW earlier)
- Bitcoin POW
- Solana POH
- Cosmos Tendermint
- Avalanche ???
Permissioned blockchains to create CBDC (Cetral Bank Digital Currency, Interbanking communication - like SWIFT traditionally does) (not about decentralization)
- HyperledgerFabric by IBM/Linux Foundation
- Corda by R3
- atop Ethereum https://github.com/Consensys/quorum
- with Polkadot Substrate
- with Avalanche Evergreen
Problem - grows of the blockchain data7 in terms of hundreds of GBs. Notion of the succint blockchain (example of Mina)
Blockchain Trilemma by Vitalik Buterin - choose 2:
- Security (how easily a node can join the network)
- Decentralization (how many nodes are participating in the consensus)
- Scalability (how many transactions per second (TPS) can the network handle)
DePINs (decentralized physical infrastructure)
Finality is a probabilistic property even for deterministic protocols as there always exists a chance for a hard fork (consensus can be coordinated off-chain like in the DAO hack case)
Our focus is on distributed nature of blockchains, namely Distributed Ledger Technology (DLT), rather than money-related issues. While being an innovative, advanced technology, revolutionalizing several aspects of life, blockchain projects sometimes suffer from ordinary security issues and try to solve them with traditional centralized methods - check https://www.web3isgoinggreat.com/ for a rich collection.
See, for example, Solana’s validators market shares and commissions https://solanabeach.io/validators
(https://ethereum.org/en/whitepaper/)
https://courses.cfte.education/blockchain-and-defi-projects/
See also Balaji S. Srinivasan Quantifying Decentralization
Implementation languages also matter: niche ones (like Haskell or OCaml) or too low level ones (like C) can also prevent wider adoption by developers.
see Blockchain Size
https://bitinfocharts.com/bitcoin/ for Bitcoin, https://etherscan.io/chartsync/chaindefault for Ethereum
Starting with Ethereum1, almost all blockchains provide a Turing-complete computation environment to execute programs (so-called smart contracts2, but in Solana they are called just programs) on chain. The execution is usually handled by some VM (the most popular choice is WASM-compatible VMs3) which is builtin in the node. The smart-contracts are written either in domain languages like Solidity and Cairo or in general languages like Rust (to be compiled to wasm or other ISA)4.
Because of the whole domain complexity and available sources of contracts, any overlooked bug can bring a disaster5
If create such smart-contracts which allow some (based for example on the entrance fee) accounts in the blockchain to vote on some proposals, we will get a collective decision making, or so-called DAO (Decentralized Autonomous Organization)6. DAOs are real7.
Bitcoin also has Bitcoin Script but it doesn’t provide loops as a control structure.
The concept of smart contract was proposed by Nick Szabo in 1994
besides WASM-compatible VMs (like Wasmer), there is a Sealevel - Solana EBPf compatible VM with parallel execution, zkVM, Risc0 to name a few
Check https://defillama.com/languages to see the most popular languages. But also take into account the target chain and estimate computation costs (aka “gas fees”) as an execution of a smart contract is not free.
Despite evident pros (decentralized, automatic process) and cons (e.g. bugs in the smart contracts, longer decision process), in case of DAO governance with fungible tokens there is a problem of “whales” - large token holders which can shift voting or in case of DAO governance with the whitelisted accounts the risk of Sybill attack. They can even use so called flash loans(when tokens are borrowed and returned within one block so no collateral is required) to gain a voting power in momentum.
A US state of Wyoming even recognizes DAOs as LLC to provide them a full legal status. See also Securities Law: Social & Community Tokens
Here are the list of references to cool articles and books etc I’ve used to prepare this book.
- Mikito Takada Distributed systems: for fun and profit
- Brendan Burns Designing Distributed Systems
- Martin Kleppmann Designing Data-Intensive Applications
- Wikipedia Web 3.0
- Easy Rust
- Rust by example
- System design primer
- Lisa Guo on scaling Instagram
- Re: prototypes vs classes was: Re: Sun’s HotSpot
- Martin Kleppmann A Critique of the CAP Theorem
- Martin Logan, Eric Merritt, and Richard Carlsson Erlang and OTP in Action
- https://github.com/tsloughter/epmdless
- https://contactchanaka.medium.com/erlang-cluster-peer-discovery-on-kubernetes-aa2ed15663f9
- Joe Armstrong Making reliable distributed systems in the presence of software errors
- https://www.erlang.org
- https://stackoverflow.com/questions/43173196/what-is-the-maximum-practical-number-of-nodes-in-an-erlang-system
- https://stackoverflow.com/questions/5044574/how-scalable-is-distributed-erlang
- Why we used Pony to write Wallaroo
- Adopting Erlang
- BEAM book
- https://stackoverflow.com/questions/32846615/what-is-the-best-way-of-doing-computationally-intensive-tasks-in-erlang-w-o-scal
- Timmo Verlaan No(de) discovery without DNS & EPMD - Code BEAM STO
- Robert on anything
- Phil Trinder et al Scaling Reliably: Improving the Scalability of the Erlang Distributed Actor Platform
- Scaling Erlang Cluster to 10,000 Nodes
- https://softwareengineering.stackexchange.com/questions/277464/is-erlang-really-an-actor-model-language
- Fallacies of distributed computing
- Fred Hebert Lessons Learned while Working on Large-Scale Server Software
- Distributed training with TensorFlow
- Distributed training of sparse ML models — Part 1: Network bottlenecks
- Distributed training of sparse ML models — Part 2: Optimized strategies
- Distributed training of sparse ML models — Part 3: Observed speedups
- What is blockchain technology?
- Mario Zupan How to build a blockchain in Rust
- Pascal Akunne A guide to blockchain consensus protocols
- Ethereum
- Satoshi Nakamoto Bitcoin: A Peer-to-Peer Electronic Cash System
- Chase Barker Getting Started with Solana Development
- Solana Core Concepts (Community Video)
- Solana Docs
- https://chorus.one/networks/solana/
- Felix Lutsch The Basics of Staking
- Satoshi Nakamoto Bitcoin: A Peer-to-Peer Electronic Cash System
- The Launch of Chorus Ventures
- Nick Szabo The Idea of Smart Contracts
- Eddie Xie, Yuanjun Yang How we built Twitter’s highly reliable ads pacing service
- Взлетит или нет — две разные точки зрения на Web3
- etcd
- Etcd official website
- Wikipedia Network throughput
- Wikipedia Network congestion
- Wikipedia Goodput
- Wikipedia Measuring network throughput
- Wikipedia Latency (engineering)
- Wikipedia Network bridge
- Wikipedia Network delay
- Alex Diaconu Navigating the 8 fallacies of distributed computing
- Kubernetes: Why Use It, How It Works, Options and Alternatives
- Kubernetes Nodes: Components and Basic Operations
- Clustering and Network Partitions
- Martin Kleppmann A Critique of the CAP Theorem
- Benchmarks: FreeBSD 13 vs. NetBSD 9.2 vs. OpenBSD 7 vs. DragonFlyBSD 6 vs. Linux
- Urbit
- Ethereum Whitepaper
- Urbit for Normies
- Confidential Computing: Hardware-Based Trusted Execution for Applications and Data
- A Technical Analysis of Confidential Computing v1.2
- Brian “Beej Jorgensen” Hall Beej’s Guide to Network Programming
- Wikipedia Network interface controller
- Wikipedia Internet protocol suite
- Wikipedia Routing protocol
- Wikipedia IPsec
- Wikipedia Network address translation
- Wikipedia Private network
- RFC 1918 Address Allocation for Private Internets
- Wikipedia Classless Inter-Domain Routing
- Wikipedia Subnetwork
- Wikipedia Unique local address
- RFC 4193 Unique Local IPv6 Unicast Addresses
- Wikipedia UDP hole punching
- Wikipedia TCP hole punching
- Wikipedia Network switch
- Wikipedia End-to-end principle
- Wikipedia Connectionless communication
- Wikipedia CYCLADES
- Wikipedia Port forwarding
- Wikipedia Internet Protocol
- Wikipedia Interactive Connectivity Establishment
- Wikipedia IPv4 address exhaustion
- Wikipedia Maximum transmission unit
- Wikipedia Address Resolution Protocol
- Wikipedia Dynamic Host Configuration Protocol
- Wikipedia Default gateway
- Wikipedia Path MTU Discovery
- Wikipedia Internet Control Message Protocol
- Wikipedia Tunneling protocol
- Wikipedia Network packet
- Wikipedia Router (computing)
- Wikipedia Packet switching
- Wikipedia Circuit switching
- Wikipedia QUIC
- Wikipedia HTTP/3
- Wikipedia HTTP/2
- Wikipedia Default gateway
- Wikipedia Time to live
- Wikipedia Neighbor Discovery Protocol
- Wikipedia MAC address
- Wikipedia VoIP gateway
- Wikipedia Gateway (telecommunications)
- Wikipedia MAC spoofing
- Wikipedia IP address
- Wikipedia Berkeley sockets
- Wikipedia Ethernet
- Wikipedia Virtual network interface
- Sami Niiranen OpenVPN puts packets inside your packets
- Universal TUN/TAP device driver
- FAQ: What is the difference between bridging and routing?
- Yury Pitsishin Demystifying ifconfig and network interfaces in Linux
- Ifconfig Command - Explained in Detail
- Private IP Address Ranges
- Getting IPv6 private addressing right
- Robert Greiner CAP Theorem: Revisited
- Wikipedia Public key infrastructure
- Wikipedia Virtual private network
- Wikipedia Multicast
- Wikipedia IP multicast
- Автономный способ обхода DPI и эффективный способ обхода блокировок сайтов по IP-адресу
- Let’s code a TCP/IP stack, 1: Ethernet & ARP
- Let’s code a TCP/IP stack, 2: IPv4 & ICMPv4
- Let’s code a TCP/IP stack, 3: TCP Basics & Handshake
- Routing traffic through custom tap device en-route to internet on linux
- Two tap device can’t communicate over bridge
- Get the IP address of a network interface in C using SIOCGIFADDR
- How to connect a tap interface to the internet?
- How to read data from tap interface?
- Graham Smith Understanding TUN TAP Interfaces
- What is BGP hijacking?
- What is routing? | IP routing
- Wikipedia ioctl
- IPv6 on Linux tun/tap: NDP not working
- Herman J. Radtke III Working with C unions in Rust FFI
- Using libc::ioctl to read interface flags
- what is the meaning of this macro _IOR(MY_MACIG, 0, int)?
- Ivan Ristić Bulletproof TLS Guide
- Ioctl Numbers
- TUN/TAP interface (on Linux)
- The
bindgen
User Guide - Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman Linux Device Drivers
- Wikipedia IPX/SPX
- Wikipedia AppleTalk
- Linux Capabilities: Why They Exist and How They Work
- Linux Capabilities In Practice
- Raw capture capabilities (CAP_NET_RAW, CAP_NET_ADMIN) not working outside /usr/bin and friends for packet capture program using libpcap
- Carlos Polop HackTricks
- Неубиваемый P2P-интернет
- Еще одна «критическая» «уязвимость» «VPN» и почему Port Fail — ерунда
- Shodan собирал IPv6-адреса NTP-клиентов и сканировал их в ответ
- Как IPv6 помогает роутеры ломать
- TCP Congestion Control или Почему скорость прыгает
- Moxie Marlinspike My first impressions of web3
- Cryptocurrency wallets and keys — an introduction to digital asset custody and security
- VPN with Overlapping Networks
- Linux programmatically up/down an interface kernel
- Using ip, what does LOWER_UP mean?
- What is SOCK_DGRAM and SOCK_STREAM?
- What is a generic socket and how does it relate to a network device?
- Семенов Ю.А. 4.4.11 Протоколы маршрутизации (обзор, таблицы маршрутизации, вектор расстояния)
- Семенов Ю.А. 4.4.11.1 Внутренний протокол маршрутизации RIP
- “Tell me everything that happens when you type google.com into a web browser”
- How Data moves through the Internet - Networking Fundamentals
- Что происходит, когда вводишь url, или как работает интернет
- Validator FAQs
- Design Document
- Ed Harmoush What happens when…
- Ed Harmoush Key Players
- Ed Harmoush ARP in 5 Minutes
- Ed Harmoush Traditional ARP
- Ed Harmoush Proxy ARP
- Ed Harmoush What is Subnetting? - Subnetting Mastery - Part 1 of 7
- Ed Harmoush Drawing the Cheat Sheet - Subnetting Mastery - Part 2 of 7
- Ed Harmoush How to solve ANY Subnetting Problems in 60 seconds or less - Subnetting Mastery - Part 3 of 7
- Ed Harmoush Practice Examples - Subnetting Mastery - Part 4 of 7
- Ed Harmoush
- Ed Harmoush
- Ed Harmoush
- Ed Harmoush
- Ed Harmoush
- Ed Harmoush
- Ed Harmoush
- Protocol Buffers - Google’s data interchange format
- Scapy
- Cap’n Proto
- FlatBuffers
- Improving Facebook’s performance on Android with FlatBuffers
- FlexBuffers
- Serde
- Сравнение гетерогенных блокчейнов (Cosmos, Polkadot, Avalanche)
- Wikipedia Атака Сивиллы
- Ethereum Whitepaper
- Wikipedia Reed’s law
- Как работает Wi-fi. Часть 1. История беспроводных сетей
- Как работает Wi-fi. Часть 2. Физический уровень
- What Everyone Gets Wrong About Blockchains
- Что такое слои блокчейна L0, L1, L2 и L3 и зачем они нужны
- About Staking on Ethereum
- Заблуждения программистов о времени
- Urbit Reference
- Opentelemetry Concepts
- TCP — плохой вариант для дата-центров. Встречайте новый протокол Homa
- Snap: a Microkernel Approach to Host Networking
- Mastering Ethereum, by Andreas M. Antonopoulos, Gavin Wood
- Running A Full Node
- Andy Pavlo Databases in 2022: A Year in Review
- Blockchain Fails to Gain Traction in the Enterprise
- Fast and Transparent Interbank Reconciliation Powered by Distributed Ledger Technology
- Что такое блокчейн-мост и какие риски он в себе таит?
- VPN Gate, или — неубиваемая Великим Китайским Файрволом распределённая сеть VPN
- Melissa E. O’Neill PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation
- Fast and Transparent Interbank Reconciliation Powered by Distributed Ledger Technology
- Random Number Generation (Entropy)
- Understanding random number generators, and their limitations, in Linux
- How does a true random number generator collect entropy data?
- Nick Sullivan Ensuring Randomness with Linux’s Random Number Generator
- Igor Adamenko Basics of the Internet
- Igor Adamenko TCP & UDP, or the two pillars of the Internet
- Igor Adamenko Internet addressing
- Igor Adamenko Where are the borders in the Internet?
- Igor Adamenko How DNS servers and resolvers work
- Myths about /dev/urandom
- SSU2 — транспортный протокол I2P нового поколения на базе UDP
- VictoriaMetrics Key concepts
- Distributed SQL: An Alternative to Database Sharding
- Marc Brooker Exponential Backoff And Jitter
- What Happens After Finality in ETH2?
- Ognyan Chikov The Process of Creating Decentralized Apps (dApps)
- Dan Luu Files are hard
- Dan Luu Files are fraught with peril
- Dan Luu Reading postmortems
- Dan Luu Notes on concurrency bugs
- The Beacon Chain Ethereum 2.0 explainer you need to read first
- TigerBeetle - A Million Financial Transactions per Second in Zig
- Joran Dirk Greef TigerStyle! (Or How To Design Safer Systems in Less Time)
- Interledger Community Call - 25 November 2020
- Martin Thompson Evolution of Financial Exchange Architectures
- Как концептуально работает Tornado Cash, который «забанили» власти США
- Перехват трафика как вектор атаки на пользователей блокчейн-проектов
- Preethi Kasireddy How does Ethereum work, anyway?
- Preethi Kasireddy How does the NEW Ethereum work?
- Distributed validator technology
- Ethereum Contract ABI Specification. Взаимодействие с контрактом
- Anton Ilinchik, Vishal Varshney All you need to know about timeouts
- Konstantinos Tsiaras Study and Resource Analysis of Ethereum Execution Client Bootstrapping
- Proof-of-Work vs Proof-of-Stake
- How Ouroboros Samasika Upholds Mina’s Goals of Decentralization
- Brad Cohn, Evan Shapiro, and Emre Tekişalp Mina: Economics and Monetary Policy
- Jeff Hodges Notes on Distributed Systems for Young Bloods
- Blockchain and Digital Asset projects worldwide in 2023
- Polkadot documentation
- Polkadot light paper
- Substrate Documentation
- Dr. Gavin Wood POLKADOT: VISION FOR A HETEROGENEOUS MULTI-CHAIN FRAMEWORK
- Alvaro Videla Failure modes in distributed systems
- Vaidehi Joshi Modes of Failure (Part 1)
- Vaidehi Joshi Modes of Failure (Part 2)
- Leslie Lamport Time, Clocks, and the Ordering of Events in a Distributed System
- Wikipedia Vector clock
- Bryan Fink Why Vector Clocks are Easy
- Justin Sheehy Why Vector Clocks Are Hard
- Giuseppe DeCandia et al Dynamo: Amazon’s Highly Available Key-value Store
- Distributed Computing Manifesto
- Jonathan Ellis Why Cassandra Doesn’t Need Vector Clocks
- Carlos Baquero, Nuno Preguiça Why Logical Clocks are Easy
- Barbara Simons, Jennifer Lundelius Welch and Nancy Lynch An Overview of Clock Synchronization
- Shihui Song Survey on Scalable Failure Detectors
- Wikipedia Lamport timestamp
- Tushar Deepak Chandra and Sam Toueg Unreliable Failure Detectors for Reliable Distributed Systems
- A Foray Into Blockchain Scalability
- Ivan Velichko Computer Networking Introduction: Ethernet and IP (Heavily Illustrated)
- Ivan Velichko Bridge vs. Switch: What I Learned From a Data Center Tour
- Ivan Velichko Illustrated introduction to Linux iptables
- Thomas Graf Why is the kernel community replacing iptables with BPF?
- Jeremy Colvin What is Kube-Proxy and why move from iptables to eBPF?
- Manish Sampat When (And When Not) to Use eBPF
- Proof of Useful Work
- Nakamoto Coefficient: An Accurate Indicator for Blockchain Decentralization?
- Jan Camenisch Chain Key Cryptography: The Scientific Breakthrough Behind the Internet Computer
- The DFINITY Team The Internet Computer for Geeks
- Marek Majkowski Select is fundamentally broken
- Marek Majkowski Epoll is fundamentally broken 1/2
- Marek Majkowski Epoll is fundamentally broken 2/2
- Rob Habermeier Polkadot v1.0: Sharding and Economic Security
- SPARTAN Research Bitcoin Layers: Tapestry of a Trustless Financial Era
- Balaji S. Srinivasan Quantifying Decentralization
- Dimitris Karakostas, Aggelos Kiayias, and Christina Ovezik SoK: A Stratified Approach to Blockchain Decentralization
- Ivan Velichko How Container Networking Works: a Docker Bridge Network From Scratch
- Ivan Velichko Networking Lab: Ethernet Broadcast Domains
- Diego Ongaro and John Ousterhout In Search of an Understandable Consensus Algorithm
- Jim Zhang Consensus Algorithms: PoA, IBFT or Raft?
- Jeff Chase Distributed Systems, Failures, and Consensus
- Pierre Krieger Everything I know about networking
- A Brief Tour of FLP Impossibility
- Michael Fischer, Nancy Lynch, Michael Paterson Impossibility of Distributed Consensus with One Faulty Process
- Wikipedia Quorum (distributed computing)
- Consensus Protocols: Two-Phase Commit
- Isak Toivanen and Maximilian Vorbrodt Analyzing the Performance of Linux Networking Approaches for Packet Processing
- Sebastian Gallenmüller et al. Comparison of frameworks for high-performance packet IO
- AF_XDP
- Andrew Poelstra A Treatise on Altcoins
- A Deep Dive Into Blockchain Scalability
- DePIN: What are Decentralized Physical Infrastructure Networks?
- Vitalik Buterin DAOs, DACs, DAs and More: An Incomplete Terminology Guide
- Adam J. Kerpelman What is a DAO and What is it For?
- Reuben Bramanathan Securities Law: Social & Community Tokens
- Solidity
- Bridging and Finality: An Introduction
- Bridging and Finality: Ethereum
https://www.practicalnetworking.net/series/nat/nat/ https://www.practicalnetworking.net/series/arp/address-resolution-protocol/ https://www.practicalnetworking.net/stand-alone/communication-through-multiple-switches/ https://www.practicalnetworking.net/series/arp/gratuitous-arp/ https://www.practicalnetworking.net/series/arp/arp-probe-arp-announcement/ https://www.practicalnetworking.net/stand-alone/vlans/ https://www.practicalnetworking.net/stand-alone/eigrp-metric/ https://www.practicalnetworking.net/series/packet-traveling/packet-traveling/ https://www.practicalnetworking.net/series/packet-traveling/packet-traveling-series-finale/ https://www.practicalnetworking.net/series/packet-traveling/host-to-host-through-a-router/ https://www.practicalnetworking.net/series/packet-traveling/host-to-host/ https://www.practicalnetworking.net/series/packet-traveling/host-to-host-through-a-switch/ https://www.practicalnetworking.net/stand-alone/ethernet-wiring/ https://habr.com/ru/post/684524/ https://habr.com/ru/post/675812/ https://habr.com/ru/post/686230/
http://book.itep.ru/4/6/bitcoin.htm http://book.itep.ru//4/6/blockchain.htm http://book.itep.ru/5/cicl_54.htm http://book.itep.ru/6/vlan_62.htm http://book.itep.ru/6/fwal_63.htm http://book.itep.ru/6/pgp_644.htm http://book.itep.ru/6/ipsec.htm http://book.itep.ru/10/sem_net.htm http://book.itep.ru/4/44/ip6_4411.htm http://book.itep.ru/4/44/tun_4412.htm http://book.itep.ru/4/44/nd.htm http://book.itep.ru/4/44/rtp_4492.htm http://book.itep.ru/4/44/rtc_4493.htm http://book.itep.ru/4/4/sip.htm http://book.itep.ru/4/44/rut_4411.htm
https://blog.twitter.com/engineering/en_us/topics/insights/2021/simple-scalable-graph-neural-networks https://blog.twitter.com/engineering/en_us/topics/insights/2021/fusing-elasticsearch-with-neural-networks-to-identify-data https://blog.twitter.com/engineering/en_us/topics/infrastructure/2021/processing-billions-of-events-in-real-time-at-twitter- https://blog.twitter.com/engineering/en_us/topics/insights/2021/graph-neural-networks-through-the-lens-of-differential-geometry- https://blog.twitter.com/engineering/en_us/topics/insights/2022/graph-machine-learning-with-missing-node-features
https://blog.twitter.com/engineering/en_us/topics/open-source/2020/hunting-a-linux-kernel-bug
https://ethereum.org/en/developers/docs/evm/ https://ethereum.github.io/yellowpaper/paper.pdf https://ethereum.org/en/developers/docs/ https://docs.soliditylang.org/en/v0.6.0/introduction-to-smart-contracts.html#the-ethereum-virtual-machine
https://habr.com/ru/company/yandex/blog/564510/
https://habr.com/ru/company/dsec/blog/278335/ https://habr.com/ru/post/168407/ https://habr.com/ru/post/430172/