Goral generated with Imagine.art

System requirements

Memory: RSS 30M. An actual requirement may be different - as it depends on the amount of data, scrape and push intervals (see below for each service)
Binary size is around 15 Mb
Platforms: Linux (x86-64 or aarch64). Other platforms - namely MacOS (arm chips) and Windows should work also (being only covered with unit tests).

Installation

You can download Goral binary with

curl --proto '=https' --tlsv1.2 -sSf https://maksimryndin.github.io/goral/download.sh | sh

or by downloading a prebuilt binary from the releases page manually

wget https://github.com/maksimryndin/goral/releases/download/0.1.16/goral-0.1.16-x86_64-unknown-linux-gnu.tar.gz
tar -xzf goral-0.1.16-x86_64-unknown-linux-gnu.tar.gz
cd goral-0.1.16-x86_64-unknown-linux-gnu/
shasum -a 256 -c sha256_checksum.txt

or from source

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
git clone --depth 1 --branch 0.1.16 https://github.com/maksimryndin/goral
cd goral
RUSTFLAGS='-C target-feature=+crt-static' cargo build --release --target <target triple>

To run a binary

goral -c config.toml --id myhost

where an myhost is an identifier of the machine (no more than 8 chars) and an example config.toml can be

[general]
service_account_credentials_path = "/etc/goral_service_account.json"
messenger.specific.chat_id = "-000000000000"
messenger.url = "https://api.telegram.org/bot12345678:XXXXXxxxxx-XXXddxxss-XXX/sendMessage"

[healthcheck]
spreadsheet_id = "<spreadsheet_id_1>"
messenger.specific.chat_id = "-111111111111"
messenger.url = "https://api.telegram.org/bot12345678:XXXXXxxxxx-XXXddxxss-XXX/sendMessage"
messenger.send_google_append_error = false
autotruncate_at_usage_percent = 90
[[healthcheck.liveness]]
name = "backend"
type = "Http"
endpoint = "http://127.0.0.1:8080"

[logs]
spreadsheet_id = "<spreadsheet_id_2>"
messenger.url = "https://discord.com/api/webhooks/123456789/xxxxx-XXXXX"
autotruncate_at_usage_percent = 90

[metrics]
spreadsheet_id = "<spreadsheet_id_3>"
messenger.specific.token = "xoxb-123-456-XXXXXX"
messenger.specific.channel = "XXXXXXXX"
messenger.url = "https://slack.com/api/chat.postMessage"
autotruncate_at_usage_percent = 90
[[target]]
endpoint = "http://127.0.0.1:8080/metrics"
name = "backend"

[system]
spreadsheet_id = "<spreadsheet_id_4>"
messenger.url = "https://discord.com/api/webhooks/101010101/xxxxx-XXXXX"
autotruncate_at_usage_percent = 90
mounts = ["/", "/var"]
process_names = ["goral", "mybackend"]

[kv]
spreadsheet_id = "<spreadsheet_id_5>"
port = 50000

See also Services and Recommended deployment.

Setup

To use Goral you need to have a Google account and obtain a service account:

Create a project (we suggest creating a separate project for Goral for security reasons)
Enable Sheets from the products page
Create a service account with Editor role
Create a private key in JSON (three dots menu in Actions column -> Manage keys -> Add Key) for it (the private key should be downloaded by your browser)
Create a spreadsheet (where the scraped data will be stored) and share the spreadsheet with the service account email (you can take client_email from the downloaded private key) as an Editor
Extract spreadsheet id from the spreadsheet url

Steps 5-6 can be repeated for each service (it is recommended to have separate spreadsheets for usability and separation of concerns).

Note: you can also install Google Sheets app for your phone to have an access to the data.

Notifications are sent to messengers with three levels:

🟢 (INFO)
🟡 (WARN)
🔴 (ERROR)

and are prefixed with id (the argument for --id flag). Messengers are configured separately for every service so that you can use e.g. Discord for General service and Telegram for Healthcheck.

Slack

App creation

Follow the quickstart guide and the posting guide.

Example configuration:

messenger.specific.token = "xoxb-slack-token"
messenger.specific.channel = "CHRISHWFH2"
messenger.url = "https://slack.com/api/chat.postMessage"

Rate limit

Bot creation

Create a bot
Create a private group for notifications to be sent to
Add your bot to the group
Obtain a chat_id following the accepted answer

Example configuration:

messenger.specific.chat_id = "-100129008371"
messenger.url = "https://api.telegram.org/bot<bot token>/sendMessage"

Rate limit

Note: for Telegram all info-level messages are sent without notification so the phone doesn't vibrate or make any sound.

Services

A workhorse abstraction of Goral over a sheet is an appendable log - just a table which grows via addition of key-value records.

A sheet managed by Goral has a title <log to collect data on>@<host_id>@<service> <creation datetime>. You can change the title, column names and other elements of the sheet but be aware that Goral will continue to append data in the same order as when the sheet was created by Goral if the form of the data hasn't changed (either <log to collect data on> or its keys). Creation datetimes for sheets always differ by some amount of seconds (jittered) even those sheets were created at the same time - in order to prevent conflicts in sheet titles.

Commented lines (starting with #) for all configurations below are optional and example values are their defaults. Every service has a messenger configuration (see Setup). It is recommended to have several messengers and different chats/channels for each messenger and take into account their rate limits when configuring a service.

Storage quota

Every service (except General) has an autotruncate_at_usage_percent configuration - the limit of the usage share by a service. Any Google spreadsheet can contain at most 10_000_000 cells so if a services takes more than autotruncate_at_usage_percent of 10_000_000 cells, Goral will truncate old data by either removing old rows or removing the same named sheets (<log to collect data on>) under a service. For every service the cleanup will truncate the surplus (actual usage - limit) and 10% of the limit.

When providing your own limits, do remember to have a sum of limits no more than 100% for all services related to the same spreadsheet. Default settings assume conservatively that you write all the data to the same spreadsheet. If KV service is turned on, its default truncation limit is set for 100% (meaning nothing is truncated as a safe default), so you will get a warning that limits sum is over 100% if you use the same spreadsheet for other services.

Also if a spreadsheet includes other sheets not managed by Goral, take into account their usage.

General

General service is responsible for reserved communication channel and important notifications about Goral itself. Also it periodically checks (72 hours at the moment) for new releases of Goral to notify. Its configuration

Basic configuration

[general]
service_account_credentials_path = "/path/to/service_account.json"
messenger.url = "<messenger api url for sending messages>"

Full configuration

[general]
# log_level = "info"
service_account_credentials_path = "/path/to/service_account.json"
messenger.url = "<messenger api url for sending messages>"
# graceful_timeout_secs = 5

Configuration of General service is a minimum configuration for Goral.

Healthcheck

Healthcheck service with a configuration

Basic configuration

[healthcheck]
spreadsheet_id = "<spreadsheet_id>"
[[healthcheck.liveness]]
type = "Http"
endpoint = "http://127.0.0.1:9898"

Full configuration

[healthcheck]
spreadsheet_id = "<spreadsheet_id>"
# messenger.url = "<messenger api url for sending messages>"
# push_interval_secs = 20
# autotruncate_at_usage_percent = 10
[[healthcheck.liveness]]
# name = "http://127.0.0.1:9898" # by default the endpoint itself is used as a name
# initial_delay_secs = 0
# period_secs = 5
type = "Http"
endpoint = "http://127.0.0.1:9898"
# timeout_ms = 3000 # should be less than or equal period_secs
[[healthcheck.liveness]]
# name = "ls -lha" # by default the command itself is used as a name
# initial_delay_secs = 0
# period_secs = 5
type = "Command"
command = ["ls", "-lha"]
# timeout_ms = 3000 # should be less than or equal period_secs
[[healthcheck.liveness]]
# name = "[::1]:9898" # by default the tcp socket addr itself is used as a name
# initial_delay_secs = 0
# period_secs = 5
type = "Tcp"
endpoint = "[::1]:9898"
# timeout_ms = 3000 # should be less than or equal period_secs
[[healthcheck.liveness]]
# name = "http://[::1]:50050" # by default the tcp socket addr itself is used as a name
# initial_delay_secs = 0
# period_secs = 5
type = "Grpc"
endpoint = "http://[::1]:50050"
# timeout_ms = 3000 # should be less than or equal period_secs

will create a sheet for every liveness probe.

Liveness probes follow the same rules as for k8s. Namely:

you should choose among HTTP GET (any status >=200 and <400), gRPC, TCP (Goral can open socket) and command (successful exit)
you can specify an initial delay, a period of a probe and a timeout on a probe
misconfiguration of a liveness probe is considered a failure
for gRPC Health service should be configured on the app side (see also gRPC health checking protocol). Only http scheme is supported at the moment. If you need a tls check, you can use a command probe with a grpc health probe and specify proper certificates.

Goral saves probe time, a status (true for alive) and a text output (for HTTP GET - a response text, for a command - a stdout output, for all probes - an error text). Each probe is saved at a separate sheet. In case an output is larger than 1024 bytes, it is truncated and you get permanent warnings in logs of Goral. So configure the output size of your healthcheck reasonably (healthcheck responses shouldn't be heavy). For command healthchecks it is recommended to assign a name to a liveness probe or wrap your command in some script so that in case of small changes in command arguments preserve the same sheet for data (otherwise Goral will create a new sheet since the title has changed).

If a messenger is configured, then any healthcheck change (healthy -> unhealthy and vice versa) is sent via the messenger. In case of many endpoints with short liveness periods there is a risk to hit a rate limit of a messenger.

Metrics

Metrics scrape endpoints with Prometheus metrics. Maximum response body is set to 65536 bytes.

Basic configuration

[metrics]
spreadsheet_id = "<spreadsheet_id>"
[[target]]
endpoint = "<prometheus metrics endpoint1>"

Full configuration

[metrics]
spreadsheet_id = "<spreadsheet_id>"
# messenger.url = "<messenger api url for sending messages>"
# push_interval_secs = 30
# scrape_interval_secs = 10
# scrape_timeout_ms = 3000
# autotruncate_at_usage_percent = 20
[[target]]
endpoint = "<prometheus metrics endpoint1>"
name = "app1"
[[target]]
endpoint = "<prometheus metrics endpoint2>"
name = "app2"

For every scrape target and every metric Metrics service creates a separate sheet. When several targets are scraped, their sheet names start with <name>: to distinguish several instances of the same app.

If there is an error while scraping, it is sent via a configured messenger or via a default messenger of General service.

Note: if the observed app restarts, then its metric counters are reset. Goral just scrapes metrics as-is without trying to merge them. For metrics data it is usually acceptable. If you need some more reliable way to collect data - consider using KV Log as it uses a synchronous push strategy and allows you to setup a merge strategy on the app side.

If a messenger is configured, then any scraping error is sent via the messenger (even it is the same error, it is sent each time). In case of many scrape endpoints with short scrape intervals there is a risk to hit a messenger rate limit.

Logs

Logs service with the following configuration:

Basic configuration

[logs]
spreadsheet_id = "<spreadsheet_id>"

Full configuration

[logs]
spreadsheet_id = "<spreadsheet_id>"
# messenger.url = "<messenger api url for sending messages>"
# push_interval_secs = 5
# autotruncate_at_usage_percent = 30
# filter_if_contains = []
# drop_if_contains = []

will create a single sheet with columns datetime, level, log_line. A log line is truncated to 50 000 chars as it is a Google Sheets limit for a cell. Goral tries to extract a log level and datetime from a log line. If it fails to extract a log level then N/A is displayed. If it fails to extract datetime, then the current system time is used.

For logs collection Goral reads its stdin. Basically it is a portable way to collect stdout of another process without a privileged access. Using pipes, you can create a more sophisticated preprocessing of logs. There is a caveat - if we make a simple pipe like instrumented_app | goral then in case of a termination of the instrumented_app Goral will not see any input and will stop reading. There is a way with named pipes (for Windows there should also be a way as it also supports named pipes).

You create a named pipe, say instrumented_app_logs_pipe with the command mkfifo instrumented_app_logs_pipe (it creates a pipe file in the current directory - you can choose an appropriate place)
The named pipe exists till there is at least one writer. So we create a fake one while true; do sleep 365d; done >instrumented_app_logs_pipe &
start Goral with its usual command args and the pipe: goral -c config.toml --id "host" <instrumented_app_logs_pipe
start you app with instrumented_app | tee instrumented_app_logs_pipe - you will see your logs in stdout as before and they also be cloned to the named pipe which is read by Goral.

With this named pipes approach the instrumented_app restarts doesn't stop Goral from reading its stdin for logs. Just be sure to autorecreate a fake writer in case of a host system restarts. See also Deployment section for an example.

As there may be a huge amount of logs, it is recommended to filter the volume by specifiying an array of substrings (case sensitive) in filter_if_contains (e.g. ["info", "warn", "error"]) and drop_if_contains, and/or have a separate spreadsheet for logs collection as a huge amount of them may hurdle the use of Google sheets due to the constant UI updates.

System

System service configuration:

Basic configuration

[system]
spreadsheet_id = "<spreadsheet_id>"

Full configuration

[system]
spreadsheet_id = "<spreadsheet_id>"
# push_interval_secs = 20
# autotruncate_at_usage_percent = 20 # approx 2 days of data under default settings
# scrape_interval_secs = 10
# scrape_timeout_ms = 3000
# messenger.url = "<messenger api url for sending messages>"
# mounts = ["/"]
# names = ["goral"]

With this configuration System service will create the following sheets:

basic: with general information about the system (boot time, memory, cpus, swap, a number of processes)
network: for every network interface a number of bytes read/written since the previous measurement, total read/written
top_disk_read - process which has read the most from disk during the last second
top_disk_write - process which has written the most to disk during the last second
top_cpu - process with the most cpu time during the last second
top_memory - process with the most memory usage during the last second
top_open_files (for Linux only) - among the processes with the same user as Goral (!) - process with the most opened files (files in a Linux sense - includes also sockets, pipes etc)
for every process with name containing one of the substrings in names - a sheet with process info. Note: the first match (case sensitive) is used so plan accordingly a unique name for your binary.
for every mount in mounts - disk usage and free space.
ssh - for Linux systems ssh access log is monitored. There is a status field with the following values:
- rejected - ssh user is correct but the key or password is wrong. Also a catchall reason for other unsuccessful connections.
- invalid_user - an invalid ssh user (unexisting) was used.
- timeout - a timeout on ssh connection happened.
- wrong_params - no matching key exchange method found or an invalid format of the key
- connected - a successful ssh connection is established (by default there is a rule with a warning notification for this event)
- terminated - an ssh session (established earlier with connected) is terminated

System service doesn't require root privileges to collect the telemetry. For a process a cpu usage percent may be more than 100% in a case of a multi-threaded process on several cores. memory_used by process is a resident-set size.

If there is an error while collecting system info, it is sent via a configured messenger or via a default messenger of General service.

KV Log

Goral allows you to append key-value data to Google spreadsheet. Let's take an example. You provide a SAAS for wholesalers and have several services, let's say "Orders" and "Marketing Campaigns"¹. Your client asks you for a billing data for each of the services in a spreadsheet format at the end of the month. You turn on KV Goral service with the following configuration:

Basic configuration

[kv]
spreadsheet_id = "<spreadsheet_id>"
port = 49152 # port from the range 49152-65535

Full configuration

[kv]
spreadsheet_id = "<spreadsheet_id>"
port = <"port from the range 49152-65535">
# autotruncate_at_usage_percent = 100
# messenger.url = "<messenger api url for sending messages>"

Such a configuration runs a server process in the Goral daemon listening at the specified port (localhost only for security reasons). From your app you periodically make a batch POST request to localhost:<port>/api/kv with a json body:

{
  "rows": [
    {
        "datetime": "2023-12-09T09:50:46.136945094Z", /* an RFC 3339 and ISO 8601 date and time string */
        "log_name": "orders", /* validated against regex ^[a-zA-Z_][a-zA-Z0-9_]*$ */
        "data": [["donuts", 10], ["chocolate bars", 3]], /* first datarow in "orders" log defines order of column headers for a sheet (if it should be created) */
    },
    {
        "datetime": "2023-12-09T09:50:46.136945094Z",
        "log_name": "orders",
        "data": [["chocolate bars", 3], ["donuts", 0]], /* you should provide the same keys (but the order is not important) for all datarows which go to the same sheet, otherwise a separate sheet is created */
    },
    {
        "datetime": "2023-12-09T09:50:46.136945094Z",
        "log_name": "campaigns",
        "data": [["name", "10% discount for buying 3 donuts"], ["is active", true], ["credits", -6], ["datetime", "2023-12-11 09:19:32.827321506"]], /* datatypes for values are string, integer (unsigned 32-bits), boolean, float (64 bits) and datetime string (in common formats without tz) */
    }
  ]
}

Goral KV service responds 200 OK for a successful append. Goral accepts every batch and creates corresponding sheets in the configured spreadsheet for "Orders" and "Marketing Campaigns". At the end of the month you have all the billing data neatly collected. For even more interactive setup you can share a spreadsheet access with your client (for him/her see all the process online) and configure a messenger for alerts and notifications (see the section Rules) by adding your client to the chat. Unlike other Goral services, this KV api is synchronous - if Goral responds successfully then sheets are created already and data is saved.

If there is an error while appending data, it is sent only via a default messenger of General service. Configured messenger is only used for notifications according to the configured rules.

Note: KV accepts unsigned 32-bits so the integers range is [0, 4_294_967_295]. As Google Sheets only accepts f64 (doubles), then the lossless conversion from integers to floats is only valid for unsigned 32-bits. Goral internally uses lossy conversion from unsigned 64-bits integers to f64 for those cases where rounding errors are acceptable, such as system measurements (disk space, memory etc.). If rounding errors are acceptable in your case, you should use floats for values exceeding the above mentioned range.

Note: Appending to the log is not idempotent, i.e. if you retry the same request two times, then the same set of rows will be added twice. If it is absolutely important to avoid any duplication (which may happen in case of some unexpected failure or retrying logic), then it is recommended to some unique idempotence key to data to be able to filter duplicates in the spreadsheet and remove them manually.

Note: for KV service the autotruncation mechanism is turned off by default (autotruncate_at_usage_percent = 100). It means that you should either set that value to some percent below 100 or clean up old data manually.

Note: For every append operation Goral uses 2 Google api method calls, so under the quota limit of 300 requests per minute, we have 5 requests per second or 2 append operations (not considering other Goral services which use the same quota). That's why it is strongly recommended to use a batched approach (say, send rows in batches every 10 seconds or so) otherwise you can exhaust Google api quota quickly (especially when other Goral services run). Exponential backoff algorithm is not applicable to KV service induced requests (in contrast to other Goral services). So retries are on the client app side and you may expect http response status code 429: Too many requests in case if you generate an excessive load. And it can impact other Goral services. In any case Goral puts KV requests in the messages queue with a capacity 1, so any concurrent request will wait until the previous one is handled.

Another notable use case is to log console errors from the frontend - catch JS exceptions, accumulate them in some batch at your backend and send the batch to the Goral KV service. ↩

Rules

Every service automatically creates a rules sheet (for some services reasonable defaults are provided) which allows you to set notifications (via configured messengers) on the data as it is collected. Datetime for rules is optional and is set only for default rules. You choose a log name (everything up to the first @ in a sheet title) and a key (a column header of a sheet), select a condition which is checked, and an action: either send info/warn/error message or skip the rule's match from any further rules processing.

Rules are fetched from a rules sheet by every Goral service dynamically every 30-46 seconds (the period is jittered to prevent hitting the Google quota). By default warnings on rules update error are suppressed (to remove unnecessary messages noise). You can turn it on for a specific service with the setting:

messenger.send_rules_update_error = true

Recommended deployment

Goral follows a fail-fast approach - if something violates assumptions (marked with assert:), the safest thing is to panic and restart. It doesn't behave like that for recoverable errors, of course. For example, if Goral cannot send a message via messenger, it will try to message via General service notifications and its logs. And will continue working and collecting data. If it cannot connect to Google API, it will retry first.

For some services like healthcheck you may want to suppress sending of Google API errors with

messenger.send_google_append_error = false

to minimize unactionable notifications. Failure to append rows to a spreadsheet doesn't impact rules notifications and healthchecks as they are triggered before an append operation.

There is also a case of fatal errors (e.g. MissingToken error for Google API which usually means that system time has skewed). In that case only someone outside can help. And in case of such panics Goral first tries to notify you via a messenger configured for General service to let you know immediately.

Linux

So following Erlang's idea of supervision trees we recommend to run Goral as a systemd service under Linux for automatic restarts in case of panics and other issues.

Install Goral
Make it system-wide available as an executable

sudo mv goral /usr/local/bin/goral

An example systemd configuration (can be created with sudo systemctl edit --force --full goral.service):

[Unit]
Description=Goral observability daemon
After=network.target
StartLimitIntervalSec=0

[Service]
Type=simple
ExecStart=/usr/local/bin/goral -c /etc/goral.toml --id myhost
Restart=always
User=myuser

[Install]
WantedBy=multi-user.target

Note: User=myuser will limit top_open_files for System to processes only under myuser. So if you monitor open file descriptors of some process, make sure to run both it and Goral under the same user.

Create a service account file (e.g. at /etc/goral_service_account.json) and a config file /etc/goral.toml
Enable for restart after boot sudo systemctl enable goral
Start the service sudo systemctl start goral

Then to check errors in Goral's log (if any error is reported via a messenger):

sudo journalctl --no-pager -u goral -g ERROR

If you plan to use System service then you should not containerize Goral to get the actual system data. Goral implements a graceful shutdown (its duration is configured) for SIGINT (Ctrl+C) and SIGTERM signals to safely send all the data in process to the permanent spreadsheet storage.

Keyboard shortcuts

Goral