valeriansaliou/vigil

Microservices Status Page

Monitors a distributed infrastructure and sends alerts (Slack, SMS, etc

Vigil

Microservices Status Page. Monitors a distributed infrastructure and sends alerts (Slack, SMS, etc.).

Vigil is an open-source Status Page you can host on your infrastructure, used to monitor all your servers and apps, and visible to your users (on a domain of your choice, eg. status.example.com).

It is useful in microservices contexts to monitor both apps and backends. If a node goes down in your infrastructure, you receive a status change notification in a Slack channel, Email, Twilio SMS or/and XMPP.

Tested at Rust version: rustc 1.60.0 (7737e0b5c 2022-04-04)

🇭đŸ‡ē Crafted in Budapest, Hungary.

👉 See a live demo of Vigil on Crisp Status Page.

:newspaper: The Vigil project was announced in a post on my personal journal.

Who uses it?

Crisp Meili miragespace Redsmin Image-Charts Pikomit

👋 You use Vigil and you want to be listed there? Contact me.

Features

  • Monitors your infrastructure services automatically
  • Notifies you when a service gets down or gets back up via a configured channel:
    • Email
    • Twilio (SMS)
    • Slack
    • Zulip
    • Telegram
    • Pushover
    • Gotify
    • XMPP
    • Matrix
    • Webhook
  • Generates a status page, that you can host on your domain for your public users (eg. https://status.example.com)

How does it work?

Vigil monitors all your infrastructure services. You first need to configure target services to be monitored, and then Vigil does the rest for you.

There are three kinds of services Vigil can monitor:

  • HTTP / TCP / ICMP services: Vigil frequently probes an HTTP, TCP or ICMP target and checks for reachability
  • Application services: Install the Vigil Reporter library eg. on your NodeJS app and get reports when your app gets down, as well as when the host server system is overloaded
  • Local services: Install a slave Vigil Local daemon to monitor services that cannot be reached by the Vigil master server (eg. services that are on a different LAN)

It is recommended to configure Vigil, Vigil Reporter or Vigil Local to send frequent probe checks, as to ensure you are quickly notified when a service gets down (thus to reduce unexpected downtime on your services).

Hosted alternative to Vigil

Vigil needs to be hosted on your own systems, and maintained on your end. If you do not feel like managing yet another service, you may use Crisp Status instead.

Crisp Status is a direct port of Vigil to the Crisp customer support platform.

Crisp Status hosts your status page on Crisp systems, and is able to do what Vigil does (and even more!). Crisp Status is integrated to other Crisp products (eg. Crisp Chatbox & Crisp Helpdesk). It warns your users over chatbox and helpdesk if your status page reports as dead for an extended period of time.

As an example of a status page running Crisp Status, check out Enrich Status Page.

How to use it?

Installation

Vigil is built in Rust. To install it, either download a version from the Vigil releases page, use cargo install or pull the source code from master.

Install from Cargo:

If you prefer managing vigil via Rust's Cargo, install it directly via cargo install:

Ensure that your $PATH is properly configured to source the Crates binaries, and then run Vigil using the vigil command.

Install from source:

The last option is to pull the source code from Git and compile Vigil via cargo:

You can find the built binaries in the ./target/release directory.

Install libssl-dev (ie. OpenSSL headers) and libstrophe-dev (ie. XMPP library headers; only if you need the XMPP notifier) before you compile Vigil. SSL dependencies are required for the HTTPS probes and email notifications.

Install from Docker Hub:

You might find it convenient to run Vigil via Docker. You can find the pre-built Vigil image on Docker Hub as valeriansaliou/vigil.

Pre-built Docker version may not be the latest version of Vigil available.

First, pull the valeriansaliou/vigil image:

Then, seed it a configuration file and run it (replace /path/to/your/vigil/config.cfg with the path to your configuration file):

In the configuration file, ensure that:

  • server.inet is set to 0.0.0.0:8080 (this lets Vigil be reached from outside the container)
  • assets.path is set to ./res/assets/ (this refers to an internal path in the container, as the assets are contained there)

Vigil will be reachable from http://localhost:8080.

Configuration

Use the sample config.cfg configuration file and adjust it to your own environment.

Available configuration options are commented below, with allowed values:

[server]

  • log_level (type: string, allowed: debug, info, warn, error, default: error) — Verbosity of logging, set it to error in production
  • inet (type: string, allowed: IPv4 / IPv6 + port, default: [::1]:8080) — Host and TCP port the Vigil public status page should listen on
  • workers (type: integer, allowed: any number, default: 4) — Number of workers for the Vigil public status page to run on
  • reporter_token (type: string, allowed: secret token, default: no default) — Reporter secret token (ie. secret password)

[assets]

  • path (type: string, allowed: UNIX path, default: ./res/assets/) — Path to Vigil assets directory

[branding]

  • page_title (type: string, allowed: any string, default: Status Page) — Status page title
  • page_url (type: string, allowed: URL, no default) — Status page URL
  • company_name (type: string, allowed: any string, no default) — Company name (ie. your company)
  • icon_color (type: string, allowed: hexadecimal color code, no default) — Icon color (ie. your icon background color)
  • icon_url (type: string, allowed: URL, no default) — Icon URL, the icon should be your squared logo, used as status page favicon (PNG format recommended)
  • logo_color (type: string, allowed: hexadecimal color code, no default) — Logo color (ie. your logo primary color)
  • logo_url (type: string, allowed: URL, no default) — Logo URL, the logo should be your full-width logo, used as status page header logo (SVG format recommended)
  • website_url (type: string, allowed: URL, no default) — Website URL to be used in status page header
  • support_url (type: string, allowed: URL, no default) — Support URL to be used in status page header (ie. where users can contact you if something is wrong)
  • custom_html (type: string, allowed: HTML, default: empty) — Custom HTML to include in status page head (optional)

[metrics]

  • poll_interval (type: integer, allowed: seconds, default: 120) — Interval for which to probe nodes in poll mode
  • poll_retry (type: integer, allowed: seconds, default: 2) — Interval after which to try probe for a second time nodes in poll mode (only when the first check fails)
  • poll_http_status_healthy_above (type: integer, allowed: HTTP status code, default: 200) — HTTP status above which poll checks to HTTP replicas reports as healthy
  • poll_http_status_healthy_below (type: integer, allowed: HTTP status code, default: 400) — HTTP status under which poll checks to HTTP replicas reports as healthy
  • poll_delay_dead (type: integer, allowed: seconds, default: 10) — Delay after which a node in poll mode is to be considered dead (ie. check response delay)
  • poll_delay_sick (type: integer, allowed: seconds, default: 5) — Delay after which a node in poll mode is to be considered sick (ie. check response delay)
  • poll_parallelism (type: integer, allowed: any number, default: 4) — Maximum number of poll threads to be ran simultaneously (in case you are monitoring a lot of nodes and/or slow-replying nodes, increasing parallelism will help)
  • push_delay_dead (type: integer, allowed: seconds, default: 20) — Delay after which a node in push mode is to be considered dead (ie. time after which the node did not report)
  • push_system_cpu_sick_above (type: float, allowed: system CPU loads, default: 0.90) — System load indice for CPU above which to consider a node in push mode sick (ie. UNIX system load)
  • push_system_ram_sick_above (type: float, allowed: system RAM loads, default: 0.90) — System load indice for RAM above which to consider a node in push mode sick (ie. percent RAM used)
  • script_interval (type: integer, allowed: seconds, default: 300) — Interval for which to probe nodes in script mode
  • script_parallelism (type: integer, allowed: any number, default: 2) — Maximum number of script executor threads to be ran simultaneously (in case you are running a lot of scripts and/or long-running scripts, increasing parallelism will help)
  • local_delay_dead (type: integer, allowed: seconds, default: 40) — Delay after which a node in local mode is to be considered dead (ie. time after which the node did not report)

[plugins]

[plugins.rabbitmq]

  • api_url (type: string, allowed: URL, no default) — RabbitMQ API URL (ie. http://127.0.0.1:15672)
  • auth_username (type: string, allowed: username, no default) — RabbitMQ API authentication username
  • auth_password (type: string, allowed: password, no default) — RabbitMQ API authentication password
  • virtualhost (type: string, allowed: virtual host, no default) — RabbitMQ virtual host hosting the queues to be monitored
  • queue_ready_healthy_below (type: integer, allowed: any number, no default) — Maximum number of payloads in RabbitMQ queue with status ready to consider node healthy.
  • queue_nack_healthy_below (type: integer, allowed: any number, no default) — Maximum number of payloads in RabbitMQ queue with status nack to consider node healthy.
  • queue_ready_dead_above (type: integer, allowed: any number, no default) — Threshold on the number of payloads in RabbitMQ queue with status ready above which node should be considered dead (stalled queue)
  • queue_nack_dead_above (type: integer, allowed: any number, no default) — Threshold on the number of payloads in RabbitMQ queue with status nack above which node should be considered dead (stalled queue)
  • queue_loaded_retry_delay (type: integer, allowed: milliseconds, no default) — Re-check queue if it reports as loaded after delay; this avoids false-positives if your systems usually take a bit of time to process pending queue payloads (if any)

[notify]

  • startup_notification (type: boolean, allowed: true, false, default: true) — Whether to send startup notification or not (stating that systems are healthy)
  • reminder_interval (type: integer, allowed: seconds, no default) — Interval at which downtime reminder notifications should be sent (if any)
  • reminder_backoff_function (type string, allowed: none, linear, square, cubic, default: none) — If enabled, the downtime reminder interval will get larger as reminders are sent. The value will be reminder_interval × pow(N, x) with N being the number of reminders sent since the service went down, and x being the specified growth factor.
  • reminder_backoff_limit (type: integer, allowed: any number, default: 3) — Maximum value for the downtime reminder backoff counter (if a backoff function is enabled).

[notify.email]

  • to (type: string, allowed: email address, no default) — Email address to which to send emails
  • from (type: string, allowed: email address, no default) — Email address from which to send emails
  • smtp_host (type: string, allowed: hostname, IPv4, IPv6, default: localhost) — SMTP host to connect to
  • smtp_port (type: integer, allowed: TCP port, default: 587) — SMTP TCP port to connect to
  • smtp_username (type: string, allowed: any string, no default) — SMTP username to use for authentication (if any)
  • smtp_password (type: string, allowed: any string, no default) — SMTP password to use for authentication (if any)
  • smtp_encrypt (type: boolean, allowed: true, false, default: true) — Whether to encrypt SMTP connection with STARTTLS or not
  • reminders_only (type: boolean, allowed: true, false, default: false) — Whether to send emails only for downtime reminders or everytime

[notify.twilio]

  • to (type: array[string], allowed: phone numbers, no default) — List of phone numbers to which to send text messages
  • service_sid (type: string, allowed: any string, no default) — Twilio service identifier (ie. Service Sid)
  • account_sid (type: string, allowed: any string, no default) — Twilio account identifier (ie. Account Sid)
  • auth_token (type: string, allowed: any string, no default) — Twilio authentication token (ie. Auth Token)
  • reminders_only (type: boolean, allowed: true, false, default: false) — Whether to send text messages only for downtime reminders or everytime

[notify.slack]

  • hook_url (type: string, allowed: URL, no default) — Slack hook URL (ie. https://hooks.slack.com/[..])
  • mention_channel (type: boolean, allowed: true, false, default: false) — Whether to mention channel when sending Slack messages (using @channel, which is handy to receive a high-priority notification)
  • reminders_only (type: boolean, allowed: true, false, default: false) — Whether to send Slack messages only for downtime reminders or everytime

[notify.zulip]

  • bot_email (type: string, allowed: any string, no default) — The bot mail address as given by the Zulip interface
  • bot_api_key (type: string, allowed: any string, no default) — The bot API key as given by the Zulip interface
  • channel (type: string, allowed: any string, no default) — The name of the channel to send notifications to
  • api_url (type: string, allowed: URL, no default) — The API endpoint url (eg. https://domain.zulipchat.com/api/v1/)
  • reminders_only (type: boolean, allowed: true, false, default: false) — Whether to send messages only for downtime reminders or everytime

[notify.telegram]

  • bot_token (type: string, allowed: any strings, no default) — Telegram bot token
  • chat_id (type: string, allowed: any strings, no default) — Chat identifier where you want Vigil to send messages. Can be group chat identifier (eg. "@foo") or user chat identifier (eg. "123456789")

[notify.pushover]

  • app_token (type: string, allowed: any string, no default) — Pushover application token (you need to create a dedicated Pushover application to get one)
  • user_keys (type: array[string], allowed: any strings, no default) — List of Pushover user keys (ie. the keys of your Pushover target users for notifications)
  • reminders_only (type: boolean, allowed: true, false, default: false) — Whether to send Pushover notifications only for downtime reminders or everytime

[notify.gotify]

  • app_url (type: string, allowed: URL, no default) - Gotify endpoint without trailing slash (eg. https://push.gotify.net)
  • app_token (type: string, allowed: any string, no default) — Gotify application token
  • reminders_only (type: boolean, allowed: true, false, default: false) — Whether to send Gotify notifications only for downtime reminders or everytime

[notify.xmpp]

Notice: the XMPP notifier requires libstrophe (libstrophe-dev package on Debian) to be available when compiling Vigil, with the feature notifier-xmpp enabled upon Cargo build.

  • to (type: string, allowed: Jabber ID, no default) — Jabber ID (JID) to which to send messages
  • from (type: string, allowed: Jabber ID, no default) — Jabber ID (JID) from which to send messages
  • xmpp_password (type: string, allowed: any string, no default) — XMPP account password to use for authentication
  • reminders_only (type: boolean, allowed: true, false, default: false) — Whether to send messages only for downtime reminders or everytime

[notify.matrix]

  • homeserver_url (type: string, allowed: URL, no default) — Matrix server where the account has been created (eg. https://matrix.org)
  • access_token (type: string, allowed: any string, no default) — Matrix access token from a previously created session (eg. Element Web access token)
  • room_id (type: string, allowed: any string, no default) — Matrix room ID to which to send messages (eg. !abc123:matrix.org)
  • reminders_only (type: boolean, allowed: true, false, default: false) — Whether to send messages only for downtime reminders or everytime

[notify.webhook]

  • hook_url (type: string, allowed: URL, no default) — Web Hook URL (eg. https://domain.com/webhooks/[..])

[probe]

[[probe.service]]

  • id (type: string, allowed: any unique lowercase string, no default) — Unique identifier of the probed service (not visible on the status page)
  • label (type: string, allowed: any string, no default) — Name of the probed service (visible on the status page)

[[probe.service.node]]

  • id (type: string, allowed: any unique lowercase string, no default) — Unique identifier of the probed service node (not visible on the status page)
  • label (type: string, allowed: any string, no default) — Name of the probed service node (visible on the status page)
  • mode (type: string, allowed: poll, push, script, local, no default) — Probe mode for this node (ie. poll is direct HTTP, TCP or ICMP poll to the URLs set in replicas, while push is for Vigil Reporter nodes, script is used to execute a shell script and local is for Vigil Local nodes)
  • replicas (type: array[string], allowed: TCP, ICMP or HTTP URLs, default: empty) — Node replica URLs to be probed (only used if mode is poll)
  • scripts (type: array[string], allowed: shell scripts as source code, default: empty) — Shell scripts to be executed on the system as a Vigil sub-process; they are handy to build custom probes (only used if mode is script)
  • http_headers (type: map[string, string], allowed: any valid header name and value, default: empty) — HTTP headers to add to HTTP requests (eg. http_headers = { "Authorization" = "Bearer xxxx" })
  • http_method (type string, allowed: GET, HEAD, POST, PUT, PATCH, no default) — HTTP method to use when polling the endpoint (omitting this will default to using HEAD or GET depending on the http_body_healthy_match configuration value)
  • http_body (type string, allowed: any string, no default) — Body to send in the HTTP request when polling an endpoint (this only works if http_method is set to POST, PUT or PATCH)
  • http_body_healthy_match (type: string, allowed: regular expressions, no default) — HTTP response body for which to report node replica as healthy (if the body does not match, the replica will be reported as dead, even if the status code check passes; the check uses a GET rather than the usual HEAD if this option is set)
  • rabbitmq_queue (type: string, allowed: RabbitMQ queue names, no default) — RabbitMQ queue associated to node, which to check against for pending payloads via RabbitMQ API (this helps monitor unacked payloads accumulating in the queue)
  • rabbitmq_queue_nack_healthy_below (type: integer, allowed: any number, no default) — Maximum number of payloads in RabbitMQ queue associated to node, with status nack to consider node healthy (this overrides the global plugins.rabbitmq.queue_nack_healthy_below)
  • rabbitmq_queue_nack_dead_above (type: integer, allowed: any number, no default) — Threshold on the number of payloads in RabbitMQ queue associated to node, with status nack above which node should be considered dead (stalled queue, this overrides the global plugins.rabbitmq.queue_nack_dead_above)

Run Vigil

Vigil can be run as such:

./vigil -c /path/to/config.cfg

Usage recommendations

Consider the following recommendations when using Vigil:

  • Vigil should be hosted on a safe, separate server. This server should run on a different physical machine and network than your monitored infrastructure servers.
  • Make sure to whitelist the Vigil server public IP (both IPv4 and IPv6) on your monitored HTTP services; this applies if you use a bot protection service that challenges bot IPs, eg. Distil Networks or Cloudflare. Vigil will see the HTTP service as down if a bot challenge is raised.

What status variants look like?

Vigil has 3 status variants, either healthy (no issue ongoing), sick (services under high load) or dead (outage):

Healthy status variant

Sick status variant

Dead status variant

What do alerts look like?

When a monitored backend or app goes down in your infrastructure, Vigil can let you know by Slack, Twilio SMS, Email and XMPP:

You can also get nice realtime down and up alerts on your eg. iPhone and Apple Watch:

What do Webhook payloads look like?

If you are using the Webhook notifier in Vigil, you will receive a JSON-formatted payload with alert details upon any status change; plus reminders if notify.reminder_interval is configured.

Here is an example of a Webhook payload:

Webhook notifications can be tested with eg. Webhook.site, before you integrate them to your custom endpoint.

You can use those Webhook payloads to create custom notifiers to anywhere. For instance, if you are using Microsoft Teams but not Slack, you may write a tiny PHP script that receives Webhooks from Vigil and forwards a notification to Microsoft Teams. This can be handy; while Vigil only implements convenience notifiers for some selected channels, the Webhook notifier allows you to extend beyond that.

How can I create script probes?

Vigil lets you create custom probes written as shell scripts, passed in the Vigil configuration as a list of scripts to be executed for a given node.

Those scripts can be used by advanced Vigil users when their monitoring use case requires scripting, ie. when push and poll probes are not enough.

The replica health should be returned by the script shell as return codes, where:

  • rc=0: healthy
  • rc=1: sick
  • rc=2 and higher: dead

As scripts are usually multi-line, script contents can be passed as a literal string, enclosed between '''.

As an example, the following script configuration always return as sick:

Note that scripts are executed in a system shell ran by a Vigil-owned sub-process. Make sure that Vigil runs on an UNIX user with limited privileges. Running Vigil as root would let any configured script perform root-level actions on the machine, which is not recommended.

How can I integrate Vigil Reporter in my code?

Vigil Reporter is used to actively submit health information to Vigil from your apps. Apps are best monitored via application probes, which are able to report detailed system information such as CPU and RAM load. This lets Vigil show if an application host system is under high load.

Vigil Reporter Libraries

  • NodeJS: node-vigil-reporter
  • TypeScript: ts-vigil-reporter
  • Python: py-vigil-reporter
  • Golang: go-vigil-reporter
  • Rust: rs-vigil-reporter
  • Dart: dart-vigil-reporter

👉 Cannot find the library for your programming language? Build your own and be referenced here! (contact me)

Vigil Reporter HTTP API

In case you need to manually report node metrics to the Vigil endpoint, use the following HTTP configuration (adjust it to yours):

1ī¸âƒŖ Report a replica

Endpoint URL:

HTTP POST https://status.example.com/reporter/<probe_id>/<node_id>/

Where:

  • node_id: The parent node of the reporting replica
  • probe_id: The parent probe of the node

Request headers:

  • Add an Authorization header with a Basic authentication where the password is your configured reporter_token.
  • Set the Content-Type to application/json; charset=utf-8, and ensure you submit the request data as UTF-8.

Request data:

Adjust the request data to your replica context and send it as HTTP POST:

Where:

  • replica: The replica unique identifier (eg. the server LAN IP)
  • interval: The push interval (in seconds)
  • load.cpu: The general CPU load, from 0.00 to 1.00 (can be more than 1.00 if the CPU is overloaded)
  • load.ram: The general RAM load, from 0.00 to 1.00

2ī¸âƒŖ Flush a replica

Endpoint URL:

HTTP DELETE https://status.example.com/reporter/<probe_id>/<node_id>/<replica_id>/

Where:

  • node_id: The parent node of the reporting replica
  • probe_id: The parent probe of the node
  • replica_id: The replica unique identifier (eg. the server LAN IP)

Request headers:

  • Add an Authorization header with a Basic authentication where the password is your configured reporter_token.

How can I monitor services on a different LAN using Vigil Local?

Vigil Local is an (optional) slave daemon that you can use to report internal service health to your Vigil-powered status page master server. It is designed to be used behind a firewall, and to monitor hosts bound to a local loop or LAN network, that are not available to your main Vigil status page.

Vigil Local monitors local poll and script replicas, and reports their status to Vigil on a periodic basis.

You can read more on Vigil Local on its repository, and follow the setup instructions.

:children_crossing: Troubleshoot Issues

ICMP replicas always report as dead

On Linux systems, non-priviledge users cannot create raw sockets, which Vigil ICMP probing system requires. It means that, by default, all ICMP probe attempts will fail silently, as if the host being probed was always down.

This can easily be fixed by allowing Vigil to create raw sockets:

Note that HTTP and TCP probes do not require those raw socket capabilities.

:fire: Report A Vulnerability

If you find a vulnerability in Vigil, you are more than welcome to report it directly to @valeriansaliou by sending an encrypted email to [email protected]. Do not report vulnerabilities in public GitHub issues, as they may be exploited by malicious people to target production servers running an unpatched Vigil server.

:warning: You must encrypt your email using @valeriansaliou GPG public key: :key:valeriansaliou.gpg.pub.asc.

Issues

Collection of the latest Issues

mjarkk

mjarkk

Comment Icon0

When i start the docker with the example config i get get error:

After some debugging i discovered i needed to change

To:

But it took me very long to discover what exactly was wrong as the error is unclear. It would be nice if the error message showed some more context to which line is incorrect.

This address is also used in the example config and i expect i can copy that to get started: https://github.com/valeriansaliou/vigil/blob/b666b2785af85ae09ee3e7b490f86f1efbd35a0e/config.cfg#L7-L12

kauron

kauron

Comment Icon2

It seems that src/notifier/xmpp.rs is outdated w.r.t. the time and libstrophe libraries.

With my limited Rust knowledge, I created a patch, but there are still errors that escape my comprehension. Can anyone help me complete this fix?

Patch

Error

  • Vigil version: 1.22.5
  • Apply patch and run: cargo build --frozen --release --all-features
  • Rust version: 1.59.0
  • Error:
tekurinui

tekurinui

Comment Icon0

Loving vigil, great project.

Feature request, it would be great to be able to override metrics for a node.

benstadin

benstadin

Comment Icon1

Is it possible to turn off SSL certificate verification via config the file? This would be helpful for development.

Also, changing the certificate doesn't seem to make a difference. I'm using the following Dockerfile but still get a SSL error:

(DEBUG) - prober poll result was not received for http target: #404 (error: error sending request for url (#404): error trying to connect: error:1416F086:SSL routines:tls_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:1916: (unable to get local issuer certificate))

The Dockerfile:

sparanoid

sparanoid

enhancement
Comment Icon1

It seems all nodes are sorted alphabetically. Currently I can hack the order by prefixing numbers like:

But this also affects readability in notifications:

That would be great if there is a way to sort them manually or just keep the order defined in the config.

sparanoid

sparanoid

Comment Icon1

When dealing with replica that has redirections:

In Vigil config:

The http_body_healthy_match will fall since the body are changed:

That would be great if there's an option to follow directions.

mathieudebrito

mathieudebrito

Comment Icon3

Hi guys !

I use poll_http_status_healthy_below & poll_http_status_healthy_above to check for a 401 to be the right response status for my-website.com. (my-website.com is protected behind an .htaccess, that's why I look for a 401)

Here is my config file :

Vigil logs are :

And my status page is show red as you can imagine. Is it a bug on your side ? Or did I miss a specific setting ? I followed the documentation.

Best regards, Sincerely

gbonnefille

gbonnefille

Comment Icon3

Is there a way to validate a SSL/TLS certificate validity?

Motivation: I use Let's Encrypt but, sometimes, renew goes wrong and... certificate is obsolete. Having a probe for that can help a lot.

Please, note that ideally this verification should be distinct enougth from HTTPS polling as someone can desire to register an URL returning an error code (404, 401...). But as we only wish to validate certificate, this shoul be enough.

Eijebong

Eijebong

Comment Icon2

Right now there's basically nothing checked on startup and it will panic when vigil starts polling (so basically immediately) if something's wrong.

It'd be nice to have a verification step first so we could include stuff like checking that there's no duplicated id in the config file. (Totally didn't open that because I made this mistake twice already...)

verymilan

verymilan

Comment Icon2

The Crisp port shows a screenshot of a big fat announcement. I partially could imitate that by putting custom_html to the top, but that wouldn't be the same. I would also like to trigger the notifier with announcements and track maintenances/updates on maintenances.

Vigil is pretty cool but this is clearly missing... maybe Vigil could look for local Markdown files as a workaround for the missing backend?

bduron

bduron

Comment Icon2

Hey @valeriansaliou and contributors 👋

I was wondering if I missed a config option to get more details in my slack notification messages, namely :

  • Status changed to: healthy -> know which nodes got back up specifically
  • When a subset of the down nodes are up again, display which ones got back up (partially)
    • eg. Status is still: dead, Nodes: A, B, C, but nodes D, E, F got back up
  • When one node of a probe.service is down, know exactly which one instead of all the service nodes

Thank you so much for this great software!

CSP197

CSP197

enhancement
Comment Icon1

Hello!

Currently if there are 10 replicas on the vigil status page, and 3 of them are dead, Vigil declares a "Partial Service Outage". I would like to inquire if the concept of an outage threshold value could be instrumented, which would be the minimum ratio needed before declaring an "Service Outage".

This value could be set in the config.cfg file:

outage_threshold = 0.5 //or 50?

This value would represent the minimum ratio of the # of dead replicas to the # of total replicas needed to declare a state of "Service Outage".

valeriansaliou

valeriansaliou

enhancement
Comment Icon2

Due to the replacement of fastping-rs with ping, ICMPv6 support might have been dropped on some platforms.

A fork of the unmaintained ping library, and a full rework into a cleaner library would be much needed, adding support for ICMP IPv6 on all platforms.

blissend

blissend

enhancement
Comment Icon8

First of all, thanks so much for vigil. I have tried so many alternatives and vigil is the only one that works consistently and works best.

I do have a feature request however. It would be nice to be able to mark or show somehow which replica's is which on the web page. Doesn't have to give away sensitive info, could be a label or anything really.

If this is by design let me know or already possible then just ignore me :D I think it may be useful sometimes to know in automated workflows what got added/removed or perhaps you don't want to alert but curious which replica has unusual latency (which I'm not sure why some replica's show latency and others none in that popup).

L1Cafe

L1Cafe

bug
Comment Icon1

I'm not very experienced with Caddy, because for now, I've always resorted to simply using Nginx, however I'm finding some difficulties when setting up Caddy and Vigil.

To check that it wasn't Caddy's fault, I have also tried to reverse proxy other webpages, and everything seemed to work perfectly fine.

And to be perfectly clear, serving Vigil over port 8080 (HTTP only) also works perfectly fine, both dialing the IP as well as using the hostname (although most web browsers refuse to connect because HSTS is enforced, I checked this using cURL which doesn't observe HSTS).

I use the following settings in vigil.cfg to serve on port 8080:

Initially, my Caddyfile looked like this:

(I replaced my actual domain with hostname.com).

According to Caddy's documentation, it passes on all original headers unmodified to the upstream server. Since I suspected there may be an issue with the Host header, I modified the Caddyfile as so:

However, it unfortunately still does not work, and I'm not sure at this point whether or not this is my fault, Caddy's, or Vigil's.

The relevant fragment of logs from Caddy looks like this:

L1Cafe

L1Cafe

enhancement
Comment Icon0

While I do try to stick to UTC as a timezone, I think Vigil would do a better job if it had the possibility of changing timezones.

As this isn't in the documentation, I assumed there's no possibility to change timezones.

A few ideas here:

  • Timezone should change depending on the visitor (not sure if there's a standard way to do this, but perhaps https://stackoverflow.com/questions/6939685/get-client-time-zone-from-browser that should help).
  • There should be a way to set up a different timezone for notifications (if the administrator lives in Beijing, they probably don't care about UTC notifications).
  • Lastly, if timezone cannot be detected, it should resort to UTC (as opposed to resorting to the administrator's timezone).
mikepruett3

mikepruett3

bug
Comment Icon2

I disabled IPv6 on my server, and after rebooting vigil does not want to start, giving the following error...

My vigil config file:

As far as I can tell there is no way in the configuration to specify only using IPv4 ICMP polling, so I am stuck at this point. Any help would be appreciated.

frdmn

frdmn

enhancement
Comment Icon2

Hi 👋

Great project!

It would be great if the last state of a probe/replica is stored somewhere persistent to make sure it doesn't trigger an obsolete notification (which was already sent when the service actually went down) again upon next start.

Versions

Find the latest versions by id

v1.23.0 - Apr 11, 2022

  • The probing of poll and script replicas can now be parallelized, making the whole process more scalable on large status pages; parallelism can be changed via configuration options: metrics.poll_parallelism and metrics.script_parallelism (@jasquat — submitted in PR #114).
  • Bump dependencies to latest versions.

v1.22.5 - Feb 11, 2022

  • Improved the poll retry mechanism, by increasing the hold delay between attempts, and fixing a counting issue, where you would expect 3 total poll attempts for a retry value of 2 (configured with the metrics.poll_retry option).

v1.22.4 - Feb 04, 2022

  • Fixed an issue where the downtime reminder notifications backoff count would not be reset if status was going from dead to sick.
  • Moved code to Rust 2021 Edition.
  • Not distributing Intel 32 bits builds anymore.
  • Bump dependencies to latest versions.

v1.22.3 - Feb 03, 2022

  • Added the ability to configure a downtime reminder notifications backoff function and limit via configuration options: reminder_backoff_function and reminder_backoff_limit in notify (@Eijebong — submitted in PR #103).
  • Improved the code style with some refactorings.
  • Bump dependencies to latest versions.

v1.22.2 - Aug 29, 2021

  • Added a configuration validator, ran when Vigil starts, that makes sure there is no duplicate service or node identifier.
  • Fixed a typo in the main template, where a node probe mode label would not show correctly for push probes (wrongly showing as poll instead).
  • Bump dependencies to latest versions.

v1.22.1 - Jul 11, 2021

  • Fixed non-working Gotify notifier due to extraneous slash being added in the URL.

v1.22.0 - Jun 30, 2021

  • Added a Zulip notifier (@Eijebong — submitted in PR #87).
  • Added the ability to customize HTTP probe method, headers and body via configuration options: http_method, http_headers and http_body in probe.service.node (@Eijebong — submitted in PR #86).
  • Now logging HTTP probe failure trace (@tglman — submitted in PR #85).

v1.21.2 - Jun 21, 2021

  • The RabbitMQ queue monitoring options defined as plugins.rabbitmq.queue_* can now be partly overridden per-node with: probe.service.node.rabbitmq_queue_nack_healthy_below and probe.service.node.rabbitmq_queue_nack_dead_above.

v1.21.1 - Mar 24, 2021

  • Reworked the Matrix notifier introduced in v1.20.0, which was not working as intended (fixed panic upon notification dispatch).
  • Made notifier-matrix a default build feature, as the Matrix notifier now depends on reqwest (matrix-sdk and tokio were removed).
  • Removed the configuration options: notify.matrix.username, notify.matrix.password and notify.matrix.device_id, as notify.matrix.access_token suffice.

v1.21.0 - Mar 22, 2021

  • Added a Reporter HTTP API DELETE route to flush a replica from Vigil (@NikoGrano — submitted in PR #78).

v1.20.2 - Mar 12, 2021

  • Fixed a regression introduced in v1.20.0 after moving from rocket to actix, where an empty server.reporter_token could not be authenticated against by HTTP clients (due to the new authentication middleware considering an empty password as invalid).

v1.20.1 - Mar 12, 2021

  • Added the ability to reference to a probe group in the URL, which is useful when there are a lot of nodes (please update your templates and stylesheets, though this is backwards compatible) (@shinkhouse — submitted in PR #75).

v1.20.0 - Mar 06, 2021

  • Added a Matrix notifier (@wolf4ood — submitted in PR #74).
  • Moved HTTP server from rocket to actix, meaning Vigil now builds on Rust stable (@tglman — submitted in PR #72).

v1.19.0 - Dec 09, 2020

  • Added a Gotify notifier (@zllovesuki — submitted in PR #65).
  • Auto-trim Twilio notifier messages as to avoid SMS fragmentation on large downtime reports (most SMS receivers and networks support up to 1600 characters by re-building message segments).

v1.18.0 - Jul 25, 2020

  • Added a new local prober type, which allows a Vigil Local daemon to report node status that Vigil cannot monitor directly due to network restrictions (eg. different LAN, firewall, etc.).

v1.17.0 - Jun 29, 2020

  • Added a new script prober type, which allows custom shell scripts to extend Vigil monitoring capabilities (for instance, you may build scripts for live end-to-end testing, DNS response checks, etc.).
  • Bump dependencies to latest versions.

v1.16.0 - Apr 21, 2020

  • Moved the ICMP probing library from fastping-rs to ping, as fastping-rs did not behave well on large Vigil setups (it could cause panics due to spawning too many threads at the same time on large poll events).
  • Disabled HTTP Keep-Alive on rocket, as to prevent worker resources to be exhausted by external HTTP clients.

v1.15.1 - Apr 16, 2020

  • Startup alerts are now marked as such (in the alert message).

v1.15.0 - Apr 16, 2020

  • Vigil now sends a notification whenever it is booting up (it sets its status to healthy). This can be disabled via the notify.startup_notification option.
  • Added a Telegram notifier (@michaeldel — submitted in PR #54).
  • Release script is now able to produce a statically-linked build for more targets (x86_64, i686 and armv7).
  • Bump dependencies to latest versions.

v1.14.3 - Jan 14, 2020

  • Fix replica URL parsing for IPv6 URLs for icmp and tcp protocols. Previously, an IPv6 URL formatted as eg. tcp://[::1]:80 would be passed as eg. tuple ('[::1]', 80) to the probe, which is not resolvable and would incur a failure. The correct tuple format is now being passed, which is eg. ('::1', 80).
  • Implement a stricter replica URL parsing from configuration in ReplicaURL, for icmp and tcp protocols. Extraneous non-used URL port and segments (where applicable) are now considered as invalid.

v1.14.2 - Jan 13, 2020

  • Improve the ICMP probe introduced in v1.14.0, by batching all pings for a single replica in the same poll cycle.
  • Report the true latency for ICMP probe replicas, instead of the ICMP timeout (the worst observed RTT is picked up).

v1.14.1 - Jan 13, 2020

  • Fix issues with the ICMP probe introduced in v1.14.0 related to probing IPv6 hosts.
  • Change the behavior of the ICMP prober if an hostname is provided for the replica; all resolved addresses are now health-checked in sequential order (as ICMP is used to check if an host is up or down, we need to check all IPs provided in the DNS response — unlike upper layer application-level TCP and HTTP checks where a single randomly-picked address is checked, as this is sufficient to deem a replica as healthy or dead).

v1.14.0 - Jan 13, 2020

  • Add an ICMP poll probe type, which sends ICMP pings to check if a target host is reachable. Those poll replicas can be configured aside regular TCP and HTTP hosts, using the following URL pattern: #404 (eg. #404).

v1.13.0 - Jan 03, 2020

  • Bump dependencies to latest versions, if possible (time could not be updated, as rocket depends on the time::Tm type that's no more in time v0.2).
  • The Docker build for Vigil is now statically-linked via MUSL, which makes the resulting Docker image much smaller, down from ~40MB to ~4MB (see #45 — many thanks to @cristicbz).

v1.12.1 - Oct 22, 2019

  • Add plugins.rabbitmq.queue_nack_dead_above to be alerted when a RabbitMQ queue might be stalled due to rejected payloads (eg. a sub-system at the consumer level may be failing and is NACK-ing payloads back to the queue).

v1.12.0 - Oct 21, 2019

  • Add plugins.rabbitmq.queue_ready_dead_above to be alerted when a RabbitMQ queue might be stalled (ie. messages are not being passed to consumers).

v1.11.1 - Aug 06, 2019

  • Add notify.slack.mention_channel to control whether notification messages are sent with the Slack @channel mention keyword or not (using @channel sends a high-priority notification to channel members, which stands out from regular channel messages).
  • Colors have been added to healthy, sick and dead status labels in Pushover notifications.

v1.11.0 - Aug 06, 2019

  • Added a Pushover notifier (Pushover is a generic push notification service that's handy to receive alerts in a centralized application on the phone or desktop).

v1.10.1 - Aug 05, 2019

  • Fixed a bug in the Webhook notifier added in v1.10.0, where it would try to dispatch a Webhook notification even if no Webhook endpoint was configured (resulting in a logged failure).

v1.10.0 - Jul 02, 2019

  • Added a Webhook notifier (can be used to forward status change notifications to other systems eg. Microsoft Teams).
  • The date shown in the status page footer is now dynamic (showing the current year date; frozen when the Vigil process is started).

Information - Updated May 03, 2022

Stars: 1.1K
Forks: 89
Issues: 35

Rust Playground for MacOS

status: experimental / pre-release / guaranteed buggy

Rust Playground for MacOS

blackhole is a server that responds to any request with http status code 200

For example, you can check what kind of request is notified by GitHub webhook from the access log

blackhole is a server that responds to any request with http status code 200

git-branch-status

A command line tool for displaying git branch colored by status, like zsh's

git-branch-status

Course link: Status: ✅*

Course link: Gitmoji for commit messages

Course link: Status: ✅*

Status: Work in Progress

You can see the Short Design Doc to get more info on the upcoming project

Status: Work in Progress

This simple tool grabs the cable status information from an Arris

By default this status page is served via HTTP at the cable modem's

This simple tool grabs the cable status information from an Arris

carlog is a simple, lightweight crate that provides Cargo logging style messages via the

Status struct or via multiple macros that recreate common cargo message formats:

carlog is a simple, lightweight crate that provides Cargo logging style messages via the

rs-crisp-status-reporter

Crisp Status Reporter for Rust

rs-crisp-status-reporter

Attractor is a program to search for and generate two dimensional quadratic map strange

Status: experimental, for me as an exercise in Rust

Attractor is a program to search for and generate two dimensional quadratic map strange
Facebook Instagram Twitter GitHub Dribbble
Privacy