path: root/README
diff options
authorSteve Kemp <steve@steve.org.uk>2012-11-12 21:00:16 +0000
committerSteve Kemp <steve@steve.org.uk>2012-11-12 21:00:16 +0000
commit6334b9cdfc47bd85b2ce236572e08406324d25cd (patch)
treebd0bd3cc279d8377efde2affc8dc223bfb858ca2 /README
Initial dump of code.
Diffstat (limited to 'README')
1 files changed, 141 insertions, 0 deletions
diff --git a/README b/README
new file mode 100644
index 0000000..553edc9
--- /dev/null
+++ b/README
@@ -0,0 +1,141 @@
+ We have a existing monitoring solution which suffers several problems:
+ * It is hard to scale, because all tests are executed upon one machine.
+ * It is over-engineered, hard to modify, and suffers from threading-related issues.
+ Steve proposes we throw this away and replace with something that is
+ both simpler in implementation, and easier to modify. We'll keep in mind the
+ aim of allowing multiple monitoring stations - although we note that we will
+ need to update firewalls to allow probes from more hosts than our single current
+ one.
+ The core design is based upon a work queue. I envisage two parts to the system:
+ * A parser that reads a list of hosts and tests to apply. These
+ tests are broken down into individual jobs, serialized to JSON,
+ and stored in a queue.
+ * An arbitrary number of monitoring hosts, which pull jobs from the
+ work queue and execute them.
+ Because we have an existing tool deployed, sentinel, which has a
+ reasonably well-defined configuration file I propose that the new
+ solution will be 100% compatible with it.
+ This means we must accept lines of the following form:
+LINN_HOSTS is and and and and and and and and
+LINN_SSH_HOSTS must run ssh on 22 otherwise '*Managed client*: "[Goto Redmine]":https://managed.bytemark.co.uk/projects/linn ssh failure'.
+http://acerecords.co.uk/ must run http with status 200 otherwise '*Managed client*: "[Goto Redmine]":https://managed.bytemark.co.uk/projects/acerecords/wiki/Wiki HTTP failure'.
+http://acerecords.co.uk/ must run http with content 'Ace Records' otherwise '*Managed client*: "[Goto Redmine]":https://managed.bytemark.co.uk/projects/acerecords/wiki/Wiki HTTP failure'.
+ In brief we accept four distinct kinds of line:
+ ------------
+ Comments are lines that are blank or which begin with the comment-character ("#").
+ 2. Macro Definitions
+ ---------------------
+ There are two types of macros:
+ FOO is and and
+ FOO are fetched from https://admin.bytemark.co.uk/network/monitor_ips/routers.
+ We accept both of these easily, with the caveat that macro-names must match
+ the regular expression ^[A-Z_]$.
+ 3. Service Tests
+ -----------------
+ Service tests are best explained by several examples:
+ SWITCHES must run ssh otherwise 'Bytemark networking infrastructure: switch'.
+ mirror.bytemark.co.uk must run ftp on 21 otherwise 'Bytemark Mirror: FTP failure'.
+ The general case is:
+ hostname|macro must run XXX [on NN] otherwise 'alert'.
+ If we restrict ourself to saying that every test must be named by the service
+ then we can generalize them.
+ 4. ping tests
+ -------------
+ Ping tests are of the form:
+ FOO must ping otherwise 'alert text'.
+ example.vm.bytemark.co.uk must ping otherwise 'alert text'.
+ These are a simplification of the service tests, because the only real difference
+ is that we write "must ping" rather than "must run XXX".
+There are two parts to our system:
+ a. Parser.
+ b. Worker.
+The parser will read the named configuration file(s), parse them, and submit
+to our queue a JSON-encoded piece of data for each test we must run.
+The worker will pull down these tests, and execute them.
+Sample JSON looks like this:
+ {"target_host":"","test_type":"ssh","test_port":"22","test_alert":"*Managed client*: \"[Goto Redmine]\":https://managed.bytemark.co.uk/projects/wellinformed/wiki/Wiki ssh failure"}
+You'll see that the JSON-encoded data is merely a hash, with the following
+ target_host: The host that will be probed.
+ test_port: The port number that will be queried. i.e "22", or "222" for SSH probes.
+ test_type: The type of test we're runnign "ssh", "http", "ftp", "imap", etc.
+ test_alert: The text of the alert we'll raise, on failure.
+There are only two extra fields that we expect to set in the normal course of events:
+ http_text: Expected HTTP/HTTPS contents.
+ http_status: Expected HTTP/HTTPS response code.