diff options
author | Steve Kemp <steve@steve.org.uk> | 2012-11-12 21:00:16 +0000 |
---|---|---|
committer | Steve Kemp <steve@steve.org.uk> | 2012-11-12 21:00:16 +0000 |
commit | 6334b9cdfc47bd85b2ce236572e08406324d25cd (patch) | |
tree | bd0bd3cc279d8377efde2affc8dc223bfb858ca2 /README |
Initial dump of code.
Diffstat (limited to 'README')
-rw-r--r-- | README | 141 |
1 files changed, 141 insertions, 0 deletions
@@ -0,0 +1,141 @@ + + +About +----- + + We have a existing monitoring solution which suffers several problems: + + * It is hard to scale, because all tests are executed upon one machine. + + * It is over-engineered, hard to modify, and suffers from threading-related issues. + + +Proposal +-------- + + Steve proposes we throw this away and replace with something that is + both simpler in implementation, and easier to modify. We'll keep in mind the + aim of allowing multiple monitoring stations - although we note that we will + need to update firewalls to allow probes from more hosts than our single current + one. + + The core design is based upon a work queue. I envisage two parts to the system: + + * A parser that reads a list of hosts and tests to apply. These + tests are broken down into individual jobs, serialized to JSON, + and stored in a queue. + + * An arbitrary number of monitoring hosts, which pull jobs from the + work queue and execute them. + + + + + +Implementation +-------------- + + Because we have an existing tool deployed, sentinel, which has a + reasonably well-defined configuration file I propose that the new + solution will be 100% compatible with it. + + This means we must accept lines of the following form: + +-- + +LINN_HOSTS is 89.16.185.172 and 46.43.50.217 and 89.16.185.171 and 89.16.185.173 and 89.16.185.174 and 46.43.50.216 and 46.43.50.212 and 46.43.50.217 and 89.16.185.171. + +LINN_SSH_HOSTS must run ssh on 22 otherwise '*Managed client*: "[Goto Redmine]":https://managed.bytemark.co.uk/projects/linn ssh failure'. + +http://acerecords.co.uk/ must run http with status 200 otherwise '*Managed client*: "[Goto Redmine]":https://managed.bytemark.co.uk/projects/acerecords/wiki/Wiki HTTP failure'. +http://acerecords.co.uk/ must run http with content 'Ace Records' otherwise '*Managed client*: "[Goto Redmine]":https://managed.bytemark.co.uk/projects/acerecords/wiki/Wiki HTTP failure'. +-- + + In brief we accept four distinct kinds of line: + + + + 1. Comments + ------------ + Comments are lines that are blank or which begin with the comment-character ("#"). + + + 2. Macro Definitions + --------------------- + There are two types of macros: + + FOO is 1.2.3.4 and 2.3.4.5 and 4.5.6.6. + FOO are fetched from https://admin.bytemark.co.uk/network/monitor_ips/routers. + + We accept both of these easily, with the caveat that macro-names must match + the regular expression ^[A-Z_]$. + + + 3. Service Tests + ----------------- + Service tests are best explained by several examples: + + SWITCHES must run ssh otherwise 'Bytemark networking infrastructure: switch'. + mirror.bytemark.co.uk must run ftp on 21 otherwise 'Bytemark Mirror: FTP failure'. + + The general case is: + + hostname|macro must run XXX [on NN] otherwise 'alert'. + + If we restrict ourself to saying that every test must be named by the service + then we can generalize them. + + + + 4. ping tests + ------------- + Ping tests are of the form: + + FOO must ping otherwise 'alert text'. + example.vm.bytemark.co.uk must ping otherwise 'alert text'. + + These are a simplification of the service tests, because the only real difference + is that we write "must ping" rather than "must run XXX". + + + + +Behaviour +--------- + +There are two parts to our system: + + + a. Parser. + + b. Worker. + +The parser will read the named configuration file(s), parse them, and submit +to our queue a JSON-encoded piece of data for each test we must run. + +The worker will pull down these tests, and execute them. + +Sample JSON looks like this: + + {"target_host":"46.43.37.199","test_type":"ssh","test_port":"22","test_alert":"*Managed client*: \"[Goto Redmine]\":https://managed.bytemark.co.uk/projects/wellinformed/wiki/Wiki ssh failure"} + + +You'll see that the JSON-encoded data is merely a hash, with the following +keys: + + target_host: The host that will be probed. + + test_port: The port number that will be queried. i.e "22", or "222" for SSH probes. + + test_type: The type of test we're runnign "ssh", "http", "ftp", "imap", etc. + + test_alert: The text of the alert we'll raise, on failure. + +There are only two extra fields that we expect to set in the normal course of events: + + http_text: Expected HTTP/HTTPS contents. + http_status: Expected HTTP/HTTPS response code. + + + + |