README


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149

About
-----

  We have a existing monitoring solution which suffers several problems:

    * It is hard to scale, because all tests are executed upon one machine.

    * It is over-engineered, hard to modify, and suffers from threading-related issues.


Proposal
--------

  Steve proposes we throw this away and replace with something that is
 both simpler in implementation, and easier to modify.  We'll keep in mind the
 aim of allowing multiple monitoring stations - although we note that we will
 need to update firewalls to allow probes from more hosts than our single current
 one.

  The core design is based upon a work queue.  I envisage two parts to the system:

    * A parser that reads a list of hosts and tests to apply.  These
      tests are broken down into individual jobs, serialized to JSON,
      and stored in a queue.

    * An arbitrary number of monitoring hosts, which pull jobs from the
      work queue and execute them.


Implementation
--------------

  Because we have an existing tool deployed, sentinel, which has a
 reasonably well-defined configuration file I propose that the new
 solution will be 100% compatible with it.

  This means we must accept lines of the following form:

--

LINN_HOSTS is 89.16.185.172 and 46.43.50.217 and 89.16.185.171 and 89.16.185.173 and 89.16.185.174 and 46.43.50.216 and 46.43.50.212 and 46.43.50.217 and 89.16.185.171.

LINN_HOSTS must run ssh on 22 otherwise '*Managed client*: "[Goto Redmine]":https://managed.bytemark.co.uk/projects/linn ssh failure'.

http://acerecords.co.uk/ must run http with status 200 otherwise '*Managed client*: "[Goto Redmine]":https://managed.bytemark.co.uk/projects/acerecords/wiki/Wiki HTTP failure'.

http://acerecords.co.uk/ must run http with content 'Ace Records' otherwise '*Managed client*: "[Goto Redmine]":https://managed.bytemark.co.uk/projects/acerecords/wiki/Wiki HTTP failure'.
--

  In brief we accept four distinct kinds of line:


  1. Comments
  ------------
  Comments are lines that are blank or which begin with the comment-character ("#").


  2. Macro Definitions
  ---------------------
  There are two types of macros:

     FOO is 1.2.3.4 and 2.3.4.5 and 4.5.6.6.
     FOO are fetched from https://admin.bytemark.co.uk/network/monitor_ips/routers.

  We accept both of these easily, with the caveat that macro-names must match
  the regular expression ^[1-9A-Z_]$.


  3.  Service Tests
  -----------------
  Service tests are best explained by several examples:

     SWITCHES must run ssh otherwise 'Bytemark networking infrastructure: switch'.
     mirror.bytemark.co.uk must run ftp on 21 otherwise 'Bytemark Mirror: FTP failure'.

  The general case is:

     hostname|macro must run XXX [on NN] otherwise 'alert'.

  If we restrict ourself to saying that every test must be named by the service that is
  under test then we can generalize them.  This means we'll invoke the ftp-handler for
  
     foo.vm must run ftp otherwise 'alert text'.

  The bar-handler for the line:

     example.vm.bytemark.co.uk must run bar otherwise 'alert text'.

  The JSON which we serialize will also have "test_type:ftp", and "test_type:bar", respectively.


  4. ping tests
  -------------
  Ping tests are of the form:

     FOO must ping otherwise 'alert text'.
     example.vm.bytemark.co.uk must ping otherwise 'alert text'.

  These are a simplification of the service tests, because the only real difference
  is that we write "must ping" rather than "must run ping".


Behaviour
---------

There are two parts to our system:


  a.  Parser.
  b.  Worker.

The parser will read the named configuration file(s), parse them, and submit
to our queue a JSON-encoded piece of data for each test we must run.

The worker will pull down these tests, and execute them.

Sample JSON looks like this:

  {"target_host":"46.43.37.199","test_type":"ssh","test_port":"22","test_alert":"*Managed client*: \"[Goto Redmine]\":https://managed.bytemark.co.uk/projects/wellinformed/wiki/Wiki ssh failure"}


You'll see that the JSON-encoded data is merely a hash, with the following
keys:

   target_host:  The host that will be probed.

   test_port:  The port number that will be queried.  i.e "22", or "222" for SSH probes.

   test_type:  The type of test we're runnign "ssh", "http", "ftp", "imap", etc.

   test_alert:  The text of the alert we'll raise, on failure.

There are only two extra fields that we expect to set in the normal course of events:

   http_text:    Expected HTTP/HTTPS contents.
   http_status:  Expected HTTP/HTTPS response code.

TODO:  The DNS-test will also use different fields.