diff options
| author | Steve Kemp <steve@steve.org.uk> | 2017-08-08 08:43:38 +0300 | 
|---|---|---|
| committer | Steve Kemp <steve@steve.org.uk> | 2017-08-08 08:43:38 +0300 | 
| commit | 53810efe8a7f92c47b5b6bab9de4afb52ce380b2 (patch) | |
| tree | 29f618e2cd4010d3e6584c833db214ecf9b61301 /lib/custodian | |
| parent | 21870b8cf59f0450ef09c10890f64bb5b88c7645 (diff) | |
Ignore bogus DNS results.
We've had a problem for the past few weeks (?) where we see
false DNS errors when making http/https requests with `curb`/`libcurl`.
To resolve these issues properly we're going to have to rewrite
the code to avoid the current gem.  However that is considerable work
because of the hole we've back ourself into - wanting to test both
IPv4 and IPv6 "properly".  We'll have to duplicate that work if
we use `net/http`, or even mroe so if we use `open3` and exec
`curl -4|-6 ..`
For the moment this commit changes how things are handled to deal
with the issue we see - which doesn't solve the problem but will
mask it.
When custodian runs a test it will return a status-code:
* Custodian::TestResult::TEST_FAILED
  * The test failed, such that an alert should be raised.
* Custodian::TestResult::TEST_PASSED
  * The test succeeded, such that any previous alert should be cleared.
* Custodian::TestResult::TEST_SKIPPED
  * Nothing should be done.
As the failure we see is very very specific - an exception is thrown
of the type `Curl::Err::HostResolutionError` - we can catch that
and return `TEST_SKIPPED`.  That means that there will be no
(urgent) alert.
Obviously the potential risk of swallowing all DNS-failures is that
a domain might expire and we'd never know.  So we'll do a little
better than merely skipping the test if there are DNS failures:
* If we see a DNS failure.
  * Then we try to lookup the host as an A & AAAA record.
    * If that succeeds we decide the issue was bogus.
    * If that fails then the host legitimately doesn't resolve so we raise an alert.
To recap:
* If a host fails normally - bogus status-code, or missing text - we behave as we did in the past.
* Only in the case of a DNS-error from curb/curl do we go down this horrid path.
   * Where we try to confirm the error, and swallow it if false.
This closes #13.
Diffstat (limited to 'lib/custodian')
| -rw-r--r-- | lib/custodian/protocoltest/http.rb | 75 | 
1 files changed, 74 insertions, 1 deletions
diff --git a/lib/custodian/protocoltest/http.rb b/lib/custodian/protocoltest/http.rb index a3f34cc..71388c8 100644 --- a/lib/custodian/protocoltest/http.rb +++ b/lib/custodian/protocoltest/http.rb @@ -337,8 +337,80 @@ module Custodian            rescue Curl::Err::TooManyRedirectsError              errors << "#{protocol_msg}: More than 10 redirections."            rescue Curl::Err::HostResolutionError => x -            # Log the DNS error-message. + +            # +            # We've been beset by a series of false-alerts in the recent +            # past which have all occurred at this point: +            # +            #   * We get bogus errors in resolving DNS from curb/libcurl. +            # +            #   * These errors go away on retry. +            # +            #   * But the retry isn't fast enough to outrace the +            #     supression-time of our alerts. +            # +            # For the moment we're going to _temporarily_ ignore these +            # errors. +            # +            #  * If a host has Connection-Refused, the wrong status-cde +            #    or similar failure it will be handled as normal. +            # +            #  * If the host has genuinely lost DNS then we're going to +            #    raise an alert, but if it is this false-error then we +            #    will silently disable this test-run. +            # +            # + +            #  IP addresses we found for the host +            ips = [] + +            # Get the hostname we're connecting to. +            u = URI.parse(test_url) +            target = u.host + +            # +            #  Resolve the target to an IP, unless it is already an address. +            # +            if (target =~ /^([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)$/) || +               (target =~ /^([0-9a-f:]+)$/) +              ips.push(target) +            else + +              # +              # OK if it didn't look like an IP address then attempt to +              # look it up, as both IPv4 and IPv6. +              # +              begin +                timeout(30) do + +                  Resolv::DNS.open do |dns| + +                    ress = dns.getresources(target, Resolv::DNS::Resource::IN::A) +                    ress.map { |r| ips.push(r.address.to_s) } + +                    ress = dns.getresources(target, Resolv::DNS::Resource::IN::AAAA) +                    ress.map { |r| ips.push(r.address.to_s) } +                  end +                end +              rescue Timeout::Error => _e +                # NOP +              end +            end + +            # +            #  At this point we either have: +            # +            #   "ips" containing entries - because the hostname resolved +            # +            #   "ips" being empty because the DNS failure was genuine +            # +            return Custodian::TestResult::TEST_SKIPPED unless ips.empty? + +            # +            # Log the DNS error-message, as this is genuine. +            #              resolution_errors << "#{protocol_msg}: #{x.class}: #{x.message}\n  #{x.backtrace.join("\n  ")}." +            rescue => x              errors << "#{protocol_msg}: #{x.class}: #{x.message}\n  #{x.backtrace.join("\n  ")}."            end @@ -361,6 +433,7 @@ module Custodian          # uh-oh! Resolution failed on both protocols!          if resolution_errors.length > 1            errors << "DNS Error when resolving host - #{resolution_errors.join(',')}" +          end          if !errors.empty?  | 
