summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorSteve Kemp <steve@steve.org.uk>2017-08-08 08:43:38 +0300
committerSteve Kemp <steve@steve.org.uk>2017-08-08 08:43:38 +0300
commit53810efe8a7f92c47b5b6bab9de4afb52ce380b2 (patch)
tree29f618e2cd4010d3e6584c833db214ecf9b61301
parent21870b8cf59f0450ef09c10890f64bb5b88c7645 (diff)
Ignore bogus DNS results.
We've had a problem for the past few weeks (?) where we see false DNS errors when making http/https requests with `curb`/`libcurl`. To resolve these issues properly we're going to have to rewrite the code to avoid the current gem. However that is considerable work because of the hole we've back ourself into - wanting to test both IPv4 and IPv6 "properly". We'll have to duplicate that work if we use `net/http`, or even mroe so if we use `open3` and exec `curl -4|-6 ..` For the moment this commit changes how things are handled to deal with the issue we see - which doesn't solve the problem but will mask it. When custodian runs a test it will return a status-code: * Custodian::TestResult::TEST_FAILED * The test failed, such that an alert should be raised. * Custodian::TestResult::TEST_PASSED * The test succeeded, such that any previous alert should be cleared. * Custodian::TestResult::TEST_SKIPPED * Nothing should be done. As the failure we see is very very specific - an exception is thrown of the type `Curl::Err::HostResolutionError` - we can catch that and return `TEST_SKIPPED`. That means that there will be no (urgent) alert. Obviously the potential risk of swallowing all DNS-failures is that a domain might expire and we'd never know. So we'll do a little better than merely skipping the test if there are DNS failures: * If we see a DNS failure. * Then we try to lookup the host as an A & AAAA record. * If that succeeds we decide the issue was bogus. * If that fails then the host legitimately doesn't resolve so we raise an alert. To recap: * If a host fails normally - bogus status-code, or missing text - we behave as we did in the past. * Only in the case of a DNS-error from curb/curl do we go down this horrid path. * Where we try to confirm the error, and swallow it if false. This closes #13.
-rw-r--r--lib/custodian/protocoltest/http.rb75
1 files changed, 74 insertions, 1 deletions
diff --git a/lib/custodian/protocoltest/http.rb b/lib/custodian/protocoltest/http.rb
index a3f34cc..71388c8 100644
--- a/lib/custodian/protocoltest/http.rb
+++ b/lib/custodian/protocoltest/http.rb
@@ -337,8 +337,80 @@ module Custodian
rescue Curl::Err::TooManyRedirectsError
errors << "#{protocol_msg}: More than 10 redirections."
rescue Curl::Err::HostResolutionError => x
- # Log the DNS error-message.
+
+ #
+ # We've been beset by a series of false-alerts in the recent
+ # past which have all occurred at this point:
+ #
+ # * We get bogus errors in resolving DNS from curb/libcurl.
+ #
+ # * These errors go away on retry.
+ #
+ # * But the retry isn't fast enough to outrace the
+ # supression-time of our alerts.
+ #
+ # For the moment we're going to _temporarily_ ignore these
+ # errors.
+ #
+ # * If a host has Connection-Refused, the wrong status-cde
+ # or similar failure it will be handled as normal.
+ #
+ # * If the host has genuinely lost DNS then we're going to
+ # raise an alert, but if it is this false-error then we
+ # will silently disable this test-run.
+ #
+ #
+
+ # IP addresses we found for the host
+ ips = []
+
+ # Get the hostname we're connecting to.
+ u = URI.parse(test_url)
+ target = u.host
+
+ #
+ # Resolve the target to an IP, unless it is already an address.
+ #
+ if (target =~ /^([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)$/) ||
+ (target =~ /^([0-9a-f:]+)$/)
+ ips.push(target)
+ else
+
+ #
+ # OK if it didn't look like an IP address then attempt to
+ # look it up, as both IPv4 and IPv6.
+ #
+ begin
+ timeout(30) do
+
+ Resolv::DNS.open do |dns|
+
+ ress = dns.getresources(target, Resolv::DNS::Resource::IN::A)
+ ress.map { |r| ips.push(r.address.to_s) }
+
+ ress = dns.getresources(target, Resolv::DNS::Resource::IN::AAAA)
+ ress.map { |r| ips.push(r.address.to_s) }
+ end
+ end
+ rescue Timeout::Error => _e
+ # NOP
+ end
+ end
+
+ #
+ # At this point we either have:
+ #
+ # "ips" containing entries - because the hostname resolved
+ #
+ # "ips" being empty because the DNS failure was genuine
+ #
+ return Custodian::TestResult::TEST_SKIPPED unless ips.empty?
+
+ #
+ # Log the DNS error-message, as this is genuine.
+ #
resolution_errors << "#{protocol_msg}: #{x.class}: #{x.message}\n #{x.backtrace.join("\n ")}."
+
rescue => x
errors << "#{protocol_msg}: #{x.class}: #{x.message}\n #{x.backtrace.join("\n ")}."
end
@@ -361,6 +433,7 @@ module Custodian
# uh-oh! Resolution failed on both protocols!
if resolution_errors.length > 1
errors << "DNS Error when resolving host - #{resolution_errors.join(',')}"
+
end
if !errors.empty?