Handle self-signed certs expiration in package tests#18881
Handle self-signed certs expiration in package tests#18881
Conversation
…icates that are either expired or will expire within the year
…in package tests and fails the test step under certain conditions
|
|
||
| use_elastic_package | ||
|
|
||
| echo "--- [${package_name}] Check TLS certificate expiry" |
There was a problem hiding this comment.
I assumed this was the right place to put the check, this way it will run once for each package immediately before the tests, at least that's how I understand the pipeline structure.
This way only tests that have expired or close-to-be-expired tests will fail, and not the entire job.
|
Question: Does it make sense to bump package changelogs in this case ? I did not do it because this doesn't fix a bug in the package's code, it only modifies the setup used to test it. I'm happy to revisit this though |
This comment has been minimized.
This comment has been minimized.
efd6
left a comment
There was a problem hiding this comment.
LGTM, but please update the proposed commit message to be the text that will be included as the commit message (no Markdown since git log is not Markdown-aware).
🚀 Benchmarks reportPackage
|
| Data stream | Previous EPS | New EPS | Diff (%) | Result |
|---|---|---|---|---|
deprecation |
8264.46 | 6024.1 | -2240.36 (-27.11%) | 💔 |
server |
6493.51 | 5347.59 | -1145.92 (-17.65%) | 💔 |
slowlog |
4132.23 | 3367 | -765.23 (-18.52%) | 💔 |
To see the full report comment with /test benchmark fullreport
mrodm
left a comment
There was a problem hiding this comment.
@elastic/ecosystem WDYT about adding this new check in the CI ?
| echo " openssl req -x509 -newkey rsa:2048 -keyout <key> -out <cert> \\" | ||
| echo " -subj '<subject>' [-addext 'subjectAltName=DNS:<hostname>'] -days 3650 -noenc" | ||
| echo " .buildkite/scripts/update-test-cert.sh <cert>" | ||
| exit 1 |
There was a problem hiding this comment.
Wondering if this could create a XML file following the jUnit format, so it can be then reported as a failure and follow the automation of flaky-tests to create GitHub issues with this error.
There was a problem hiding this comment.
Some example:
<?xml version="1.0" encoding="UTF-8"?>
<testsuites>
<testsuite name="system" tests="20" failures="1">
<!--test suite for system tests-->
<testcase name="system test: default" classname="aws.cloudtrail" time="105.619672145"></testcase>
<testcase name="system test: data_granularity" classname="aws.ec2_metrics" time="717.594866274"></testcase>
<testcase name="system test: default" classname="aws.ec2_metrics" time="1022.428637031"></testcase>
<testcase name="system test: default" classname="aws.securityhub_findings_full_posture" time="39.192594525"></testcase>
<testcase name="system test: default" classname="aws.cloudfront_logs" time="109.69616598"></testcase>
<testcase name="system test: default" classname="aws.elb_logs" time="100.8600316"></testcase>
<testcase name="system test: default" classname="aws.securityhub_insights" time="36.60892882"></testcase>
<testcase name="system test: default" classname="aws.waf" time="108.81979344"></testcase>
<testcase name="system test: default" classname="aws.s3access" time="106.250401394"></testcase>
<testcase name="system test: default" classname="aws.apigateway_logs" time="194.243282454"></testcase>
<testcase name="system test: default" classname="aws.ec2_logs" time="103.762434038"></testcase>
<testcase name="system test: default" classname="aws.firewall_logs" time="107.836685465"></testcase>
<testcase name="system test: default" classname="aws.guardduty" time="44.097627039"></testcase>
<testcase name="system test: default" classname="aws.inspector" time="39.243202065"></testcase>
<testcase name="system test: default" classname="aws.route53_resolver_logs" time="103.599879859"></testcase>
<testcase name="system test: default" classname="aws.securityhub_findings" time="42.964267548"></testcase>
<testcase name="system test: default" classname="aws.config" time="643.350771599">
<failure>test case failed: could not find the expected hits in logs-aws.config-28774 data stream</failure>
</testcase>
<testcase name="system test: default" classname="aws.emr_logs" time="111.049148092"></testcase>
<testcase name="system test: default" classname="aws.redshift" time="812.692963297"></testcase>
<testcase name="system test: default" classname="aws.vpcflow" time="105.466743095"></testcase>
</testsuite>
</testsuites>There was a problem hiding this comment.
I like that ! I'm thinking maybe I could shift the responsibility of this script in the tests, something like in the "beforeEach" hook, this way we get the actual failure in Junit natively.
WDYT ?
There was a problem hiding this comment.
These files are generated by elastic-package via the elastic-package test commands, one file per test type (static, system, pipeline...). There is no support for hooks there. And moreover, as this is detected as part of the CI scripts, before any elastic-package command related to testing, elastic-package cannot do anything here.
If you meant to add this functionality to elastic-package, I'm not sure this should belong there. I think elastic-package should be agnostic to the files being present in the docker deployer folder. WDYT @elastic/ecosystem ?
There was a problem hiding this comment.
Yeah I understand your concerns, adding the check for expired certs within elastic-package is where I was going next, since there are currently 14 diferent certificates that we would need to monitor. It felt like approaching this from the test runner's perspective would give us a scalable solution and require less maintenance.
This condition needs to be caught before a test runs though because once the test runs, the root cause (the log entry that mentions certificate expiry) will only be logged in the container logs and those are not included in buildkite artifacts at the moment. This means that automated investigation tools such as the PR Detective won't have the necessary data to flag it properly and we are left with an error message that is vague (Assertions failed, test timed out).
I'll see what I can do to generate a JUNIT file in the current script, if you have any other pointers, I'm definitely interested :D
There was a problem hiding this comment.
Sweet ! So generating the junit report makes the root cause visible to the PR detective so that's nice
| echo "Renew the certificates above, then sync test configs:" | ||
| echo " openssl req -x509 -newkey rsa:2048 -keyout <key> -out <cert> \\" | ||
| echo " -subj '<subject>' [-addext 'subjectAltName=DNS:<hostname>'] -days 3650 -noenc" | ||
| echo " .buildkite/scripts/update-test-cert.sh <cert>" |
There was a problem hiding this comment.
This file does not exist in the PR.
There was a problem hiding this comment.
Oh ! Nice catch thank you, will fix that
…certification expiry
…h a bunch of expired and soon-to-be-expired certs in package tests
… github.com:elastic/integrations into fix/expired-self-signed-certs-in-integrations-tests
This comment has been minimized.
This comment has been minimized.
💔 Build Failed
Failed CI StepsHistory
|
TL;DR
Remediation
Investigation detailsRoot CauseCurrent evidence is insufficient to identify a concrete code/test/dependency failure in What can be ruled out:
Evidence
Verification
Follow-upIf the rerun still omits failure details, please attach the raw Buildkite job log (including pre-teardown output) so root cause can be pinned to a specific file/test. Note 🔒 Integrity filter blocked 1 itemThe following item were blocked because they don't meet the GitHub integrity level.
To allow these resources, lower tools:
github:
min-integrity: approved # merged | approved | unapproved | noneWhat is this? | From workflow: PR Buildkite Detective Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not. |
Summary of changes
Re-generate expired self-signed certificates used to test integration packages and add a monitoring script in the CI that will run before every package is tested to flag expired and close-to-expiry certificates.
The CI will fail the test step for a given package if its certificates expires in 6 months or less.
The CI will log a warning in the test step for a given package if the certificate expires in more than 6 months but less than a year.
Here's how the diagnosis was done for the
aws.configintegration:aws.configtests were regularly failing (timing out) in [#17491][cisco-ftd]: Parse user authentication rejection reasons #18828elastic-package up -d;elastic-package test system -d configfrompackages/awsfolder showed the same behavior locallyopenssl x509 -enddate -noout -in $CERT_FILE_PATHelastic-package up -d;elastic-package test system -d configfrompackages/awsand it passes almost immediately instead of timing out after 10 minutesProposed commit message
Re-generate expired and close to be expired self-signed certificates for aws.config and cybereason integrations. Add a monitoring script in the CI that will run before every integration package is tested to flag expired and close-to-expiry certificates and fail the test with a comprehensive JUNIT report.
Checklist
changelog.ymlfile.How to test this PR locally
cd packages/awselastic-package stack up -delastic-package test system -d configso it only runs the config testsIf you want to reproduce the failure, try with a commit from
mainsuch as bebb005Related pipeline failures
https://buildkite.com/elastic/integrations/builds/42437/canvas?sid=019dfe60-06c0-4b86-a3f2-b404855cfa40&tab=output