Skip to content

Handle self-signed certs expiration in package tests#18881

Open
Niceplace wants to merge 12 commits intomainfrom
fix/expired-self-signed-certs-in-integrations-tests
Open

Handle self-signed certs expiration in package tests#18881
Niceplace wants to merge 12 commits intomainfrom
fix/expired-self-signed-certs-in-integrations-tests

Conversation

@Niceplace
Copy link
Copy Markdown
Contributor

@Niceplace Niceplace commented May 7, 2026

Summary of changes

Re-generate expired self-signed certificates used to test integration packages and add a monitoring script in the CI that will run before every package is tested to flag expired and close-to-expiry certificates.

The CI will fail the test step for a given package if its certificates expires in 6 months or less.
The CI will log a warning in the test step for a given package if the certificate expires in more than 6 months but less than a year.

Here's how the diagnosis was done for the aws.config integration:

  1. aws.config tests were regularly failing (timing out) in [#17491][cisco-ftd]: Parse user authentication rejection reasons #18828
  2. Local execution with elastic-package up -d;elastic-package test system -d config from packages/aws folder showed the same behavior locally
  3. Container log investigation showed errors related to x509
elastic-agent-1  | {"log.level":"error","@timestamp":"2026-05-07T16:33:55.327Z","message":"Error dialing x509: certificate has expired or is not yet valid: current time 2026-05-07T16:33:55Z is after 2026-05-06T06:27:43Z","component.id":"cel-default","component.type":"cel","component.binary":"filebeat","component.dataset":"elastic_agent.filebeat","log.source":"cel-default","log.origin":{"file.line":39,"file.name":"transport/logging.go","function":"github.com/elastic/elastic-agent-libs/transport/httpcommon.(*HTTPTransportSettings).RoundTripper.LoggingDialer.func2"},"id":"cel-aws.config-569bd14e-3f72-4aed-835d-c40421343ac3","network.transport":"tcp","server.address":"config.xxxx.amazonaws.com:443","log.logger":"input.cel","service.name":"filebeat","input_source":"https://config.xxxx.amazonaws.com/","input_url":"https://config.xxxx.amazonaws.com/","ecs.version":"1.6.0","ecs.version":"1.6.0"}
  1. Checked certificate expiry date with openssl x509 -enddate -noout -in $CERT_FILE_PATH
  2. Re-generated certificate with expiry 10 years from now
  3. Run the same test again elastic-package up -d;elastic-package test system -d config from packages/aws and it passes almost immediately instead of timing out after 10 minutes

Proposed commit message

Re-generate expired and close to be expired self-signed certificates for aws.config and cybereason integrations. Add a monitoring script in the CI that will run before every integration package is tested to flag expired and close-to-expiry certificates and fail the test with a comprehensive JUNIT report.

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.
  • I have verified that any added dashboard complies with Kibana's Dashboard good practices

How to test this PR locally

  1. From the root of the repo: cd packages/aws
  2. Run elastic-package stack up -d
  3. Run elastic-package test system -d config so it only runs the config tests
  4. All tests pass

If you want to reproduce the failure, try with a commit from main such as bebb005

Related pipeline failures

https://buildkite.com/elastic/integrations/builds/42437/canvas?sid=019dfe60-06c0-4b86-a3f2-b404855cfa40&tab=output

Niceplace added 2 commits May 7, 2026 15:54
…icates that are either expired or will expire within the year
…in package tests and fails the test step under certain conditions
@Niceplace Niceplace requested review from a team as code owners May 7, 2026 20:11
@Niceplace Niceplace added the bugfix Pull request that fixes a bug issue label May 7, 2026

use_elastic_package

echo "--- [${package_name}] Check TLS certificate expiry"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed this was the right place to put the check, this way it will run once for each package immediately before the tests, at least that's how I understand the pipeline structure.

This way only tests that have expired or close-to-be-expired tests will fail, and not the entire job.

@Niceplace
Copy link
Copy Markdown
Contributor Author

Question: Does it make sense to bump package changelogs in this case ? I did not do it because this doesn't fix a bug in the package's code, it only modifies the setup used to test it. I'm happy to revisit this though

@github-actions

This comment has been minimized.

@Niceplace Niceplace added Integration:aws AWS and removed bugfix Pull request that fixes a bug issue labels May 7, 2026
Copy link
Copy Markdown
Contributor

@efd6 efd6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please update the proposed commit message to be the text that will be included as the commit message (no Markdown since git log is not Markdown-aware).

@elastic-vault-github-plugin-prod
Copy link
Copy Markdown

elastic-vault-github-plugin-prod Bot commented May 8, 2026

🚀 Benchmarks report

Package elasticsearch 👍(2) 💚(1) 💔(3)

Expand to view
Data stream Previous EPS New EPS Diff (%) Result
deprecation 8264.46 6024.1 -2240.36 (-27.11%) 💔
server 6493.51 5347.59 -1145.92 (-17.65%) 💔
slowlog 4132.23 3367 -765.23 (-18.52%) 💔

To see the full report comment with /test benchmark fullreport

Copy link
Copy Markdown
Collaborator

@mrodm mrodm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elastic/ecosystem WDYT about adding this new check in the CI ?

echo " openssl req -x509 -newkey rsa:2048 -keyout <key> -out <cert> \\"
echo " -subj '<subject>' [-addext 'subjectAltName=DNS:<hostname>'] -days 3650 -noenc"
echo " .buildkite/scripts/update-test-cert.sh <cert>"
exit 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this could create a XML file following the jUnit format, so it can be then reported as a failure and follow the automation of flaky-tests to create GitHub issues with this error.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some example:

<?xml version="1.0" encoding="UTF-8"?>
<testsuites>
  <testsuite name="system" tests="20" failures="1">
    <!--test suite for system tests-->
    <testcase name="system test: default" classname="aws.cloudtrail" time="105.619672145"></testcase>
    <testcase name="system test: data_granularity" classname="aws.ec2_metrics" time="717.594866274"></testcase>
    <testcase name="system test: default" classname="aws.ec2_metrics" time="1022.428637031"></testcase>
    <testcase name="system test: default" classname="aws.securityhub_findings_full_posture" time="39.192594525"></testcase>
    <testcase name="system test: default" classname="aws.cloudfront_logs" time="109.69616598"></testcase>
    <testcase name="system test: default" classname="aws.elb_logs" time="100.8600316"></testcase>
    <testcase name="system test: default" classname="aws.securityhub_insights" time="36.60892882"></testcase>
    <testcase name="system test: default" classname="aws.waf" time="108.81979344"></testcase>
    <testcase name="system test: default" classname="aws.s3access" time="106.250401394"></testcase>
    <testcase name="system test: default" classname="aws.apigateway_logs" time="194.243282454"></testcase>
    <testcase name="system test: default" classname="aws.ec2_logs" time="103.762434038"></testcase>
    <testcase name="system test: default" classname="aws.firewall_logs" time="107.836685465"></testcase>
    <testcase name="system test: default" classname="aws.guardduty" time="44.097627039"></testcase>
    <testcase name="system test: default" classname="aws.inspector" time="39.243202065"></testcase>
    <testcase name="system test: default" classname="aws.route53_resolver_logs" time="103.599879859"></testcase>
    <testcase name="system test: default" classname="aws.securityhub_findings" time="42.964267548"></testcase>
    <testcase name="system test: default" classname="aws.config" time="643.350771599">
      <failure>test case failed: could not find the expected hits in logs-aws.config-28774 data stream</failure>
    </testcase>
    <testcase name="system test: default" classname="aws.emr_logs" time="111.049148092"></testcase>
    <testcase name="system test: default" classname="aws.redshift" time="812.692963297"></testcase>
    <testcase name="system test: default" classname="aws.vpcflow" time="105.466743095"></testcase>
  </testsuite>
</testsuites>

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that ! I'm thinking maybe I could shift the responsibility of this script in the tests, something like in the "beforeEach" hook, this way we get the actual failure in Junit natively.

WDYT ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These files are generated by elastic-package via the elastic-package test commands, one file per test type (static, system, pipeline...). There is no support for hooks there. And moreover, as this is detected as part of the CI scripts, before any elastic-package command related to testing, elastic-package cannot do anything here.

If you meant to add this functionality to elastic-package, I'm not sure this should belong there. I think elastic-package should be agnostic to the files being present in the docker deployer folder. WDYT @elastic/ecosystem ?

Copy link
Copy Markdown
Contributor Author

@Niceplace Niceplace May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I understand your concerns, adding the check for expired certs within elastic-package is where I was going next, since there are currently 14 diferent certificates that we would need to monitor. It felt like approaching this from the test runner's perspective would give us a scalable solution and require less maintenance.

This condition needs to be caught before a test runs though because once the test runs, the root cause (the log entry that mentions certificate expiry) will only be logged in the container logs and those are not included in buildkite artifacts at the moment. This means that automated investigation tools such as the PR Detective won't have the necessary data to flag it properly and we are left with an error message that is vague (Assertions failed, test timed out).

I'll see what I can do to generate a JUNIT file in the current script, if you have any other pointers, I'm definitely interested :D

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet ! So generating the junit report makes the root cause visible to the PR detective so that's nice

echo "Renew the certificates above, then sync test configs:"
echo " openssl req -x509 -newkey rsa:2048 -keyout <key> -out <cert> \\"
echo " -subj '<subject>' [-addext 'subjectAltName=DNS:<hostname>'] -days 3650 -noenc"
echo " .buildkite/scripts/update-test-cert.sh <cert>"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file does not exist in the PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ! Nice catch thank you, will fix that

@github-actions

This comment has been minimized.

@elasticmachine
Copy link
Copy Markdown

elasticmachine commented May 8, 2026

💔 Build Failed

Failed CI Steps

History

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

TL;DR

Check integrations jamf_pro failed, but the captured Buildkite log only contains teardown/artifact-upload output and does not include the actual failing command output. The cert pre-check added in this PR is not the blocker for jamf_pro (it exits cleanly for that package), so the immediate action is to rerun the step with full process_package logs visible.

Remediation

  • Re-run Check integrations jamf_pro and capture the full process_package section output (the current log is truncated to post-failure teardown).
  • Inspect/download build/test-results/*.xml and package test logs from the rerun to identify the exact failing test/assertion.
  • If needed, reproduce locally from this PR head:
    ./.buildkite/scripts/test_one_package.sh packages/jamf_pro origin/main 29e052fb1dc085948e8913b83d184bc30a65e532
Investigation details

Root Cause

Current evidence is insufficient to identify a concrete code/test/dependency failure in jamf_pro because the provided step log does not include the failing process_package output.

What can be ruled out:

  • The new TLS cert gate is invoked before tests in .buildkite/scripts/test_one_package.sh (L29-L31) and would fail early if certs were invalid.
  • For packages/jamf_pro, .buildkite/scripts/check_certificates.sh path scan (L58-L63) finds no cert files and exits 0, so this job’s failure occurred after that pre-check.

Evidence

  • Build: https://buildkite.com/elastic/integrations/builds/42589
  • Job/step: Check integrations jamf_pro
  • Key log excerpt:
    • /tmp/gh-aw/buildkite-logs/integrations-check-integrations-jamf_pro.txt:97 --- [jamf_pro] failed
    • /tmp/gh-aw/buildkite-logs/integrations-check-integrations-jamf_pro.txt:100 Error: The command exited with status 1
    • /tmp/gh-aw/buildkite-logs/integrations-check-integrations-jamf_pro.txt contains only teardown/upload lines after failure, not the preceding failing command output.

Verification

  • Ran cert pre-check on PR head commit 29e052fb1dc085948e8913b83d184bc30a65e532:
    bash .buildkite/scripts/check_certificates.sh packages/jamf_proNo certificate files found under packages/jamf_pro — nothing to check.

Follow-up

If the rerun still omits failure details, please attach the raw Buildkite job log (including pre-teardown output) so root cause can be pinned to a specific file/test.

Note

🔒 Integrity filter blocked 1 item

The following item were blocked because they don't meet the GitHub integrity level.

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

What is this? | From workflow: PR Buildkite Detective

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants