[fix](json) Fix strict json validation#63309
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
Review result: request changes.
Critical checkpoint conclusions:
- Goal/test: The PR targets strict json_valid parsing and strict typed jsonb_extract_* semantics, with BE unit tests and a jsonb_p0 regression test. The added tests cover the core one-row cases but miss a multi-row constant-argument execution path.
- Scope: The change is mostly focused, but the new typed numeric BE implementations need to preserve FunctionJsonbExtract's const-column contract.
- Concurrency/lifecycle/config/compatibility: No new concurrency, special lifecycle, config, storage format, or FE-BE protocol compatibility concern was identified.
- Parallel paths: FE now sends typed functions directly to BE; this correctly requires BE implementations for those functions, but the const JSON + const path path is incomplete for numeric typed extractors.
- Error handling/data correctness: One data correctness issue found: scalar_vector sizes the output from the unwrapped const path column rather than the input block row count.
- Tests: Existing tests do not exercise typed numeric extraction over multi-row input with both arguments constant.
- Observability/performance: No additional observability requirement or obvious performance blocker found.
User focus: No additional user-provided review focus was present.
| const ColumnString::Offsets& roffsets, const NullMap* r_null_map, | ||
| Container& res, NullMap& null_map) { | ||
| size_t size = roffsets.size(); | ||
| res.resize(size); |
There was a problem hiding this comment.
This sizes the result from the unwrapped path column, which is only one row when both the JSON argument and path argument are ColumnConst. FunctionJsonbExtract::execute_impl checks jsonb_data_const first and calls this scalar_vector branch even when path_const[0] is also true, so a query that evaluates jsonb_extract_int(jsonb_parse('{"a":1}'), '$.a') over a multi-row block can build a nested result column of size 1 and pair it with a null map sized to input_rows_count. That can either trip column-size invariants or return only the first row. Please handle the const/const case explicitly, or pass input_rows_count into this branch and fill that many result rows.
### What problem does this PR solve?
Issue Number: DORIS-25567
Related PR: None
Problem Summary: Fix json_valid accepting incomplete JSON input or JSON with trailing non-whitespace content. DORIS-25568 is documented as intentional behavior because typed jsonb_extract_* keeps the historical jsonb_extract plus cast rewrite and follows normal cast rules.
### Release note
Fix json_valid to reject incomplete or trailing JSON input.
### Check List (For Author)
- Test: Unit Test / FE build
- ./run-be-ut.sh --run --filter=JsonBinaryValueTest.TestValidation:FunctionJsonbTEST.JsonValidStrictTest
- ./build.sh --fe
- Added regression-test/suites/jsonb_p0/test_json_valid_strict.groovy, but local regression run did not complete because the worktree FE metadata/port state prevented a stable cluster startup.
- Behavior changed: Yes. json_valid rejects incomplete/trailing JSON input. typed jsonb_extract_* behavior is unchanged and documented as intentional.
- Does this need documentation: No
ad33600 to
fdf8418
Compare
|
/review |
There was a problem hiding this comment.
I found one blocking performance regression in the JSONB parser change. The fix goal is clear and the tests cover the new strict json_valid cases, but the implementation adds a full-document raw_json pre-pass on every JSONB parse path, which doubles traversal work for large JSON payloads used by json_valid, jsonb_parse, casts, and ingestion.
Critical checkpoint conclusions:
- Goal/test: the PR aims to make json_valid reject incomplete/trailing JSON; BE unit and regression coverage were added for representative invalid suffix cases.
- Scope: the change is small, but the raw_json pre-pass is broader than needed for the goal.
- Concurrency/lifecycle/config/compatibility: no new concurrency, lifecycle, config, storage-format, or FE-BE protocol compatibility concerns found.
- Parallel paths: the shared JsonbParser path covers json_valid and JSONB parsing/casts, but that also means parser overhead affects all those paths.
- Error handling/data correctness: Status propagation is preserved; the strict validation intent is sound.
- Tests: added tests cover trailing non-whitespace content, but not the parser performance impact. No additional user-provided focus points were present.
| } | ||
|
|
||
| std::string_view raw_json; | ||
| simdjson::error_code raw_res = doc.raw_json().get(raw_json); |
There was a problem hiding this comment.
This raw_json() call consumes the entire document, then doc.rewind() makes the code parse and serialize the same JSON again. JsonbParser::parse is shared by json_valid, jsonb_parse, casts to JSONB, and other ingestion paths, so large object/array inputs now pay an extra full simdjson traversal just to detect trailing content. The later code already consumes the root value and checks doc.at_end(), while top-level scalar getters use simdjson's root APIs that disallow trailing content and the new is_null() check covers the not/partial-null case. Please avoid the whole-document raw_json() pre-pass and enforce the remaining scalar/null validation in the existing parse pass so JSONB parsing stays single-pass.
### What problem does this PR solve? Issue Number: None Related PR: apache#63309 Problem Summary: Avoid the extra raw_json pre-pass in strict JSONB parsing, and keep const JSON plus const path extract results sized to the input block. ### Release note None ### Check List (For Author) - Test: Unit Test - ./run-be-ut.sh --run --filter=JsonBinaryValueTest.TestValidation:FunctionJsonbTEST.JsonbParseTest:FunctionJsonbTEST.JsonbParseErrorToNullTest:FunctionJsonbTEST.JsonValidStrictTest:FunctionJsonbTEST.JsonExtractConstConstMultiRow - build-support/check-format.sh - build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN (failed due existing complexity/NOLINT diagnostics and local toolchain stddef.h lookup) - Behavior changed: Yes. Const JSON plus const path non-string JSONB extract returns a correctly sized repeated result. - Does this need documentation: No
|
/review |
There was a problem hiding this comment.
No additional blocking issues found in this review.
Critical checkpoint conclusions:
- Goal and tests: the PR tightens JSON parsing so incomplete/trailing-content inputs are rejected, documents the existing typed JSON extract rewrite semantics, and fixes the const JSON plus const path result sizing path. Added BE unit coverage and a regression case target the changed behavior.
- Scope: changes are small and focused on JSONB parsing/extraction and related tests/comments.
- Concurrency/lifecycle: no new concurrency, locking, static initialization, or special lifecycle management introduced.
- Compatibility/configuration/persistence: no new config, persistence, storage format, or FE-BE protocol changes. Behavior change is limited to stricter JSON text parsing.
- Parallel paths: current final diff handles the previously raised const/const non-string JSON extract path; string/JSONB extract already uses the vector_vector_v2 path sized by input_rows_count.
- Error handling: new parser failures return non-OK Status; no ignored new Status paths found in production code.
- Test coverage: BE unit tests and regression coverage were added. I attempted
./run-be-ut.sh --run --filter=JsonBinaryValueTest.TestValidation:FunctionJsonbTEST.JsonbParseTest:FunctionJsonbTEST.JsonbParseErrorToNullTest:FunctionJsonbTEST.JsonValidStrictTest:FunctionJsonbTEST.JsonExtractConstConstMultiRow, but it timed out after 120s during first-time submodule setup before executing the tests. - Observability/performance: no new observability need; the final parser implementation avoids the earlier whole-document raw_json pre-pass performance concern.
- User focus: no additional user-provided review focus was specified.
What problem does this PR solve?
Related PR: None
Problem Summary: Fix json_valid accepting incomplete JSON input or JSON with trailing non-whitespace content.
jsonb_extract_*keeps the historical jsonb_extract plus cast rewrite and follows normal cast rules.Release note
Fix json_valid to reject incomplete or trailing JSON input.
Check List (For Author)