inspectjson-go

module
v0.0.0-...-8fcd0d4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 30, 2025 License: MIT

README

inspectjson-go

Parse JSON with imperfect syntax and capture metadata about byte offsets.

  • Decode human-crafted, error-prone JSON often found on web pages.
  • Reference byte and line+column offsets of JSON structures, keys, and values.
  • Sanitize the syntax of JSON for any strict decoder implementation.
  • Describe invalid syntax and suggest replacements.

This is implemented as a custom tokenizer based on the official JSON standards, but with configurable behaviors for edge cases of non-compliant syntax.

Usage

Import the module and refer to the code's documentation (pkg.go.dev).

import "github.com/dpb587/inspectjson-go/inspectjson"

Some sample use cases and starter snippets can be found in the examples directory.

examples$ go run ./parse-value <<<'{"n":true}'
inspectjson.ObjectValue{
  BeginToken: inspectjson.BeginObjectToken{
    SourceOffsets: &cursorio.TextOffsetRange{
      From: cursorio.TextOffset{Byte: 0, LineColumn: cursorio.TextLineColumn{0, 0}},
      Until: cursorio.TextOffset{Byte: 1, LineColumn: cursorio.TextLineColumn{0, 1}},
    },
  },
  EndToken: inspectjson.EndObjectToken{
    SourceOffsets: &cursorio.TextOffsetRange{
      From: cursorio.TextOffset{Byte: 9, LineColumn: cursorio.TextLineColumn{0, 9}},
      Until: cursorio.TextOffset{Byte: 10, LineColumn: cursorio.TextLineColumn{0, 10}},
    },
  },
  Members: map[string]inspectjson.ObjectMember{
    "n": inspectjson.ObjectMember{
      Name: inspectjson.StringValue{
        SourceOffsets: &cursorio.TextOffsetRange{
          From: cursorio.TextOffset{Byte: 1, LineColumn: cursorio.TextLineColumn{0, 1}},
          Until: cursorio.TextOffset{Byte: 4, LineColumn: cursorio.TextLineColumn{0, 4}},
        },
        Value: "n",
      },
      Value: inspectjson.BooleanValue{
        SourceOffsets: &cursorio.TextOffsetRange{
          From: cursorio.TextOffset{Byte: 5, LineColumn: cursorio.TextLineColumn{0, 5}},
          Until: cursorio.TextOffset{Byte: 9, LineColumn: cursorio.TextLineColumn{0, 9}},
        },
        Value: true,
      },
    },
  },
  ReplacedMembers: nil,
}
examples$ go run ./tokenize-offsets <<<'{"n":true}'
L1C1:L1C2;0x0:0x1           	begin-object      	{
L1C2:L1C5;0x1:0x4           	string            	"n"
L1C5:L1C6;0x4:0x5           	name-separator    	:
L1C6:L1C10;0x5:0x9          	true              	true
L1C10:L1C11;0x9:0xa         	end-object        	}
examples$ go run ./tokenize-log-lax <<<'[01,TRUE,"hello world",]//test'
L1C2:L1C3;0x1:0x2           	LaxNumberTrimLeadingZero	"0" -> ""
L1C5:L1C9;0x4:0x8           	LaxLiteralCaseInsensitive	"TRUE" -> "true"
L1C16:L1C17;0xf:0x10        	LaxStringEscapeMissingEscape	"\t" -> "\\t"
L1C23:L1C24;0x16:0x17       	LaxIgnoreExtraComma	"," -> ""
L1C25:L1C31;0x18:0x1e       	LaxIgnoreLineComment	"//test" -> ""
examples$ go run ./tokenize-sanitize <<<'[01,TRUE,"hello world",]//test'
[1,true,"hello\tworld"]

More complex usage can be seen from importers like rdfkit-go.

Parser

Given an io.Reader, parse and return a Value. The Value interface is implemented by the grammar value types (e.g. BooleanValue, ObjectValue), and they include fields for source offsets, scalar values, and other tokenization metadata, such as start/end delimiters.

value, err := inspectjson.Parse(
  os.Stdin,
  inspectjson.TokenizerConfig{}.
    SetLax(true).
    SetSourceOffsets(true),
)
Parser Options

Use ParserConfig to chain any of the following customizations and use it as an extra argument. The tokenizer options may also be used. Snippets in bold are a default behavior.

  • KeepReplacedObjectMembers
    • SetKeepReplacedObjectMembers(false) - a previously-encountered member will be dropped (i.e. last member wins).
    • SetKeepReplacedObjectMembers(true) - replaced members will be moved and appended to the ReplacedMembers field.

Tokenizer

Given an io.Reader, iterate over each Token. The Token interface is implemented by the grammar syntax types (e.g. LiteralTrueToken, BeginObjectToken) and include a field for source offsets and, if arbitrary, its content.

tokenizer := inspectjson.NewTokenizer(
  os.Stdin,
  inspectjson.TokenizerConfig{}.
    SetLax(true).
    SetSourceOffsets(true),
)

for {
  token, err := tokenizer.Next()
  if err != nil {
    if errors.Is(err, io.EOF) {
      break
    }

    panic(err)
  }

  switch tt := token.(type) {
  case inspectjson.BeginArrayToken:
  // ...
  }
}

The contents of a token will be the decoded string representation for its type (including the effects of any syntax recovery). For example, the contents of a StringToken may include literal new lines and UTF-16 code points.

Tokenizer Options

Use TokenizerConfig to chain any of the following customizations and use it as an extra argument. Snippets in bold are a default behavior.

  • EmitWhitespace(bool)
    • SetEmitWhitespace(false) - no whitespace tokens will be returned.
    • SetEmitWhitespace(true) - whitespace tokens will be returned.
  • Lax(bool)
    • SetLax(false) - requires adherence to JSON syntax.
    • SetLax(true) - allow all of the recoverable syntax errors.
  • Multistream(bool)
    • SetMultistream(false) - once a value has been completed, EOF is expected.
    • SetMultistream(true) - values will continue to be tokenized until EOF.
  • SourceOffsets(bool)
    • SetSourceOffsets(false) - no offset data is included in tokens.
    • SetSourceOffsets(true) - capture byte and text line+column offsets for each token.
  • SourceInitialOffset(TextOffset) - use a non-zero, initial offset (and enable capture of offset data).
  • SyntaxBehavior(SyntaxBehavior, bool) - allow or disallow a specific behavior.
  • SyntaxRecoveryHook(SyntaxRecoveryHookFunc)
    • SetSyntaxRecoveryHook(nil) - syntax recovery will be handled silently.
    • SetSyntaxRecoveryHook(f) - for each recovered syntax occurrence, f will be invoked.
Tokenizer Reader

To use the tokenizer as a sanitization pipeline for a generic JSON decoder, create an io.Reader from it.

decoder := json.NewDecoder(inspectjson.NewTokenizerReader(tokenizer))

Syntax

Several SyntaxBehavior constants describe optional tokenization behaviors which may be configured via tokenizer options. The following describe behaviors for common human mistakes and non-standard encoders.

  • LaxIgnoreBlockComment - ignore /* block */ comments.
  • LaxIgnoreLineComment - ignore // line comments (which continues until end of line).
  • LaxStringEscapeInvalidEscape - convert, for example, \z (invalid) into \\z.
  • LaxStringEscapeMissingEscape - convert, for example, (tab, U+0009) into \t.
  • LaxNumberTrimLeadingZero - trim invalid, leading zeros of a number.
  • LaxLiteralCaseInsensitive - allow case-insensitive literals, such as True.
  • LaxIgnoreExtraComma - ignore any repetitive or trailing commas within arrays or objects.
  • LaxIgnoreTrailingSemicolon - ignore any semicolon after a value.

Additionally, the following warning may be observed if the recovery hook is used.

  • WarnStringUnicodeReplacementChar - invalid Unicode sequence was replaced with U+FFFD.
Recovery Hook

When SyntaxRecoveryHook is used, each recovered syntax occurrence will result in a SyntaxRecovery being emitted which includes metadata about the source offsets, source runes, value start, and replacement runes, as applicable.

Resources

License

MIT License

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL