jsonextract

package module
v1.5.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 21, 2021 License: MIT Imports: 12 Imported by: 3

README

Tests Go Reference

jsonextract

jsonextract is a Go library for extracting JSON and JavaScript objects from any source. It can be used for data extraction tasks like web scraping.

If any text looks like a JavaScript object or is close to looking like JSON, it will be converted to it.

Extractor program

There's a small extractor program that uses this library to get data from URLs and files.

If you want to give it a try, you can just go-get it:

go get -u github.com/xarantolus/jsonextract/cmd/jsonx

You can use it both on files and URLs like this:

jsonx reader_test.go

or like this:

jsonx "https://stackoverflow.com/users/5728357/xarantolus?tab=topactivity"

It is also possible to only extract objects with certain keys by passing them along:

jsonx "https://www.youtube.com/watch?v=ap-BkkrRg-o" videoId title channelId

Another example:

jsonx "https://www.youtube.com/playlist?list=PLBQ5P5txVQr9_jeZLGa0n5EIYvsOJFAnY" videoId title
Examples

There are examples in the examples subdirectory.

The string example shows how to use the package to get all JSON objects/arrays in a string, it uses a strings.Reader for that.

The stackoverflow-chart example shows how to extract the reputation chart data of a StackOverflow user. Extracted data is then used to draw the same chart using Go:

Comparing chart from StackOverflow and the scraped and drawn result

Another real-world use-case is the yt-live example which extracts video info about the current live stream of a YouTube channel. The example illustrates how simple and powerful this library can be.

Other examples for the Objects method can be found in the documentation.

Supported notations

This software supports not just extracting normal JSON, but also other JavaScript notation.

This means that text like the following, which is definitely not valid JSON, can also be extracted to an object:

<script>
var x = {
	// Keys without quotes are valid in JavaScript, but not in JSON
	key: "value",
	num: 295.2,

	// Comments are removed while processing

	// Mixing normal and quoted keys is possible
	"obj": {
		"quoted": 325,
		'other quotes': true,
		unquoted: 'test', // This trailing comma will be removed
	},

	// JSON doesn't support all these number formats
	"dec": +21,
	"hex": 0x15,
	"oct": 0o25,
	"bin": 0b10101,
	bigint: 21n,

	// NaN will be converted to null. Infinity values are however not supported
	"num2": NaN,

	// No matter the sign, NaN becomes null
	"num3": -NaN,

	// Undefined will be interpreted as null
	"udef": undefined,

	`lastvalue`: `multiline strings are
no problem`
}
</script>

results in

{"key":"value","num":295.2,"obj":{"quoted":325,"other quotes":true,"unquoted":"test"},"dec":21,"hex":21,"oct":21,"bin":21,"bigint":21,"num2":null,"num3":null,"udef":null,"lastvalue":"multiline strings are\nno problem"}
Notes
  • While the functions take an io.Reader and stream data from it without buffering everything in memory, the underlying JS lexer uses ioutil.ReadAll. That means that this won't work well on files that are larger than memory.
  • It is possible to craft input in a way that will require the parser to revert a lot, which will take more time. One such input is repeating opening braces for arrays [ without closing them, after more than a few thousand it gets noticeably slow.
  • When extracting objects from JavaScript files using Reader, you can end up with many arrays that look like [0], [1], ["i"], which is a result of indices being used in the script. You have to filter these out yourself.
  • While this package supports most number formats, there are some that don't work because the lexer doesn't support them. One of those is underscores in numbers. An example is that in JavaScript 2175 can be written as 2_175 or 0x8_7_f, but that doesn't work here (normal HEX numbers do however). Another example are numbers with a leading zero; they are rejected by the lexer because it's not clear if they should be interpreted as octal or decimal.
  • Another example of unsupported number types are the float values Inf, +Inf, -Inf and other infinity values. While NaN is converted to null (as NaN is not valid JSON), infinity values don't have an appropriate JSON representation.
Changelog
  • v1.5.3: Support signed +NaN and -NaN by converting them to null, just like the normal NaN
  • v1.5.2: Objects now behaves as documented and only matches the first option found. This is useful for cascading options from the most keys to the least keys you want, which is useful if there is some overlap.
  • v1.5.1: Objects now also goes through all child elements of a matched element
  • v1.5.0: Objects now terminates early if all callback functions are satisfied. To indicate this you can return ErrStop from an ObjectOption's callback, which will make sure that this function is not called again.
  • v1.4.4: Objects: Add Required field that controls if an error should be returned if the object was not found. This makes it more convenient to use this function as no further checks are needed after extracting objects.
  • v1.4.3: Convert floats with trailing dots to valid JSON, e.g. a 1. is converted to 1.0 as the first one isn't valid JSON
  • v1.4.2: Fix crash found using go-fuzz
  • v1.4.1: Transform NaN inputs to null
  • v1.4.0: Add Objects method for easily decoding smaller subsets of large nested structures
  • v1.3.1: Support more number formats by transforming them to decimal numbers, which are valid in JSON
  • v1.3.0: Return to non-streaming version that worked with all objects, the streaming version seemed to skip certain parts and thus wasn't very great
  • v1.2.0: Fork the JS lexer and make it use the underlying streaming lexer that was already in that package. That's a bit faster and prevents many unnecessary resets. This also makes it possible to extract from very large files with a small memory footprint.
  • v1.1.11: No longer stop the lexer from reading too much, as that didn't work that good
  • v1.1.10: Stops the JS lexer from reading all data from input at once, prevents expensive resets
  • v1.1.9: JS Regex patterns are now returned as strings
  • v1.1.8: Fix bug where template literals were interpreted the wrong way when certain escape sequences were present
  • v1.1.7: More efficient extraction when a trailing comma is found
  • v1.1.6: Always return the correct error
  • v1.1.5: Small clarification on the callback
  • v1.1.4: Support trailing commas in arrays and objects
  • v1.1.3: Many small internal changes
  • v1.1.2: Also support JS template strings
  • v1.1.1: Also turn single-quoted strings into valid JSON
  • v1.1.0: Now supports anything that looks like JSON, which also includes JavaScript object declarations
  • v1.0.0: Initial version, supports only JSON
Thanks

Thanks to everyone who made the parse package possible. Without it, creating this extractor would have been a lot harder.

Contributing

Please feel free to open issues for anything that doesn't seem right, even small stuff.

License

This is free as in freedom software. Do whatever you like with it.

Documentation

Overview

Package jsonextract implements functions for finding and extracting any valid JavaScript object (not just JSON) from an io.Reader.

This is an example of valid input for this package:

<script>
var x = {
	// Keys without quotes are valid in JavaScript, but not in JSON
	key: "value",
	num: 295.2,

	// Comments are removed while processing

	// Mixing normal and quoted keys is possible
	"obj": {
		"quoted": 325,
		'other quotes': true,
		unquoted: 'test', // This trailing comma will be removed
	},

	// JSON doesn't support all these number formats
	"dec": +21,
	"hex": 0x15,
	"oct": 0o25,
	"bin": 0b10101,
	bigint: 21n,

	// NaN will be converted to null. Infinity values are however not supported
	"num2": NaN,

	// Undefined will be interpreted as null
	"udef": undefined,

	`lastvalue`: `multiline strings are
no problem`
}
</script>

The input will be searched for anything that looks like JavaScript notation. Found objects and arrays are converted to JSON, which can then be used for decoding into Go structures.

Objects is a high-level function for easily extracting certain objects no matter their position within any other object. Reader is a lower-level function that gives you more control over how you process objects and arrays.

Index

Examples

Constants

This section is empty.

Variables

View Source
var ErrCallbackNeverCalled = errors.New("callback never called")

ErrCallbackNeverCalled is returned from the Objects method if the callback of a required ObjectOption was never satisfied, which means that the callback never returned ErrStop.

View Source
var ErrStop = errors.New("stop processing json")

ErrStop can be returned from a JSONCallback function to indicate that processing should stop. When used with Reader, it will stop processing. When used with Objects, the callback function will never be called again (e.g. after it received the required data).

Functions

func Objects added in v1.4.0

func Objects(r io.Reader, o []ObjectOption) (err error)

Objects extracts all nested objects and passes them to appropriate callback functions. You can define which keys must be present for an object to be passed to your function.

This method will check not just top-level object keys, but also those of all child objects.

If multiple options would match, only the first one will be processed. This allows you to cascade options to first extract objects with the most keys, then those with less (which is useful if there are overlapping keys).

If a required option is not matched, ErrCallbackNeverCalled will be returned.

Arrays/Slices will not cause a callback as they don't have keys, but objects in them will be matched.

Example (MultipleList)

This example shows how to extract both a single object and a list of other objects.

// Define all structs we need for extraction
type ytVideo struct {
	VideoID string `json:"videoId"`
	Title   struct {
		Runs []struct {
			Text string `json:"text"`
		} `json:"runs"`
	} `json:"title"`
}

type ytPlaylist struct {
	URLCanonical string `json:"urlCanonical"`
	Title        string `json:"title"`
}

// This is where our data should end up
var (
	videoList []ytVideo
	playlist  ytPlaylist
)

// This file contains the HTML response of a YouTube playlist.
// One could also extract directly from a response body
f, err := os.Open("testdata/playlist.html")
if err != nil {
	panic(err)
}
defer f.Close()

err = Objects(f, []ObjectOption{
	{
		// All videos have an "videoId" and "title" key
		Keys: []string{"videoId", "title"},
		// We use a more specialized callback to append to videoList
		Callback: func(b []byte) error {
			var vid ytVideo

			// Decode the given object. It has at least the Keys defined above
			err := json.Unmarshal(b, &vid)
			if err != nil {
				// if that didn't work, we skip the object
				return nil
			}

			// Check if anything required is missing
			if len(vid.Title.Runs) == 0 || vid.VideoID == "" {
				return nil
			}

			// Seems like we got the info we wanted, we can now store it
			videoList = append(videoList, vid)

			// ... and continue with the next object
			return nil
		},
	},
	{
		// Here we want to extract a playlist info object
		Keys: []string{"title", "urlCanonical"},
		Callback: Unmarshal(&playlist, func() bool {
			return playlist.Title != "" && playlist.URLCanonical != ""
		}),
	},
})
if err != nil {
	panic(err)
}

fmt.Printf("The %q playlist has %d videos\n", playlist.Title, len(videoList))
Output:

The "Starship" playlist has 10 videos
Example (NestedObjects)

This example shows how to extract nested objects.

// Test input
var input = strings.NewReader(`
	<script>
	var x = {
		"id": 339750489,
		// This comment makes the input invalid JSON
		"node_id": "MDEwOlJlcG9zaXRvcnkzMzk3NTA0ODk=",
		"name": "jsonextract",
		"full_name": "xarantolus/jsonextract",
		"private": false,
		"owner": {
			"login": "xarantolus",
			"id": 32465636,
			"node_id": "MDQ6VXNlcjMyNDY1NjM2",
			"avatar_url": "https://avatars.githubusercontent.com/u/32465636?v=4",
			"gravatar_id": "",
			"html_url": "https://github.com/xarantolus",
			"type": "User",
			"site_admin": false
		},
		"html_url": "https://github.com/xarantolus/jsonextract",
		"description": "Go package for finding and extracting any valid JavaScript object (not just JSON) from an io.Reader",
		"open_issues_count": 0,
		"license": {
			"key": "mit",
			"name": "MIT License",
			"spdx_id": "MIT",
			"url": "https://api.github.com/licenses/mit",
			"node_id": "MDc6TGljZW5zZTEz"
		},
	}
	</script>`)

// The "license" object has this structure
type License struct {
	Key    string `json:"key"`
	Name   string `json:"name"`
	SpdxID string `json:"spdx_id"`
	URL    string `json:"url"`
	NodeID string `json:"node_id"`
}

// ... and the "owner" object has this one
type Owner struct {
	Login      string `json:"login"`
	ID         int    `json:"id"`
	NodeID     string `json:"node_id"`
	AvatarURL  string `json:"avatar_url"`
	GravatarID string `json:"gravatar_id"`
	HTMLURL    string `json:"html_url"`
	Type       string `json:"type"`
	SiteAdmin  bool   `json:"site_admin"`
}

// We want to extract these two different objects that are nested within
// the whole JSON-like structure
var (
	license License
	owner   Owner
)

// Use Objects to extract all objects and match them to their keys
err := Objects(input, []ObjectOption{
	{
		// A valid license object has these keys
		Keys: []string{"key", "name", "spdx_id", "node_id"},
		// Unmarshal decodes the object to license until the function verifies that correct data was found
		// If there were multiple objects matching the keys, one could select the one that is wanted
		Callback: Unmarshal(&license, func() bool {
			// Return true if all fields we want have valid values
			return license.Key != "" && license.Name != ""
		}),
		// If this value is not present in the JSON data, the Objects call will return an error
		Required: true,
	},
	{
		// The owner object mostly has different keys, the overlap with "node_id"
		// doesn't matter because all listed keys must be present anyways
		Keys: []string{"login", "id", "html_url", "node_id"},
		Callback: Unmarshal(&owner, func() bool {
			return owner.Login != "" && owner.HTMLURL != ""
		}),
		Required: true,
	},
})
if err != nil {
	panic(err)
}

fmt.Printf("%s has published their package under the %s\n", owner.Login, license.Name)
Output:

xarantolus has published their package under the MIT License

func Reader

func Reader(reader io.Reader, callback JSONCallback) (err error)

Reader reads all JSON and JavaScript objects from the input and calls callback for each of them.

Errors returned from the callback will stop the method. The error will be returned, except if it is ErrStop which will cause the method to return nil.

Please note that the reader must return UTF-8 bytes for this to work correctly.

Types

type JSONCallback

type JSONCallback func([]byte) error

JSONCallback is the callback function passed to Reader and ObjectOptions.

Any JSON objects will be passed to it as bytes as defined by the function.

If this function returns an error, processing will stop and return that error. You can return ErrStop to make sure the function will not be called again.

func Unmarshal added in v1.4.0

func Unmarshal(pointer interface{}, verify func() bool) JSONCallback

Unmarshal returns a callback function that can be used with the Objects method for decoding one element. After verify returns true, the object will no longer be changed.

Please note that any Unmarshal errors will be ignored, which means that if you don't pass a pointer or your struct field types don't match the ones in the data, you will not be notified about the error.

type ObjectOption added in v1.4.0

type ObjectOption struct {
	// Keys defines a filter for objects. Only objects where these keys are present will be passed to Callback.
	// If this is not set, all objects will be passed to the callback.
	Keys []string

	// Callback receives JSON bytes for all objects that have all keys defined by Keys.
	// Returning ErrStop will stop extraction without error. Other errors will be returned.
	Callback JSONCallback

	// Required sets whether ErrCallbackNeverCalled should be returned if the callback function for this ObjectOption is not called
	Required bool
}

ObjectOption defines filters and callbacks for the Object method

Directories

Path Synopsis
cmd
examples

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL