classify

module

v0.0.6 Latest Latest Go to latest Published: Jun 19, 2022 License: GPL-3.0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/olesho/classify

Links

Open Source Insights

README ¶

Classify

Classify is an efficient tool for extraction of structured field sequences from HTML/XML data sources. It finds repetitive patterns and returns sequence of fields or XPaths to extract those fields.

When is this useful

Classify allows to create scraping/parsing pattern for specific data source and do it quickly. See usage section below.

Requirements

Go 1.13+

Installation

go get github.com/olesho/classify/sequence
cd classify/bin/fields
go install

Usage

Sample HTML input:

<html>
    <body>
        <section> Some Ad </section>
        <section> 
            <h1> Data </h1> 
            <div>
                <h3> Title 1 </h3>
                <p> Some text 1 </p>
                <img src="/src1"> image1 </img>
            </div>
            <div>
                <h3> Title 2 </h3>
                <p> Some text 2 </p>
                <img src="/src2"> image2 </img>
            </div>
            <div>
                <h3> Title 3 </h3>
                <p> Some text 3 </p>
                <img src="/src3"> image3 </img>
            </div>
        </section>
        <section> 
            <h2> Some Menu </h2>
            <ul>
                <li>Menu Item 1</li>
                <li>Menu Item 2</li>
                <li>Menu Item 3</li>
            </ul>
        </section>
    </body>
</html>

As text:

Run: fields

Title 1
Some text 1
/src1
image1
--------------------------
Title 2
Some text 2
/src2
image2
--------------------------
Title 3
Some text 3
/src3
image3
--------------------------

As JSON:

fields -json

{
  "fields": [
    [
      "Title 1",
      "Some text 1",
      "/src1",
      "image1"
    ],
    [
      "Title 2",
      "Some text 2",
      "/src2",
      "image2"
    ],
    [
      "Title 3",
      "Some text 3",
      "/src3",
      "image3"
    ]
  ],
  "stats": {
    "groups_count": 3,
    "group_fields_count": 4
  }
}

As CSV:

fields -csv

Title 1,Some text 1,/src1,image1
Title 2,Some text 2,/src2,image2
Title 3,Some text 3,/src3,image3

As XPath pattern

XPath pattern generation is still being developed: fields -xpath

/html/body/section/div/h3/@text=
/html/body/section/div/p/@text=
/html/body/section/div/img
/html/body/section/div/@text=

Multiple groups of data:

Data is being extracted and ranked descending by volume with indexes 0, 1, 2 ... In case you need to access other groups put the index N like: fields N

From example above fields -csv 1 will produce following output:

Menu Item 1
Menu Item 2
Menu Item 3

Menu items take less volume and are ranked after main data block.

Extracting from other sources:

Input data could be any URL or file. curl -s YOUR_URL_HERE | fields Keep in mind that some websites use dynamic content extensively. So CURLed version might differ significantly from the one you see in the browser. You might want to use: chromium --dump-dom 'YOUR_URL_HERE' | fields or google-chrome --dump-dom 'YOUR_URL_HERE' | fields instead of CURL.

Shell

For debugging purposes there is a command line tool:

cd bin/cshell
go build .
./cshell

Directories ¶

Path	Synopsis
arena classify project classify.go	classify project classify.go
bin
api
cshell
fields
comparator
sequence

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL