wikiparse

package module

v0.0.0-...-c59b9f8 Latest Latest Go to latest Published: Jul 10, 2017 License: MIT Imports: 16 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/pressure679/go-wikiparse

Links

Open Source Insights

README ¶

go-wikiparse

If you're like me, then you enjoy playing with lots of textual data and scour the internet for sources of it.

mediawiki's dumps are a pretty awesome chunk that's fun to work with.

Installation

go get github.com/dustin/go-wikiparse

Usage

The parser takes any io.Reader as a source assuming it's a complete XML dump and lets you pull wikiparse.Page objects out of it. These typically arrive as bzip2 files, so I make my program open the file and set up a bzip reader over it and all that. But you don't need to do that if you want to read off of stdin. Here's a complete example that emits page titles from a decompressing stream on stdin:

package main

import (
	"fmt"
	"os"

	"github.com/dustin/go-wikiparse"
)

func main() {
	p, err := wikiparse.NewParser(os.Stdin)
	if err != nil {
		fmt.Fprintf(os.Stderr, "Error setting up parser", err)
		os.Exit(1)
	}

	for err == nil {
		var page *wikiparse.Page
		page, err = p.Next()
		if err == nil {
			fmt.Println(page.Title)
		}
	}
}

Example invocation:

bzcat enwiki-20120211-pages-articles.xml.bz2 | ./sample

Geographical Information

Because it's interesting to me, I wrote a parser for the wikiproject geographical coordinates that are found on many pages. Use this on the page's content to find out if it's a place or not. Then go there.

Pressure679

github.com/pressure679/WikiPagerankDB/tools.go I honestly copy/pasted the xml types for Wikipedia's XML dumps, but the code also taught me a bunch of stuff about Go and the use of (effective) Go with XML.

Documentation ¶

Overview ¶

Package wikiparse is library to understand the wikipedia xml dump format.

The dumps are available from the wikimedia group here:

http://dumps.wikimedia.org/

In particular, I've worked mostly with the enwiki dumps from here:

http://dumps.wikimedia.org/enwiki/

See the example programs in subpackages for an idea of how I've made use of these things.

Index ¶

Variables
func FindFiles(text string) []string
func FindLinks(text string) []string
func URLForFile(name string) string
type Contributor
type Coord
- func ParseCoords(text string) (Coord, error)
type IndexEntry
- func (i IndexEntry) String() string
type IndexReader
- func NewIndexReader(r io.Reader) *IndexReader
- func (ir *IndexReader) Next() (IndexEntry, error)
type IndexSummaryReader
- func NewIndexSummaryReader(r io.Reader) (rv *IndexSummaryReader, err error)
- func (isr *IndexSummaryReader) Next() (offset int64, count int, err error)
type IndexedParseSource
type Page
type Parser
- func NewIndexedParser(indexfn, datafn string, numWorkers int) (Parser, error)
- func NewIndexedParserFromSrc(src IndexedParseSource, numWorkers int) (Parser, error)
- func NewParser(r io.Reader) (Parser, error)
type ReadSeekCloser
type Redirect
type Revision
type SiteInfo

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrNoCoordFound = errors.New("no coord data found")

ErrNoCoordFound is returned from ParseCoords when there's no coordinate date found.

Functions ¶

func FindFiles ¶

func FindFiles(text string) []string

FindFiles finds all the File references from within an article body.

This includes things in comments, as many I found were commented out.

func FindLinks ¶

func FindLinks(text string) []string

FindLinks finds all the links from within an article body.

func URLForFile ¶

func URLForFile(name string) string

URLForFile gets the wikimedia URL for the given named file.

Types ¶

type Contributor ¶

type Contributor struct {
	ID       uint64 `xml:"id"`
	Username string `xml:"username"`
}

A Contributor is a user who contributed a revision.

type Coord ¶

type Coord struct {
	Lon, Lat float64
}

Coord is Longitude/latitude pair from a coordinate match.

func ParseCoords ¶

func ParseCoords(text string) (Coord, error)

ParseCoords parses geographical coordinates as specified in http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Geographical_coordinates

type IndexEntry ¶

type IndexEntry struct {
	StreamOffset int64
	PageOffset   int
	ArticleName  string
}

An IndexEntry is an individual article from the index.

func (IndexEntry) String ¶

func (i IndexEntry) String() string

type IndexReader ¶

type IndexReader struct {
	// contains filtered or unexported fields
}

An IndexReader is a wikipedia multistream index reader.

func NewIndexReader ¶

func NewIndexReader(r io.Reader) *IndexReader

NewIndexReader gets a wikipedia index reader.

func (*IndexReader) Next ¶

func (ir *IndexReader) Next() (IndexEntry, error)

Next gets the next entry from the index stream.

This assumes the numbers were meant to be incremental.

type IndexSummaryReader ¶

type IndexSummaryReader struct {
	// contains filtered or unexported fields
}

IndexSummaryReader gets offsets and counts from an index.

If you don't want to know the individual articles, just how many and where, this is for you.

func NewIndexSummaryReader ¶

func NewIndexSummaryReader(r io.Reader) (rv *IndexSummaryReader, err error)

NewIndexSummaryReader gets a new IndexSummaryReader from the given stream of index lines.

func (*IndexSummaryReader) Next ¶

func (isr *IndexSummaryReader) Next() (offset int64, count int, err error)

Next gets the next offset and count from the index summary reader.

Note that the last returns io.EOF as an error, but a valid offset and count.

type IndexedParseSource ¶

type IndexedParseSource interface {
	OpenIndex() (io.ReadCloser, error)
	OpenData() (ReadSeekCloser, error)
}

An IndexedParseSource provides access to a multistream xml dump and its index.

This is typically downloaded as two files, but a seekable interface such as HTTP with range requests can also serve.

type Page ¶

type Page struct {
	Title     string     `xml:"title"`
	ID        uint64     `xml:"id"`
	Redir     Redirect   `xml:"redirect"`
	Revisions []Revision `xml:"revision"`
	Ns        uint64     `xml:"ns"`
}

A Page in the wiki.

type Parser ¶

type Parser interface {
	// Get the next page from the parser
	Next() (*Page, error)
	// Get the toplevel site info from the stream
	SiteInfo() SiteInfo
}

A Parser emits wiki pages.

func NewIndexedParser ¶

func NewIndexedParser(indexfn, datafn string, numWorkers int) (Parser, error)

NewIndexedParser gets an indexed/parallel wikipedia dump parser from the given index and data files.

func NewIndexedParserFromSrc ¶

func NewIndexedParserFromSrc(src IndexedParseSource, numWorkers int) (Parser, error)

NewIndexedParserFromSrc creates a Parser that can parse multiple pages concurrently from a single source.

func NewParser ¶

func NewParser(r io.Reader) (Parser, error)

NewParser gets a wikipedia dump parser reading from the given reader.

type ReadSeekCloser ¶

type ReadSeekCloser interface {
	io.ReadSeeker
	io.Closer
}

ReadSeekCloser is io.ReadSeeker + io.Closer.

type Redirect ¶

type Redirect struct {
	Title string `xml:"title,attr"`
}

A Redirect to another Page.

type Revision ¶

type Revision struct {
	ID          uint64      `xml:"id"`
	Timestamp   string      `xml:"timestamp"`
	Contributor Contributor `xml:"contributor"`
	Comment     string      `xml:"comment"`
	Text        string      `xml:"text"`
}

A Revision to a page.

type SiteInfo ¶

type SiteInfo struct {
	SiteName   string `xml:"sitename"`
	Base       string `xml:"base"`
	Generator  string `xml:"generator"`
	Case       string `xml:"case"`
	Namespaces []struct {
		Key   string `xml:"key,attr"`
		Case  string `xml:"case,attr"`
		Value string `xml:",chardata"`
	} `xml:"namespaces>namespace"`
}

SiteInfo is the toplevel site info describing basic dump properties.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
tools
cbload Load a wikipedia dump into CouchBase	Load a wikipedia dump into CouchBase
couchload Load a wikipedia dump into CouchDB	Load a wikipedia dump into CouchDB
esload Load a wikipedia dump into ElasticSearch	Load a wikipedia dump into ElasticSearch
mgoload
traverse Sample program that finds all the geo data in wikipedia pages.	Sample program that finds all the geo data in wikipedia pages.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL