wikipedia

package
v0.0.0-...-c5d5a31 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 21, 2020 License: Apache-2.0 Imports: 21 Imported by: 0

Documentation

Overview

Package wikipedia fetches Wikipedia articles

Package wikipedia fetches Wikipedia articles

Index

Constants

This section is empty.

Variables

View Source
var Available = map[language.Tag]struct{}{}/* 295 elements not displayed */

Available is a map of all languages that Wikipedia supports. https://en.wikipedia.org/wiki/List_of_Wikipedias There is also a separate entry for Wiktionary and Wikiquote:

https://en.wiktionary.org/wiki/Wiktionary:List_of_languages
https://en.wikiquote.org/wiki/Wikiquote:Other_language_Wikiquotes

We sort their table by # of Articles descending.

View Source
var CirrusURL, _ = url.Parse("https://dumps.wikimedia.org/other/cirrussearch/current/")

CirrusURL is the url for the cirrus wikipedia files

View Source
var WikiDataURL, _ = url.Parse("https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2")

WikiDataURL comes from a different url (smaller file)...cirrus link is formatted differently.

Functions

func Languages

func Languages(supported []language.Tag) ([]language.Tag, []language.Tag)

Languages verifies languages based on Wikipedia's supported languages. An empty slice of supported languages implies you support every language available.

Types

type Aliases

type Aliases map[string][]Text

Aliases holds the alternative names for an Item

func (*Aliases) Scan

func (a *Aliases) Scan(value interface{}) error

Scan unmarshals jsonb data

type Award

type Award struct {
	Item []Wikidata `json:"item,omitempty"`
	Date []DateTime `json:"date,omitempty" property:"P585"`
}

Award is an award someone won

type City

type City struct {
	Item  []Wikidata `json:"item,omitempty"`
	Start []DateTime `json:"start,omitempty" property:"P580"`
	End   []DateTime `json:"end,omitempty" property:"P582"`
}

City is a geographical city NOTE: This datastructure is duplicative and can be combined into a more general label and combined with Country, Interment, Military, etc. structs. Perhaps just add all the qualifiers to our Wikidata structure?

type Claims

type Claims struct {
	Country              []Country    `json:"country,omitempty"`
	Image                []string     `json:"image,omitempty"`
	BirthPlace           []Wikidata   `json:"birthplace,omitempty"`
	Sex                  []Wikidata   `json:"sex,omitempty"`
	Father               []Wikidata   `json:"father,omitempty"`
	Mother               []Wikidata   `json:"mother,omitempty"`
	Spouse               []Spouse     `json:"spouse,omitempty"`
	CountryOfCitizenship []Wikidata   `json:"country_of_citizenship,omitempty"` // country of residence
	Instance             []Wikidata   `json:"instance,omitempty"`
	Capital              []City       `json:"capital,omitempty"`
	Currency             []Wikidata   `json:"currency,omitempty"`
	Flag                 []string     `json:"flag,omitempty"`
	Teams                []Team       `json:"teams,omitempty"` // sports teams
	Education            []Education  `json:"education,omitempty"`
	Occupation           []Wikidata   `json:"occupation,omitempty"`
	Signature            []string     `json:"signature,omitempty"`
	Interment            []Interment  `json:"interment,omitempty"` // burial/ashes location
	Genre                []Wikidata   `json:"genre,omitempty"`
	Religion             []Wikidata   `json:"religion,omitempty"`
	Awards               []Award      `json:"awards,omitempty"`
	Ethnicity            []Wikidata   `json:"ethnicity,omitempty"`
	Military             []Military   `json:"military,omitempty"` // military branch
	RecordLabel          []Wikidata   `json:"record_label,omitempty"`
	Position             []Wikidata   `json:"position,omitempty"` // e.g. position on team...forward, center, etc..
	MusicBrainz          []string     `json:"musicbrainz,omitempty"`
	Partner              []Spouse     `json:"partner,omitempty"`
	Origin               []Wikidata   `json:"origin,omitempty"`         // country of origin
	DeathCause           []Wikidata   `json:"cause_of_death,omitempty"` // there is also P1196 "manner of death"
	Members              []Member     `json:"members,omitempty"`
	Residence            []Wikidata   `json:"residence,omitempty"`
	Hand                 []Wikidata   `json:"hand,omitempty"` // left or right-handed
	Coordinate           []Coordinate `json:"coordinate,omitempty"`
	Birthday             []DateTime   `json:"birthday,omitempty"`
	Death                []DateTime   `json:"death,omitempty"`
	Start                []DateTime   `json:"start,omitempty"`
	Publication          []DateTime   `json:"publication,omitempty"`
	Sport                []Wikidata   `json:"sport,omitempty"`
	Drafted              []Wikidata   `json:"drafted,omitempty"`
	GivenName            []Wikidata   `json:"given_name,omitempty"`
	Influences           []Wikidata   `json:"influences,omitempty"`
	Location             []Wikidata   `json:"location,omitempty"`
	Website              []string     `json:"website,omitempty"`
	Population           []Population `json:"population,omitempty"`
	Instrument           []Instrument `json:"instrument,omitempty"` // Jimi Hendrix Fender Stratocaster
	Participant          []Wikidata   `json:"participant,omitempty"`
	Nominations          []Nomination `json:"nominations,omitempty"`
	Languages            []Wikidata   `json:"languages,omitempty"` // languages spoken and/or written proficiency
	BirthName            []Text       `json:"birth_name,omitempty"`
	Spotify              []string     `json:"spotify,omitempty"`
	USDA                 []string     `json:"usda,omitempty"`
	Twitter              []string     `json:"twitter,omitempty"`
	Instagram            []string     `json:"instagram,omitempty"`
	Facebook             []string     `json:"facebook,omitempty"`
	YouTube              []string     `json:"youtube,omitempty"`
	WorkStart            []DateTime   `json:"work_start,omitempty"` // better name??? P571 is similar tag
	Height               []Quantity   `json:"height,omitempty"`
	Weight               []Quantity   `json:"weight,omitempty"`
	Siblings             []Wikidata   `json:"siblings,omitempty"`
}

Claims are the formatted and condensed version of the Wikidata claims

func (*Claims) Scan

func (c *Claims) Scan(value interface{}) error

Scan unmarshals jsonb data http://www.booneputney.com/development/gorm-golang-jsonb-value-copy/

type Coordinate

type Coordinate struct {
	Latitude  []float64  `json:"latitude,omitempty"`
	Longitude []float64  `json:"longitude,omitempty"`
	Altitude  []float64  `json:"altitude,omitempty"`
	Precision []float64  `json:"precision,omitempty"`
	Globe     []Wikidata `json:"globe,omitempty"`
}

Coordinate is a Wikipedia coordinate

type Country

type Country struct {
	Item  []Wikidata `json:"item,omitempty"`
	Start []DateTime `json:"start,omitempty" property:"P580"`
	End   []DateTime `json:"end,omitempty" property:"P582"`
}

Country is a geographical country

type DateTime

type DateTime struct {
	Value    string   `json:"value,omitempty"`
	Calendar Wikidata `json:"calendar,omitempty"`
}

DateTime is the raw, unformatted version of a datetime Note: Wikidata only uses Gregorian and Julian calendars

type Definition

type Definition struct {
	Part     string    `json:"part,omitempty"`
	Meaning  string    `json:"meaning,omitempty"`
	Synonyms []Synonym `json:"synonyms,omitempty"`
}

Definition is a single definition and synonyms

type Descriptions

type Descriptions map[string]Text

Descriptions holds the descriptions for an Item

func (*Descriptions) Scan

func (d *Descriptions) Scan(value interface{}) error

Scan unmarshals jsonb data

type Education

type Education struct {
	Item   []Wikidata `json:"item,omitempty"`
	Start  []DateTime `json:"start,omitempty" property:"P580"`
	End    []DateTime `json:"end,omitempty" property:"P582"`
	Degree []Wikidata `json:"degree,omitempty" property:"P512"`
	Major  []Wikidata `json:"major,omitempty" property:"P812"`
}

Education represents the education of a person

type Fetcher

type Fetcher interface {
	Setup() error
	Fetch(query string, lang language.Tag) ([]*Item, error)
}

Fetcher outlines the methods used to retrieve Wikipedia snippets

type File

type File struct {
	URL *url.URL

	Base string
	Dir  string
	ABS  string
	Type FileType
	// contains filtered or unexported fields
}

File is a wikipedia/wikidata dump file

func CirrusLinks(supported []language.Tag, fileTypes []FileType) ([]*File, error)

CirrusLinks finds the latest cirrus links available from wikipedia. e.g. enwiki-20171009-cirrussearch-content.json.gz Note: Cirrus is their elasticsearch-formatted dump files. The cirrussearch urls for wikipedia includes the wikibase_item and has a more similar layout to their API than the dumps found at https://dumps.wikimedia.org/enwiki/latest/.

func NewFile

func NewFile(u *url.URL, ft FileType, l language.Tag) *File

NewFile returns a new file and sets the URL and Base.

func (*File) Download

func (f *File) Download() error

Download downloads a wikipedia/wikidata dump file

func (*File) Parse

func (f *File) Parse(truncate int) error

Parse parses a wikipedia/wikidata dump file and sends it to Dumper

func (*File) SetABS

func (f *File) SetABS(dir string) *File

SetABS sets the absolute path for a file

func (*File) SetDumper

func (f *File) SetDumper(d dumper) *File

SetDumper sets the Dumper for a file

type FileType

type FileType string

FileType is a type of Wikipedia file

const (
	// WikidataFT is a Wikidata file type
	WikidataFT FileType = "wikidata"
	// WikipediaFT is a Wikipedia file type
	WikipediaFT FileType = "wikipedia"
	// WikiquoteFT is a Wikiquote file type
	WikiquoteFT FileType = "wikiquote"
	// WiktionaryFT is a Wiktionary file type
	WiktionaryFT FileType = "wiktionary"
)

type Instrument

type Instrument struct {
	Item         []Wikidata `json:"item,omitempty"`
	Manufacturer []Wikidata `json:"manufacturer,omitempty" property:"P176"`
}

Instrument is a musical instrument (guitar, drums, etc)

type Interment

type Interment struct {
	Item  []Wikidata `json:"item,omitempty"`
	Start []DateTime `json:"start,omitempty" property:"P580"`
	End   []DateTime `json:"end,omitempty" property:"P582"`
}

Interment is the place a person was buried

type Item

type Item struct {
	Wikipedia
	*Wikidata
	Wikiquote  Wikiquote  `json:"wikiquote"`
	Wiktionary Wiktionary `json:"wiktionary"`
}

Item is the contains the complete wiki info for a person, thing or word.

type JiveData

type JiveData struct {
	HTTPClient *http.Client
	Key        string
}

JiveData is a Wikipedia data provider

func (*JiveData) Fetch

func (j *JiveData) Fetch(query string, lang language.Tag) ([]*Item, error)

Fetch retrieves Wikipedia Items from Jive Data

func (*JiveData) Setup

func (j *JiveData) Setup() error

Setup performs setup actions

type Labels

type Labels map[string]Text

Labels holds the labels for an Item

func (*Labels) Scan

func (l *Labels) Scan(value interface{}) error

Scan unmarshals jsonb data

type Member

type Member struct {
	Item  []Wikidata `json:"item,omitempty"`
	Start []DateTime `json:"start,omitempty" property:"P580"`
	End   []DateTime `json:"end,omitempty" property:"P582"`
	Date  []DateTime `json:"date,omitempty" property:"P585"` // some don't have start/end time just a point-in-time.
}

Member is a part of a group (band, etc)

type Military

type Military struct {
	Item  []Wikidata `json:"item,omitempty"`
	Start []DateTime `json:"start,omitempty" property:"P580"`
	End   []DateTime `json:"end,omitempty" property:"P582"`
}

Military is a person's history in the military

type Nomination

type Nomination struct {
	Item []Wikidata `json:"item,omitempty"`
	For  []Wikidata `json:"for,omitempty" property:"P1686"`
	Date []DateTime `json:"date,omitempty" property:"P585"`
}

Nomination is a nomination for an award

type Population

type Population struct {
	Value []Quantity `json:"value,omitempty"`
	Date  []DateTime `json:"date,omitempty" property:"P585"`
}

Population is a point-in-time value of a country's population

type PostgreSQL

type PostgreSQL struct {
	*sql.DB
}

PostgreSQL contains our client and database info

func (*PostgreSQL) Dump

func (p *PostgreSQL) Dump(ft FileType, lang language.Tag, rows chan interface{}) error

Dump creates a temporary table and dumps rows via our transaction

func (*PostgreSQL) Fetch

func (p *PostgreSQL) Fetch(query string, lang language.Tag) ([]*Item, error)

Fetch retrieves an Item from PostgreSQL https://www.wikidata.org/w/api.php

func (*PostgreSQL) Setup

func (p *PostgreSQL) Setup() error

Setup creates our functions

type Quantity

type Quantity struct {
	Amount string   `json:"amount,omitempty"`
	Unit   Wikidata `json:"unit,omitempty"`
}

Quantity is a Wikipedia quantity

type Spouse

type Spouse struct {
	Item  []Wikidata `json:"item,omitempty"`
	Start []DateTime `json:"start,omitempty" property:"P580"`
	End   []DateTime `json:"end,omitempty" property:"P582"`       // do we also need P585 as we do for Partner?
	Place []Wikidata `json:"location,omitempty" property:"P2842"` // AKA Location P276
}

Spouse represents a person's spouse or partner

type Synonym

type Synonym struct {
	Language string `json:"language,omitempty"`
	Word     string `json:"word,omitempty"`
}

Synonym is a Wiktionary link to another word

type Team

type Team struct {
	Item     []Wikidata `json:"item,omitempty"`
	Start    []DateTime `json:"start,omitempty" property:"P580"`
	End      []DateTime `json:"end,omitempty" property:"P582"`
	Position []Wikidata `json:"position,omitempty" property:"P413"`
	Number   []string   `json:"number,omitempty" property:"P1618"`
}

Team represents a team on which a person played

type Text

type Text struct {
	Text     string `json:"value,omitempty"`
	Language string `json:"language,omitempty"`
}

Text is a language and value

type Wikidata

type Wikidata struct {
	ID           string `json:"id,omitempty"`
	Labels       `json:"labels,omitempty"`
	Aliases      `json:"aliases,omitempty"`
	Descriptions `json:"descriptions,omitempty"`
	*Claims      `json:"claims,omitempty"`
}

Wikidata is a Wikidata item

func (*Wikidata) UnmarshalJSON

func (w *Wikidata) UnmarshalJSON(b []byte) error

UnmarshalJSON formats and extracts only the info we need from claims

type Wikipedia

type Wikipedia struct {
	ID           string   `json:"wikibase_item"`
	Language     string   `json:"language"`
	OutgoingLink []string `json:"outgoing_link,omitempty"`
	Popularity   float64  `json:"popularity_score,omitempty"`
	Title        string   `json:"title"`
	Text         string   `json:"text"`
	// contains filtered or unexported fields
}

Wikipedia holds the summary text of an article

func (*Wikipedia) UnmarshalJSON

func (w *Wikipedia) UnmarshalJSON(data []byte) error

UnmarshalJSON truncates the text

type Wikiquote

type Wikiquote struct {
	ID       string   `json:"wikibase_item,omitempty"`
	Language string   `json:"language,omitempty"`
	Source   string   `json:"source_text,omitempty"` // "text" isn't parseable
	Quotes   []string `json:"quotes,omitempty"`
}

Wikiquote holds the summary text of an article another option is xml: https://dumps.wikimedia.org/enwikiquote/20180201/enwikiquote-20180201-pages-articles-multistream.xml.bz2

func (*Wikiquote) UnmarshalJSON

func (w *Wikiquote) UnmarshalJSON(data []byte) error

UnmarshalJSON extracts the raw quotes from the source_text

type Wiktionary

type Wiktionary struct {
	Title    string `json:"title"`
	Language string `json:"language,omitempty"`
	Source   string `json:"source_text,omitempty"` // "text" isn't parseable
	// Etymology     string // origin of the word...not implemented yet
	// Pronunciation string // not implemented yet
	Definitions []*Definition `json:"definitions,omitempty"`
}

Wiktionary holds the structure for a word and it's definition(s)

func (*Wiktionary) UnmarshalJSON

func (w *Wiktionary) UnmarshalJSON(data []byte) error

UnmarshalJSON extracts the raw info needed from the source_text

Directories

Path Synopsis
cmd
dumper
Dumper downloads and dumps wikipedia/wikidata/wikiquotes data to a postgresql database.
Dumper downloads and dumps wikipedia/wikidata/wikiquotes data to a postgresql database.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL