README

go-conllu

CoNLL-U parser written in Go. Convert CoNLL-U files to in-memory structs

The Computational Natural Language Learning - U format (CoNLL-U) is used by the Universal Dependencies project to represent natural language annotations. go-conllu parses CoNNL-U file formats and exposes the data via in-memory Go structs.

⚙️ Installation

go get github.com/nuvi/go-conllu

🚀 Quick Start

package main

import (
	"fmt"
	"log"

	conllu "github.com/nuvi/go-conllu"
)

func main() {
	sentences, err := conllu.ParseFile("../../test_data/en_ewt-ud-train.small.conllu")
	if err != nil {
		log.Fatal(err)
	}

	for _, sentence := range sentences {
		for _, token := range sentence.Tokens {
			fmt.Println(token)
		}
		fmt.Println()
	}
}

Issues

All issues should be submitted via the issues tab on Github. Please provide the code and data used in order for us to reproduce the issue.

💬 Contact

Feel free to reach out with questions/comments to maintainers:

Twitter Follow

Transient Dependencies

None, and we plan to keep it that way.

👏 Contributing

We love help! Contribute by forking the repo and opening pull requests. Please ensure that your code passes the existing tests and linting processes, and write new tests to test your changes if applicable.

All pull requests should be submitted to the "master" branch.

go test
go fmt

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Dep

type Dep struct {
	Head   float64
	Deprel string
}

    Dep is a representation of a single part of the enhanced dependency graph

    type MorphologicalFeature

    type MorphologicalFeature struct {
    	Feature string
    	Value   string
    }

      MorphologicalFeature from the universal feature inventory (https://universaldependencies.org/u/feat/index.html) or from a defined language-specific extension (https://universaldependencies.org/ext-feat-index.html)

      type Sentence

      type Sentence struct {
      	Tokens []Token
      }

        Sentence represents a sentence of parsed CoNLL-U tokens

        func Parse

        func Parse(r io.Reader) ([]Sentence, error)

          Parse parses conllu via the io.Reader interface and returns all of the tokens found Parse doesn't close the reader when finished, that must be done manually

          func ParseFile

          func ParseFile(filepath string) ([]Sentence, error)

            ParseFile opens, reads, and parses a file in conllu format and returns all of the tokens found. ParseFile is a convencience wrapper for the Parse() function when working with files on disk

            type Token

            type Token struct {
            	ID float64 // Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0)
            
            	Form string // Word form or punctuation symbol
            
            	Lemma string // Lemma or stem of word form
            
            	UPOS string // Universal part-of-speech tag
            
            	XPOS string // Language-specific part-of-speech tag; empty if not available
            
            	// List of morphological features, which are described on the type; nil if not available
            	Feats []MorphologicalFeature
            
            	// Head of the current word, which is either the id of the head token for this word, or 0 if none
            	// https://universaldependencies.org/format.html#syntactic-annotation
            	Head float64
            
            	// Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one
            	Deprel string
            
            	// Enhanced dependency graph in the form of a list of head-deprel pairs. See Dep type for more information; nil if none.
            	// Dependencies that are shared between the basic and the enhanced dependency representations must be repeated in the Deps field
            	Deps []Dep
            
            	// Any other annotation, represented as a list separated by "|". Nil if none.
            	// https://universaldependencies.org/format.html#miscellaneous
            	Misc []string
            }

              Token represents a single token, e.g. "hello", "goodby" and holds all associated annotations https://universaldependencies.org/format.html#conll-u-format

              Directories

              Path Synopsis
              examples