cui2vec

package module
v0.0.0-...-d05e622 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 14, 2020 License: MIT Imports: 18 Imported by: 6

README

cui2vec

package cui2vec implements utilities for dealing with cui2vec embeddings and mapping cuis to text.

Paper (author not affiliated with this code): https://arxiv.org/pdf/1804.01486.pdf

Documentation: https://godoc.org/github.com/hscells/cui2vec

Example: See cmd/cui2vec/main.go

Data

Pre-trained embeddings (model) can be downloaded from https://figshare.com/s/00d69861786cd0156d81.

A pre-computed distances version of these pre-trained embeddings is included in the testdata folder.


Example file structure of mapping file:

ICUI,title
5,(131)i-maa
39,dipalmitoylphosphatidylcholine
96,1-methyl-3-isobutylxanthine
107,"1-(n-methylglycine)-8-l-isoleucine-angiotensin ii"
139,"16,16-dimethyl-pge2"
151,"17 beta hydroxy 5 beta androstan 3 one"
167,17-ketosteroids
172,18-hydroxycorticosterone
173,"18 hydroxydesoxycorticosterone"

One way this can be constructed is detailed in:

Jimmy, Zuccon G., Koopman B. (2018) Choices in Knowledge-Base Retrieval for Consumer Health Search.
In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds) Advances in Information Retrieval. ECIR 2018.
Lecture Notes in Computer Science, vol 10772. Springer, Cham

Command-line

Command-line utility can be installed with:

go install github.com/hscells/cui2vec/cmd/cui2vec
Usage: cui2vec [--cui CUI] [--model MODEL] [--type TYPE] [--skipfirst] [--numcuis NUMCUIS] [--mapping MAPPING] [--verbose]

Options:
  --cui CUI
  --model MODEL
  --type TYPE
  --skipfirst
  --numcuis NUMCUIS, -n NUMCUIS
  --mapping MAPPING
  --verbose, -v
  --help, -h             display this help and exit
  --version              display version and exit
Pre-computing distances

A tool that can be used to compress and increase the speed of computing similar CUIs is included in the form of pcdvec.

Install the tool with

go get github.com/hscells/cui2vec/cmd/pcdvec
Usage: pcdvec --cui CUI [--output OUTPUT] [--skipfirst]

Options:
  --cui CUI
  --output OUTPUT, -o OUTPUT
  --skipfirst
  --help, -h             display this help and exit
  --version              display version and exit

Documentation

Overview

package cui2vec implements utilities for dealing with cui2vec Embeddings and mapping cuis to text.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CUI2Int

func CUI2Int(cui string) (int, error)

CUI2Int converts a string CUI into an integer.

func Cosine

func Cosine(x, y []float64) (float64, error)

Cosine returns the cosine similarity between two vectors.

func Int2CUI

func Int2CUI(val int) string

Int2CUI converts an integer value to a CUI.

Types

type AliasMapping

type AliasMapping map[string][]string

func LoadCUIAliasMapping

func LoadCUIAliasMapping(path string) (AliasMapping, error)

type Concept

type Concept struct {
	CUI   string
	Value float64
}

Concept is a CUI that has a similarity score in relation to a target CUI.

func Softmax

func Softmax(z []Concept) []Concept

Softmax normalises a slice of concepts.

type Embeddings

type Embeddings interface {
	LoadModel(r io.Reader) error
	Similar(cui string) ([]Concept, error)
}

Embeddings is a complete cui2vec file loaded into memory.

type Mapping

type Mapping map[string]string

func LoadCUIFrequencyMapping

func LoadCUIFrequencyMapping(path string) (Mapping, error)

func LoadCUIMapping

func LoadCUIMapping(path string) (Mapping, error)

LoadCUIMapping loads a mapping of cui to most common title.

Mapping of cuis->title is constructed as per: Jimmy, Zuccon G., Koopman B. (2018) Choices in Knowledge-Base Retrieval for Consumer Health Search. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds) Advances in Information Retrieval. ECIR 2018. Lecture Notes in Computer Science, vol 10772. Springer, Cham

File must reflect this.

func (Mapping) Invert

func (m Mapping) Invert() Mapping

type PrecomputedEmbeddings

type PrecomputedEmbeddings struct {
	Matrix [][]int
	Cols   int
}

PrecomputedEmbeddings is a type of cui2vec container where the distances between CUIs have been pre-computed. It contains a sparse Matrix where the rows are CUIs and the columns are the distances to other CUIs. Each row is formatted in the form [CUI, score, CUI, score, ...]. Each CUI must be converted back to a string, and each score must be re-normalised from an int back to a float (taken care of by the Similar method).

func NewPrecomputedEmbeddings

func NewPrecomputedEmbeddings(r io.Reader) (*PrecomputedEmbeddings, error)

func (*PrecomputedEmbeddings) LoadModel

func (v *PrecomputedEmbeddings) LoadModel(r io.Reader) error

LoadModel reads a model from disk into memory. The file format of the pre-computed distances file is that of a single, continuous byte sequence starting with four bytes indicating the rows in the matrix. The first four bytes indicate a single Uint32 number representing the size of the matrix. This is used to create a fixed-size sparse matrix. The `Cols` attribute of the `PrecomputedEmbeddings` type is used to read N four-byte Uint32 numbers at a time to populate the columns of the matrix.

func (*PrecomputedEmbeddings) Similar

func (v *PrecomputedEmbeddings) Similar(cui string) ([]Concept, error)

Similar matches a given input CUI to the `Cols`-closest CUIs in the cui2vec embedding space. As each row in the matrix is encoded into (CUI, score) pairs, this method handles that. It also converts each int value in the matrix into either a string CUI or a re-normalised softmax score float64.

func (*PrecomputedEmbeddings) WriteModel

func (v *PrecomputedEmbeddings) WriteModel(w io.Writer) error

WriteModel writes a pre-computed distance matrix to disk. The write begins with a four-byte sequence to be parsed as a Uint32 representing the size of the matrix. Each value of the matrix is then written one by one in a continuous byte sequence, where each element in the matrix is encoded as a four-byte sequence to be parsed as a Uint32. Elements of the matrix are written row-by-row, and each row is exactly `Cols` wide. If there are less than `Cols` elements in a row, the row is padded with zeros.

type SimResponse

type SimResponse struct {
	V []Concept
}

type UncompressedEmbeddings

type UncompressedEmbeddings struct {
	SkipFirst  bool
	Comma      rune
	Embeddings map[string][]float64
}

func NewUncompressedEmbeddings

func NewUncompressedEmbeddings(r io.Reader, skipFirst bool, comma rune) (*UncompressedEmbeddings, error)

func (*UncompressedEmbeddings) LoadModel

func (v *UncompressedEmbeddings) LoadModel(r io.Reader) error

LoadModel a cui2vec pre-trained model into memory. The pre-trained file from:

https://arxiv.org/pdf/1804.01486.pdf

which was downloaded from:

https://figshare.com/s/00d69861786cd0156d81

is a csv file. The skipFirst parameter determines if the first line of the file should be skipped.

func (*UncompressedEmbeddings) Similar

func (v *UncompressedEmbeddings) Similar(cui string) ([]Concept, error)

Similar computes cuis that a similar to an input CUI. The distance function used is Cosine similarity. The CUIs are then run through Softmax and sorted.

type VecClient

type VecClient struct {
	// contains filtered or unexported fields
}

func NewVecClient

func NewVecClient(addr string) (*VecClient, error)

func (*VecClient) Sim

func (c *VecClient) Sim(cui string) ([]Concept, error)

func (*VecClient) Vec

func (c *VecClient) Vec(cui string) ([]float64, error)

type VecResponse

type VecResponse struct {
	V []float64
}

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL