OpenDataLink

module
v0.0.0-...-3c9b827 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 20, 2022 License: MIT

README

Introduction

Open Data Link is a search engine for open data. It supports the following search methods:

  • Semantic keyword search over metadata
  • Similar dataset search using semantic similarity of metadata
  • Joinable table search
  • Unionable table search

System overview

There are three main components:

  1. The crawler script downloads datasets and metadata from Socrata.
  2. sketch_columns and process_metadata create data sketches and metadata embedding vectors.
  3. The server builds indices on the data columns and metadata and serves the frontend.

Development guide

Run crawler
scripts/download_socrata_datasets.sh [app token file]
Sketch dataset columns

Create the column_sketches table:

sqlite3 opendatalink.sqlite < sql/create_column_sketches_table.sql

Run sketch_columns to sketch (minhash) dataset columns and store them in the column_sketches table:

go run cmd/sketch_columns/main.go
Build fastText database
curl -O https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
unzip crawl-300d-2M.zip
go run cmd/build_fasttext/main.go < crawl-300d-2M.vec

This will create the fasttext.sqlite database.

Process metadata

Create the metadata and metadata_vectors tables:

sqlite3 opendatalink.sqlite < sql/create_metadata_tables.sql

Run process_metadata:

go run cmd/process_metadata/main.go

This will create metadata embedding vectors for each dataset and save them in the metadata_vectors table. The metadata is saved in the metadata table.

Start server
go run cmd/server/main.go
Configuring database paths

The server, sketch_columns, and process_metadata look for databases named opendatalink.sqlite and fasttext.sqlite in the current directory by default. Alternate paths can be specified in the OPENDATALINK_DB and FASTTEXT_DB environment variables.

Directories

Path Synopsis
cmd
build_fasttext
Command build_fasttext builds a fastText SQLite database.
Command build_fasttext builds a fastText SQLite database.
metadata_index
Command metadata_index is a command-line interface for testing the metadata embedding index.
Command metadata_index is a command-line interface for testing the metadata embedding index.
process_metadata
Command process_metadata creates metadata embedding vectors and stores the metadata and the vectors in the Open Data Link database.
Command process_metadata creates metadata embedding vectors and stores the metadata and the vectors in the Open Data Link database.
profile
Serves the OpenDataLink frontend with CPU profiling enabled Command server serves the Open Data Link frontend.
Serves the OpenDataLink frontend with CPU profiling enabled Command server serves the Open Data Link frontend.
server
Command server serves the Open Data Link frontend.
Command server serves the Open Data Link frontend.
sketch_columns
Command sketch_columns sketches dataset columns and stores the sketches in the Open Data Link database.
Command sketch_columns sketches dataset columns and stores the sketches in the Open Data Link database.
internal
database
Package database provides a wrapper of sql.DB for working with the Open Data Link database.
Package database provides a wrapper of sql.DB for working with the Open Data Link database.
server
Package server defines the Server type for serving the Open Data Link frontend.
Package server defines the Server type for serving the Open Data Link frontend.
vec32
Package vec32 provides operations on float32 vectors.
Package vec32 provides operations on float32 vectors.
wordemb
Package wordemb creates embedding vectors for text by averaging word vectors.
Package wordemb creates embedding vectors for text by averaging word vectors.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL