mwsapi

module
v0.0.0-...-0e8dd81 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 9, 2019 License: GPL-3.0

README

mws-api

Build Status

A golang library and set of tools to setup, query and maintain a MathWebSearch and Temasearch instance.

Overview

  • cmd: Implementation of commands
    • cmd/temaquery: Queries A joined MathWebSearch + ElasticSearch Setup
    • cmd/mwsapid: HTTP Daemon serving temasearch queries
    • cmd/mwsquery: Queries a (plain) MathWebSearch instance for MathWebSeach Queries
    • cmd/elasticquery: Queries a (plain) Elasticsearch instance for Tema Queries
    • cmd/elasticsync: Creates and maintains an Elasticsearch instance for use with Temasearch
  • connection: Contains Connection Code to MathWebSearch and ElasticSearch
  • engine: Underlying code used by commands above
    • engine/elasticsync: Creates and maintains an Elasticsearch instance for use with Temasearch
    • engine/elasticengine: ElasticSearch only queries
    • engine/mwsengine: MathWebSearch queries
    • engine/temaengine: TemaSearch Queries
  • query: Implements query parsing and serializing
  • result: Implements result parsing and serializing
  • utils: General utility functions

Processes

In the following we describe the basic functionality between all programs within this repository. This documentation is intended to serve as an entry point, and thus does not describe all implementation details. The most detailed reference is always the source code.

In principle, the source code for all commands is found in the appropriate cmd/ subdirectory. Executables can be built by either using thhe standard go build ./cmd/$CMD or by simply using make $CMD. The binaries will be placed in the root directory.

MathWebSearch Query

The program in cmd/mwsquery can run plain MathWebSearch Queries. Queries are defined by the Query Struct.

Each Query consists of a list of MathWebSearch Expressions. A MathWebSearch Expression is a ContentMathML expression in XML Syntax with additional support for Query variables. For this one can use <mws:qvar> tags to specify universal variables. The text content of a <mws:qvar> is considered its name and qvars with the same name will match to the same expressions.

Each Expression can be given as an argument to the mwsquery executable. For example:

./mwsquery '<mws:qvar>x</mws:qvar>'

Normal results are returned as JSON to STDOUT. The results are defined by the Result Struct.

All queries are paginated -- by default they return the first 10 results. The parameters -from and -size can be used to customize the result set.

Sometimes it is only important how many results are returned, not the results themselves. For this purpose the -count flag can be provided.

Additionally, instead of returning the full results, sometimes it is also desired to only return the ids of each found formulae. This can be useful for debugging and use inside a full TemaSearch scenario. To achieve this, the -ids flag can be provided.

Elasticsearch Syncronization

The program in cmd/elasticsync creates and maintains an Elasticsearch Index for use with Tema-Search.

A Temasearch Index is a set of JSON objects conforming to the Temasearch Harvest Element Schema. In the following we call each such document a Harvest Element. A group of elements belonging to a single .harvest file (which in and of itself belongs to one source file) is usually contained in one line of an elasticsearch index file ending in .json. For backward compatibility, in between each lines of items in the index, an additional document containing legacy meta-information should be stored. These .json files are stored within one folder on disk.

In order to make this index queryable, it needs to be kept in sync with an appropriate Elasticsearch index. To achieve this one could in principle perform the following process to syncronize the disk with the index:

  • Delete all existing indexed documents from Elasticsearch (if any)
  • Read each .json file from disk and then
  • add the documents contained inside of it to Elasticsearch

This approach does not scale well with large datasets. Having to delete the entire database, only to add the same content back is too slow.

Instead we split the Temasearch index into into the different files and treat each file seperatly. We call each file a segment. To syncronize an updated on-disk index into Elasticsearch, we roughly do the following:

  • Mark all existing segments in the database as 'untouched'
  • For each segment from the ElasticSearch index to be added:
    • compute a hash of the segment
    • check if this segment with the same name is already stored in the database by comparing the hash
      • if yes, we do not need to do anything as it has not changed
      • if no, we remove the old segment documents (if any) and add the new documents belonging to this hash
    • mark the segment as 'touched' within this syncronization process
  • Delete the documents belonging to any segment still marked as 'untouched'

This process is far more efficient -- only updating documents in the database that have actually been changed. However, this process requires that two seperate ElasticSearch indexes are maintained. The first index -- called tema by convention -- contains the TemaSearch Index Documents and is most obvious. The second index is called tema-segments and contains a list of known segments as well as their hashes. As a hash implementation we use SHA256.

Elasticsearch Query

The program in cmd/elasticquery can run queries against the elasticsearch part of Temasearch. Queries are defined by the Query Struct and consist of two parameters:

  • Some text to search the index for
  • A list of mathwebsearch ids that were found by normal MathWebSearch

A query may have both text and ids to search for, but it must not be empty. These can be provided to elasticquery using the text and ids parameters. For example:

./elasticquery -text "Hello" -ids 1,2,3

Normal results are returned as JSON to STDOUT. The results are defined by the Result Struct.

All queries are paginated -- by default they return the first 10 results. The parameters -from and -size can be used to customize the result set.

Sometimes it is only important how many results are returned, not the results themselves. For this purpose the -count flag can be provided.

Internally, each query consists of two phases:

  • The Document Phase. This intially queries elasticsearch to find all matching documents.
  • The Highlight Phase. For each returned document, elasticsearch is queried again to highlight matching segments.

A normal query runs both phases. For debugging, it is possible to only run the document phase by running the -document-phase-only flag.

API Daemon

The program in cmd/mwsapid implements an HTTP Daemon that can answer all the queries above. It can be configured using the command line arguments:

Usage of mwsapid:
  -elastic-host string
        Host to use for elasticsearch
  -elastic-port int
        Port to use for elasticsearch (default 9200)
  -mws-host string
        Host to use for mathwebsearch. If omitted, disable mathwebsearch support
  -mws-port int
        Port to use for mathwebsearch (default 8080)
  -host string
        Host to listen on (default "localhost")
  -port int
        Port to listen on for queries (default 3000)
General Structure

The server supports three kinds of requests -- which are described in more detail below. When using POST requests, all parameters should be encoded using JSON in the body.

For all requests, the server will respond with a JSON object in the body. By default, this corresponds to a simple application/json response. However, when the URL-parameter callback is provided, a JSONP response is sent instead.

Furthermore, the server makes use of the following status codes:

Code Description Meaning
200 OK Request suceeded and the body will contain the desired response.
400 Bad Request Malformed request, this occurs if some parameters are out of range or missing.
404 Not Found The given request was not found or is not supported by the server.
405 Method Not Allowed The request method (i.e. POST or GET) is not allowed for the given request.
500 Internal Server Error Something went wrong when trying to answer the query.

When responding with a non-200 status code, the body will always contain a JSON string with a detailed error message. This message is not intended for end users, instead it should be used by developers to debug the issue at hand.

Search Result Serialization
Result

A search results are represented using the Result struct as follows:

Field Type Optional Description
kind string no Type of response returned. One of "mwsd", "tema", "elastic", "elastic-document" or "". The elastic ones are currently unused.
took number yes Time it took to run the query in Nanoseconds. Optional.
stats Dict<string, number> yes Time in nanoseconds spent in specific phases. Component names may change in the future.
total number no Total number of results for the query, regardless how many are returned.
from number no 0-based number this set of results starts at.
size number no Number of results returned.
variables Array<QueryVariable> yes Query Variables found within the original query. See detailed description below.
ids Array<number> yes Internal result ids, when requested.
hits Array<Hit> yes The list of matching hits. See detailed description below.
Hit

A hit is represented using the Hit struct as following:

Field Type Optional Description
id string yes (Possibly internal) id of this hit.
url string yes Url of this hit.
xpath string yes Xpath of the query term to the formulae referred to by this hit.
element HarvestElement yes Harvest element (aka <mws:data> element) belonging to this hit. See below for detailed description.
score number yes Score of this Hit as determined by ElasticSearch.
snippets Array<string> yes Snipets that caused this hit to gain the score. TemaSearch only.
math_ids Array<MathFormula> no Formulae found within this hit. See detailed description below.
HarvestElement

An <mws:data> element is represented using the HarvestElement struct as follows:

Field Type Optional Description
metadata any no Metadata belonging to this element. When representing a valid JSON object, returns that JSON object. When empty, an empty JSON obeject. Otherwise a string representing the XML.
segment string yes Name of the segment (i.e. .harvest file) this element comes from.
text string no Text contained in this document. Might contained replaced math ids.
mws_id Dict<number, Dict<string, MathFormula> yes Information about each math identifier within this document, and a map from the internal mws ids to the appropriate formula. See below for detailed description.
mws_ids Array<number> yes List of math identifiers in this documents. Corresponds to the keys of mws_id.
math Dict<string, string> no Source code of replaced math elements within this document.
MathFormula

A math formula is represented using the MathFormula struct as following:

Field Type Optional Description
source string yes MathML Element Representing entire formula.
durl string yes Document URL this formula is contained in.
url string yes Local Formula Id
xpath string yes XPath from the formula to the matching subterm.
subterm string yes MathML Element representing the matching subterm.
subst Dict<string, string> yes MathML Elements representing values for the subsituted terms.
QueryVariable

A query variable is represented using the QueryVariable struct as following:

Field Type Optional Description
name string no Name of this Query Variable
xpath string no XPath from the root of the query to the variable itself.
Status Request

The Status Handler is called running a GET on /. It takes no parameters and returns a StatusResponse with the following structure:

Field Type Optional Description
name string no Name of this server. Always mwsapid.
tagline string no Server Tagline.
engines Dict<string, bool> no Supported "engines" or routes.

Example Response:

{
  "name": "mwsapid",
  "tagline": "You know, for math",
  "engines": {
    "mws": true,
    "tema": false
  }
}
MathWebSearch Request

The MWS Handler is called running a POST on /mws/. It takes parameters of type MWSAPIRequest.

Field Type Optional Description
expressions Array<string> yes List of MathWebSearch expressions. Each should be a the body of a single "mws:expression" tag, using the "mws" and "m" predefined namespaces.
mwsids boolean yes When true, do not return MathWebSearch results, but only their IDs
count boolean yes When true, return only count of results, not results themselves.
from number yes Used for pagination. 0-based index to start result set at. Defaults to 0, must be >= 0.
size number yes Used for pagination. Maximum number of results to returns. Defaults to 10, must be between 0 and 100 inclusive.

For example, when the server is running on localhost at port 3000, the following curl command could be used to make a simple request:

curl -d '{"expressions":["<mws:qvar>x</mws:qvar>"]}' -H "Content-Type: application/json" -X POST http://localhost:3000/mws/

If the count parameter is true, the response will be a single json number. Otherwise, the server will return a Result struct, see the above section for details on how this looks. This behaviour is identical to the mwsquery executable. Example responses can be found in the cmd/mwsquery/cmd/testdata folder.

TmeaSearch Request

Not yet documented.

Docker

For convenience, a Dockerfile serving the API daemon is provided. It can be found at the automated build mathwebsearch/mwsapi on DockerHub. It can be run as follows:

docker run mathwebsearch/mwsapi

It serves the API Daemon (see above) on port 3000 by default and can be customized using the following environment variables:

  • MWSAPID_HOST: Host to listen for requests. Defaults to "0.0.0.0".

  • MWSAPID_PORT: Port to listen for requests. Defaults to 3000.

  • MWSAPID_MWS_HOST: Host to expect MathWebSearch Daemon on. If omitted, MathWebSearch support is disabled.

  • MWSAPID_MWS_PORT: Port to expect MathWwebSearch Daemon on. Defaults to 8080.

  • MWSAPID_ELASTIC_HOST: Host to expect Elasticsearch Daemon on. If omitted, TemaSearch support is disabled.

  • MWSAPID_ELASTIC_PORT: Port to expected Elasticsearch on. Default to 9200.

Furthermore, a Docker Image for elasticsync also exists. See MathWebSearch/tema-elasticsync for details.

License

GPL3, see LICENSE.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL