Searchzin
A simple search engine implementation.
Motivation
Study purposes, mostly for understanding the implementation details of how
search engines are made, performance trade-offs and structure.
Description
The idea is to make a isomorphic application from the UI to the database system.
Usage
The application can be deployed using either docker or the binary released in
github.
./searchzin -c <path-to-config>.yml
After that you can look into http://localhost:8080
to see the search page.
Configuration
The configuration can be made by either the configuration file located by
default in /etc/searchzin/config.yml
, or providing configuration keys in the
form -C key=value
, the second form overrides the first.
Configuration defaults:
port: 8080 # Service port
path:
log: /var/log/searchzin # Log directory
data: /var/lib/searchzin # Data directory
Development
All the project structure is made in golang, using the
gin
framework.
Dependencies are managed using dep
.
Most of the project toolchain is managed by the
Makefile
,
the important targets are:
install
: Install needed dependencies and git hooks
readme
: Performs README.md
inclusion of files
lint
: Performs linting and formatting of the code
test
: Well, compile and run unit tests
build
: Creates a linux
distributable folder in dist
run
: Runs the code using go run
run-dev
: Creates and runs a docker container
release
: Creates a release docker image
publish
: Publishes the docker image on dockerhub using the contents of the
VERSION
file as the version
publish-latest
: Publishes the docker image on dockerhub with the latest
tag
- (TODO)
watch
: Performs lint
and test
on file modification
- (TODO)
func-test
: Performs functional tests inside the features
folder
Architecture
There are 6 main components to this search engine:
- Document database
- Index database
- Indexing service
- Query executor
- Query planner
- Query parser
Each component has a clear responsability in the system, and all of them work
togheter to respond to queries and document indexing requests.
Document Database
It's responsible to store and give id's to newly created documents. The
constraints are:
- Stores documents and their
id
s
- Enables
id
generation with no collisions for persistence
- Efective document storing algorithm, being optimized for fast reads and fast
enought writes
- Aware of the underlying storage unit, being it
ssd
or hdd
- Aware of the underlying linux page size and file caching strategy
Index database
Stores a reverse-index of "terms" and documents
- Stores
terms
to document set relations
- Enable
key
manipulation strategies for queries with keyword approximation
- Optimized for low density keys with lots of documents
- Aware of the underlying linux page size to easily fit and be loaded in-memory
Indexing service
Given a new document understands it and saves both on the index database and the
document database.
- Knows which fields are indexed and how
- Knows the document structure and can related that to the indexes
Query parser
Parses the user input and transforms it into a query plan using a tree-like
data structure.
- Parses a string given by the user and turns it into a graph
- The DSL will be similar to
lucene
's
- Support for ANSI SQL queries
- Support for document key lookups
Query planner
Given a query tree, optimizes it being aware of the restrictions and the
environment in which it will be executed.
- Remove redundant results, making them available to all the steps that need
- Aware of index size to sort which effective retrievals will be done first
- Returns an ordered list of query nodes to be executed
Query executor
After having a structured plan the query then retrieves effective data from the
index
database, this step is performed by the executor.
- Knows how to query the index database
- Joins the results given by it in a ordered fashion
- Retrieves the documents
- Stores the query results in a file to be queried later using "cold" storage
Query language
This query language is heavily based on lucene's, to simplify design and
understand what tradeoffs were made.
Test scenario
The current test scenario that will be used will be indexing podcasts by name,
content and tags.
The base usage can be found in searchzin-example
.
License
searchzin is available under the MIT license.