lafzi

package module
v0.0.0-...-d52e2b0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 5, 2022 License: MIT Imports: 8 Imported by: 0

README

Go-Lafzi Go Report Card Go Reference

Go-Lafzi is a Go package for searching Arabic text using its transliteration (phonetic search). Loosely based on research by Istiadi (2012) and multiple papers related with it.

It works by using indexed trigram for approximate string matching, while the search result is ranked using Longest Common Sequence with Myers Diff Algorithm. For storing the indexes, it uses SQLite database which brings several advantages:

  • The indexing and lookup process is pretty fast, around 3 seconds for indexing entire Al-Quran and 90 ms per lookup. For more detail, checkout the code in sample/quran.
  • Can be safely used concurrently.

Usage

For example, we want to find the word "rahman" within surah Al-Fatiha:

package main

import (
	"encoding/json"
	"fmt"

	"github.com/hablullah/go-lafzi"
)

var arabicTexts = []string{
	"بِسْمِ اللَّهِ الرَّحْمَـٰنِ الرَّحِيمِ",
	"الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ",
	"الرَّحْمَـٰنِ الرَّحِيمِ",
	"مَالِكِ يَوْمِ الدِّينِ",
	"إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ",
	"اهْدِنَا الصِّرَاطَ الْمُسْتَقِيمَ",
	"صِرَاطَ الَّذِينَ أَنْعَمْتَ عَلَيْهِمْ غَيْرِ الْمَغْضُوبِ عَلَيْهِمْ وَلَا الضَّالِّينَ",
}

func main() {
	// Open storage
	storage, err := lafzi.OpenStorage("sample.lafzi")
	checkError(err)

	// Prepare documents
	var docs []lafzi.Document
	for i, arabicText := range arabicTexts {
		docs = append(docs, lafzi.Document{
			ID:     i + 1,
			Arabic: arabicText},
		)
	}

	// Save documents to storage
	err = storage.AddDocuments(docs...)
	checkError(err)

	// Search in storage
	results, err := storage.Search("rahman")
	checkError(err)

	// Print search result
	bt, _ := json.MarshalIndent(&results, "", "  ")
	fmt.Println(string(bt))
}

func checkError(err error) {
	if err != nil {
		panic(err)
	}
}

Which will give us following results :

[
  {
    "DocumentID": 1,
    "Confidence": 1
  },
  {
    "DocumentID": 3,
    "Confidence": 1
  }
]

Resources

All resources mentioned here is also available in doc folder. This is done to prevent case where the university decided to close public access to these research. For example, paper by Istiadi was publicly available back in 2014, however now in 2022 it can only downloaded by member of its university.

By the way, the algorithm that implemented in this package is not exactly the same as in these papers. There are also some papers that I ignored, i.e. the papers to find Arabic text cross-verse in Qur'an, which I believe not really useful for general Arabic texts. There are also many parts that I've changed to make implementation easier and to increase performance in testing.

  • Istiadi, Muhammad Abrar. "Sistem pencarian ayat al-quran berbasis kemiripan fonetis." (2012). (PDF, link)
  • Zafran, Aidil, Moch Arif Bijaksana, and Kemas M. Lhaksmana. "Truncated query of phonetic search for al qur’an." 2019 7th International Conference on Information and Communication Technology (ICoICT). IEEE, 2019. (PDF, link)
  • Rifaldi, Eki, Moch Arif Bijaksana, and Kemas Muslim Lhaksamana. "Sistem Pencarian Lintas Ayat Al-Qur'an Berdasarkan Kesamaan Fonetis." Indonesia Journal on Computing (Indo-JC) 4.2 (2019): 177-188. (PDF, link)
  • Rasyad, Naufal, Moch Arif Bijaksana, and Kemas Muslim Lhaksmana. "Pencarian Potongan Ayat Al-Qur'an dengan Perbedaan Bunyi pada Tanda Berhenti Berdasarkan Kemiripan Fonetis." Jurnal Linguistik Komputasional 2.2 (2019): 56-61. (PDF, link)
  • Satriady, Wildhan, Moch Arif Bijaksana, and Kemas M. Lhaksmana. "Quranic Latin Query Correction as a Search Suggestion." Procedia Computer Science 157 (2019): 183-190. (PDF, link)
  • Octavia, Agni, Moch Arif Bijaksana, and Kemas Muslim Lhaksmana. "Verse Search System for Sound Differences in the Qur’an Based on the Text of Phonetic Similarities." Jurnal Sisfokom (Sistem Informasi dan Komputer) 9.3 (2020): 317-322. (PDF, link)
  • Fitriani, Intan Khairunnisa, Moch Arif Bijaksana, and Kemas Muslim Lhaksmana. "Qur’an Search System for Handling Cross Verse Based on Phonetic Similarity." Jurnal Sisfokom (Sistem Informasi dan Komputer) 10.1 (2021): 46-51. (PDF, link)
  • Purwita, Naila Iffah, et al. "Typo handling in searching of Quran verse based on phonetic similarities." Register: Jurnal Ilmiah Teknologi Sistem Informasi 6.2 (2020): 130-140. (PDF, link)
  • Cendikia, Putri, Moch Arif Bijaksana, and Kemas M. Lhaksmana. "Pencarian Ayat Al-Qur'an Yang Tidak Utuh Berdasarkan Kemiripan Fonetis." eProceedings of Engineering 7.2 (2020). (PDF, link)
  • Elder, Robert. "Myers Diff Algorithm - Code & Interactive Visualization." (2017) (archive, link)

License

Go-Lafzi is distributed using MIT license.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Document

type Document struct {
	ID     int
	Arabic string
}

Document is the Arabic document that will be indexed.

type Result

type Result struct {
	DocumentID int
	Confidence float64
}

Result contains id of the suitable document and its confidence level.

type Storage

type Storage struct {
	// contains filtered or unexported fields
}

Storage is the container for storing reverse indexes for Arabic documents that will be searched later. Use sqlite3 as database engine.

func OpenStorage

func OpenStorage(path string) (*Storage, error)

OpenStorage open the reverse indexes database in the specified path.

func (*Storage) AddDocuments

func (st *Storage) AddDocuments(docs ...Document) error

AddDocuments save and index the documents into the storage.

func (*Storage) DeleteDocuments

func (st *Storage) DeleteDocuments(ids ...int) error

DeleteDocuments remove the documents in the storage.

func (*Storage) Search

func (st *Storage) Search(query string) ([]Result, error)

Search for suitable documents using the specified query.

func (*Storage) SetMinConfidence

func (st *Storage) SetMinConfidence(f float64)

SetMinConfidence set the minimum confidence score for the search result. Default is 40%.

Directories

Path Synopsis
internal
lcs
sample

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL