cpaml

package module
v0.0.0-...-55f9b32 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 27, 2020 License: MIT Imports: 3 Imported by: 0

README

cpaml

Copy-pasted and modified strings lookup

This lib may be used to lookup spammers and abusers on social networks or dating websites. Typical use cases is when spammers use same copy-pasted messages but change email or phone number after being banned:

The quick brown fox jumps over the lazy dog then once again runs away and calls 1234567890

And then:

The quick brown fox jumps over the lazy dog then once again runs away and calls gmail@gmail.com

So you want a precise match, allowing variations.

Unlike other approximate text match, this lib "similarity" is exact matched similarity. 40% similarity means 40% of text matched exactly. So even 10% may be a sign of copy-pasted spam message and you may want to review it. 50% may be used safely to ban/delete/flag message automatically.

Samples index is kept in memory, so this may not work for large databases. ~25000 messages samples takes ~300Mb of RAM

Usage example:

    spamIndex := cpaml.Init(13)
    spamIndex.AddToSet( "480f89e6fc3ffdfbf7cb2c518ab45f54",
        "The quick brown fox jumps over the lazy dog then once again runs away and calls 1234567890")
    spamIndex.AddToSet( "ab0f8abe6fc3ffdfbf7cb2c518ab4fab",
        "The quick brown fox jumps over the lazy dog then once again runs away and calls gmal@gmail.com")

    id, sim := spamIndex.LookupSimilar("The quick brown fox jumps over the lazy dog then once again runs away and calls gmal@yahoo.com")
    

LookupSimilar(t string) will return best matched sample ID and calculated "similarity" (0-100)

PS. You may want to call garbage collector on each index sync: runtime.GC()

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Cpaml

type Cpaml struct {
	// contains filtered or unexported fields
}

func Init

func Init(kmerLength int) *Cpaml

* Provide kmer length. Depend on your cases, 13 works well for English and user messages and comments. Works not so good with short strings (100 signs and shorter)

func (*Cpaml) AddToIndex

func (c *Cpaml) AddToIndex(id string, t string) (bool, bool)

* Add sample to index

func (*Cpaml) AddToSet

func (c *Cpaml) AddToSet(id string, t string) (bool, bool)

* add sample to index if not added retrun true if sample was added return second true in case string cannot be added because has high kmer/length ration, mean repeated multiple times

func (*Cpaml) GetStats

func (c *Cpaml) GetStats() Stats

func (*Cpaml) IsInIndex

func (c *Cpaml) IsInIndex(id string) bool

func (*Cpaml) LookupSimilar

func (c *Cpaml) LookupSimilar(t string) (string, uint)

return sample ID as given for AddToIndex and similarity 0-100

func (*Cpaml) RemoveFromIndex

func (c *Cpaml) RemoveFromIndex(idx uint, id string)

* remove from index. Recommended to use RemoveStale() instead

func (*Cpaml) RemoveStale

func (c *Cpaml) RemoveStale(isForRemove func(id string) bool) int

* remove unused (inactive) samples from index. closure must return true for unused sample ID

type Stats

type Stats struct {
	NofSamples      int
	NofKmersIndexed int
}

type TextInIndex

type TextInIndex struct {
	Id       string
	NofKmers uint
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL