proto

package

v0.0.0-...-9be3a58 Latest Latest Go to latest Published: May 22, 2017 License: BSD-3-Clause Imports: 1 Imported by: 1

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/Qihoo360/poseidon

Links

Open Source Insights

Documentation ¶

Overview ¶

Package proto is a generated protocol buffer package.

It is generated from these files:

poseidon_if.proto

It has these top-level messages:

DocGzMeta
DocId
DocIdList
FastPForCompressedDocIdList
InvertedIndex
FastPForCompressedInvertedIndex
PdzCompressedInvertedIndex
InvertedIndexGzMeta

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type DocGzMeta ¶

type DocGzMeta struct {
	Path   string `protobuf:"bytes,1,opt,name=path" json:"path,omitempty"`
	Offset uint64 `protobuf:"varint,2,opt,name=offset" json:"offset,omitempty"`
	Length uint32 `protobuf:"varint,3,opt,name=length" json:"length,omitempty"`
}

原始数据按照gz压缩文件格式存放在hdfs中每128行原始数据合在一起称为一个 Document（文档）一个hdfs文件按照2GB大小计算，大约可以容纳 10w 个压缩后的 Document 我们用 DocGzMeta 结构来描述文档相关的元数据信息

func (*DocGzMeta) ProtoMessage ¶

func (*DocGzMeta) ProtoMessage()

func (*DocGzMeta) Reset ¶

func (m *DocGzMeta) Reset()

func (*DocGzMeta) String ¶

func (m *DocGzMeta) String() string

type DocId ¶

type DocId struct {
	DocId    uint64 `protobuf:"varint,1,opt,name=docId" json:"docId,omitempty"`
	RowIndex uint32 `protobuf:"varint,2,opt,name=rowIndex" json:"rowIndex,omitempty"`
}

func (*DocId) ProtoMessage ¶

func (*DocId) ProtoMessage()

func (*DocId) Reset ¶

func (m *DocId) Reset()

func (*DocId) String ¶

func (m *DocId) String() string

type DocIdList ¶

type DocIdList struct {
	// 该分词所关联的 Document ID。按照 docId 升序排列
	// 为了方便 protobuf 的 varint 压缩存储，采用差分数据来存储
	// 差分数据：后一个数据的存储值等于它的原始值减去前一个数据的原始值
	// 举例如下：
	// 假如原始 docId 列表为：1,3,4,7,9,115,120,121,226
	// 那么实际存储的数据为： 1,2,1,3,2,106,6,1,105
	DocIds []*DocId `protobuf:"bytes,1,rep,name=docIds" json:"docIds,omitempty"`
}

一个分词可能会出现多个文档中，由于每个文档有多行原始数据组成每个关联数据需要 docId、rawIndex 两个信息来描述

func (*DocIdList) GetDocIds ¶

func (m *DocIdList) GetDocIds() []*DocId

func (*DocIdList) ProtoMessage ¶

func (*DocIdList) ProtoMessage()

func (*DocIdList) Reset ¶

func (m *DocIdList) Reset()

func (*DocIdList) String ¶

func (m *DocIdList) String() string

type FastPForCompressedDocIdList ¶

type FastPForCompressedDocIdList struct {
	DocList []uint64 `protobuf:"varint,1,rep,name=docList" json:"docList,omitempty"`
	RowList []uint32 `protobuf:"varint,2,rep,name=rowList" json:"rowList,omitempty"`
}

压缩的docIdList, 使用FastPFOR算法压缩，两个数组解压后等长

func (*FastPForCompressedDocIdList) ProtoMessage ¶

func (*FastPForCompressedDocIdList) ProtoMessage()

func (*FastPForCompressedDocIdList) Reset ¶

func (m *FastPForCompressedDocIdList) Reset()

func (*FastPForCompressedDocIdList) String ¶

func (m *FastPForCompressedDocIdList) String() string

type FastPForCompressedInvertedIndex ¶

type FastPForCompressedInvertedIndex struct {
	Index map[string]*FastPForCompressedDocIdList `` /* 130-byte string literal not displayed */
}

func (*FastPForCompressedInvertedIndex) GetIndex ¶

func (m *FastPForCompressedInvertedIndex) GetIndex() map[string]*FastPForCompressedDocIdList

func (*FastPForCompressedInvertedIndex) ProtoMessage ¶

func (*FastPForCompressedInvertedIndex) ProtoMessage()

func (*FastPForCompressedInvertedIndex) Reset ¶

func (m *FastPForCompressedInvertedIndex) Reset()

func (*FastPForCompressedInvertedIndex) String ¶

func (m *FastPForCompressedInvertedIndex) String() string

type InvertedIndex ¶

type InvertedIndex struct {
	Index map[string]*DocIdList `` /* 130-byte string literal not displayed */
}

Token->DocIds 倒排索引表结构。这个索引数据压缩后最终每天需要占用2TB空间。 hashid=hash64(token)%100亿，重复(冲突)不影响直接在hdfs上进行分词，中间数据文件(按照hashid排序，总共100亿行)：hashid token list<DocId>

索引文件创建过程:

N := 200 取N=200，每200个左右的分词组建一个InvertedIndex对象
for i := 0; ; i++ {
	1. 取 hashid 在 [ i*N,(i+1)*N ) 这个区间中的分词及其DocId列表
	2. 生成一个 InvertedIndex 对象，序列化，gz压缩，追加到hdfs文件中
	3. 记录下四元组: <hdfspath, i, offset, length>
}

上述第3步中记录的四元组中 hdfspath、hashid 两个字段可以根据规则推测出来，因此只需要记录offset、length即可
总共需要记录 5000w (=总分词数/N) 条数，每个8字节，总计需要400M，这个文件可以存放在hdfs中，加载的时候可以加载到缓存中(NoSQL)

func (*InvertedIndex) GetIndex ¶

func (m *InvertedIndex) GetIndex() map[string]*DocIdList

func (*InvertedIndex) ProtoMessage ¶

func (*InvertedIndex) ProtoMessage()

func (*InvertedIndex) Reset ¶

func (m *InvertedIndex) Reset()

func (*InvertedIndex) String ¶

func (m *InvertedIndex) String() string

type InvertedIndexGzMeta ¶

type InvertedIndexGzMeta struct {
	Offset uint64 `protobuf:"varint,1,opt,name=offset" json:"offset,omitempty"`
	Length uint32 `protobuf:"varint,2,opt,name=length" json:"length,omitempty"`
	Path   string `protobuf:"bytes,3,opt,name=path" json:"path,omitempty"`
}

存入NoSQL中，Key=int(hashid/N)

func (*InvertedIndexGzMeta) ProtoMessage ¶

func (*InvertedIndexGzMeta) ProtoMessage()

func (*InvertedIndexGzMeta) Reset ¶

func (m *InvertedIndexGzMeta) Reset()

func (*InvertedIndexGzMeta) String ¶

func (m *InvertedIndexGzMeta) String() string

type PdzCompressedInvertedIndex ¶

type PdzCompressedInvertedIndex struct {
	Index map[string]string `` /* 130-byte string literal not displayed */
}

Pdz 压缩算法是针对Protobuf的 Repeated 字段的一种压缩算法，详细实现情况请见： pdz_compress.go

func (*PdzCompressedInvertedIndex) GetIndex ¶

func (m *PdzCompressedInvertedIndex) GetIndex() map[string]string

func (*PdzCompressedInvertedIndex) ProtoMessage ¶

func (*PdzCompressedInvertedIndex) ProtoMessage()

func (*PdzCompressedInvertedIndex) Reset ¶

func (m *PdzCompressedInvertedIndex) Reset()

func (*PdzCompressedInvertedIndex) String ¶

func (m *PdzCompressedInvertedIndex) String() string

Source Files ¶

View all Source files

poseidon_if.pb.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL