twitterdata

module
v0.0.0-...-50c602e Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 22, 2025 License: Apache-2.0

README

Twitter data in Dgraph

This benchmark provides code to download twitter data using streaming APIs and output is written in json files. Then, the data is post-processed and converted into live loader acceptable json files.

Downloading twitter data

Output

Generates as many files as number of workers, and each line has one tweet.

Run

mkdir json
go run postprocess/pp.go -i json -o pp

Post Processing

This step cleans up the data downloaded from twitter. This step also assigns an external id to each tweet and user (author or user mentioned in a tweet).

mkdir pp
go run postprocess/pp.go 10 json pp

Run Live Loader

Setup Dgraph

docker run --rm -it -p 5080:5080 -p 6080:6080 -p 8080:8080 -p 9080:9080 -p 8000:8000 --name dgraph dgraph/dgraph dgraph zero
docker exec -it dgraph dgraph alpha --lru_mb 2048 --zero localhost:5080
docker exec -it dgraph dgraph-ratel

(Note, all data would be lost when container stops)

Setup Schema

Use dgraph ratel to setup following schema -

user_id: string @index(exact) @upsert .
user_name: string @index(hash) .
screen_name: string @index(term) .

id_str: string @index(exact) @upsert .
created_at: dateTime @index(hour) .
hashtags: [string] @index(exact) .

author: uid @count @reverse .
mention: uid @reverse .

Run

dgraph live -f pp -x xidmap --zero localhost:5080 -c 1

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL