cc-hosts

module

v0.0.0-...-a677503 Latest Latest Go to latest Published: Jun 2, 2025 License: MIT

Details

Common Crawl regularly releases host- and domain-level graphs, for visualising the crawl data.

To use that data you can use Java SDK that suggetsst that

the webgraphs are usually multiple Gigabytes in size and require for processing a sufficient Java heap size (Java option -Xmx)

Depending on application, keeping all the data in memory may be not cost efficient.

Web Graph is static and sorted data. We can get single digit second latency by just reading that data from cold storage like S3.

With this project you can:

graph LR;
    Gateway>API Gateway] --> Lambda;
    Lambda -->S3[(AWS S3)];