mime-scraper

command module

v0.0.0-...-6ed0812 Latest Latest Go to latest Published: Mar 7, 2023 License: MIT Imports: 6 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/singhayushh/mime-scraper

Links

Open Source Insights

README ¶

Mime Scraper

Scrapes mimetype metadata and exports as json

Description

This project is about a web scraping script that scrapes data from Complete MIME Types List, and exports them to a json file.

What's the problem we are trying to solve?

For one of my internal projects, where users could upload any file and store in a custom cloud db, it was required for the files to be identified via their mimetypes along with some basic info about the file types along with a suitable icon for each file type.

How can Mime Scraper help?

Mime Scraper creates a json of all existing mimetypes supported by HTTP, with the following 5 fields:

name: name of the file type
mimetype: mimetype of the file
extension: extension of the file associated
icon: this field is to be populated with the relevant icon theme you'd be using, so it has been kept empty. (I've used tabler icons, you may request the final file for taber icons integration)
info: url for more information about the mimetype, mostly points to some wikipedia article.

The idea

The idea was to write a super fast script that can scrape and parse data quickly in an exportable format. I've used Go, for its high speed, and the colly framework - the fastest available atm. The data fetched by scraping is parsed to a custom struct type and pushed to a list of the same data type. Once scraping is completed, the list is written to mimetypes.json file using the os package.

Project structure

I've used the standard golang project structing. The code resides in the following directory on my Unix system: $GOPATH/src/github.com/singhayushh/mime-scraper

Getting started

Prerequisites

Softwares needed

Go

Knowledge needed

HTML and XPath
Mimetypes

Installing

create a directory under $GOPATH/src/github.com and name it singhayushh
clone the repo inside this directory
cd to the cloned directory and run go get to install modules from go.mod file
run go run main.go or create a executable build.