README ¶
cascadia
TOC
The Go Cascadia package implements CSS selectors for html. This is the command line tool, started as a thin wrapper around that package, but growing into a better tool to test CSS selectors without writing Go code:
Usage
$ cascadia
cascadia wrapper
built on 2017-08-01
Command line interface to go cascadia CSS selectors package
Options:
-h, --help display help information
-i, --in *The html/xml file to read from (or stdin)
-o, --out *The output file (or stdout)
-c, --css *CSS selectors
-p, --piece sub CSS selectors within -css to split that block up into pieces
format: PieceName=[RAW:]selector_string
RAW: will return the selected as-is; else the text will be returned
-d, --delimiter[= ] delimiter for pieces csv output
-w, --wrap-html wrap up the output with html tags
-b, --base base href tag used in the wrapped up html
-q, --quiet be quiet
Its output has two modes, single selection mode and block selection mode, depending on whether the --piece
parameter is given on the command line or not.
- The single selection mode will output the selection as HTML source, while
- The block selection mode will output HTML text in a tsv/csv table form
For details about the concept of block and pieces, check out andrew-d/goscrape (in fact, cascadia
was initially developed just for it, so that I don't need to tweak Go code, build & run it just to test out the block and pieces selectors). Here is the exception:
- Inside each page, there's 1 or more blocks - some logical method of splitting up a page into subcomponents.
- Inside each block, you define some number of pieces of data that you wish to extract. Each piece consists of a name, a selector, and what data to extract from the current block.
This all sounds rather complicated, but in practice it's quite simple. See the next section for details.
Examples
Single selection mode
All the three -i -o -c
options are required. By default it reads from stdin
and output to stdout
:
$ echo '<input type="radio" name="Sex" value="F" />' | tee /tmp/cascadia.xml | cascadia -i -o -c 'input[name=Sex][value=M]'
0 elements for 'input[name=Sex][value=M]':
Either the input or the output can be followed by a file name:
$ cascadia -i /tmp/cascadia.xml -o -c 'input[name=Sex][value=F]'
1 elements for 'input[name=Sex][value=F]':
<input type="radio" name="Sex" value="F"/>
Of course, any number of selections allowed:
$ echo '<table border="0" cellpadding="0" cellspacing="0" style="table-layout: fixed; width: 100%; border: 0 dashed; border-color: #FFFFFF"><tr style="height:64px">aaa</tr></table>' | cascadia -i -o -c 'table[border="0"][cellpadding="0"][cellspacing="0"], tr[style=height\:64px]'
2 elements for 'table[border="0"][cellpadding="0"][cellspacing="0"], tr[style=height\:64px]':
<table border="0" cellpadding="0" cellspacing="0" style="table-layout: fixed; width: 100%; border: 0 dashed; border-color: #FFFFFF"><tbody><tr style="height:64px"></tr></tbody></table>
<tr style="height:64px"></tr>
Block selection mode
First, as the single selection mode will output the selection as HTML source, so if you want HTML text instead, then you can make use of the block selection mode.
$ echo '<div class="container"><p align="justify"><b>Name: </b>John Doe</p></div>' | tee /tmp/cascadia.xml | cascadia -i -o -c 'div > p'
1 elements for 'div > p':
<p align="justify"><b>Name: </b>John Doe</p>
$ cat /tmp/cascadia.xml | cascadia -i -o -c 'div' --piece SelText='p'
SelText
Name: John Doe
Note that the block selection mode can output in HTML as well -- it just outputs (HTML) text by default:
$ cat /tmp/cascadia.xml | cascadia -i -o -c 'div' --piece SelText='RAW:p'
SelText
<p align="justify"><b>Name: </b>John Doe</p>
The real power of block selection mode resides in its capability of producing tsv/csv tables without any go programming:
$ curl --silent https://news.ycombinator.com | cascadia -i -o -c 'tr.athing' -p No=span.rank -p Title='td.title > a' -p Site=span.sitestr
No Title Site
1. Onedrive is slow on Linux but fast with a ?Windows? user-agent (2016) microsoft.com
2. Starting today, users of Firefox can also enjoy Netflix on Linux netflix.com
3. Research Debt distill.pub
...
27. USPS Informed Delivery ? Digital Images of Front of Mailpieces usps.com
28. Performance bugs ? the dark matter of programming bugs forwardscattering.org
29. Most items of clothing have complicated international journeys bbc.co.uk
30. High-performance employees need quieter work spaces qz.com
It's poor man's scrapper tool if text are the only thing needed. For scrapping beyond text, then just go one step further, to use andrew-d/goscrape (or my goscrape instead, which has some enhancements to it).
Again, if text are the only thing needed, then cascadia
might be already enough. Here is how to scrap Hacker News across several pages:
$ curl --silent https://news.ycombinator.com/news?p=[1-3] | cascadia -i -o -c 'tr.athing' -p No=span.rank -p Title='td.title > a' -p Site=span.sitestr
No Title Site
1. Starting today, users of Firefox can also enjoy Netflix on Linux netflix.com
2. Onedrive is slow on Linux but fast with a ?Windows? user-agent (2016) microsoft.com
3. Research Debt distill.pub
...
27. Yes I Still Want to Be Doing This at 56 (2012) thecodist.com
28. Performance bugs ? the dark matter of programming bugs forwardscattering.org
29. USPS Informed Delivery ? Digital Images of Front of Mailpieces usps.com
30. High-performance employees need quieter work spaces qz.com
31. Most items of clothing have complicated international journeys bbc.co.uk
32. Telstra?s Gigabit Class LTE Network cellularinsights.com
...
58. The New Laptop Ban Adds to Travelers' Lack of Privacy and Security eff.org
59. QEMU: user-to-root privesc inside VM via bad translation caching chromium.org
60. Startups that debuted at Y Combinator W17 Demo Day 2 techcrunch.com
61. The Cracking Monolith: Forces That Call for Microservices semaphoreci.com
62. Amsterdam Airport Launches API Platform schiphol.nl
...
88. Founder Stories: Leah Culver of Breaker (YC W17) ycombinator.com
89. Find out what you, or someone on your team, did on the last working day github.com
90. PSD2 ? a directive that will change banking in Europe evry.com
By default it uses tab \t
as fields delimiter, so the output is in .tsv
format. To change to .csv
, add -d ,
to the command line.
Twitter Search
Block selection mode is poor man's web scrapping tool, and it is very simple to use. Here is another practical example -- Twitter searching. We all know that you have to pay for the Twitter Search API and it only serves Tweets from the past week. With cascadia
, you can search the tweets for free, and get the latest content as well.
Here is how I watch for Toronto/GTA's Gas Price Alert, without getting all other tweets from him:
$ cascadia -i 'https://twitter.com/search?q=%22Gas%20Price%20Alert%22%20%23GTA%20from%3AGasBuddyDan&src=typd' -o -c 'div.stream div.original-tweet div.content' --piece Time='small.time' --piece Tweet='div.js-tweet-text-container > p'
Time Tweet
Jul 31
Gas Price Alert #Toronto #GTA #Hamilton #Ottawa #LdnOnt #Barrie #Kitchener #Niagara #Windsor N/C Tues and to a 2ct/l HIKE gor Wednesday
Jul 6
Gas Price Alert #Toronto #GTA #LdnOnt #Hamilton #Ottawa #Barrie #KW to see a 1 ct/l drop @ for Friday July 7
May 30
Gas Price Alert #Toronto #GTA #Ottawa #LdnOnt #Hamilton #KW #Barrie #Windsor prices won't change Wednesday but will DROP 1 ct/l Thursday
May 15
Gas Price Alert #Toronto #GTA #Barrie #Hamilton #LdnOnt #Ottawa #KW #Windsor NO CHANGE @ except gas bar shenanigans for Tues & Wednesday
Mar 7
Gas Price Alert #Toronto #GTHA #LdnOnt #Ottawa #Barrie #KW #Windsor to see a 1 cent a litre HIKE Wed March 8 (to 107.9 in the #GTA)
Reconstruct the separated pages
Many web sites annoyingly separated one file into several small pieces so that they can show it to you in different web pages, with different ads. However, I'd like to view them in one page and no ads. Or, at least that is what I'd been hoping for all the time, but I didn't have an easy way of doing it until now, with cascadia
.
With cascadia
then no more programming is necessary. All we need to do now is to pass on some command line parameters, and the magic will happen. There are so many such sites that break thing into several small pieces, the following two are those I just did the other day.
The first one is separated across over 23 pages! Twenty-three! I would just give up if I don't have cascadia
, but with it, it is so simple:
curl --silent http://www.chinadmd.com/file/prrxtuivvxsxxwwaexuuwovp_[1-23].html | cascadia -i -o -c div.panel-body -p 'Book=div.tofu-txt' > /tmp/book.txt
The first page is here, and all 23-pages are collected here. I collect them as plain text because the HTML were just wrapping around the plain text, thus no need HTML, plain text is good enough.
Collecting as HTML is no trouble either. Here is another example:
curl --silent http://www.shangxueedu.com/shuxue/ksdg/20170113_162_[1-6].html | cascadia -i -o -c div.m-post -p 'Book=RAW:div.post-con' --wrap-html | tee /tmp/book.html
The fifth page is here, and all pages are collected here. Please check them out.
More On CSS Selector
I'm not an expert on CSS Selector at all, but the following resources are what I found most helpful to me.
- CSS Selectors Cheat Sheet I think It's very good, because it's usage oriented and very practical, i.e., it arranges the Selectors according to their purposes. If that's too dry for you, check out
- The 30 CSS Selectors You Must Memorize It only lists those selectors that are important, but it gives concrete examples and explanations
- CSS Selector Reference from w3schools. This is the one I most often refer to, because the list is comprehensive, and there is also an online CSS Selector Tester that really helped me learn and understand
Documentation ¶
There is no documentation for this package.