Reading offline in batch mode
I was looking for ways to read in batches, and by that I mean, getting some online content, having an easy way to archive it, going offline and starting to read.
Along the way, make notes, summarize, make a list of open questions from the content that was read, and then repeat the whole process again.
The majority of documents I read are: blogposts, documentation, mailing lists, wikipedia pages, newspaper articles, links people recommend I should read, news.ycombinator.com articles, lobste.rs articles, stackoverflow posts.
I'm sure there have been multiple solutions for this, I haven't read all of them so I'm just writing down what I did.
These batches can also accumulate as time goes by and some indexing1 would be required to make them searchable(we'll have a look at that too).
There are 4 things I'm using for this:
- snaplinks firefox extension
- maf firefox extension
- orgmode
org-maff-lib.sh(bash library for manipulating maff files. it depends on sqlite3 , xmlstarlet , html2text )
Let's take them one by one. The snaplinks extension allows you to select a number of links on a page and open them in new tabs. The maf extension creates a MAFF archive(one type of web archive) that stores all the pages in your browser's tabs including the javascript and css and images for each of them. Orgmode is used to view the contents of MAFF archive and open the pages inside them. Since orgmode is also a great note-taking tool, you can summarize those pages, and take notes while reading. The last on the list is org-maff-lib.sh, it's a Bash library that allows you to manipulate these MAFF files.
A maff archive is easy to make, you just open up a page in firefox, select a few links
Then just save all the tabs as a maff archive
If you look at example.maff which you've just created, it's just a zip
archive, with the webpages in it
After you're done with this, you can easily get all the contents of the archive into orgmode like so:
In the image above, the org-mode links are pointing to the MAFF file path on disk and then the file inside it. For example an org-mode link to one of the web pages in the archive looks like this:
[[shell:firefox --new-tab "jar:file:///tmp/example/example.maff!/1449828414312_509/index.html"][Lobsters]]
This works because Firefox has support for jar urls (Chrome on the other hand doesn't yet have the ability to open/save MAFF files and can't open jar urls)
The org-maff-lib.sh bash script is composed of a few functions.
The maff_export function takes all the contents out of the maff lib
and outputs an org-mode tree compatible listing for all the contents.
You can then open all the files in orgmode.
maff_export() { MAFF_FILE=$1 RDF_FILES=$(unzip -qq -l $MAFF_FILE "*.rdf" | awk '{print $4}') echo "*** MAFF file $MAFF_FILE" for rdf in $RDF_FILES; do DIR_NAME=$(echo "$rdf" | perl -ne 'm{(\d+_\d+)} && print "$1"') RDF_CONTENT=$(unzip -q -c $MAFF_FILE "$rdf") TITLE=$( echo "$RDF_CONTENT" | xmlstarlet sel -t -v '//MAF:title/@RDF:resource' | perl -pne '$x=chr(34); $y=chr(39); s{[,$x$y]}{ }g; s{--}{}g; s{[\|\t\]\[]}{ }g;') ORIG_URL=$( echo "$RDF_CONTENT" | xmlstarlet sel -t -v '//MAF:originalurl/@RDF:resource') INDEX_FILE=$(echo "$RDF_CONTENT" | xmlstarlet sel -t -v '//MAF:indexfilename/@RDF:resource') ORIG_URL_B64=$(echo "$ORIG_URL" | base64 -w 0) FULL_JAR_URL=$(echo "jar:file://$MAFF_FILE!/$DIR_NAME/$INDEX_FILE") if [[ $FULL_JAR_URL =~ \.pdf$ ]]; then OUTFILE=$(echo $ORIG_URL | md5sum | cut -d' ' -f1).pdf cmd="mkdir /tmp/maff1/ 2>/dev/null;unzip -c $MAFF_FILE \"$DIR_NAME/$INDEX_FILE\" > \"/tmp/maff1/$OUTFILE\" ; nohup bash -c 'okular \"/tmp/maff1/$OUTFILE\" & 2>/dev/null >/dev/null' 2>/dev/null >/dev/null & exit 0" echo "**** TODO [[shell:$cmd][PDF-$TITLE]]" else cmd="firefox --new-tab \"$FULL_JAR_URL\"" echo "**** TODO [[shell:$cmd][$TITLE]]" fi done }
This function creates an sqlite table that is being used to index the contents of all pages contained in the MAFF files. It uses a full-text index.
maff_table_create() { QUERY_CREATE=" CREATE VIRTUAL TABLE maff USING fts3 ( file CHAR(600), path CHAR(600), title TEXT, body TEXT, origurl CHAR(1000), sha1 CHAR(200) ); " touch $DB echo $QUERY_CREATE | sqlite3 $DB 2>/dev/null }
This function uses html2text to extract the contents of each HTML file contained in the MAFF archive, and then inserts it in the sqlite DB. The conversion to text is only done as a pre-processing step in order to index the contents for full-text search.
# pdf2txt is in package python-pdfminer # html2text is in package html2text maff_table_index() { QUERY='INSERT INTO maff ("file","path","title","body","origurl","sha1") ' MAFF_FILE=$(readlink -f $1) RDF_FILES=$(unzip -qq -l $MAFF_FILE "*.rdf" | awk '{print $4}') echo "*** MAFF file $MAFF_FILE" echo $MAFF_FILE for rdf in $RDF_FILES; do #echo "$rdf" DIR_NAME=$(echo "$rdf" | perl -ne 'm{(\d+_\d+)} && print "$1"') RDF_CONTENT=$(unzip -q -c $MAFF_FILE "$rdf") TITLE=$( echo "$RDF_CONTENT" | xmlstarlet sel -t -v '//MAF:title/@RDF:resource' | perl -pne '$x=chr(34); $y=chr(39); s{[,$x$y]}{ }g; s{--}{}g; s{[\|\t\]\[]}{ }g;') ORIG_URL=$( echo "$RDF_CONTENT" | xmlstarlet sel -t -v '//MAF:originalurl/@RDF:resource') INDEX_FILE=$(echo "$RDF_CONTENT" | xmlstarlet sel -t -v '//MAF:indexfilename/@RDF:resource') if [[ $INDEX_FILE =~ \.html$ ]]; then set -f CONTENT_RAW=$(unzip -qq -c $MAFF_FILE "$DIR_NAME/$INDEX_FILE") CONTENT=$(echo $CONTENT_RAW | timeout 3 html2text | perl -pne '$x=chr(34);$y=chr(39); s{(\n|[,$x$y])}{ }g; s{--}{}g;' ) SHA1=$(echo $CONTENT_RAW | sha1sum | awk '{print $1}') Q1=$QUERY Q1="$Q1 VALUES ('$MAFF_FILE','$DIR_NAME/$INDEX_FILE','$TITLE','$CONTENT','$ORIG_URL','$SHA1');" echo $TITLE echo $Q1 | sqlite3 $DB set +f fi done }
This function allows you to do full-text searches on all documents stored in the MAFF files.
maff_table_search() { MATCH=$1 QUERY=" SELECT ('[[shell:firefox --new-tab \"jar:file://' || file || '!/' || path || '\"][' || title || ']]') AS link FROM maff WHERE body MATCH '$MATCH' GROUP BY origurl; " set -f echo $QUERY | sqlite3 $DB set +f }
This function counts how many documents are present in a MAFF file.
maff_count() { MAFF_FILE=$(readlink -f $1) RDF_FILES=$(unzip -qq -l $MAFF_FILE "*.rdf" | awk '{print $4}') COUNT=$(echo "$RDF_FILES" | wc -l) echo $COUNT }
Footnotes:
Recoll is an interesting (and more generic) solution for indexing files
EDIT Looks like someone else had a similar approach and was also writing about MAFF archives in 2011 and was mentioning the need to index the contents of the archives.
EDIT Improved search function, now only selects distinct documents that match the query