I was looking for ways to read in batches, and by that I mean, getting some online content, having an easy way to archive it, going offline and starting to read.

Along the way, make notes, summarize, make a list of open questions from the content that was read, and then repeat the whole process again.

The majority of documents I read are: blogposts, documentation, mailing lists, wikipedia pages, newspaper articles, links people recommend I should read, news.ycombinator.com articles, lobste.rs articles, stackoverflow posts.

I'm sure there have been multiple solutions for this, I haven't read all of them so I'm just writing down what I did.

These batches can also accumulate as time goes by and some indexing1 would be required to make them searchable(we'll have a look at that too).

There are 4 things I'm using for this:

Let's take them one by one. The snaplinks extension allows you to select a number of links on a page and open them in new tabs. The maf extension creates a MAFF archive(one type of web archive) that stores all the pages in your browser's tabs including the javascript and css and images for each of them. Orgmode is used to view the contents of MAFF archive and open the pages inside them. Since orgmode is also a great note-taking tool, you can summarize those pages, and take notes while reading. The last on the list is org-maff-lib.sh, it's a Bash library that allows you to manipulate these MAFF files.

A maff archive is easy to make, you just open up a page in firefox, select a few links

Then just save all the tabs as a maff archive

If you look at example.maff which you've just created, it's just a zip archive, with the webpages in it

After you're done with this, you can easily get all the contents of the archive into orgmode like so:

In the image above, the org-mode links are pointing to the MAFF file path on disk and then the file inside it. For example an org-mode link to one of the web pages in the archive looks like this:

[[shell:firefox --new-tab "jar:file:///tmp/example/example.maff!/1449828414312_509/index.html"][Lobsters]]

This works because Firefox has support for jar urls (Chrome on the other hand doesn't yet have the ability to open/save MAFF files and can't open jar urls)

The org-maff-lib.sh bash script is composed of a few functions.

The maff_export function takes all the contents out of the maff lib and outputs an org-mode tree compatible listing for all the contents. You can then open all the files in orgmode.

maff_export() {
    MAFF_FILE=$1
    RDF_FILES=$(unzip -qq -l $MAFF_FILE "*.rdf" | awk '{print $4}')
    echo "*** MAFF file $MAFF_FILE"
    for rdf in $RDF_FILES; do
        DIR_NAME=$(echo "$rdf" | perl -ne 'm{(\d+_\d+)} && print "$1"')
        RDF_CONTENT=$(unzip -q -c $MAFF_FILE "$rdf")
        TITLE=$(     echo "$RDF_CONTENT" | xmlstarlet sel -t -v '//MAF:title/@RDF:resource' | perl -pne '$x=chr(34); $y=chr(39); s{[,$x$y]}{ }g; s{--}{}g; s{[\|\t\]\[]}{ }g;')
        ORIG_URL=$(  echo "$RDF_CONTENT" | xmlstarlet sel -t -v '//MAF:originalurl/@RDF:resource')
        INDEX_FILE=$(echo "$RDF_CONTENT" | xmlstarlet sel -t -v '//MAF:indexfilename/@RDF:resource')

        ORIG_URL_B64=$(echo "$ORIG_URL" | base64 -w 0)
        FULL_JAR_URL=$(echo "jar:file://$MAFF_FILE!/$DIR_NAME/$INDEX_FILE")

        if [[ $FULL_JAR_URL =~ \.pdf$ ]]; then
            OUTFILE=$(echo $ORIG_URL | md5sum | cut -d' ' -f1).pdf
            cmd="mkdir /tmp/maff1/ 2>/dev/null;unzip -c $MAFF_FILE \"$DIR_NAME/$INDEX_FILE\" > \"/tmp/maff1/$OUTFILE\" ; nohup bash -c 'okular \"/tmp/maff1/$OUTFILE\" & 2>/dev/null >/dev/null' 2>/dev/null >/dev/null & exit 0"
            echo "**** TODO [[shell:$cmd][PDF-$TITLE]]"
        else
            cmd="firefox --new-tab \"$FULL_JAR_URL\""
            echo "**** TODO [[shell:$cmd][$TITLE]]"
        fi
    done
}

This function creates an sqlite table that is being used to index the contents of all pages contained in the MAFF files. It uses a full-text index.

maff_table_create() {
    QUERY_CREATE="
    CREATE VIRTUAL TABLE maff USING fts3 (
        file   CHAR(600),
        path   CHAR(600),
        title  TEXT,
        body   TEXT,
        origurl CHAR(1000),
        sha1   CHAR(200)
    );
    "
    touch $DB
    echo $QUERY_CREATE | sqlite3 $DB 2>/dev/null
}

This function uses html2text to extract the contents of each HTML file contained in the MAFF archive, and then inserts it in the sqlite DB. The conversion to text is only done as a pre-processing step in order to index the contents for full-text search.

# pdf2txt   is in package python-pdfminer
# html2text is in package html2text
maff_table_index() {
    QUERY='INSERT INTO maff ("file","path","title","body","origurl","sha1") '
    MAFF_FILE=$(readlink -f $1)
    RDF_FILES=$(unzip -qq -l $MAFF_FILE "*.rdf" | awk '{print $4}')
    echo "*** MAFF file $MAFF_FILE"
    echo $MAFF_FILE
    for rdf in $RDF_FILES; do
        #echo "$rdf"
        DIR_NAME=$(echo "$rdf" | perl -ne 'm{(\d+_\d+)} && print "$1"')
        RDF_CONTENT=$(unzip -q -c $MAFF_FILE "$rdf")
        TITLE=$(     echo "$RDF_CONTENT" | xmlstarlet sel -t -v '//MAF:title/@RDF:resource' | perl -pne '$x=chr(34); $y=chr(39); s{[,$x$y]}{ }g; s{--}{}g; s{[\|\t\]\[]}{ }g;')
        ORIG_URL=$(  echo "$RDF_CONTENT" | xmlstarlet sel -t -v '//MAF:originalurl/@RDF:resource')
        INDEX_FILE=$(echo "$RDF_CONTENT" | xmlstarlet sel -t -v '//MAF:indexfilename/@RDF:resource')
        if [[ $INDEX_FILE =~ \.html$ ]]; then
            set -f
            CONTENT_RAW=$(unzip -qq -c $MAFF_FILE "$DIR_NAME/$INDEX_FILE")
            CONTENT=$(echo $CONTENT_RAW | timeout 3 html2text | perl -pne '$x=chr(34);$y=chr(39); s{(\n|[,$x$y])}{ }g; s{--}{}g;' )
            SHA1=$(echo $CONTENT_RAW | sha1sum | awk '{print $1}')
            Q1=$QUERY
            Q1="$Q1 VALUES ('$MAFF_FILE','$DIR_NAME/$INDEX_FILE','$TITLE','$CONTENT','$ORIG_URL','$SHA1');"
            echo $TITLE
            echo $Q1 | sqlite3 $DB
            set +f
        fi
    done
}

This function allows you to do full-text searches on all documents stored in the MAFF files.

maff_table_search() {
    MATCH=$1
    QUERY="
    SELECT ('[[shell:firefox --new-tab \"jar:file://' || file || '!/' || path || '\"][' || title || ']]') AS link
    FROM maff
    WHERE body MATCH '$MATCH'
    GROUP BY origurl;
    "
    set -f
    echo $QUERY | sqlite3 $DB
    set +f
}

This function counts how many documents are present in a MAFF file.

maff_count() {
    MAFF_FILE=$(readlink -f $1)
    RDF_FILES=$(unzip -qq -l $MAFF_FILE "*.rdf" | awk '{print $4}')
    COUNT=$(echo "$RDF_FILES" | wc -l)
    echo $COUNT
}

Footnotes:

1

Recoll is an interesting (and more generic) solution for indexing files

EDIT [2015-12-18] Looks like someone else had a similar approach and was also writing about MAFF archives in 2011 and was mentioning the need to index the contents of the archives.

EDIT [2015-12-18] Improved search function, now only selects distinct documents that match the query