Monthly Archives: November 2016

A taste of tag soup

TagSoup is a library to parse non compliant HTML.

To explain why you might want this, lets start by considering the following table.

<table id="test" border="1">

Notice the missing closing td tag.

This still, in a browser renders as a table, border and all.

Though I don’t have the actual stats, judging by how often I may mistakes in my own HTML, I think this I a valid enough reason to suspect the html you try to parse may not be compliant.

The problem is, a strict XML parser would not parse this. What we need is a parser which is lenient enough to parse malformed HTML with some degree of usefulness. Remote controlling a browser session would be a novel solution but it does incur a few overheads making it slower, harder to build and harder to code. Let’s see how far we can get with some common xml parsers.

For example, using the popular HaXml the following error is produced when trying to parse the following source document.

Prelude Text.XML.HaXml.Parse>  xmlParse "" "<b>hello world</i></b>"                                                                                                                          
*** Exception: in element tag b,                                                                                                                                                                                   
tag <b> terminated by </i>                                                                                                                                                                                         
  at file   at line 1 col 15  

This library is way to strict to parse bad html. Let’s try another. This time we will try the package known on hackage as simply xml

Prelude Text.XML.Light> parseXML "<b>hello world</i></b>"
   (Element{elName =                                                                                                                                                                                               
              QName{qName = "b", qURI = Nothing, qPrefix = Nothing},                                                                                                                                               
            elAttribs = [],                                                                                                                                                                                        
            elContent =                                                                                                                                                                                            
                 (CData{cdVerbatim = CDataText, cdData = "hello world</i>",                                                                                                                                        
                        cdLine = Just 1})],                                                                                                                                                                        
            elLine = Just 1})]     

Slightly better, but as you can see the closing i tag is counted as text just like hello.

Now, this time, we’ll try using TagSoup.

What is Tagsoup? Tagsoup is a library for parsing and re-rendering html.

To get it,

cabal install tagsoup

After having done that. You can do some scraping. But first you might want to do some reading on the library by doing an online search for “tag soup haskell”.

Also, if you do not know what “tag soup” is you might want to read up the page on wikipedia.

With that out of the way,

First import tagsoup

import Text.HTML.TagSoup

Now we can try and parse the broken HTML again.

Prelude Text.HTML.TagSoup> parseTags "<b>hello world</i></b>"
[TagOpen "b" [],TagText "hello world",TagClose "i",TagClose "b"]

…Now we’re getting somewhere.

As a small demonstration of the library, let’s extract just the text from the html document.

Prelude Text.HTML.TagSoup> concatMap fromTagText $ filter isTagText $ parseTags "<b>hello world</i></b>"
"hello world"

So here you see a library which is capable of handling broken html documents in a fairly more malleable way than the usual xml library.

The cross compatibility/targeted system trade off.

Why do some people use android and other ios?

Why do some people like windows and other MacOS?

Why do some people like to write apps in Java while others dot net.

Why do some people like web apps while others like native apps?

You will find that apps made for one system that just aren’t available in the same way on the others systems and one reason for this is because some environments are more cross compatible than others. This compatibility however, comes at a price.

Often cross platform code is slower because it has to run through an interpreter. The java script with web applications is one example. Even Java byte code has to run through extra interpretation before being run.

Often the abstraction layers do not provide the complete functionality of all the systems they target and just provide a subset. Someone developing for Windows using dot net might be able to add context menus to File Explorer but someone using Java may not do that because they are trying to create a consistent experience across systems where adding custom context menus is not possible. Also, web apps do not currently allow for as much data to be stored locally as native apps. A Cross Platform developer might have to support a wider array of hardware such as monitors with different resolutions so they might settle for a simpler UI for speed of development.

The more targeted developer can make more assumptions about Ram availability. While developers in the past have tried to put “Recommended Hardware Requirements” on the box, it can be hard to expect users to check it. Even for the experienced software buyer it can greatly complicate the software availability landscape to the end user.

In summary.

  • If each individual who uses your app will use it a lot of the time, use native as the experience can be better custom fit.
  • If you want your app to be fast use native
  • otherwise, provided the first two conditions are not met, if you want to target a large audience use an environment which is more cross platform.