Add-on:Parsedom for xbmc plugins

From Official Kodi Wiki
Jump to navigation Jump to search

Parsedom for xbmc plugins.

See this add-on on the kodi.tv showcase

Author: TheCollective

Type: Add-on library/module
Repo:

Summary: Parsedom for xbmc plugins.
Home icon grey.png   ▶ Add-ons ▶ Parsedom for xbmc plugins.

Testing/Status

Integration and unittests are run continously by Jenkins

http://tc.tobiasussing.dk/jenkins/view/Common%20Functions/

Developers

This DOM parser is a fast replacement for Beautiful Soup.

And also a few other useful functions.

Setup

To use the parsedom functions edit your addon.xml like this.

 <requires>
   <import addon="xbmc.python" version="2.0"/>
   <import addon="script.module.parsedom" version="0.9.0"/> # Add this
 </requires>

And add the following to your py file.

 import xbmc, xbmcvfs, xbmcaddon
 import CommonFunctions
 common = CommonFunctions.CommonFunctions()
 common.plugin = plugin

Debugging

To enable debugging set the following values in default.py

common.dbg = True # Default
common.dbglevel = 3 # Default

Whenever you debug your own code you should also debug in the cache. Otherwise you should remember to DISABLE it.

parseDOM(self, html, name = "", attrs = {}, ret = False)

  • html(string or list) - String to parse, or list of strings to parse.
  • name(string) - Element to match ( for instance "span" )
  • attrs(dictionary) - Dictionary with attributes you want matched in the elment ( for instance { "id": "span3", "class": "oneclass.*anotherclass", "attribute": "a random tag" } )
  • ret(string or False) - Attribute in element to return value of. If not set(or False), returns content of DOM element.

returns list

Getting element content.

 link_html = "<a href='bla.html'>Link Test</a>"
 ret = common.parseDOM(link_html, "a")
 print repr(ret) # Prints ['Link Test']

Getting an element attribute.

 link_html = "<a href='bla.html'>Link Test</a>"
 ret = common.parseDOM(link_html, "a", ret = "href")
 print repr(ret) # Prints ['bla.html']

Get element with matching attribute.

 link_html = "<a href='bla1.html' id='link1'>Link Test1</a><a href='bla2.html' id='link2'>Link Test2</a><a href='bla3.html' id='link3'>Link Test3</a>"
 ret1 = common.parseDOM(link_html, "a", attrs = { "id": "link1" }, ret = "href")
 ret2 = common.parseDOM(link_html, "a", attrs = { "id": "link2" })
 ret3 = common.parseDOM(link_html, "a", attrs = { "id": "link3" }, ret = "id")
 print repr(ret1) # Prints ['bla1.html']
 print repr(ret2) # Prints ['Link Test2']
 print repr(ret3) # Prints ['link3']

When scraping sites it is prudent to scrape in steps, since real websites are often complicated.

Take this example where you want to get all the user uploads.

<div id="content">
 <div id="sidebar">
  <div id="latest">
   <a href="/video?8wxOVn99FTE">Miley Cyrus - When I Look At You</a>>br /<
   <a href="/video?46">Puppet theater</a><br />
   <a href="/video?98">VBLOG #42</a><br />
   <a href="/video?11">Fourth upload</a><br />
  </div>
 </div>
 <div id="user">
  <div id="uploads">
   <a href="/video?12">First upload</a><br />
   <a href="/video?23">Second upload</a><br />
   <a href="/video?34">Third upload</a><br />
   <a href="/video?41">Fourth upload</a><br />
  </div>
 </div>
</div>

The first step is to limit your search to the correct area.

One should always find the inner most DOM element that contains the needed data.

 ret = common.parseDOM(html, "div", attrs = { "id": "uploads" })

The variable ret now contains

  ['<a href="/video?12">First upload</a><br />
  <a href="/video?23">Second upload</a><br />
  <a href="/video?34">Third upload</a><br />
  <a href="/video?41">Fourth upload</a><br />']

And now we get the video url.

 videos = common.parseDOM(ret, "a", ret = "href")
 print repr(videos) # Prints [ "video?12", "video?23", "video?34", "video?41" ]


Getting an element attribute.

 link_html = "<a href='bla.html'>Link Test</a>"
 ret = common.parseDOM(link_html, "a", ret = "href")
 print repr(ret) # Prints ['bla.html']

Get element with matching attribute.

 link_html = "<a href='bla1.html' id='link1'>Link Test1</a><a href='bla2.html' id='link2'>Link Test2</a><a href='bla3.html' id='link3'>Link Test3</a>"
 ret1 = common.parseDOM(link_html, "a", attrs = { "id": "link1" }, ret = "href")
 ret2 = common.parseDOM(link_html, "a", attrs = { "id": "link2" })
 ret3 = common.parseDOM(link_html, "a", attrs = { "id": "link3

_fetchPage(self, dict)

WARNING: This call will be changed from "_fetchPage" to "fetchPage" in the future.

Fetches a page from the internet.

returns tupple of ( dict, int).

An int of 200 indicates success, and 500 error.

The dict returned contains

  • content: HTML content
  • new_url: Redirect url
  • header: Header information
result = common._fetchPage({"url": "http://www.example.com/index.html"})
if result["status"] == 200:
   print "content: %s" %result["content"]

result = common._fetchPage({"url": "http://www.example.com/doesnotexist.html"})
if result["status"] == 500:
   print "redirect url: %s" %result["new_url"]
   print "header: %s" %result["header"]
   print "content: %s" %result["content"]

log(self, str, level = 0)

Sends the string to the xbmc.log function if the level provided is less than the level set in dbglevel.

Returns None

 import CommonFunctions
 common = CommonFunctions.CommonFunctions()
 common.plugin = "PluginName"
 common.dbg = True
 common.dbglevel = 3
 def helloWorld():
   common.log("Ran this")
   common.log("Ignored this", 4)
   common.log("Ran this as well", 2)

 helloWorld()

Will give the following output in the xbmc.log

[PluginName] helloWorld : 'Ran This'
[PluginName] helloWorld : 'Ran This as well'

openFile(self, filepath, options = "w")

Opens a binary or text file handle

Returns filehandle

file = common.openFile("myfile.txt", "wb")
file.write("my data")
file.close()

getUserInput(self, title = "Input", default="", hidden=False)

This function raises a keyboard for user input

Returns string

search = common.getUserInput("Artist", "") # Will ask the user to write an Artist to search for

def_search = common.getUserInput("Artist", "Miley Cyrus") # Will default to Miley Cyrus if the user doesn't enter another artist.

getUserInputNumbers(self, title = "Input", default="", hidden=False)

Warning: This will return string in the future.

This function raises a keyboard numpad for user input

Returns int

pin = common.getUserInput("Userpin", "") # Will ask the user to write a pin. 

def_pin = common.getUserInput("Userpin", "1234") # Will default to 1234 if the user doesn't enter another pin.

getParameters(self, dict)

Converts the request url passed on by xbmc to the plugin into a dict of key-value pairs

returns dict

params = common.getParameters(sys.argv[2]) # sys.argv[2] would be something like "?path=/root/favorites&login=true"
print repr(params) # Prints '{ "path": "/root/favorites", "login": "true" }'

replaceHtmlCodes(self, str)

WARNING: This call will be changed from "replaceHtmlCodes" to "replaceHTMLCodes" in the future.

Replaces html codes with ascii.

Returns string

clean_string = common.replaceHtmlCodes("&amp;&quot;&hellip;&gt;&lt;&#39;")
print clean_string # Prints &"...><'

stripTags(self, str)

Removes all DOM elements.

Returns string

clean_string = common.stripTags("I want this text <img src= alt='without this'>")
print clean_string # Prints "I want this text"

makeAscii(self, str)

This function implements a horrible hack related to python 2.4's terrible unicode handling.

Returns string

clean_string = common.makeAscii("test נלה מהי test")
print clean_string # Prints "test   test"