Skip to content Skip to sidebar Skip to footer

Use Scrapy + Splash Return Html

I'm trying to figure out scrapy and splash. As an exercise, I tried to make splash click on the button on the following javascript heavy website: http://thestlbrowns.com/ and then

Solution 1:

Splash response does contain some hints:

{'description': 'Error happened while executing Lua script',
 'error': 400,
 'info': {'error': "bad argument #2 to 'assert' (string expected, got table)",
          'line_number': 8,
          'message': 'Lua error: [string "..."]:8: bad argument #2 to \'assert\' (string expected, got table)',
          'source': '[string "..."]',
          'type': 'LUA_ERROR'},
 'type': 'ScriptError'}

If you try your script in Splash's web interface (it is your friend!), you have the same error, coming from this line:

assert(splash:runjs("$('#title.play-ball > a:first-child').click()"))

If you change that Lua script a bit, catching the error (by the way, I believe you meant .title.play-ball > a:first-child because there's no element with id="title"):

function main(splash)
     local url = splash.args.url
     assert(splash:go(url))
     assert(splash:wait(1))

     -- go back 1 month in time and wait a little (1 second)
     ok, err = splash:runjs("$('.title.play-ball > a:first-child').click()")
     assert(splash:wait(1))

     -- return result as a JSON object
     return {
         html = splash:html(),
         error = err
         -- we don't need screenshot or network activity
         --png = splash:png(),
         --har = splash:har(),
     }
 end

and running it in the web interface, you get an "error" object in the response, which shows:

error: Object
    js_error: "ReferenceError: Can't find variable: $"
    js_error_message: "Can't find variable: $"
    js_error_type: "ReferenceError"
    message: "JS error: \"ReferenceError: Can't find variable: $\""
    splash_method: "runjs"
    type: "JS_ERROR"

It appears the $ magic is not working on that website. You can use it in Chrome console for example, but with Splash you probably/apparently need to load jQuery (or something similar), with splash:autoload usually. For example:

function main(splash)
     assert(splash:autoload("https://code.jquery.com/jquery-3.1.1.min.js"))
     local url = splash.args.url
     assert(splash:go(url))
     assert(splash:wait(1))

     -- go back 1 month in time and wait a little (1 second)
     ok, err = splash:runjs("$('.title.play-ball > a:first-child').click()")
     assert(splash:wait(1))

     -- return result as a JSON object
     return {
         html = splash:html(),
         error = err
         -- we don't need screenshot or network activity
         --png = splash:png(),
         --har = splash:har(),
     }
 end

Note that this JavaScript code did not work for me with Splash (the screenshot did not show the "History" thing).

But I tried with the following in the web interface, and I got the "History" show (in the png screenshot -- which is commented here):

function main(splash)
     -- no need to load jQuery when you use splash:select
     --assert(splash:autoload("https://code.jquery.com/jquery-3.1.1.min.js"))
     local url = splash.args.url
     assert(splash:go(url))
     assert(splash:wait(15))

     local element = splash:select('.title.play-ball > a:first-child')
     local bounds = element:bounds()
     assert(element:mouse_click{x=bounds.width/2, y=bounds.height/2})
     assert(splash:wait(5))

     -- return result as a JSON object
     return {
         html = splash:html(),
         -- we don't need screenshot or network activity
         --png = splash:png(),
         --har = splash:har(),
     }
 end

Indeed, Splash 2.3 has helpers for that kind of interaction (e.g clicking on an element). See for example splash:select and element:mouse_click

Also note that I increased the wait() values.


Solution 2:

You need to "quote" your script before you pass it to Splash:

script = """Your script"""
from urllib.parse import quote
script = quote(script)
# 'Your%20script'

Post a Comment for "Use Scrapy + Splash Return Html"