Home > Django, python > Amazon Android App Store Free App of the Day RSS Feed in Django

Amazon Android App Store Free App of the Day RSS Feed in Django

I‘ve been working off and on with Django now for a little over a year, but haven’t actually published anything yet. Most of the projects I’ve tackled are rather large, so I have nothing to really to show for it. Last night however, I came up with an idea for the perfect small Django project that would be simple to implement and actually quite useful.

Amazon’s Android App Store has the concept of a Free App a day. The idea being, if you check their marketplace every day, they’ll reward you with the opportunity to get an app that’s normally paid for free. However, this of course requires you to remember to check it each day. Something I’m rather bad at. What I’m not bad at however is visiting my RSS Reader of choice at least once a day. So the only question then is how to get the free app into an RSS format. (Spoiler Alert: You can find the RSS page at http://rss.dougwarren.org/AmazonFreeAppFeed/)

Django to the Rescue

This gave me the opportunity to try out and showcase some technologies I’ve been wanting to use for awhile. In a previous project I parsed HTML with StoneSoup, but it’s not very well supported and lxml.html has been benchmarked[1] as being superior in every way. I’ve also talked a lot to co-workers about how easy virtualenv keeps your Python code separate and gets rid of dependency conflicts. And I’ve been meaning to look into celery for asynchronous task resolution. (Particularly, I liked the idea of using celery as a cron[2] as I didn’t like django-command-extension’s runjobs system. However, for the sake of finishing this in a single evening, I cut that last dependency. I left the code in place to come back to it in the future, but for now this is powered by cron.) Finally of course, it would be my first public facing Django project. Albeit very trivial and minimal.

Installing Software


The first step is to take care of the actual installation of software. I’ll make the project root, set up virtualenv, and install Django:

[dwarren@thebigwave ~]$ cd /home/dougwarren/
[dwarren@thebigwave dougwarren]$ mkdir rss
[dwarren@thebigwave dougwarren]$ cd rss
[dwarren@thebigwave rss]$ virtualenv -p /usr/local/bin/python2.7 --no-site-packages .
[dwarren@thebigwave rss]$ . bin/activate
(rss)[dwarren@thebigwave rss]$ vi requirements.txt

The (rss) on the path is a reminder as to what virtualenv environment is currently running. In my bashprofile I have several aliases that will call deactivate before activating another virtualenv so I can quickly switch from environment to environment.

Add into the requirements file the following projects:

django
django-extensions
requests
lxml
ipython

Now I’ll install the software and set up our version control system. I’ve been using git lately, but I may spend some time with Mercurial soon to get a better understanding of the pros and cons of each.

(rss)[dwarren@thebigwave rss]$ pip install -r requirements.txt
(rss)[dwarren@thebigwave rss]$ git init .
(rss)(master) [dwarren@thebigwave rss]$ vi .gitignore

Into the .gitignore file I’ll add the list of files and directories that should not be added to source control:

bin/
include/
lib/
db/
migrations/
share/
*.pyc

Starting out with Django

Next, now that there’s a base of the project, I’ll check in what’s there and start a new project called ‘apps’. This is actually a point that I wish could be different. From above, I have a directory structure that looks like:

/home/dougwarren/rss
                /bin
                /lib
                /src
                /share

Now Django-admin won’t start a project in an existing directory, so I have to have another subdirectory off of rss for the project. I’d rather install the project in the rss directory and have the apps off of it. If anyone knows an easy way to accomplish this let me know!

(rss)(master) [dwarren@thebigwave rss]$ git add .gitignore requirements.txt
(rss)(master) [dwarren@thebigwave rss]$ git commit -m 'initial commit'
(rss)(master) [dwarren@thebigwave rss]$ django-admin.py startproject apps
(rss)(master) [dwarren@thebigwave rss]$ cd apps
(rss)(master) [dwarren@thebigwave apps]$ django-admin startapp amazonfeed
(rss)(master) [dwarren@thebigwave apps]$ chmod a+x manage.py
(rss)(master) [dwarren@thebigwave apps]$ mkdir db
(rss)(master) [dwarren@thebigwave apps]$ git add * amazonfeed/*
(rss)(master) [dwarren@thebigwave apps]$ git commit -m 'initial django baseline'
(rss)(master) [dwarren@thebigwave amazonfeed]$ vi settings.py

Exploring Django projects and apps

The previous commands have made a new directory off of rss ‘apps’ inside of this is all of the Django files for the site that I’m creating. Of particular note is manage.py and settings.py. Manage.py is a python script that will be used to interact with the Django internals. One of the first steps I take is making it directly executed because it’s a lot easier to be typing ./manage.py than python manage.py all the time. Settings.py is used to specify django-specific settings for this installation. In particular I’m going to be making the following changes:

# Django settings for apps project.
from os.path import abspath, dirname, basename, join

DEBUG = False
TEMPLATE_DEBUG = DEBUG

ROOT_PATH = abspath(dirname(__file__))
PROJECT_NAME = basename(ROOT_PATH)

ADMINS = (
    ('Doug Warren', 'rss@dougwarren.org'),
)

MANAGERS = ADMINS

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.sqlite3', # Add 'postgresql_psycopg2', 'postgresql', 'mysql', 'sqlite3' or 'oracle'.
        'NAME': join(ROOT_PATH, 'db', 'rss.db'),# Or path to database file if using sqlite3.
        'USER': '',                      # Not used with sqlite3.
        'PASSWORD': '',                  # Not used with sqlite3.
        'HOST': '',                      # Set to empty string for localhost. Not used with sqlite3.
        'PORT': '',                      # Set to empty string for default. Not used with sqlite3.
    }
}
...
INSTALLED_APPS = (
    'django.contrib.sites',
    'amazonfeed',
)

Two words of note here. First, I always try to make my projects as relocatable as possible. As such, I never hardcode a path if I can avoid it at all. The use of basename() and join() will enable me to make a development copy of the same project on the same machine just by checking it out of git. For one of my other projects I run 3 different versions on the same VPS the only difference is the Apache config (See below for an example.)

Now, for the second point I’m going to contradict the first. The django sites app has a default app called ‘example.com’ and the Django syndication code uses the sites app to get the atom link. So it will need to be set to the proper URL. I’ll handle that after the database has been created.

Adding Model Data

The previous commands have created the app amazonfeed, I’ve added it to the INSTALLED_APPS list so Django knows it exists, and I’ve even defined a database that Django can use. Now to add the code that will describe the tables in the database.

(rss)(master) [dwarren@thebigwave apps]$ cd amazonfeed
(rss)(master) [dwarren@thebigwave amazonfeed]$ vi models.py
from django.db import models
from django.db.models import ForeignKey, DateField, CharField, URLField, TextField

class Vendor(models.Model):
    name = CharField(max_length=128)

    def __unicode__(self):
        return self.name

class App(models.Model):
    name = CharField(max_length=128)
    url = URLField(verify_exists=False)
    image_location = URLField(verify_exists=False)
    vendor = ForeignKey(Vendor)
    description = TextField()
    date = DateField(auto_now=True)

    def __unicode__(self):
        return self.name

I’ve kept things fairly simple here but still somewhat normalized. The main entry of the Free App will be the App, and I’ll keep track of it’s name, the URL where it can be found, the image shown for the app, who makes it, what does Amazon have to say about it and when was it last seen. The vendor data will live in another table so if one vendor is blessed enough to have multiple free apps of the day I won’t duplicate that data. I probably should have taken the App ID out of the URL field, but I’m not certain at this point how portable that will be in the future. I also originally thought of writing my own markup (and hence taking the image_location) but at this point I just display the description from a separate page. The date field is the date that the App was last seen by the scraper, so anytime that the model saves the date will be updated. Finally for both the vendor and the app itself, the __unicode__ special method which is used to display a representation of the object will simply return the name.

Now that the database models have been described, I can create the actual database. this is done through the manage.py script discussed previously. ‘syncdb’ will scan through all of the models in the INSTALLED_APPS and create tables for them. If tables already exist it will not alter them. There is a standard app called South that will handle migrations for you however. (This is a slight failing of Django in my opinion, in that we’re in the 2nd decade of the 21st century and database migrations aren’t considered part of the core project still.)

(rss)(master) [dwarren@thebigwave amazonfeed]$ cd ..
(rss)(master) [dwarren@thebigwave apps]$ ./manage.py syncdb

I mentioned previously that the sites database will contain the example.com entry. I’ll need to change that to be the site where the app is running from or some of the fields won’t come out right. In developing other Django apps I’ve frequently had to delete and restart the database. As such I prize repeatably. For this next task I could have specified a fixture to add a 2nd site ID, and updated the settings.py entry for SITE_ID to be 2. Instead though I’ll have a small script that will change the existing SITE_ID 1:

(rss)(master) [dwarren@thebigwave apps]$ vi sites.py
from django.contrib.sites.models import Site
my_site = Site.objects.get(pk=1)
my_site.domain = 'rss.dougwarren.org'
my_site.name = "Doug's RSS Feeds"
my_site.save()

And now to execute it I turn back to manage.py which using it’s shell subcommand let’s me execute python code with all of the paths correctly set. (See below for doing so outside of manage.py)

(rss)(master) [dwarren@thebigwave apps]$ ./manage.py shell < sites.py
(rss)(master) [dwarren@thebigwave apps]$ cd amazonfeed
(rss)(master) [dwarren@thebigwave amazonfeed]$ vi tasks.py

Scrape the Page

As mentioned in the introduction, I originally intended to use celery as a crontab, and hence the name of the scraper being tasks. I may still do so in a followup. However for now it will just contain a single function that will be run when the script is run:

from lxml.html import fromstring, tostring
import requests
import re
from amazonfeed.models import Vendor, App

free_app_location = 'http://www.amazon.com/mobile-apps/b/ref=topnav_storetab_mas?node=2350149011'

def getfreeappdata():
    """ Get the data on the current free app and insert it into the database """
    r = requests.get(free_app_location)

    amazon_url = 'http://www.amazon.com{0}'
    html = fromstring(r.content)

    app_html = html.cssselect('span.fad-widget-footer-title a')[0]
    app_name = app_html.text
    app_url = amazon_url.format(app_html.get('href'))

    vendor_html = html.cssselect('span.fad-widget-footer-vendor')[0]
    vendor_name = re.sub('by ', '', vendor_html.text)

    image_html = html.cssselect('div.fad-widget-large-artwork img')[0]
    image_location = image_html.get('src')

    description_request = requests.get(app_url)

    description_html = fromstring(description_request.content)
    description = description_html.cssselect('div.aplus')[0]

    # Create Django objects
    vendor = Vendor.objects.get_or_create(name=vendor_name)[0]

    # Check to see if this app already exists
    app_query = App.objects.filter(name=app_name)

    # Update the time on the current app
    if app_query.count() != 0:
        app = app_query[0]
    else:
        # Or create a new one
        app = App(name=app_name,
                url = app_url,
                image_location=image_location,
                vendor=vendor,
                description=tostring(description),
                )

    app.save()

if __name__ == '__main__':
    getfreeappdata()

The code is fairly self-explanatory. The first third is getting the page from amazon, the middle third concerns isolating the variables that I care about, and the final portion creates the database objects. A few things to note though. First, there’s absolutely no error detection or evasion going on. Scrapers are nasty dirty things. If the site changes too much it will fail. And when it does fail, I want to be notified in an E-Mail right away. I don’t want to try to recover from the unexpected here, the unexpected is that a 3rd party changed the feed from under me and it’s not going to be recoverable. Second, I’m looking forward a bit to the time when Amazon lists the same app a second time. When that happens the date field will get updated and things will progress.

The next step is getting this task to be run at a scheduled time, so I’ll add it to the crontab.

(rss)(master) [dwarren@thebigwave amazonfeed]$ crontab -e
VIRTUAL_ENV=/home/dougwarren/rss
PATH=/home/dougwarren/rss/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/home/dwarren/bin
PYTHONPATH=/home/dougwarren/rss:/home/wyntersoft/dougwarren/apps
DJANGO_SETTINGS_MODULE=apps.settings

0 1 * * * python /home/wyntersoft/rss/apps/amazonfeed/tasks.py

The only thing to note here is that the PYTHONPATH specifies both the apps directory and the root of the virtualenv. It seems odd to me that both are required, but they are. (I would think just the apps directory should be sufficient.) I specified the cron job to fire at 1AM PST as it seems that the free apps rotate at midnight PST.

Django Feed

Now that the model has created the database, and the script has populated the database, it’s time to read the database and output an RSS feed:

(rss)(master) [dwarren@thebigwave amazonfeed]$ vi feeds.py
from django.contrib.syndication.views import Feed
from amazonfeed.models import Vendor, App
from amazonfeed.tasks import free_app_location

class AmazonFeed(Feed):
    title = 'Latest Free App of the Day'
    description = 'Latest Amazon Free App of the Day'
    link = free_app_location

    def items(self):
        return App.objects.order_by('-date')[:5]

    def link(self, obj):
        return free_app_location

    def item_link(self, item):
        return item.url

    def item_title(self, item):
        return "{0} by {1}".format(item, item.vendor)

    def item_description(self, item):
        return item.description

This will return the 5 latest Apps sorted by date descending along with where they can be gotten from, and the description of the App. You’ll note in the title the __unicode__() representation is being used to print both the App and the Vendor. The next step is to map it into the url scheme:

(rss)(master) [dwarren@thebigwave amazonfeed]$ cd ..
(rss)(master) [dwarren@thebigwave apps]$ vi urls.py
from django.conf.urls.defaults import patterns, include, url
from apps.amazonfeed.feeds import AmazonFeed

urlpatterns = patterns('',
    (r'', AmazonFeed()),
    (r'AmazonFreeAppFeed/$', AmazonFeed()),
)

I’ve published /AmazonFreeAppFeed/ as the canonical URL for the project, but for now I’m also allowing / to access it as well. Mostly because I wanted to do this entire project without having to define any templates or write any HTML.

At this point everything is done from the Django end. The only thing left is to get the webserver to serve the pages. I use apache, and I have a small WSGI template that I replicate over and over:

(rss)(master) [dwarren@thebigwave rss]$ sudo vi /etc/httpd/conf/httpd.conf
<VirtualHost *:80>
        DocumentRoot "/home/dougwarren/rss/apps"
        ServerName rss.dougwarren.org
        Alias /static/ /home/dougwarren/rss/apps/static-final/
        WSGIScriptAlias / /home/dougwarren/rss/apps/rss.wsgi
        WSGIProcessGroup wynter
</VirtualHost>

I set a wildcard DNS entry on the dougwarren.org domain to the address of my VPS. As such, whenever I wish, I can add new hostnames to Apache and start serving content from it. The last thing to do is to set up the wsgi that was specified above:

(rss)(master) [dwarren@thebigwave apps]$ vi rss.wsgi
import sys
import site
import os

cur_path = os.path.dirname(__file__)
sys.path.append(cur_path)
base_path = os.path.abspath(os.path.join(cur_path,".."))

sys.path.append(base_path)
prev_sys_path = list(sys.path)

# add the site-packages of our virtualenv as a site dir
site.addsitedir(os.path.join(base_path,'lib','python2.7','site-packages'))
site.addsitedir(os.path.join(base_path,'src'))

# reorder sys.path so new directories from the addsitedir show up first
new_sys_path = [p for p in sys.path if p not in prev_sys_path]
for item in new_sys_path:
    sys.path.remove(item)
sys.path[:0] = new_sys_path

# import from down here to pull in possible virtualenv django install
from django.core.handlers.wsgi import WSGIHandler
os.environ['DJANGO_SETTINGS_MODULE'] = 'apps.settings'
application = WSGIHandler()

This wsgi file is similar to the changes made to settings.py or the crontab, it’s based off of a snippet I found on-line somewhere, but I lost the attribution at some point. If you know where it’s from please let me know so I can update it. Again no actual paths are specified everything is relative to where the wsgi file is specified.

The only thing left to do is restart Apache, and commit the changes!

(rss)(master) [dwarren@thebigwave apps]$ sudo /etc/rc.d/init.d/httpd restart
(rss)(master) [dwarren@thebigwave apps]$ git add settings.py rss.wsgi sites.py amazonfeed/models.py amazonfeed/feeds.py amazonfeed/tasks.py
(rss)(master) [dwarren@thebigwave apps]$ git commit -m 'Final commit'

Well, that’s it! The Amazon Free Apps of the day are now parsed and I hopefully won’t miss any in the future, and if you point your RSS reader at http://rss.dougwarren.org/AmazonFreeAppFeed/ maybe you won’t either.

  1.  Python HTML Parser Performance
  2.  Django/Celery Quickstart (or, how I learned to stop using cron and love celery)
Be Sociable, Share!

No related posts.

Categories: Django, python Tags: , ,
  1. Joe
    June 18, 2012 6:05 am | #1

    Hi. Thanks so much for making this. I use it every day. Recently though I dont think it has been working correctly. It is returning multiple different apps per day.

    • June 19, 2012 7:04 am | #2

      Looks like there was a change to the data, I made some changes that should have fixed it. It’ll take a day or so to be sure though.

  1. No trackbacks yet.