Parse Twitter card/OpenGraph data out of HTML pages
There are a number of HTML tags that a page can provide to tell Facebook and Twitter how to render information about the link. Specifically Twitter Cards for Twitter and OpenGraph tags for Facebook. Here's some sample HTML from the <head>
from an Imgur page:
<meta name="twitter:site" content="@imgur" />
<meta name="twitter:domain" content="imgur.com" />
<meta name="twitter:app:id:googleplay" content="com.imgur.mobile" />
<meta name="twitter:title" content="Can't place the building here"/>
<meta property="og:title" content="Can't place the building here"/>
<meta property="author" content="Imgur" />
<meta property="article:author" content="Imgur" />
<meta property="article:publisher" content="https://www.facebook.com/imgur">
<meta property="og:type" content="article" />
<meta property="og:image" content="http://i.imgur.com/puEqa4C.jpg?fb" />
<meta property="og:image:width" content="600" />
<meta property="og:image:height" content="315" />
<meta name="twitter:card" content="summary_large_image"/>
<meta name="twitter:image" content="https://i.imgur.com/puEqa4C.jpg"/>
<meta property="og:description" content="Imgur: The most awesome images on the Internet."/>
<meta name="twitter:description" content="Imgur: The most awesome images on the Internet."/>
As you can see, there are a number of tags that could be useful, including a link to an actual image (that could be downloaded and embedded) and the title. Imgur has both of those for Twitter and OpenGraph, but there are others that only have some. For example, the New York Times doesn't have an image for OpenGraph:
<meta property="og:url" content="https://www.nytimes.com/2013/05/19/opinion/sunday/douthat-loneliness-and-suicide.html?nytmobile=0">
<meta name="twitter:card" value="summary">
<meta name="twitter:site" value="@nytimes">
<meta property="twitter:url" content="https://www.nytimes.com/2013/05/19/opinion/sunday/douthat-loneliness-and-suicide.html?nytmobile=0">
<meta property="twitter:title" content="Opinion | All the Lonely People">
<meta property="og:title" content="Opinion | All the Lonely People">
<meta property="twitter:description" content="Is there a link between suicide and weakened social ties?">
<meta property="twitter:image" content="https://cdn1.nyt.com/images/2014/11/01/opinion/douthat-circular/douthat-circular-thumbLarge-v4.jpg">
If the Content-Type
is text/html
or similar, and there's no site-specific parser (eg YouTube), parse the HTML and look for these tags.
I recommend BeautifulSoup 4 for HTML parsing. Initially maybe just do OpenGraph or Twitter cards (you pick), then add support for the other eventually.