Twitter 101: Store Tweets with MongoDB

in Drivers, PHP, Twitter

Twitter has been changing their API and features so quickly that even planning out a structure to store tweets can be quite difficult.

Luckily (compared to other, relational DB’s) MongoDB makes that much less of a problem with its schema-less documents.

To start things off let’s a get a few prerequisites checked off our list …

For this intro you will need:

  • PHP installed with the MongoDB driver (1.0.9 or greater recommended.)
  • The CURL extension for PHP.
  • A basic knowledge of querying and inserting with PHP and MongoDB.

If you have not done so, see our earlier post on the subject: MongoDB+PHP: Install and Connect

Getting Started

Twitter’s API is freely avaible and while some operations require using OAuth or xAuth getting a basic public “usertimeline” does not require authentication (however it is subject to rate limiting.)

Much more information can be found here, but for our purposes we will be accessing a basic public timeline for a user.

To do so you will need to grab the contents of the url below, replace “wsj” (for @wsj) with the user of your choice.

http://api.twitter.com/1/statuses/user_timeline.json?screen_name=wsj

This will return a JSON object which contains the last 20 (by default) tweets by the requested user. Each tweet also contains the user’s information (which fits nicely into MongoDB’s document structure.)

Example JSON Response

Importing Twitter Into MongoDB

Since we get back JSON from Twitter inserting into MongoDB is pretty straightforward …

  • Connect to MongoDB.
  • Convert the JSON into a PHP array (the PHP Mongo functions accept PHP arrays.)
  • Loop through each tweet.
  • Insert into MongoDB.

PHP Code Example

<?php

// Connect to Mongo and set DB and Collection
$mongo = new Mongo();
$db = $mongo->twitter;
$collection = $db->tweets;

//Set Twitter API options
$screen_name = "learnmongo";

// Create call to twitter API
$twitter = "http://api.twitter.com/1/statuses/user_timeline.json?";
$twitter .= "screen_name=" . $screen_name;

// Search for tweets
$curl = curl_init($twitter);

// Connect and retrieve tweets via curl
curl_setopt($curl,CURLOPT_RETURNTRANSFER,1);
$usertimeline = curl_exec($curl);
curl_close($curl);

// Convert JSON to a PHP array
$usertimeline = json_decode($usertimeline);

// Loop array and create seperate documents for each tweet
foreach ($usertimeline as $id => $item) {
   $collection->insert($item);
}

?>

You will now have a number documents in your collection, each one representing a separate tweet.

Only Inserting New Tweets

That’s all well and good, but what if you want to only insert tweets that are newer then the ones you already have in your MongoDB Collection?

First you will need to query your Collection to find the latest tweet, lucky MongoDB makes this very easy for us … just get the latest document entered into your Collection.

This will be the “latest” tweet you inserted (assuming you used code like the example above of course!)

Find Most Recently Inserted Document

<?php

// Connect to Mongo and set DB and Collection
$mongo = new Mongo();
$db = $mongo->twitter;
$collection = $db->tweets;

// Get Last ID
$obj = $collection->findOne();
$mongo_since_id = $obj["id"];

//Set Twitter API options
$screen_name = "learnmongo";
$since_id = $mongo_since_id;

?>

Now tie both pieces of code together, and you have a PHP page to query and capture all the latest tweets for a user in MongoDB.

Full Example Source

<?php

 // Twitter uses long int, so you may need to instruct the PHP/MongoDB to use long ints
 //ini_set('mongo.native_long', 1);

 // Connect to Mongo and set DB and Collection
 $mongo = new Mongo();
 $db = $mongo->twitter;
 $collection = $db->tweets;

 // Get Last ID
 $obj = $collection->findOne();
 $mongo_since_id = $obj["id"];

 //Set Twitter API options
 $screen_name = "learnmongo";
 $since_id = $mongo_since_id;

 // Create call to twitter API
 $twitter_apicall = "http://api.twitter.com/1/statuses/user_timeline.json?screen_name=" . $screen_name;
 if ($mongo_since_id != null) {
 $twitter_apicall .= "&since_id=" . $since_id;
 }

 // Search for newer tweets
 $curl = curl_init($twitter_apicall);

 // Connect and retreive tweets via curl
 curl_setopt($curl,CURLOPT_RETURNTRANSFER,1);
 $usertimeline = curl_exec($curl);
 curl_close($curl);

 // Weird fix to accomidate long numbers when using json_decode
 // User if you  cannot get the latest MongoDB Driver
 //$usertimeline = preg_replace( '/id":(\d+)/',
 //                       'id":"\1"', $usertimeline );

 // Convert JSON to a PHP array
 $usertimeline = json_decode($usertimeline);

 echo "<pre>";

 // Loop array and create separate documents for each tweet
 $tweetcount = 0;
 foreach ($usertimeline as $id => $item) {

 $collection->insert($item);
 $tweetcount++;

 }

 echo "<br />";
 echo "<b>url</b>: " . $twitter_apicall;
 echo "<br />";
 echo "<b>tweets</b>: " .$tweetcount;
 echo "</pre>";

?>

Displaying Your Tweets

Lastly to pull back and display your newly stored tweets in mongo use this bit of code …

<?
// Connect to Mongo and set DB and Collection
$mongo = new Mongo();
$db = $mongo->twitter;
$collection = $db->tweets;

// Return a cursor of tweets from MongoDB
$cursor = $collection->find();

// Convert cursor to an array
$array = iterator_to_array($cursor);

// Loop and print out tweets ...
foreach ($array as $value) {
   echo "<p>" . $value[text];
   echo " @ <b><i>" . $value[created_at] . "</i></b>";
}
?>

You should now have a nice print out of the tweets in your collection.

Note on Large Integers

Sites like Twitter and Facebook have so many users they can’t use simple 32 bit integers (int) the numbers are so long you need a 64 bit integer. Up until 1.0.9 Mongo’s PHP drivers did not understand these large number properly and if tired insert a 64 int you would end up a truncated number in your document (which is of no help of course.)

You can read more about this on our post here.

Derick Rethans also discussed this problem at length here, and submitted a fix that is now available in version MongoDB PHP driver version 1.0.9 and up.

Next In This Series

Sorry to say none of this will help much when Twitter goes down, again … for a few hours … but hey … you’ll have a nice local copy of all the tweets you need in your MongoDB so you don’t need to rely on Twitter!

Next we’ll dive into creating a “mash up” using the tweets we gathered above and then pulling in data from an external site and “mashing” it together in MongoDB, stay tuned.

4 Comments