MongoSV Pics

in MongoSV

Really enjoyed MongoSV this last week … a few posts (in process) inspired by it!

In the meantime here are some pics, the first few are by yours truly.

(Note: I’m no photographer! Ha.)

Click here to view these pictures larger

0 Comments

Getting Started With MongoDB GridFS

in GridFS, PHP

One really useful built-in feature of MongoDB is it’s GridFS.

This filesystem within MongoDB was designed for … well, holding files, especially files over 4MB … why 4MB?

Well BSON objects are limited to 4MB in size (BSON is the format that MongoDB uses to store it’s database information) so GridFS helps store files across multiple chunks.

As Kristina Chodorow of 10Gen puts it

GridFS breaks large files into manageable chunks. It saves the chunks to one collection (fs.chunks) and then metadata about the file to another collection (fs.files). When you query for the file, GridFS queries the chunks collection and returns the file one piece at a time.

Why would you want to break large files in to “chunks”? A lot of of comes down to efficient memory & disk usage.

Chunks ‘O Random-Access Memory

Gee, mister. You’re even hungrier than I am.

Say you want to store larger files (maybe a 2GB video) when you preform a query on that file all 2GB needs to be loaded into memory … say you have a bigger file, 10GB, 25GB etc … it’s quite likely you’d run out of usable RAM or not have that much RAM available at all!

So, GridFS solves this problem by streaming the data back (in chunks) to the client … this way you’d never need to use more than 4MB of RAM.

Other Reasons to Use GridFS

Some other nicities of GridFS are …

  • If you are using replication or autosharding your GridFS files will be seamlessly sharded or replicated for you.
  • Since MongoDB datafiles are broken into 2 GB chunks MongoDB will automatically break your files into OS manageable pieces.
  • You won’t have to worry about OS limitations like ‘weird’ filenames or a large number of files in one directory, etc.
  • MongoDB will auto generate the MD5 hash of your file and store it in the file’s document. This is useful to compare the stored file with it’s MD5 hash to see if it was uploaded correctly, or already exists your database.

Command Line: mongofiles

An easy way to get started and see how GridFS works is the use the mongofiles command line utility (if you downloaded the binaries of MongoDB you should file this tool in the bin directory.)

To make things easy, mongofiles accepts RESTful looking commands, for example …

$ ./mongofiles -d myfiles put 03-smbd-menu-screen.mp3
connected to: 127.0.0.1

added file: {
   _id: ObjectId('4ce9ddcb45d74ecaa7f5a029'),
   filename: "03-smbd-menu-screen.mp3",
   chunkSize: 262144,
   uploadDate: new Date(1290395084166),
   md5: "7872291d4e67ae8b8bf7aea489ab52c1",
   length: 1419631 }

done!

This uploaded (PUT) the 03-smbd-menu-screen.mp3 file to a database called myfiles (it could be any database.)

This file now resides in the myfiles DB in the fs.files Collection. We can confirm this by passing the list command.

$ ./mongofiles -d myfiles list
connected to: 127.0.0.1
03-smbd-menu-screen.mp3 1419631

Hurrah! We have our files in there …you can also query it via the MongoDB Shell like so …

> use myfiles;
> db.fs.files.find({});
{
   "_id" : ObjectId("4ce9ddcb45d74ecaa7f5a029"),
   "filename" : "03-smbd-menu-screen.mp3",
   "chunkSize" : 262144,
   "uploadDate" : "Mon Nov 22 2010 03:04:44 GMT+0000 (UTC)",
   "md5" : "7872291d4e67ae8b8bf7aea489ab52c1",
   "length" : 1419631
}

Note: the size, upload date & md5 are all produced for you which is pretty  handy.

Uploading a File (or Data) via MongoDB Driver

Likely a more realistic way of storing files in GridFS will be via one of the many available language drivers. Each driver handles GridFS a little differantly but the concepts are the same.

The first thing you need to sort is out is are you going to upload actual files or are you going to create files from strings of data?

For example an application that allows a user to upload a video directly from their computer to your application would use the file method … however an application that would take a profile image (for example) and compress and resize it for use in your application would likely use the string of data method.

The File Method

In this example we’ll assume the file is already in your filesystem in the /tmp/ dir, but the file could be wherever your web-server/PHP is configured to access.

To work with GridFS files in PHP you use MongoGridFS class, more information can be found in the documentation.

We will use MongoGridFS::storeFile but you could also use MongoGridFS::put (which works like the command line example.)

<?php

// Connect to Mongo and set DB and Collection
$mongo = new Mongo();
$db = $mongo->myfiles;

// GridFS
$grid = $db->getGridFS();

// The file's location in the File System
$path = "/tmp/";
$filename = "03-smbd-menu-screen.mp3";
// Note metadata field & filename field $storedfile = $grid->storeFile($path . $filename, array("metadata" => array("filename" => $filename), "filename" => $filename));
// Return newly stored file's Document ID echo $storedfile; ?>

The String of Data Method

String of data is very similar only we’ll pass a string instead of a file/path so use the code above but use storeBytes instead.

$storedfile = $grid->storeBytes("This is test file data!",
                 array("metadata" => array("filename" => $filename),
                 "filename" => $filename));

You could of course pass any string, string representation of a images (or an encoded file via a string) where we’ve put “This is test file data!” …

A Little About Metadata

For PHP it doesn’t really matter but since other drivers handle things slightly differently it’s best to write any metadata to it’s own metadata field as well as a separate filename field as we have done in the example above.

You can put any file metadata that makes sense for your use in the metadata field.

Stream Back Files

Now that our file or files are loaded into GridFS streaming back the file is farily simple …

  • Connect to MongoDB
  • Do a findOne() on the file
  • Load it into memory using getBytes()
  • Set the proper headers
  • Stream the file back to the browser

So, here is how we’d stream back an image in PHP …

Stream an Image from GridFS to the Browser

Warning: this will load the file into memory. If the file is bigger than your memory, this will cause problems!

<?php
// Connect to Mongo and set DB and Collection
$mongo = new Mongo();
$db = $mongo->myfiles;     

// GridFS
$gridFS = $db->getGridFS();     

// Find image to stream
$image = $gridFS->findOne("chunk-screaming.jpg");

// Stream image to browser
header('Content-type: image/jpeg');
echo $image->getBytes();

?>

With a little adjustment you could stream back an mp3, or video, or prompt for a file download, etc.

Other Ways to Search for GridFD Files

You could also use the Document’s ID …

$image = $gridFS->findOne(
         array("_id" => new MongoId("4ceb167810f1d50a80e1c71c"))
         );

That will likely be how your application would look up a file in a real world system.

You can use any valid MongoDB findOne() query in it’s place as well, or use find() to get back a GridFS cursor of files, you can find out more about that here.

Deleting Files

You delete files in the same way, there are actually a couple ways to remove GridFS files, but we’ll just use one of the easiest …

Be really careful about passing the correct query to remove or you might just find yourself removing all your files! You can also use MongoGridFS::delete and pass the Document’s ID only.

<?php

// Connect to Mongo and set DB and Collection
$mongo = new Mongo();
$db = $mongo->myfiles;

// GridFS
$gridFS = $db->getGridFS();

// Find file to remove 
$removeFile = $gridFS->remove(
                array("_id" => new MongoId("4ceb167810f1d50a80e1c71c"))
              );

?>

Wrap Up

Hopefully you can now get started with GridFS and see if it will work well for your application … remember if you stream back (using the image example above) the files they will be loaded into memory and not streamed in 4MB chunks …so be careful!

Have fun.

9 Comments

Twitter 101: Store Tweets with MongoDB

in Drivers, PHP, Twitter

Twitter has been changing their API and features so quickly that even planning out a structure to store tweets can be quite difficult.

Luckily (compared to other, relational DB’s) MongoDB makes that much less of a problem with its schema-less documents.

To start things off let’s a get a few prerequisites checked off our list …

For this intro you will need:

  • PHP installed with the MongoDB driver (1.0.9 or greater recommended.)
  • The CURL extension for PHP.
  • A basic knowledge of querying and inserting with PHP and MongoDB.

If you have not done so, see our earlier post on the subject: MongoDB+PHP: Install and Connect

Getting Started

Twitter’s API is freely avaible and while some operations require using OAuth or xAuth getting a basic public “usertimeline” does not require authentication (however it is subject to rate limiting.)

Much more information can be found here, but for our purposes we will be accessing a basic public timeline for a user.

To do so you will need to grab the contents of the url below, replace “wsj” (for @wsj) with the user of your choice.

http://api.twitter.com/1/statuses/user_timeline.json?screen_name=wsj

This will return a JSON object which contains the last 20 (by default) tweets by the requested user. Each tweet also contains the user’s information (which fits nicely into MongoDB’s document structure.)

Example JSON Response

Importing Twitter Into MongoDB

Since we get back JSON from Twitter inserting into MongoDB is pretty straightforward …

  • Connect to MongoDB.
  • Convert the JSON into a PHP array (the PHP Mongo functions accept PHP arrays.)
  • Loop through each tweet.
  • Insert into MongoDB.

PHP Code Example

<?php

// Connect to Mongo and set DB and Collection
$mongo = new Mongo();
$db = $mongo->twitter;
$collection = $db->tweets;

//Set Twitter API options
$screen_name = "learnmongo";

// Create call to twitter API
$twitter = "http://api.twitter.com/1/statuses/user_timeline.json?";
$twitter .= "screen_name=" . $screen_name;

// Search for tweets
$curl = curl_init($twitter);

// Connect and retrieve tweets via curl
curl_setopt($curl,CURLOPT_RETURNTRANSFER,1);
$usertimeline = curl_exec($curl);
curl_close($curl);

// Convert JSON to a PHP array
$usertimeline = json_decode($usertimeline);

// Loop array and create seperate documents for each tweet
foreach ($usertimeline as $id => $item) {
   $collection->insert($item);
}

?>

You will now have a number documents in your collection, each one representing a separate tweet.

Only Inserting New Tweets

That’s all well and good, but what if you want to only insert tweets that are newer then the ones you already have in your MongoDB Collection?

First you will need to query your Collection to find the latest tweet, lucky MongoDB makes this very easy for us … just get the latest document entered into your Collection.

This will be the “latest” tweet you inserted (assuming you used code like the example above of course!)

Find Most Recently Inserted Document

<?php

// Connect to Mongo and set DB and Collection
$mongo = new Mongo();
$db = $mongo->twitter;
$collection = $db->tweets;

// Get Last ID
$obj = $collection->findOne();
$mongo_since_id = $obj["id"];

//Set Twitter API options
$screen_name = "learnmongo";
$since_id = $mongo_since_id;

?>

Now tie both pieces of code together, and you have a PHP page to query and capture all the latest tweets for a user in MongoDB.

Full Example Source

<?php

 // Twitter uses long int, so you may need to instruct the PHP/MongoDB to use long ints
 //ini_set('mongo.native_long', 1);

 // Connect to Mongo and set DB and Collection
 $mongo = new Mongo();
 $db = $mongo->twitter;
 $collection = $db->tweets;

 // Get Last ID
 $obj = $collection->findOne();
 $mongo_since_id = $obj["id"];

 //Set Twitter API options
 $screen_name = "learnmongo";
 $since_id = $mongo_since_id;

 // Create call to twitter API
 $twitter_apicall = "http://api.twitter.com/1/statuses/user_timeline.json?screen_name=" . $screen_name;
 if ($mongo_since_id != null) {
 $twitter_apicall .= "&since_id=" . $since_id;
 }

 // Search for newer tweets
 $curl = curl_init($twitter_apicall);

 // Connect and retreive tweets via curl
 curl_setopt($curl,CURLOPT_RETURNTRANSFER,1);
 $usertimeline = curl_exec($curl);
 curl_close($curl);

 // Weird fix to accomidate long numbers when using json_decode
 // User if you  cannot get the latest MongoDB Driver
 //$usertimeline = preg_replace( '/id":(\d+)/',
 //                       'id":"\1"', $usertimeline );

 // Convert JSON to a PHP array
 $usertimeline = json_decode($usertimeline);

 echo "<pre>";

 // Loop array and create separate documents for each tweet
 $tweetcount = 0;
 foreach ($usertimeline as $id => $item) {

 $collection->insert($item);
 $tweetcount++;

 }

 echo "<br />";
 echo "<b>url</b>: " . $twitter_apicall;
 echo "<br />";
 echo "<b>tweets</b>: " .$tweetcount;
 echo "</pre>";

?>

Displaying Your Tweets

Lastly to pull back and display your newly stored tweets in mongo use this bit of code …

<?
// Connect to Mongo and set DB and Collection
$mongo = new Mongo();
$db = $mongo->twitter;
$collection = $db->tweets;

// Return a cursor of tweets from MongoDB
$cursor = $collection->find();

// Convert cursor to an array
$array = iterator_to_array($cursor);

// Loop and print out tweets ...
foreach ($array as $value) {
   echo "<p>" . $value[text];
   echo " @ <b><i>" . $value[created_at] . "</i></b>";
}
?>

You should now have a nice print out of the tweets in your collection.

Note on Large Integers

Sites like Twitter and Facebook have so many users they can’t use simple 32 bit integers (int) the numbers are so long you need a 64 bit integer. Up until 1.0.9 Mongo’s PHP drivers did not understand these large number properly and if tired insert a 64 int you would end up a truncated number in your document (which is of no help of course.)

You can read more about this on our post here.

Derick Rethans also discussed this problem at length here, and submitted a fix that is now available in version MongoDB PHP driver version 1.0.9 and up.

Next In This Series

Sorry to say none of this will help much when Twitter goes down, again … for a few hours … but hey … you’ll have a nice local copy of all the tweets you need in your MongoDB so you don’t need to rely on Twitter!

Next we’ll dive into creating a “mash up” using the tweets we gathered above and then pulling in data from an external site and “mashing” it together in MongoDB, stay tuned.

4 Comments

Q&A: MongoHQ (MongoDB Hosting)

in Q & A

As the interest in MongoDB heats up in the devlopement comunity you might be asking yourself  …

“I’m no DBA, I don’t want to maintian my own server … do any web hosts offer MongoDB?”

The anwser generally is no, however there are a few smaller start up that are seeking to fill that void … one of them is MongoHQ “… the hosted database solution for easily getting your apps up and running with MongoDB.”

They also sport quite a nice webbased UI for MongoDB, and best of all you can try it all for free or as low as $5 a month (great for getting your feet wet with MongoDB.)

Jason McCay of MongoHQ was nice enough to answer a few questions for us about their motivations in starting the company, the challenges of hosting on a large scale, creating a UI and their long term goals ….

Who is MongoHQ?

LM: Tell us a little about you guys

MongoHQ (while we have had many talented people help us along the way) consists of three guys: myself, Ben Wyrosdick and Anthony Crumley. We formed a Ruby on Rails consultancy called CommonThread about four years ago and started working on the beginnnings of MongoHQ about a year ago. So far, it has been a great experience and we have learned much.

LM: What inspired you to create MongoHQ?

Originally our interest was in CouchDB, but after seeing a couple of tweets, namely one by Nic Williams (@drnic), our interest in MongoDB and the technology around the NoSQL/NoRM movement really grew. From there, we started experimenting and picked up some traction by discovering there was an interest in a hosted solution.

At first, 10gen (the creators of MongoDB) wasn’t so sure about our idea, but they have been amazing along the way…assisting us and even making minor adjustments to their platform to make hosting MongoDB instances easier.

Large Scale MongoDB Hosting

LM: What is the most interesting problem you have encountered while creating MongoHQ?

I think that the most interesting problem that we had to solve was a more general one: hosting mulitple MongoDB instances in a shared environment, creating plan levels and managing resources and quotas. We continue to work though better strategies for this and we are very excited about how upcoming releases of MongoDB could assist us.

MongoDB is an amazing technology that continues to prove itself to be extremely reliable under heavy load.

LM: You use Amazon EC2′s cloud to power MongoHQ. What sort of challenges has managing so many MongoDB Databases/Servers, etc. caused? How do you manage on that scale?

For us, the challenges that have taken the most time are creating effective strategies for optimizing I/O performance on physical disks. We have had situations where a server CPU would be almost 100% idle, yet the I/O would be completely utilized.

These problems are not easily solved and once you do have a workable solution, they take time to implement. Hundreds and thousands of gigabytes of data is not moved around quickly, especially when you cannot derogate performance on the production servers. Luckily, we are getting smarter about this.

Creating a MongoDB UI

LM: From a web development perspective, what have you learned from creating a web interface for MongoDB?

Honestly, I think we have learned that, while it is a nice visual to the actual MongoDB instance, it needs to be more than just that. We need the interface to assist people by automating some of the more manual tasks that they are forced to do in the shell, especially in terms of backups, monitoring, and logging.

The growth and support needs that have hit us over the last few months have really forced us to slow development of the MongoHQ web interface. This is unfortunate and we plan to make some much needed improvements as well as feature additions soon.

What’s to Come …

LM: What are some of your longer term goals for MongoHQ?

At this time, our longer term plans tend to focus on the creation of custom plans and dedicated plans. We want to provide our users with more control over their environments while, at the same time, providing them useful tools, control and support.

Also, we are focused on expanding MongoHQ into global availability zones as well as expanding into addtional cloud infrastructures.

LM: How did you come up with your awesome rocket logo?

We wanted to do something fun and engaging, so we threw out the idea of a rocket to Von Glitschaka (@vonster), the designer of our logo and he provided us with a great, scalable brand that we are really pleased with. The stereotypical database motif as the different stages of the rocket was totally Von. He did an awesome job.

So go ahead and check out MongoHQ for yourself, you can even use their web based UI to connect to your own MongoDB server.

2 Comments

MongoDB and 64-bit Integers in PHP

in PHP, Twitter

If you’ve ever needed to work with PHP’s MongoDB driver and large integers (which Twitter and Facebook use for id’s) you might have run into a problem …

Derick Rethans documents this problem at length at his post here

[A] Facebook UserID … [uses] a “64-bit int datatype”.

Unfortunately, the MongoDB PHP Driver only had support for 32-bit integers causing problems for newer users of Facebook. For those users, their nice long UserID was truncated to only 32 bits which didn’t quite make the application work.

Basically, while MongoDB itself supports very large 64 bit numbers the PHP driver only support(ed) up to 32 bits … so when you inserted into MongoDB via PHP you’d get a truncated (cutoff) number.

The good news is this has now been fixed in driver version 1.0.9 (released 8/6/2010) and up, however you might be wondering:

  • How can I tell which driver I have?
  • How do I upgrade?
  • How do I configure my PHP settings to allow for 64 bit numbers?

Check Your Version Number

One very quick and easy way to check your Mongo PHP driver version is to create a .php file on your server, add the following and save …

<?php
    phpinfo();
?>

Then open the page in your web browser and look for “Mongo” (or use ctrl-f for find) … you should find something like this …

If you see Version 1.0.8 or lower you need to upgrade …

Keep It Simple, Upgrade

If you already have the PHP Mongo driver installed the easiest thing to do is simply run an upgrade. If you are running Linux you can use pecl, on Windows you can download and replace the .dll file.

If you have not done so, see our earlier post on the subject: MongoDB+PHP: Install and Connect

Via your Linux command line run …

# sudo pecl upgrade mongo

This should fetch the latest Mongo PHP driver and install it for you, if you already have the latest it will just exit. You will need to restart your webserver for the changes to take effect.

To see if it worked you can use the phpinfo() method above.

Oh No, It Errored

If your server does not allow executions on /tmp you might get an error like this …

/usr/bin/phpize: 209:
/tmp/pear/temp/mongo/build/shtool: Permission denied

If so, you need to download and run the upgrade manually (in some other dir than /tmp/ …) Download the latest drivers here and follow the straighforward directions. It’s pretty easy, don’t worry!

How To Set PHP .ini Properly

Depending on the version of the driver you have installed (new versions will have this set by default) you may need to turn on support for 64 long integers.

You can either do this in your php.ini or inline in your code. Edit your php.ini file and add the following line:

[mongodb]
mongo.native_long = 1

Or, do this in code like so:

<?php
ini_set('mongo.native_long', 1);

//code here
?>

I Can’t Upgrade? :(

For some users upgrading might not be an option.

Luckily there is a work around: you can simple insert the 64-bit integer as a strings using quotes when you insert. You can also user a regular expression to help you.

Say you have a JSON array from Twitter with these large numbers … you can use the following regex to change them to strings …

$usertimeline = preg_replace('/id":(\d+)/', 'id":"\1"', $usertimeline);

While not the most optimized approach … if upgrading is not an option it maybe your best bet.

5 Comments