Very likely to be one of the next "big battles" for those trying to scale social networking and related tools/applications. Anyone have other candidates to keep track of? The one's below might be the start of a watch-list of sorts:
Large-scale graph computing at Google
In order to achieve that, we have created scalable infrastructure, named Pregel, to mine a wide range of graphs. In Pregel, programs are expressed as a sequence of iterations. In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges' states, and mutate the graph's topology (experts in parallel processing will recognize that the Bulk Synchronous Parallel Model inspired Pregel).
Currently, Pregel scales to billions of vertices and edges, but this limit will keep expanding. Pregel's applicability is harder to quantify, but so far we haven't come across a type of graph or a practical graph computing problem which is not solvable with Pregel. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use. Implementing PageRank, for example, takes only about 15 lines of code. Developers of dozens of Pregel applications within Google have found that "thinking like a vertex," which is the essence of programming in Pregel, is intuitive.
We've been using Pregel internally for a while now, but we are beginning to share information about it outside of Google. Greg Malewicz will be speaking at the joint industrial track between ACM PODC and ACM SPAA this August on the very subject. In case you aren't able to join us there, here's a spoiler: The seven bridges of Königsberg — inspiration for Leonhard Euler's famous theorem that established the basics of graph theory — spanned the Pregel river.
Neo4j - a Graph Database that Kicks Buttox | High Scalability
A graph is a collection nodes (things) and edges (relationships) that connect pairs of nodes. Slap properties (key-value pairs) on nodes and relationships and you have a surprisingly powerful way to represent most anything you can think of. In a graph database "relationships are first-class citizens. They connect two nodes and both nodes and relationships can hold an arbitrary amount of key-value pairs. So you can look at a graph database as a key-value store, with full support for relationships."
A graph looks something like:
For more lovely examples take a look at the Graph Image Gallery.
Here's a good summary by Emil Eifrem, founder of the Neo4j, making the case for why graph databases rule:
Most applications today handle data that is deeply associative, i.e. structured as graphs (networks). The most obvious example of this is social networking sites, but even tagging systems, content management systems and wikis deal with inherently hierarchical or graph-shaped data.
This turns out to be a problem because it’s difficult to deal with recursive data structures in traditional relational databases. In essence, each traversal along a link in a graph is a join, and joins are known to be very expensive. Furthermore, with user-driven content, it is difficult to pre-conceive the exact schema of the data that will be handled. Unfortunately, the relational model requires upfront schemas and makes it difficult to fit this more dynamic and ad-hoc data.
A graph database uses nodes, relationships between nodes and key-value properties instead of tables to represent information. This model is typically substantially faster for associative data sets and uses a schema-less, bottoms-up model that is ideal for capturing ad-hoc and rapidly changing data.
HeartProposal – Heart Project
Heart will develop a Hadoop subsystem for RDF data store and a distributed processing engine which use Hbase + MapReduce to store and to process RDF data.
We can store very sparse RDF data in a single table in Hbase, with as many columns as they need. For example, we might make a row for each RDF subject in a table and store all the properties and their values as columns in the table. This reduces costly self-joins in answering queries asking questions on the same subject, which results in efficient processing of queries, although we still need self-joins to answer RDF path queries.
We can further accelerate query performance by using MapReduce for parallel, distributed query processing.