Feeds:
Posts
Comments

Archive for the ‘Networking’ Category

Social Graph & Beyond: Tim Berners-Lee’s Graph is The Next Level

Written by Richard MacManus / November 22, 2007 5:55 PM / 12 Comments


Tim Berners-Lee, inventor of the World Wide Web, today published a blog post about what he terms the Graph, which is similar (if not identical) to his Semantic Web vision. Referencing both Brad Fitzpatrick’s influential post earlier this year on Social Graph, and our own Alex Iskold’s analysis of Social Graph concepts, Berners-Lee went on to position the Graph as the third main “level” of computer networks. First there was the Internet, then the Web, and now the Graph – which Sir Tim labeled (somewhat tongue in cheek) the Giant Global Graph!

Note that Berners-Lee wasn’t specifically talking about the Social Graph, which is the term Facebook has been heavily promoting, but something more general. In a nutshell, this is how Berners-Lee envisions the 3 levels (a.k.a. layers of abstraction):

1. The Internet: links computers
2. Web: links documents
3. Graph: links relationships between people and/or documents — “the things documents are about” as Berners-Lee put it.

The Graph is all about connections and re-use of data. Berners-Lee wrote that Semantic Web technologies will enable this:

“So, if only we could express these relationships, such as my social graph, in a way that is above the level of documents, then we would get re-use. That’s just what the graph does for us. We have the technology — it is Semantic Web technology, starting with RDF OWL and SPARQL. Not magic bullets, but the tools which allow us to break free of the document layer.”

Sir Tim also notes that as we go up each level, we lose more control but gain more benefits: “…at each layer — Net, Web, or Graph — we have ceded some control for greater benefits.” The benefits are what happens when documents and data are connected – for example being able to re-use our personal and friends data across multiple social networks, which is what Google’s OpenSocial aims to achieve.

What’s more, says Berners-Lee, the Graph has major implications for the Mobile Web. He said that longer term “thinking in terms of the graph rather than the web is critical to us making best use of the mobile web, the zoo of wildy differing devices which will give us access to the system.” The following scenario sums it up very nicely:

“Then, when I book a flight it is the flight that interests me. Not the flight page on the travel site, or the flight page on the airline site, but the URI (issued by the airlines) of the flight itself. That’s what I will bookmark. And whichever device I use to look up the bookmark, phone or office wall, it will access a situation-appropriate view of an integration of everything I know about that flight from different sources. The task of booking and taking the flight will involve many interactions. And all throughout them, that task and the flight will be primary things in my awareness, the websites involved will be secondary things, and the network and the devices tertiary.”

Conclusion

I’m very pleased Tim Berners-Lee has appropriated the concept of the Social Graph and married it to his own vision of the Semantic Web. What Berners-Lee wrote today goes way beyond Facebook, OpenSocial, or social networking in general. It is about how we interact with data on the Web (whether it be mobile or PC or a device like the Amazon Kindle) and the connections that we can take advantage of using the network. This is also why Semantic Apps are so interesting right now, as they take data connection to the next level on the Web.

Overall, unlike Nick Carr, I’m not concerned whether mainstream people accept the term ‘Graph’ or ‘Social Graph’. It really doesn’t matter, so long as the web apps that people use enable them to participate in this ‘next level’ of the Web. That’s what Google, Facebook, and a lot of other companies are trying to achieve.

Incidentally, it’s great to see Tim Berners-Lee ‘re-using’ concepts like the Social Graph, or simply taking inspiration from them. He never really took to the Web 2.0 concept, perhaps because it became too hyped and commercialized, but the fact is that the Consumer Web has given us many innovations over the past few years. Everything from Google to YouTube to MySpace to Facebook. So even though Sir Tim has always been about graphs (as he noted in his post, the Graph is essentially the same as the Semantic Web), it’s fantastic he is reaching out to the ‘web 2.0’ community and citing people like Brad Fitzpatrick and Alex Iskold.

Related: check out Alex Iskold’s Social Graph: Concepts and Issues for an overview of the theory behind Social Graph. This is the post Tim Berners-Lee referenced. Also check out Alex’s latest post today: R/WW Thanksgiving: Thank You Google for Open Social (Or, Why Open Social Really Matters).

Read Full Post »


Report: Semantic Web Companies Are, or Will Soon Begin, Making Money

Written by Marshall Kirkpatrick / October 3, 2008 5:13 PM / 14 Comments


provostpic-1.jpgSemantic Web entrepreneur David Provost has published a report about the state of business in the Semantic Web and it’s a good read for anyone interested in the sector. It’s titled On the Cusp: A Global Review of the Semantic Web Industry. We also mentioned it in our post Where Are All The RDF-based Semantic Web Apps?.

The Semantic Web is a collection of technologies that makes the meaning of content online understandable by machines. After surveying 17 Semantic Web companies, Provost concludes that Semantic science is being productized, differentiated, invested in by mainstream players and increasingly sought after in the business world.

Provost aims to use real-world examples to articulate the value proposition of the Semantic Web in accessible, non-technical language. That there are enough examples available for him to do this is great. His conclusions don’t always seem as well supported by his evidence as he’d like – but the profiles he writes of 17 Semantic Web companies are very interesting to read.

What are these companies doing? Provost writes:

“..some companies are beginning to focus on specific uses of Semantic technology to create solutions in areas like knowledge management, risk management, content management and more. This is a key development in the Semantic Web industry because until fairly recently, most vendors simply sold development tools.”

 

The report surveys companies ranging from the innovative but unlaunched Anzo for Excel from Cambridge Semantics, to well-known big players like Down Jones Client Solutions and RWW sponsor Reuters Calais Initiative, to relatively unknown big players like the already very commercialized Expert System. 10 of the companies were from the US, 6 from Europe and 1 from South Korea.

semwebchart.jpgAbove: Chart from Provost’s report.We’ve been wanting to learn more about “under the radar” but commercialized semantic web companies ever since doing a briefing with Expert System a few months ago. We had never heard of the Italian company before, but they believe they already have they have a richer, deeper semantic index than anyone else online. They told us their database at the time contained 350k English words and 2.8m relationships between them. including geographic representations. They power Microsoft’s spell checker and the Natural Language Processing (NLP) in the Blackberry. They also sell NLP software to the US military and Department of Homeland Security, which didn’t seem like anything to brag about to us but presumably makes up a significant part of the $12 million+ in revenue they told Provost they made last year.

And some people say the Semantic Web only exists inside the laboratories of Web 3.0 eggheads!

Shortcomings of the Report

Provost writes that “the vendors [in] this report have all the appearances of thriving, emerging technology companies and they have shown their readiness to cross borders, continents, and oceans to reach customers.” You’d think they turned water into wine. Those are strong words for a study in which only 4 of 17 companies were willing to report their revenue and several hadn’t launched products yet.

The logic here is sometimes pretty amazing.

The above examples [there were two discussed – RWW] are just a brief sampling of the commercial success that the Semantic Web has been experiencing. In broad terms, it’s easy to point out the longevity of many companies in this industry and use that as a proxy for commercial success [wow – RWW]. With more time (and space in this report), additional examples could be described but the most interesting prospect pertains to what the industry landscape will look like in twelve months. [hmmm…-RWW]

 

In fact, while Provost has glowingly positive things to about all the companies he surveyed, the absence of engagement with any of their shortcomings makes the report read more like marketing material than any objective take on what’s supposed to be world-changing technology.

This is a Fun Read

The fact is, though, that Provost writes a great introduction to many companies working to sell software in a field still too widely believed to be ephemeral. The stories of each of the 17 companies profiled are fun to read and many of Provost’s points of analysis are both intuitive and thought provoking.

He says the sector is “on the cusp” of major penetration into existing markets currently served by non-semantic software. Provost argues that the Semantic Web struggles to explain itself because the World Wide Web is so intensely visual and semantics are not. He says that reselling business partners in specific distribution channels are combining their domain knowledge with the science of the software developers to bring these tools to market. He tells a great, if unattributed, story about what Linked Data could mean to the banking industry.

We hadn’t heard of several of the companies profiled in the report, and a handful of them had never been mentioned by the 34 semantic web specialist blogs we track, either.

There’s something here for everyone. You can read the full report here.

Read Full Post »

10 Semantic Apps to Watch

Written by Richard MacManus / November 29, 2007 12:30 AM / 39 Comments


One of the highlights of October’s Web 2.0 Summit in San Francisco was the emergence of ‘Semantic Apps’ as a force. Note that we’re not necessarily talking about the Semantic Web, which is the Tim Berners-Lee W3C led initiative that touts technologies like RDF, OWL and other standards for metadata. Semantic Apps may use those technologies, but not necessarily. This was a point made by the founder of one of the Semantic Apps listed below, Danny Hillis of Freebase (who is as much a tech legend as Berners-Lee).

The purpose of this post is to highlight 10 Semantic Apps. We’re not touting this as a ‘Top 10’, because there is no way to rank these apps at this point – many are still non-public apps, e.g. in private beta. It reflects the nascent status of this sector, even though people like Hillis and Spivack have been working on their apps for years now.

What is a Semantic App?

Firstly let’s define “Semantic App”. A key element is that the apps below all try to determine the meaning of text and other data, and then create connections for users. Another of the founders mentioned below, Nova Spivack of Twine, noted at the Summit that data portability and connectibility are keys to these new semantic apps – i.e. using the Web as platform.

In September Alex Iskold wrote a great primer on this topic, called Top-Down: A New Approach to the Semantic Web. In that post, Alex Iskold explained that there are two main approaches to Semantic Apps:

1) Bottom Up – involves embedding semantical annotations (meta-data) right into the data.
2) Top down – relies on analyzing existing information; the ultimate top-down solution would be a fully blown natural language processor, which is able to understand text like people do.

Now that we know what Semantic Apps are, let’s take a look at some of the current leading (or promising) products…

Freebase

Freebase aims to “open up the silos of data and the connections between them”, according to founder Danny Hillis at the Web 2.0 Summit. Freebase is a database that has all kinds of data in it and an API. Because it’s an open database, anyone can enter new data in Freebase. An example page in the Freebase db looks pretty similar to a Wikipedia page. When you enter new data, the app can make suggestions about content. The topics in Freebase are organized by type, and you can connect pages with links, semantic tagging. So in summary, Freebase is all about shared data and what you can do with it.

Powerset

Powerset (see our coverage here and here) is a natural language search engine. The system relies on semantic technologies that have only become available in the last few years. It can make “semantic connections”, which helps make the semantic database. The idea is that meaning and knowledge gets extracted automatically from Powerset. The product isn’t yet public, but it has been riding a wave of publicity over 2007.

Twine

Twine claims to be the first mainstream Semantic Web app, although it is still in private beta. See our in-depth review. Twine automatically learns about you and your interests as you populate it with content – a “Semantic Graph”. When you put in new data, Twine picks out and tags certain content with semantic tags – e.g. the name of a person. An important point is that Twine creates new semantic and rich data. But it’s not all user-generated. They’ve also done machine learning against Wikipedia to ‘learn’ about new concepts. And they will eventually tie into services like Freebase. At the Web 2.0 Summit, founder Nova Spivack compared Twine to Google, saying it is a “bottom-up, user generated crawl of the Web”.

AdaptiveBlue

AdaptiveBlue are makers of the Firefox plugin, BlueOrganizer. They also recently launched a new version of their SmartLinks product, which allows web site publishers to add semantically charged links to their site. SmartLinks are browser ‘in-page overlays’ (similar to popups) that add additional contextual information to certain types of links, including links to books, movies, music, stocks, and wine. AdaptiveBlue supports a large list of top web sites, automatically recognizing and augmenting links to those properties.

SmartLinks works by understanding specific types of information (in this case links) and wrapping them with additional data. SmartLinks takes unstructured information and turns it into structured information by understanding a basic item on the web and adding semantics to it.

[Disclosure: AdaptiveBlue founder and CEO Alex Iskold is a regular RWW writer]

Hakia

Hakia is one of the more promising Alt Search Engines around, with a focus on natural language processing methods to try and deliver ‘meaningful’ search results. Hakia attempts to analyze the concept of a search query, in particular by doing sentence analysis. Most other major search engines, including Google, analyze keywords. The company told us in a March interview that the future of search engines will go beyond keyword analysis – search engines will talk back to you and in effect become your search assistant. One point worth noting here is that, currently, Hakia has limited post-editing/human interaction for the editing of hakia Galleries, but the rest of the engine is 100% computer powered.

Hakia has two main technologies:

1) QDEX Infrastructure (which stands for Query Detection and Extraction) – this does the heavy lifting of analyzing search queries at a sentence level.

2) SemanticRank Algorithm – this is essentially the science they use, made up of ontological semantics that relate concepts to each other.

Talis

Talis is a 40-year old UK software company which has created a semantic web application platform. They are a bit different from the other 9 companies profiled here, as Talis has released a platform and not a single product. The Talis platform is kind of a mix between Web 2.0 and the Semantic Web, in that it enables developers to create apps that allow for sharing, remixing and re-using data. Talis believes that Open Data is a crucial component of the Web, yet there is also a need to license data in order to ensure its openness. Talis has developed its own content license, called the Talis Community License, and recently they funded some legal work around the Open Data Commons License.

According to Dr Paul Miller, Technology Evangelist at Talis, the company’s platform emphasizes “the importance of context, role, intention and attention in meaningfully tracking behaviour across the web.” To find out more about Talis, check out their regular podcasts – the most recent one features Kaila Colbin (an occassional AltSearchEngines correspondent) and Branton Kenton-Dau of VortexDNA.

UPDATE: Marshall Kirkpatrick published an interview with Dr Miller the day after this post. Check it out here.

TrueKnowledge

Venture funded UK semantic search engine TrueKnowledge unveiled a demo of its private beta earlier this month. It reminded Marshall Kirkpatrick of the still-unlaunched Powerset, but it’s also reminiscent of the very real Ask.com “smart answers”. TrueKnowledge combines natural language analysis, an internal knowledge base and external databases to offer immediate answers to various questions. Instead of just pointing you to web pages where the search engine believes it can find your answer, it will offer you an explicit answer and explain the reasoning patch by which that answer was arrived at. There’s also an interesting looking API at the center of the product. “Direct answers to humans and machine questions” is the company’s tagline.

Founder William Tunstall-Pedoe said he’s been working on the software for the past 10 years, really putting time into it since coming into initial funding in early 2005.

TripIt

Tripit is an app that manages your travel planning. Emre Sokullu reviewed it when it presented at TechCrunch40 in September. With TripIt, you forward incoming bookings to plans@tripit.com and the system manages the rest. Their patent pending “itinerator” technology is a baby step in the semantic web – it extracts useful infomation from these mails and makes a well structured and organized presentation of your travel plan. It pulls out information from Wikipedia for the places that you visit. It uses microformats – the iCal format, which is well integrated into GCalendar and other calendar software.

The company claimed at TC40 that “instead of dealing with 20 pages of planning, you just print out 3 pages and everything is done for you”. Their future plans include a recommendation engine which will tell you where to go and who to meet.

Clear Forest

ClearForest is one of the companies in the top-down camp. We profiled the product in December ’06 and at that point ClearForest was applying its core natural language processing technology to facilitate next generation semantic applications. In April 2007 the company was acquired by Reuters. The company has both a Web Service and a Firefox extension that leverages an API to deliver the end-user application.

The Firefox extension is called Gnosis and it enables you to “identify the people, companies, organizations, geographies and products on the page you are viewing.” With one click from the menu, a webpage you view via Gnosis is filled with various types of annotations. For example it recognizes Companies, Countries, Industry Terms, Organizations, People, Products and Technologies. Each word that Gnosis recognizes, gets colored according to the category.

Also, ClearForest’s Semantic Web Service offers a SOAP interface for analyzing text, documents and web pages.

Spock

Spock is a people search engine that got a lot of buzz when it launched. Alex Iskold went so far as to call it “one of the best vertical semantic search engines built so far.” According to Alex there are four things that makes their approach special:

  • The person-centric perspective of a query
  • Rich set of attributes that characterize people (geography, birthday, occupation, etc.)
  • Usage of tags as links or relationships between people
  • Self-correcting mechanism via user feedback loop

As a vertical engine, Spock knows important attributes that people have: name, gender, age, occupation and location just to name a few. Perhaps the most interesting aspect of Spock is its usage of tags – all frequent phrases that Spock extracts via its crawler become tags; and also users can add tags. So Spock leverages a combination of automated tags and people power for tagging.

Conclusion

What have we missed? 😉 Please use the comments to list other Semantic Apps you know of. It’s an exciting sector right now, because Semantic Web and Web 2.0 technologies alike are being used to create new semantic applications. One gets the feeling we’re only at the beginning of this trend.

Read Full Post »

10 More Semantic Apps to Watch

Written by Richard MacManus / November 20, 2008 10:00 AM / 16 Comments


In November 2007, we listed 10 Semantic apps to watch and yesterday we published an update on what each had achieved over the past year. All of them are still alive and well – a couple are thriving, some are experimenting and a few are still finding their way.

Now we’re going to list 10 more Semantic apps to watch. These are all apps that have gotten onto our radar over 2008. We’ve reviewed all but one of them, so click through to the individual reviews for more detail. It should go without saying, but this is by no means an exhaustive list – so if we haven’t mentioned your favorite, please add it in the comments.

BooRah

boorah_logo_sep08.pngBooRah is a restaurant review site that we first reviewed earlier this year. One of BooRah’s most interesting aspects is that it uses semantic analysis and natural language processing to aggregate reviews from food blogs. Because of this, BooRah can recognize praise and criticism in these reviews and then rates restaurants accordingly. BooRah also gathers reviews from Citysearch, Tripadvisor and other large review sites.

BooRah also announced last month the availability of an API that will allow other web sites and businesses to offer online reviews and ratings from BooRah to their customers. The API will surface most of BooRah’s data about a given restaurant, including ratings, menus, discounts, and coupons.

Swotti

Swotti is a semantic search engine that aggregates opinions about products to help you make purchasing decisions. We reviewed the product back in March. Swotti aggregates opinions about products from product review sites, forums and discussion boards, web sites and blogs, and then categorizes those reviews as to what feature or aspect of the product is being reviewed, tagging it accordingly, and then rating the review on as positive or negative.

Dapper MashupAds

Earlier this month we wrote about the recent improvement in Dapper MashupAds, a product we first spotted over a year ago. The idea is that publishers can tell Dapper: this is the place on my web page where the title of a movie will appear, now serve up a banner ad that’s related to whatever movie this page happens to be about. That could be movies, books, travel destinations – anything. We remarked that the UI for this has grown much more sophisticated in the past year.

How this works: in the back end, Dapper will be analyzing the fields that publishers identify and will apply a layer of semantic classification on top of them. The company believes that its new ad network will provide monetary incentive for publishers to have their websites marked up semantically. Dapper also has a product called Semantify, for SEO – see our review of that.

For more on Semantic advertising, see our write-up of a panel on this topic from the Web 3.0 Conference.

Inform.com

Inform.com analyzes content from online publishers and inserts links from a publisher’s own content archives, affiliated sites, or the web at large, to augment content being published. We reviewed it in January, when at the time the company had more than 100 clients – including CNN.com, WashingtonPost.com and the Economist.

Inform says its technology determines the semantic meaning of key words in millions of news stories around the web every day in order to recommend related content. The theory is that by automating the process of relevant link discovery and inclusion, Inform can easily add substantial value to a publisher’s content. Inform also builds out automatic topic pages, something you can see around WashingtonPost and CNN.com.

Siri

siri_coming_soon_logo.pngWe have met our share of secretive startups over the years, but few have been as secretive about their plans as Siri, which was founded in December 2007 and did not even have an official name until October this year. Siri was spun out of SRI International and its core technology is based on the highly ambitious CALO artificial intelligence project.

In our October post on Siri, we discovered that Siri is working on a “personalized assistant that learns.” We expect Siri to have a strong information management aspect, combined with some novel interface ideas. Based on our discussion with founders Dag Kittlaus and Adam Cheyer in October, we think that there will be a strong mobile aspect to Siri’s product and at least some emphasis on location awareness. Siri plans to launch in the first half of 2009.

Evri

evri-logo.pngEvri is a Paul Allen (of Microsoft fame) backed semantic search engine that launched into a limited beta in June. Evri is a search engine, though it adds a very sophisticated semantic layer on top of its results that emphasizes the relationships between different search terms. It especially prides itself for having developed a system that can distinguish between grammatical objects such subjects, verbs, and objects to create these connections. You can check out a tour of Evri here.

UpTake

Semantic search startup UpTake (formerly Kango) aims to make the process of booking travel online easier. In our review in May, we explained that UpTake is a vertical search engine that has assembled what it says is the largest database of US hotels and activities – over 400,000 of them – from more than 1,000 different travel sites. Using a top-down approach, UpTake looks at its database of over 20 million reviews, opinions, and descriptions of hotels and activities in the US and semantically extracts information about those destinations.

Imindi

Imindi is essentially a mind mapping tool, although it markets itself as a “Thought Engine”. Imindi was recommended to us in the comments to our previous post by Yihong Ding, who called it “an untraditional Semantic Web service”. Yihong said that traditionally Semantic Web services employ machines to understand humans, however Imindi’s approach is to encourage humans to better understand each other via machines.

Imindi has met with a fair amount of skepticism so far – and indeed it appears to be reaching big with its AI associations. However we think it’s worth watching, if for no other reason than to see if it can live up to the description on its About page: “By capturing the free form associations of user’s logic and intuition, IMINDI is building a global mind index which is an entirely new resource for building collective intelligence and leveraging human creativity and subjectivity on the web.”

See also: Thinkbase: Mapping the World’s Brain

Juice

JuiceWe’ve all been there. You started reading something on the Web, saw something interesting in the article, searched for it, wound up somewhere else, and after about 12 hops you’ve forgotten exactly what it was you were looking for. If only there were some way to select that topic midstream and have the information automagically appear for you, without disrupting your workflow or sending you traipsing off into the wilds of the Web.

If that sounds familiar, you may need a shot of Juice, a new Firefox 3 add-in currently in public beta from Linkool Labs, that makes researching Web content as easy as click-and-drag. In our review of Juice, we concluded that it avoids some of the more traditional stumbling blocks of Semantic apps by taking a very top-down approach focused on a distinct data set.

Faviki

Faviki is a new social bookmarking tool which we reviewed back in May. It offers something that services like Ma.gnolia, del.icio.us and Diigo do not – semantic tagging capabilities. What this means is that instead of having users haphazardly entering in tags to describe the links they save, Faviki will suggest tags to be used instead. However, unlike other services, Faviki’s suggestions don’t just come from a community of users and their tagging history, but from structured information extracted straight out of the Wikipedia database.

Because Faviki uses structured tagging, there is more that can be learned about a particular tag, its properties, and its connections to other tags. The system will automatically know what tags belong together and how they relate to others.

Conclusion

The Semantic Web continues to inch closer to reality, by being used in products such as BooRah, Inform.com and Juice. Let us know your thoughts on the above 10 products, and of course any that we missed this time round.

Read Full Post »


50+ Semantic Web Pros to Follow on Twitter

Written by Marshall Kirkpatrick / January 19, 2009 6:48 PM / 27 Comments


Here at ReadWriteWeb, we find the Semantic Web fascinating. We write about it a lot. What is the semantic web? The way we explain it is that it’s a paradigm advocating that the meaning of content on the web be made machine readable.

Why would you want to do that? Because once the “meaning” of text is automatically discernible, there’s a whole new world of things we can do with content on the web. Far out things that full text search for the mere presence of keywords would never be able to accomplish. Who’s working on the semantic web and how can you meet them? Read on.

In November, 2007 we published a list of 10 Semantic Web companies to watch. Then, one year later, we published a new list for 2008 of Semantic Web companies to watch.

Based on those lists, and reader suggestions in comments of other companies that should be watched, we present to you a list of 50+ Twitter users who work at Semantic Web companies. If you find this sector as interesting as we do, you might want to add some of these people to your microblogging community. You can click through the arrows in the iframe below to scroll through all the accounts and add the people listed. RSS readers who’d like to see the list should click through to the full post.

Mashery

A handful of these are company accounts, but most are accounts from individual employees. Want to suggest anyone we missed? (We know there are lots we’ve missed!) Let us know in comments. You can also meet the RWW crew on Twitter.

If this iFrame is driving you batty, see also this old list of links to all the accounts displayed below.

Read Full Post »

Google: “We’re Not Doing a Good Job with Structured Data”

Written by Sarah Perez / February 2, 2009 7:32 AM / 9 Comments


During a talk at the New England Database Day conference at the Massachusetts Institute of Technology, Google’s Alon Halevy admitted that the search giant has “not been doing a good job” presenting the structured data found on the web to its users. By “structured data,” Halevy was referring to the databases of the “deep web” – those internet resources that sit behind forms and site-specific search boxes, unable to be indexed through passive means.

Google’s Deep Web Search

Halevy, who heads the “Deep Web” search initiative at Google, described the “Shallow Web” as containing about 5 million web pages while the “Deep Web” is estimated to be 500 times the size. This hidden web is currently being indexed in part by Google’s automated systems that submit queries to various databases, retrieving the content found for indexing. In addition to that aspect of the Deep Web – dubbed “vertical searching” – Halevy also referenced two other types of Deep Web Search: semantic search and product search.

Google wants to also be able to retrieve the data found in structured tables on the web, said Halevy, citing a table on a page listing the U.S. presidents as an example. There are 14 billion such tables on the web, and, after filtering, about 154 million of them are interesting enough to be worth indexing.

Can Google Dig into the Deep Web?

The question that remains is whether or not Google’s current search engine technology is going to be adept at doing all the different types of Deep Web indexing or if they will need to come up with something new. As of now, Google uses the Big Table database and MapReduce framework for everything search related, notes Alex Esterkin, Chief Architect at Infobright, Inc., a company delivering open source data warehousing solutions. During the talk, Halevy listed a number of analytical database application challenges that Google is currently dealing with: schema auto-complete, synonym discovery, creating entity lists, association between instances and aspects, and data level synonyms discovery. These challenges are addressed by Infobright’s technology, said Esterkin, but “Google will have to solve these problems the hard way.”

Also mentioned during the speech was how Google plans to organize “aspects” of search queries. The company wants to be able to separate exploratory queries (e.g., “Vietnam travel”) from ones where a user is in search of a particular fact (“Vietnam population”). The former query should deliver information about visa requirements, weather and tour packages, etc. In a way, this is like what the search service offered by Kosmix is doing. But Google wants to go further, said Halevy. “Kosmix will give you an ‘aspect,’ but it’s attached to an information source. In our case, all the aspects might be just Web search results, but we’d organize them differently.”

Yahoo Working on Similar Structured Data Retrieval

The challenges facing Google today are also being addressed by their nearest competitor in search, Yahoo. In December, Yahoo announced that they were taking their SearchMonkey technology in-house to automate the extraction of structured information from large classes of web sites. The results of that in-house extraction technique will allow Yahoo to augment their Yahoo Search results with key information returned alongside the URLs.

In this aspect of web search, it’s clear that no single company has yet to dominate. However, even if a non-Google company surges ahead, it may not be enough to get people to switch engines. Today, “Google” has become synonymous with web search, just like “Kleenex” is a tissue, “Band-Aid” is an adhesive bandage, and “Xerox” is a way to make photocopies. Once that psychological mark has been made into our collective psyches and the habit formed, people tend to stick with what they know, regardless of who does it better. That’s something that’s a bit troublesome – if better search technology for indexing the Deep Web comes into existence outside of Google, the world may not end up using it until such point Google either duplicates or acquires the invention.

Still, it’s far too soon to write Google off yet. They clearly have a lead when it comes to search and that came from hard work, incredibly smart people, and innovative technical achievements. No doubt they can figure out this Deep Web thing, too. (We hope).

Read Full Post »

Older Posts »

%d bloggers like this: