What If IBM’s Watson Dethroned the King of Search?

Google continues to top the search game with the mission of “organiz[ing] the world’s information and mak[ing] it universally accessible and useful.” But now this mission is limited given how rapidly artificial intelligence has pushed the boundaries of what’s possible. It’s raised expectations of what we expect from computers. Even Siri has. In that mindset, Google is basically a gigantic database with rich access and retrieval mechanisms without the ability to create new knowledge.

via Google in Jeopardy: What If IBM’s Watson Dethroned the King of Search? | Wired Opinion | Wired.com.

In other words: Google can retrieve, but Watson can create.

Solr: The Most Important Open Source Project You’ve Never Heard Of

Lucene is used by many companies and groups as the foundation for their search engines. These organizations include AOL, Disney, and Eclipse. Lucene’s chief selling point is that the indexing engine, with a footprint of a mere megabyte of RAM, can index up to 150GBs per hour of text on commercial off-the-shelf hardware. That’s darn good!

Solr comes into the picture as the search platform front-end for Lucene. It provides full-text search, including the ability to handle such formats as Microsoft Word and PDF with Apache Tika; hit test highlighting; and faceted search, which incorporates free text searching with topic taxonomy indexing.

via Solr: The Most Important Open Source Project You’ve Never Heard Of.

Under the hood, Solr is written in Java and it relies on Lucene for its core functionality.  It usually runs within a servlet container such as the Jetty HTTP server and Javax.servlet.

Information Extraction and Synthesis Laboratory

Cross-document coreference resolution is the task of grouping the entity mentions in a collection of documents into sets that each represent a distinct entity. It is central to knowledge base construction and also useful for joint inference with other NLP components. Obtaining large, organic labeled datasets for training and testing cross-document coreference has previously been difficult. We use a method for automatically gathering massive amounts of naturally-occurring cross-document reference data to create the Wikilinks dataset comprising of 40 million mentions over 3 million entities. Our method is based on finding hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. In addition to providing large-scale labeled data without human effort, we are able to include many styles of text beyond newswire and many entity types beyond people.

via Wikilinks – Information Extraction and Synthesis Laboratory.

Google Must Pay For Libelous Search Result, Says Court

The jury at the Supreme Court of Victoria agreed with Google up to a point. The company wasn’t responsible for the results until Trkulja asked it to take them down, it said. (Read the decision in full here.) Because it stuck to its guns, Google must pay $200,000 in damages..

via Google Must Pay For Libelous Search Result, Says Court.

SHODAN – Computer Search Engine

So what does SHODAN index then? Good question. The bulk of the data is taken from ‘banners’, which are meta-data the server sends back to the client. This can be information about the server software, what options the service supports, a welcome message or anything else that the client would like to know before interacting with the server.

via SHODAN – Computer Search Engine.

What ports does SHODAN index?

The majority of data is collected on web servers at the moment (port 80), but there is also some data from FTP (23), SSH (22) and Telnet (21) services. There are plans underway to expand the index for other services. Let me know if there are specific ports you would like to see included.