AGENDA   |   TRAINING   |   ABOUT   |   SPONSORS   |   MEETUP   |   BEER FESTIVAL   |   VENUE   |  



    General Sessions  |  Track 1: Day 1  |  Track 1: Day 2  |  Track 2: Day 1  |  Track 2: Day 2

Track 2, Day 1:

Use of Solr at Trovit, A Leading Search Engine For Classified Ads


Marc Sturlese

Slides
This presentation will describe how Trovit S.L. used Solr to handle high volume searching of classified ads. Initially the architecture will be summarized as well as the different types of indexes used and how they were replicated and load balanced. For every user search many concurrent requests are made to the indexes - one is done to the organic index and one to the “sponsored ads” index. For each organic result a search request to the “similar user searches” index is done. At the same time, the user search requests are saved and later processed to decide if it must be indexed on the “similar user searches” index. The talk will then cover the key features used by Trovit, some out of the box, and custom-developed. This session is oriented to people getting started with Solr as well as more experienced users that can get ideas on development of custom components or plugins. Topics covered include:
  • Types of index and architecture
  • Tuning and custom features
  • Sharding
  • Future directions including Zookeeper and distributed IDF using SOLR-1632
Speaker bio: Marc Sturlese is a backend developer working for Trovut in Barcelona, Spain. He has a special interest in search engine development and has been working with Solr for one and a half years. Previously he worked on Trovit's search engine built on top of Lucene. Marc has a degree in multimedia engineering from the Enginyeria i Arquitectura La Salle University, in Barcelona.


Implementing Solr in Online Media As An Alternative to Commercial Search Products


Bo Raun

Slides
Nordjyske Medier is a Danish media company, publishing on web, radio, tv, and print media. For on-line search, Nordjyske Medier has a long history working with RDBMS systems, and a few years using the Google Search Appliance. They discovered Solr when it was clear that these technologies couldn’t fulfill new business needs.

Nordjyske Medier's whole IT strategy rests on Microsoft technologies (and other closed source solutions, like Citrix, VMware, etc). Discovering an open source alternative like Solr, has been an interesting journey for the entire IT staff. To ease the process, Nordjyske Medier made good use of Lucid Imagination's and Findwise’ services, for example Lucid Imagination's Expert Link gave the IT Server department some peace of mind when it came to ensuring support, high availability, etc.

Going into Solr with a long relational database background, gave Bo Raun a few personal surprises and experiences, which newcomers might benefit from.

Generally speaking, the way Nordjyske Medier uses Solr is relatively standard (DIH, replication, etc) – which is the reason for this presentation, since many other new or would-be users can learn from mistakes made, and maybe share some experience.

Today Nordjyske Medier uses Solr for an editorial archive, and for an online yellow page directory, but more sites are expected to benefit from Solr technologies in the near future. In addition, Nordjyske Medier is currently diving into some more advanced subjects, like geosearch and integrating Solr with an ontology engine.



Speaker bio: Bo Raun has worked for almost 20 years in the software industry, mainly as a developer of Windows applications, web solutions, and various integration software. For about 2 years, he has worked with Solr, implementing online search in the news media editorial archive, and yellow page directory.


Bringing Solr to Drupal: A General, and a Library-Specific Use Case


Peter Kiraly

Slides
The presentation has two different sections. The first section will cover integrating Apache Solr in a Drupal Content Management System and is based on Robert Douglass' apache_solr module and its extensions. The purpose of this module is to replace the default MySQL-based search by bringing Solr features into Drupal. This module is used by top Drupal sites such as drupal.org, WhiteHouse.gov and others, and uses a Drupal specific Solr schema configuration. The second section will be a use case and will cover incorporating library metadata (catalog records) into Drupal. It is based on eXtensible Catalog's Drupal Toolkit which is a next generation discovery interface for libraries. It provides a sophisticated web interface for preparing Solr parameters of highlighters, facets, "more like this" queries, mapping metadata fields to Solr dynamic fields and "templating" search results. This approach uses Solr's dynamic fields and is not based a fixed schema.

Speaker bio: Péter Király has been a library developer for more than 10 years and has used Lucene for 5 years. He was involved in projects with the Hungarian National Library, Hungarian Electronic Library, Hungarian National Archives, American Library Association and others. His previous company made Anacleto Digital Library based on Lucene, which was chosen by Project Gutenberg as the preferred search engine (over Google and Yahoo!). Since the beginning of 2008 he has worked for the eXtensible Catalog project as a Java and PHP developer. He has held several presentations at Hungarian library and archives related conferences, organizes the events of the Hungarian branch of the code4lib community, written articles and book chapters, and has translated technology standards (UAAG, EAD). He lives in Szentendre (which he feels is the nicest small town in Hungary) with his wife, and daughters. He is also an "old boy" water polo player.


Integration of Natural Language Processing tools with Solr


Joan Codina-Filbà

Slides
The presentation is going to be a description of an implementation and will include a discussion on how the results of Natural Language Processing tools can be effectively used with Solr.

Solr has some analysers that allow some syntactic transformation of text, but lacks from tools like lemmatizes, part of speech taggers or name entity recognition that can be obtained using NLP tools. From the other side, NLP tools are good at processing single documents but do not care about managing big collections. Using NLP tools like UIMA and Solr together can enhance the possibilities of of both. We will present how semantic information can be introduced into Solr using payloads, then how we can perform filters using these payloads (for example selecting nouns, adjectives or verbs), generate different kinds of bi-words and also how we can modify Solr to allow clustering with carrot2 using the output of the Analyzers instead of the input to Solr. It will include the Analyzers developed for Solr that enable these tasks.

Speaker bio: After doing the PhD on neural networks Joan Codina-Filbà moved to develop data mining algorithms for private software companies and from there to information retrieval and text mining. In text mining Joan Codina has been working integrating different search paradigms in a single search engine, adding the results of natural language processing into search engines and dealing with very different but difficult texts as patents or user generated contents. He has found in Lucene and Solr in particular a powerful platform easy to customize, integrate and expand. A the same time he is teaching in the Pompeu Fabra University about open source development and Information Retrieval systems.


Modular Document Processing for Solr/Lucene


Max Charas


Karl Jansson

Slides
Garbage in garbage out is a common expression, and it's especially important for search engines. People expect to find what they want and don’t generally care about the underlying data quality.

Most large commercial search platforms today have some sort of data processing pipeline for:
  • Filitering out content
  • Encoding/Decoding to the right format
  • Classification/Language Detection
  • Federating sources (SQL meta data+ Binary content = one combined document)
Findwise has developed extensive knowledge in this field and has used a number of techniques that will be covered in this talk including Open Pipeline - where the content is fetched through one of the built-in or custom made pipelines. Connectors used currently in our projects are webcrawler, file crawler, SQL crawler, REST crawler, RSS crawler, plus a number of special (not so generic ones). It is possible to use Open Pipeline as ‘index as a service’ with support for receiving documents through SOAP requests.

Tika is used for transformation from all supported formats to the internal Open Pipeline document model. After the document is transformed to a open pipeline document, a lot of specialized processing can be done depending on the information, source, format and business rules. For example, auto extraction of keywords, regexp extraction of interesting data in the content, field mappings, field merges, add static fields, categorization, clustering, tagging, filtering either through code directly in the pipeline or through integrating with other 3rd party systems.

Speaker Bio: Mr. Max Charas is a consultant working for Findwise Sweden. He has broad experience working with FAST ESP, Autonomy IDOL and Apache Solr in large enterprise environments. He focuses primarily on the architectural and technical aspects of large scale installations such as performance, security and data processing. Mr Charas is also an active Open Source evangelist and a certified Lucid Imagination Solr instructor.

Speaker Bio: Mr. Karl Jansson is a search consultant working for Findwise Sweden. He has been working in the Enterprise Search industry for 3 years after his graduation from Chalmers University of Technology with a master's degree in Engineering Physics. He has a broad experience of different search engines. He has been working with FAST ESP, Autonomy IDOL, IBM Omnifind, Google GSA and Apache Solr in both small, medium and large enterprise environments. He usually takes the technical lead role and designs the overall architecture of the enterprise search solutions. He has deep technical knowledge in for example server architecture, performance, security, content retrieval and data processing. Mr Jansson also have a great interest in Open Source solutions and Solr and Lucene in particular.



Social and Network Discovery (SaND) over Lucene


Shai Erera

Slides
The immeasurable amount of online information is clearly overwhelming and Internet technology has evolved accordingly. For example, search engine algorithms now often include credibility measures to rank their results. But traditional search technologies are still based on only two factors and the relationships between them — documents and terms. In this talk, Shai will present IBM's search tool which scans documents, social networks, and interactive sites to provide insightful results based on how users react to data and interact with one another. Searches done with SaND find and rank the documents, people, and tags connected with the entered query. Users can further narrow the search results by selecting a provided source or date, or view additional aspects of the results by selecting or hovering over various items on the page.

Speaker bio: Shai Erera is a Researcher at the IBM Research Lab in Haifa, Israel. Shai earned his M.Sc in Computer Science from the Univesity of Haifa in 2007. Shai's work experience includes the development of several search solutions (embedded and large-scale) over Lucene and is also a Lucene/Solr committer.


Query by Document: When "More Like This" Is Insufficient


Dusan Omercevic

Slides

In this session we will look closely at challenges of querying by document in general, and Zemanta's approach, in particular. We will show how a combination of contextual analysis, natural language processing and information retrieval techniques can be used to suggest a set of blog posts, news articles and pictures in real-time, while an author is writing his article inside a text editor. We will also discuss advantages and shortcomings of Apache Lucene/Solr with regards to querying by document that we have experienced while developing our system. Finally, we will demonstrate methods and tools that we use for evaluation of query by document performance.

Speaker bio: Dušan Omercevic is a software engineer at Zemanta - your blogging assistant - where he is responsible for developing a real-time content recommendation service based on Apache Lucene/Solr. Before joining Zemanta he headed software development at Najdi.si - the largest Internet search engine in Slovenia - and led several large software development projects (e.g., highway traffic management system, electronic toll collection). He is pursuing a PhD in computer vision at University of Ljubljana, and was a speaker/presenter at several conferences such as JavaBlend'09, CHI'08, ICCV'07 and Dagsthul seminar.






Agenda & Session Information

    Agenda Overview
    General Sessions
    Track 1: Day 1 | Day 2
    Track 2: Day 1 | Day 2
    Training