|
Track 2, Day 2:
Building Multilingual Search Based Applications |
|
Steve Kearns
Slides
|
|
Lucene and Solr are power tools for building search based applications – and they are commonly used to search text that is created by people in their native language. If the tools do not recognize the languages they are working with, and understand the nuances of those languages, the result will be poor matching and decreased recall and precision. Fortunately there are components that simulate human analysis of natural language, allowing you to build systems with improved search accuracy by segmenting Asian text, breaking down compound words, and going beyond simple stemming approaches.
This talk will review some of the major problems that can result from unprocessed text, and how tools can normalize the text to improve results. We will demonstrate these issues within a Lucene/Solr implementation, using examples from various major languages of the world including Chinese, Japanese, German, English and Arabic.
Speaker bio: Steve is the product manager for the Rosette Platform and is also the subject matter expert for the international compliance market within Basis Technology. Prior to Basis Technology, Steve worked at BBN Technologies where he worked on the Broadcast and Web Monitoring Systems, which capture and extract open-source intelligence from live television and internet news websites. He has experience in information visualization, distributed systems architecture and received his MS in Information Technology and BS in Computer Information Systems from Bentley University.
|
Unlocking a Broadcaster Video Archive Using Solr |
|
Karel Braeckman
Slides
|
|
The Flemish TV-broadcaster VRT has an extensive 80-year old audiovisual archive, documenting the most memorable events in Belgian history. Archive material is reused on a daily basis in TV productions, however a lot of material is never reused, simply because it cannot be found in the archive’s search application. This application is just a thin layer on top of a relational database, in which the user must enter the exact search term in the correct search field (out of 67 available search fields) to avoid the dreaded “zero results”.
In our quest to open up the valuable archive, we combine the existing information in the search application with additional information sources (subtitles, rundowns, …) and index this information with Apache Solr. Using Solr’s built-in features like faceting, synonym-support and copy-fields, our custom-built demo application drastically improves searching and navigating the archive.
But we want more. Finding the video we look for is nice, but finding the right fragment within the video is even better. We compare several strategies using Solr which enable search within a media file and to create a video search application that takes the user immediately to the correct playout position within the video.
Speaker bio: VRT-medialab is the technological research department of the VRT, Flanders’ public service broadcaster. Its focus is on the Flemish media market, but VRT-medialab pursues collaboration on a European and worldwide level. The research topics can be grouped into four areas: media production environment, distribution, information front-end and information management. The author is part of this last research group. It investigates how to handle information in order to optimize industrial media production processes. Important areas are the automatic creation of metadata (feature extraction, region of interest, etc.), the reuse of existing metadata and the efficiency of search mechanisms.
Karel Braeckman has been working with VRT-medialab since 2008. Previous public speaking experience includes the EuroIMSA 2009 conference and several broadcaster events.
|
The Path to Discovery: Facets and the Scent of Information |
|
Tyler Tate
&
H. Stefan Olafsson
Slides
|
|
Grizzly Bears can smell food from over 25 kilometres away. They perseveringly track the scent for hours through rough terrain to claim the prize. People are no different. But rather than scavenging for food, we search for information.
Like the bear, humans need constant feedback and an ever-strengthening "information scent" in order to know we're headed in the right direction. Fortunately, Solr provides us with an ideal starting point: facets.
There are myriad ways in which facets can be utilised to provide helpful cues to the user. In this talk, we'll cover everything from tag clouds and faceted navigation, to histograms and geospatial visualisations, to network trees and augmented reality. We'll discuss the applications for information-driven websites, private web applications, e-commerce, and even the intelligence community. We aim to inspire, push the limits, and maybe even spark your next big idea.
Speaker bios: H. Stefan Olafsson and Tyler Tate are the founders of TwigKit, a company focused on building truly usable interfaces for search. Stefan has a M.Sc. (Distinction) in Information Systems from the London School of Economics and has spent the last 7 years working with search—first at FAST, then as an independent search consultant at Microsoft, and now at TwigKit. Over the last decade, Tyler has led the user experience design for enterprise applications from CMS to CRM, is creator of the popular 1KB CSS Grid, and is now focused on the user experience of search. Stefan and Tyler are the organisers of a monthly Enterprise Search meetup in London, and are frequent bloggers.
|
Rapid Prototyping |
|
Erik Hatcher
Slides
|
|
Got data? Let's make it searchable! This interactive presentation will demonstrate getting documents into Solr quickly, will provide some tips in adjusting Solr's schema to match your needs better, and finally will discuss how showcase your data in a flexible search user interface. We'll see how to rapidly leverage faceting, highlighting, spell checking, and debugging. Even after all that, there will be enough time left to outline the next steps in developing your search application and taking it to production.
Speaker bio: Erik Hatcher dedicates his time to open source software. He is a committer on the Apache Lucene and Solr projects, co-authored "Lucene in Action", and has spoken at numerous conferences around the world including ApacheCon, devoxx, JavaZone, and many others.
|
European Language Analysis with Hunspell |
|
Chris Male
Slides
|
|
Currently in Solr and Lucene, the Snowball algorithmic analysers are the often used to stem European languages. However, in many cases these algorithms are unable to handle the complex infliction rules found in some European languages such as Hungarian. Hunspell is an algorithm which addresses this limitation by extending iSpell/MySpell dictionaries with grammatical rules for how to remove inflictions and decompound words. Originally intended for Hungarian, the Hunspell algorithm is now used by a number of applications, including Open Office. This has lead to the creation of Hunspell dictionaries for most European languages including some not supported by Snowball, such as Basque and Frysian.
This presentation will introduce the Hunspell alogrithm, and show how it can be used in Solr and Lucene using a specifically developed Java port of Hunspell.
Speaker bio: Chris Male is a 25-year-old New Zealander, working as a software developer for JTeam in Amsterdam. He has been using Solr and Lucene for a number of years and has been heavily involved with the development of Lucene's spatial search support leading to him becoming a Lucene Contrib Comitter. He is also one of the authors of the Hunspell Java Port used in this presentation.
|
|
|
|
Agenda & Session Information
Agenda Overview
General Sessions
Track 1: Day 1 | Day 2
Track 2: Day 1 | Day 2
Training
|
|