AGENDA \| TRAINING \| ABOUT \| SPONSORS \| MEETUP \| BEER FESTIVAL \| VENUE \|

General Sessions | Track 1: Day 1 | Track 1: Day 2 | Track 2: Day 1 | Track 2: Day 2

Track 1, Day 1:

Text and Metadata Extraction with Apache Tika

Jukka Zitting Slides		Apache Tika is a toolkit for extracting text and metadata from digital documents. It's the perfect companion to search engines and any other applications where it's useful to know more than just the name and size of a file. Powered by parser libraries like Apache POI and PDFBox, Tika offers a simple and unified way to access content in dozens of document formats. This presentation introduces Apache Tika and shows how it's being used in projects like Apache Solr and Apache Jackrabbit. You will learn how to integrate Tika with your application and how to configure and extend Tika to best suit your needs. The presentation also summarize the key characteristics of the more widely used file formats and metadata standards, and shows how Tika can help deal with that complexity. The audience is expected to have basic understanding of Java programming and MIME media types. Speaker bio: Jukka Zitting is an active open source developer focused on content management. He is the chairman of the Apache Jackrabbit project and an active participant in various other open source projects at Apache and elsewhere. He sees software development as modern day craftsmanship and believes in open source as the best way to practice the craft. Jukka works as a Senior Developer for the Swiss CMS vendor Day Software, and spends his free time exploring Switzerland or visiting his fiancée and their two cats back in Finland. Jukka has been a frequent speaker at recent ApacheCon conferences and has in recent years presented also at the Jazoon and J. Boye conferences and various smaller meetups and get-togethers.

Lucene Forecast: Version, Unicode, Flex and Modules

Simon Willnauer Uwe Schindler Slides		Since Apache Lucene moved to Java 5 in November 2009, several new features and concepts have been introduced. From maintaining version-by-version backwards compatibilit,y to fully enabled Unicode 4.0 support, and the recently merged "Flexible-Indexing" branch, future versions are ushering in a new era of Open Source full-text search. During the spring of 2010 Lucene and Solr development have merged leading to even closer development and more flexible modularization. This talk starts with a brief history and introduction of new features like Version and Unicode while main parts are outlining the vision of Lucene's next version including "Flexible-Indexing", Automation Query, and modularization. Speaker bio: Simon Willnauer is a committer of Apache Lucene Java (Contrib), OpenRelevance and Connectors. During the last couple of years he worked on design and implementation of scalable software systems and search infrastructure. He studied Computer Science at the University of Applied Science Berlin. Currently, he works as a consultant for Apache Solr, Lucene Java and Hadoop, and is a co-organizer of the "BerlinBuzzwords" conference on Scalability June 2010 in Berlin. Speaker bio: Uwe Schindler is committer and PMC member of Apache Lucene and Solr. His main focus is on development of Lucene Java. He implemented fast numerical search and is maintaining the new attribute-based text analysis API. He studied Physics at the University of Erlangen-Nuremberg. He works as software architect and consultant for PANGAEA (Publishing Network for Geoscientific & Environmental Data) in Bremen, Germany. Uwe had given talks about Lucene at various international conferences and meetups.

Munching and Crunching: Lucene Index Post-processing

Andrzej Bialecki Slides		The index at the heart of Lucene/Solr implementation is critical to the performance of your search application and the quality of your results -- and not just at indexing time. Attention to the index rewards the search application developer after the index has been created. In this talk, Andrzej Bialecki, Lucene Committer and inventor of the Luke index utility, will discuss strategies for index post-processing, including: Single-pass index splitting -- reshaping indexes for flexible deployment; index pruning, filtering and multi-tiered search, or how to serve indexes (mostly) from RAM; bit-wise search - finding the best bit-wise matches, and applications in text fingerprinting; and MapReduce indexing - map-side analysis and reduce-side index writing. Speaker bio: Andrzej Bialecki, Apache Lucene PMC Member, also serves as the project lead for Nutch, and as committer in the Lucene-java, Nutch and Hadoop projects. He has broad expertise, across domains as diverse as information retrieval, systems architecture, embedded systems kernels, networking and business process/e-commerce modeling. He's also the author of the popular Luke index inspection utility. Andrzej holds a master's degree in Electronics from Warsaw Technical University, speaks four languages and programs in many, many more. Andrzej serves on the Lucid Imagination Technical Advisory Board.

Solr Schema Demystified

Uri Boness		The index schema stands as one of the core constructs that Solr is built on. As such, schema design plays an important (and sometimes crucial) factor in the success of every Solr deployment. In this session, we will take a closer look at the schema, delve into its structure and review the elements contained within. We will discuss the different issues you have to deal with when designing a schema for your search domain. Some of the topic that will be covered: The role of the schema Field types Fields Dynamic Fields Copy fields Configuration per functionality And more... Speaker Bio: Uri Bonnes is a software architect at JTeam with over 10 years of Java development experience. For the last 5 years he has been developing search solutions based on open source search technologies based on the Lucene stack. He initially developed an in-house framework for faceted navigation, which at a later stage was abandoned in favor of Solr. Uri is very active in the Solr/Lucene community, developing patches and complementary software around these products (see Solr Explorer - http://search.jteam.nl/explorer). In the last several years, he has been giving trainings and speaking in numerous conferences promoting Solr/Lucene, including NL-JUG and ApacheCon Europe, and he is also the founder of the Dutch Lucene User Group.

Solr in the Wild: The Guardian’s Open Platform Content API

Graham Tackley Slides Slides w/notes		The Guardian's Open Platform initiative enables partners to build applications with The Guardian. As part of this initiative, The Guardian provides the Content API - a rich interface to all The Guardian's content and metadata back to 1991 - over 1 million documents. This talk starts with a brief overview of the latest iteration of the content API. It will then cover how we implemented this in Scala using Solr, addressing real-world problems in creating an index of content: how we represented a complex relational database model in Solr how we keep the index up to date, meeting a sub-5 minute end-to-end update requirement how we update the schema as the API evolves, with zero downtime how we scale in response to unpredictable demand, using cloud services Graham Tackley is the Web Platform Development Team Lead for guardian.co.uk. As well as ensuring the software running guardian.co.uk can scale to meet ever increasing demand, he and his team evolve the platform to enable reader engagement in new and innovative ways. Recently, most of his time has been occupied with the Open Platform content API. Before joining the Guardian he had worked as a consultant for ThoughtWorks and Avanade.

Make Your Domain Objects Searchable with Hibernate Search

Gustavo Fernandes Slides		This talk introduces Hibernate Search, an extension of the popular open source object-relational mapping framework Hibernate, that offers a flexible and easy way to add full text search capabilities to your Java application. Hibernate Search uses Lucene and Solr libraries internally, and does for indexes the same things that Hibernate does for databases: you can retrieve domain objects from queries, map objects to index documents using annotations and also synchronise the search index and database transparently. The talk will also show how to scale Hibernate Search beyond a single node and single JVM, highlighting the advantages and disadvantages of each modality. Speaker bio: Gustavo Nalle Fernandes is a seasoned software engineer who has been working with JEE, Java and Open Source technologies for the last 10 years, in 3 different countries. He has previously spoken in Brazil at SouJava, the world's largest Java User Group, in the Netherlands at Apache Cocoon GetTogether and in Italy at Alfresco Meetup. Currently he is working for Sourcesense, a leading European Open Source services company, and is responsible for doing Solr and Lucene consultancy and training. He has made several contribution to Open Source projects, and lately is contributing to Hibernate Search. He lives and works in London.

Key Topics When Migrating From FAST to Solr

Jan Høydahl Slides		This talk covers key topics to be aware of if you are planning a migration from FAST to Solr. We will not cover the decision making process itself or judge which search engine is "better", but will give a high-level comparison and show how things you are used to doing with FAST can be solved with Solr. We will also give examples of when you should consider 3rd party additions to add functionality where Solr is weak. The presentation is functionally orientated, with technical deep-dives once in a while. Speaker bio: Jan Høydahl has 10 years professional hands-on experience within enterprise search. He runs the independent consulting company Cominvent AS, delivering consulting services and training within mission critical search for Apache Solr and FAST ESP. With a multitude of hands-on projects, Jan is in a unique position for this talk. Jan has been speaking at several enterprise search events and workshops as well as delivered training courses around the world on Solr as well as FAST.

Agenda & Session Information

    Agenda Overview
    General Sessions
    Track 1: Day 1 | Day 2
    Track 2: Day 1 | Day 2
    Training