Strata Hadoop World
With over 70 sponsors (SAP being one of them), over 150 exhibitors and approximately 3K attendees the conference is really impressive. The venue took place in New York Manhattan at the Javits Center between September 28 and October 2. The conference offers half day tutorials on day 1 and massively parallel session on days 2 and 3. On these two days the morning is filled by very short (10 to 15 minutes) keynotes. Beside the official programme the biggest benefit of the venue are the top-class attendees you can mingle with.
Why Open Source is so important for data platforms
The conference attendees provided during discussions insights into their data platforms. The general line is that the Open Source products at use are essential parts of the overall solution to cope with the complexities and the dynamic s of the Big Data market. A consequence of the speed at which individual tools need to adapt to business requirements is that these tools are not per se enterprise ready. Even companies with long-time experience with running Big Data platforms ask questions in sessions in order to optimize their cost of operating their tools.
This is especially true for data platforms having extreme quality requirements like SAP XM:
- elastic and scalable infrastructure,
- global expansion,
- 100% availability,
- single-digit ms latency,
- fault tolerance,
- data local analytics,
- real-time insights.
The possibility of developing all needed components within one company has become unrealistic. Instead,data platforms adopt themselves into pluggable micro-service oriented architectures with specialized tools interacting only loosely with each other via REST based APIs. In such an environment it is becoming sufficiently easy to substitute a component by a new and more powerful tool.
The stability of individual tools is measured today more and more not primarilyby the maturity of its producer but by the size of the community it attracts. The acceptance of a tool by the “crowd” can be best seen at a conference like Hadoop World: while being a Hadoop conference, the concentration of the audience around topics of leveraging Spark on top of the Cassandra database shows a clear trend.
The power of LinkedIn to shape the trend of the industry
Among the topics discussed and seen at the conference are by far the 5 hottest ones:
- Apache Kafka – a high-throughput distributed messaging system is by far the hottest topic of the conference. Kafka has seen a tremendous development cycle: starting at LinkedIn it scaled from zero to a trillion processed messages per day. Companies like cisco, Netflix, Yahoo, airbnb, ebay or goldman sachs (just to name a few) are using Kafka to massively accelerate and simplify their data platforms.
- Apache Spark – a lightning fast cluster computing engine for large-scale data is at use or is planned to be used in all data platforms. Its ability to scale proportionally with the underlying data and its usability in Java, Scala, R and especially python make it a fix point in the Big Data zoo.
- Python – its ability to provide fast and easy access to both Kafka streams and Spark analytics and the vast amount of Data Science libraries make it an essential programming language. Applications on top of Python like Jupyter Notebook open up new possibilities to data scientists to create and share code, equations or visualizations in their community.
- Apache Avro – a data serialization system is a very technical but essential topic. It touches the ever growing needs of network communication and the necessity to make the communication language robust against future schema changes.
- Apache Cassandra – a scalable and highly available database. Interestingly enough, the venue being a Hadoop conference the speakers and audience have placed Cassandra as their tool of choice. The trend goes from separate batch and real-time components to one single component capable of serving both use cases: Cassandra.
5 Hottest Sessions
- Architecting a data platform – a tutorial provided by Silicon Valley Data Science
- Data 101 – a tutorial provided by O’Reilly, DataStax, Silicon Valley Data Science, Impact Lab, LinkedIn and Galvanize
- Process, store, and analyze like a boss with Team Apache: Kafka, Spark, and Cassandra – a tutorial provided by DataStax
- Data Liberation – a talk held by Martin Kleppmann (Independent). The slides shown can be downloaded here.
- Big data at Netflix: Faster and easier – a talk held by Kurt Brown (Netflix). The slides shown can be downloaded here.
5 Hottest Keynotes
- What 50 million users in 7 days can teach us about big data – Joseph Sirosh (CVP Machine Learning, Microsoft) – A very illustrative talk on what can all go wrong when a service is launched globally and also about the necessity of trial and error in Big Data projects.
- Context Computing – Jeff Jonas (IBM Fellow and Chief Scientist)
- Unleashing the power of big data today (a Splunk story) – Jim McHugh (VP Marketing Cisco)
- Declarative machine learning – Shivakumar Vaithyanathan (IBM)
- Haunted by Data – Maciej Ceglowski (Idle Words) – some deep thoughts on technology boundaries vs. etic boundaries[SS2] .
Trial is not an option
Visiting the conference has shown two major aspects
- SAP XM is facing an incredibly interesting opportunity in the market and has tremendous advantages over early adopters of data platforms: the community has developed extremely powerful and stable tools and starting from scratch without the worries of any legacy allows the free choice of best-of-breed solutions.
- The number of available tools covering every aspect of the platform is immense making the combinatorics of all possible setups seem infinite. Our goal is now to find a small set of setups best suited for our use case, test and choose the best.
Trial is not an option – it’s a design pattern for data platforms as there is always another bottleneck.