in4diary : Strata Conference (Day 1)
Strata Hadoop world conference is the first technology conference I am attending in London since my return to this beautiful city. Time flies so fast that it feels not so long ago I attended Cisco Live in San Francisco. Overall, I should say it was a mix of both technical and business talks. Some talks were more technically focussed while others were business oriented. This is justifiable due to the fact that the Strata + Hadoop world conference was targeted to both engineers and managers. So there it went, for 3 days at Hilton Hotel near Edgware Road Station, London. I had the privilege to attend two out of three days of the conference. The first day was mainly focussed on training programmes and we skipped it due to this reason. Through this blog post, I will outline some of the takeaways I presonally grasped from the conference. The article will be in two parts where I will cover the day 1 at the conference in the former article and day 2 in the latter. Being an engineer, most of the keynotes did not carry a lot of weight in terms of technical details. But I should say that I enjoyed some of the valuable messages delivered by heads of large corperations and found them useful in terms of learing how to manage innovation and the business perspective of things. Amongst Day one’s keynotes, I loved how the keynotes emphasised on the fact that today’s computer systems have transformed from methodical computer systems into complex human-machine systems. This is the same argument I tried to point out in my last blog post about Complex Systems. The keynotes also talked about the new regimes of startups and creative innovators that are revolutionizing both the Computer Science and Big Data landscapes. One of the most innovative companies I came accross is Brytlyt, a GPU relational database company that uses GPUs to run relational database engines. I was also fascinated by Julie Mayer’s perspective on the startup culture. The talked emphasised why new startups should think how to coexist with giants like Google and Facebook rather than thinking how to take them down. Once the keynotes were finished, session and tutorials started. There were several sessions that stimulated me alot. The first lecture I was most excited about was the session about Multi-model databases. In current IT world, Data is king. And Data comes in all sizes, shapes and formats. Due to this reason, the full plate of data in any organization would be stored in multiple silo’s in different formats. Some data is best performent in document format when they should be searchable and relatable. Some data has connections that represet a net or a graph. Some information need fast access. ArangoDB is one of the new generation NoSQL database that allows all data of document, key-value or graph-y forms to be stored in one unified datastore.
It also allows you to seamlessly mix and match query components that allow joining all this data within one query language. This also enables polygot persistence, the ability to store different data models in different hardware that performs best for the data models. Putting the product itself aside, I believe the conceptual aspect of allowing multi-model data structures to be housed under one database engine is very important for data management and information enrichment. If scalability can be embedded into this solution, the final conceptualization would lead to a very strong data tool. Another talk that was very popular among the attendees of the conference was Martin Kleppmann’s (LinkedIn) talk on data agility. This talk was mainly based on how LinkedIn uses Apache Kafka, a distributed commit log service to keep its data agile. Three main points discussed was how:
- Data should be accessible: All data should be accessible and available. How linkedIn facilitates this is by providing all data (Including Databases) as streams.
- Data should be Composable : Data should be loosely coupled and atomic peices that do not depend on others to be informative.
- Data should have the ability to rebuild state: By keeping logs, we can preserve the source of truth. Historic logs can be used to replay and rebuild stateful data.
The talk about monitoring, and productionizing machine learning models was one of the other most interesting talks from Day 1 of the conference. There were some key take home lessons for us data scientists from that talk. They are:
- When choosing offline metrics to measure performance of a model, the best metrics are the one’s that closely correspond to the business metrics of the organization
- eg: if you are measuring a article recommender, although you are trying to predict the score for each article, it is rather better to use a ranking metric to measure the performance of model than a regression type metric. This is because the end business goal of the engine is to suggest relevant articles in the top.
- When doing A/B testing, always do your math to understand for how long a experiement should run until you can come to a conclusion
- Also in A/B testing, validate if the model assumptions are met and change your statistical tests according to the realty.
- When you have rare classes in your multi-label classification problems, make sure you pick an appropriate accuracy measure to evaluate the model. (weighted, micro, macro)
- Beware of the Shock of Newness: Everyone hates change, therefore, it is ideal to leave a burnout period before start measuring reaction to a change in the system.
- Models go out of date: One should be aware that trends change and models go out of date.
These were the most intersting sessions from the first day at Strata Conference Lonodn. The second day was also pretty exciting with more scalable machine learning and Alot of Apache Spark which I will be covering in the next blog post. I hope you enjoyed reading this post. Do not hesitate to build up a conversation around the post to refine and improve it. Follow me in wordpress to stay in the loop for my latest blog posts.
SKIMLINKS IS HIRING A NEW DATA SCIENTIST !! If you know machine learning and if you have dying interst to work in distributed large scale data with Apache Spark, you might be working with me and my exciting team in the days to come. Get in touch with me with a CV on email@example.com