Fast Data vs. Big Data
I presented at Big Data Analytics Europe on Thurs 8th March. Here's my 15 min keynote, summarised.
As we're in Amsterdam, I thought I'd talk about one of the city's most famous sons, Baruch Spinoza -Here is his statue, located in front of the Amsterdam City Hall.
Spinoza was a 17th-century philosopher, Born in 1632 - during Amsterdam's Golden Age.
This was a period in which Amsterdam was the leading Financial centre in the world. It had the world's first stock exchange and was probably the wealthiest city in the world. This period is also the age of Reason, starting the age of Enlightenment / The first part of the Scientific Revolution.
So who was Spinoza and what’s he got to do with big data? Arguably he was one of the first data scientists. Spinoza believed in mathematical reasoning - or deductive logic and was one of the first people in the 17th century to study science as we know it today. It is fair to say, his philosophy changed our view on the nature and scope of explanation.
Whilst deductive logic seems obvious today - it wasn’t always the case. As an example, before this period, the earth was described as being at the centre of the universe. The explanation? because it is divine and perfect - and god put it there. There was little to no attempt at deductive explanation, based on measurements, or data. Spinoza, along with Copernicus, Galileo - and many others of the Enlightenment, started to challenge this.
In short, this group looked to explain the nature and activity of bodies by concentrating on their measurable aspects. Spinoza helped form a new science - the science of mathematical physics. Or, what we would know today as data science.
Spinoza was most famous for his views on Ethics - but it's his views on the Mind-Body problem I want to quickly cover here. This is probably the hardest philosophical problem of all. It goes to the heart of what is consciousness? Or, what is the mind?
Descartes famously said “Cogito ergo sum” ("I think, therefore I am"), proving we exist with this statement. In other words, because we can think, we must exist. We are not in a dream, or some form of AI simulation. But Descartes couldn’t explain how the mind and body interact. He believed in mind and body duality, saying the mind has to be separate from the body. They are completely distinct and separate. The mind can exist independently of the body.
Spinoza opposed Descartes' philosophy on mind–body dualism. He held that the two are the same. This monism is a fundamental quality of his philosophy. I couldn’t find a good picture of monism, so I I chose a picture of yoda - as he thought mind and body were the same, so it’s good enough.
But what has this dualism vs. monism got to do with data?
Well… there’s currently a dualism in thinking about DATA;
On the left, Data systems mostly focus on the passive storage of data. Phrases like “data warehouse” or “data lake” or “data store” all evoke places data goes to sit. Movement of data tends to work in batches. By its very nature it’s slow.
Just as Descartes said the mind and body are separate, in this model, big data and speed are separate things - hard to reconcile.
At Confluent, we think Streaming data is bigger than big data. We are turning the database on its side, or some say, inside out. If Spinoza were alive today he’d argue for data monism - big data and real-time insights, or fast data, can be one and the same thing.
We believe Streaming platforms are challenging old assumptions.
With big data it was - the more the better. ...With Stream Data it’s about the speed. More recent data is more valuable. It’s not about just how much you can analyze at once; it’s about data in flow, in realtime. The faster you can respond to your data, the more valuable your response - especially in the age of real-time customer experience.
And this isn't just a model. We have real examples of Streaming data fuelling the economy of the future.
Streaming data architectures have become a central element of Silicon Valley’s technology companies. Many of the largest of these have built themselves around real-time streams as a kind of central nervous system that connects applications with everything happening in the business.You can find examples of how these architectures are used at companies like Uber, Ebay, Netflix, Yelp, or virtually any other modern technology company.
So, how does data streaming work - and why is it different to a traditional relational database?
First of all, we use Apache Kafka - an open-source stream processing software platform developed by the founders of Confluent and an Apache Software Foundation project.Whilst Kafka is often categorized as a messaging system (it serves a similar role), it provides a fundamentally different abstraction.
The key abstraction in Kafka is a structured commit log of updates. A producer of data sends a stream of records - appended to this log, and any number of consumers can continually stream these updates off the tail of the log with millisecond latency.
Importantly, Kafka is built as a modern distributed system - to be fault-tolerant, high-throughput, horizontally scalable, and allows geographically distributed data streams and stream processing applications.
Data is replicated and partitioned over a cluster of machines that can grow and shrink transparently to the applications using the cluster. Consumers of data can be scaled out over a pool of machines as well and automatically adapt to failures in the consuming processes. A key aspect of the Kafka architecture is that it handles persistence well. A Kafka broker can store many TBs of data, allowing usage patterns that would be impossible in a traditional database. Kafka’s storage layer is essentially a "massively scalable pub/sub message queue architected as a distributed transaction log.
A streaming platform doesn’t have to replace your data warehouse (yet). In fact, quite the opposite - it feeds it data. It acts as a conduit for data to quickly flow into the warehouse environment for long-term retention, ad hoc analysis, and batch processing. That same pipeline can run in reverse to publish out derived results from nightly or hourly batch processing.
Our VISION: is for Kafka to act as the central nervous system of the modern company, across any vertical.
We are building the Confluent Platform, a distribution of Kafka aimed at helping companies adopt and use it as a streaming platform. We think the Confluent Platform represents the best place to get started if you are thinking about putting streaming data to use in your organization whether for a single app or at company-wide scale.
Our Mission is to build this streaming platform and put it at the heart of every modern company.