<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Dataheimer Newsletter]]></title><description><![CDATA[Data Engineering/Science]]></description><link>https://dataheimer.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!kayI!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F737e0768-239b-442b-bc27-b066513283cc_1024x1024.png</url><title>Dataheimer Newsletter</title><link>https://dataheimer.substack.com</link></image><generator>Substack</generator><lastBuildDate>Sun, 21 Jun 2026 16:03:06 GMT</lastBuildDate><atom:link href="https://dataheimer.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Subhan Hagverdiyev]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[dataheimer@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[dataheimer@substack.com]]></itunes:email><itunes:name><![CDATA[Subhan Hagverdiyev]]></itunes:name></itunes:owner><itunes:author><![CDATA[Subhan Hagverdiyev]]></itunes:author><googleplay:owner><![CDATA[dataheimer@substack.com]]></googleplay:owner><googleplay:email><![CDATA[dataheimer@substack.com]]></googleplay:email><googleplay:author><![CDATA[Subhan Hagverdiyev]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Are watermarks a lie?]]></title><description><![CDATA[the watermark delusion: scaling stream processing in a messy world]]></description><link>https://dataheimer.substack.com/p/are-watermarks-a-lie</link><guid isPermaLink="false">https://dataheimer.substack.com/p/are-watermarks-a-lie</guid><dc:creator><![CDATA[Subhan Hagverdiyev]]></dc:creator><pubDate>Wed, 11 Mar 2026 17:45:55 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!MYQa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718e0e1d-6d47-450f-9c7a-a7611c7067fa_2550x1640.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the clean, controlled world of a Flink or Spark tutorial, watermarks are magic. You set an allowed lateness of five seconds, your data flows through a well-synced Kafka cluster, and your windows close with satisfying precision.</p><p>But for many teams, the &#8220;controlled environment&#8221; is a luxury they don&#8217;t have. Once your data touches the edge, mobile devices, or third-party webhooks, watermarks transition from a strict guarantee to a useful&#8212;but incomplete&#8212;heuristic.</p><p>Here is why watermarks eventually hit a wall, and the architectural patterns used to fill the gap.</p><blockquote><p><em>I&#8217;m Subhan Hagverdiyev and welcome to Dataheimer - where we explore the atomic impact of data.</em></p><p><em>We have already reached 100+ subscribers. Thanks for all supporting. If you would like to receive more of this type of deep dives subscribe to the newsletter and I will send one each week</em></p></blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataheimer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataheimer.substack.com/subscribe?"><span>Subscribe now</span></a></p><h2>First: What is a Watermark, anyway?</h2><p>Before we break them, we have to understand their purpose. In stream processing, there is a fundamental tension between <strong>Event Time</strong> (when something happened) and <strong>Processing Time</strong> (when your engine sees it).</p><p>Because data can arrive out of order, a streaming engine needs a way to know when to stop waiting for data and actually perform a calculation (like a sum or average for a specific minute).</p><p><strong>A watermark is a &#8220;completeness&#8221; signal.</strong> It is a timestamp flowing through the stream that effectively says: <em>&#8220;I am reasonably confident that no more events with a timestamp earlier than X are going to arrive.&#8221;</em></p><ul><li><p>If a watermark for 10:05 AM arrives, the engine &#8220;closes&#8221; the 10:00&#8211;10:05 window and emits the result.</p></li><li><p>The &#8220;Allowed Lateness&#8221; is the buffer you add&#8212;telling the engine to wait an extra 10 or 60 seconds before moving that watermark forward.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MYQa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718e0e1d-6d47-450f-9c7a-a7611c7067fa_2550x1640.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MYQa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718e0e1d-6d47-450f-9c7a-a7611c7067fa_2550x1640.png 424w, https://substackcdn.com/image/fetch/$s_!MYQa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718e0e1d-6d47-450f-9c7a-a7611c7067fa_2550x1640.png 848w, https://substackcdn.com/image/fetch/$s_!MYQa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718e0e1d-6d47-450f-9c7a-a7611c7067fa_2550x1640.png 1272w, https://substackcdn.com/image/fetch/$s_!MYQa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718e0e1d-6d47-450f-9c7a-a7611c7067fa_2550x1640.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MYQa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718e0e1d-6d47-450f-9c7a-a7611c7067fa_2550x1640.png" width="1456" height="936" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/718e0e1d-6d47-450f-9c7a-a7611c7067fa_2550x1640.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:936,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:486355,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/190516735?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718e0e1d-6d47-450f-9c7a-a7611c7067fa_2550x1640.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MYQa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718e0e1d-6d47-450f-9c7a-a7611c7067fa_2550x1640.png 424w, https://substackcdn.com/image/fetch/$s_!MYQa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718e0e1d-6d47-450f-9c7a-a7611c7067fa_2550x1640.png 848w, https://substackcdn.com/image/fetch/$s_!MYQa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718e0e1d-6d47-450f-9c7a-a7611c7067fa_2550x1640.png 1272w, https://substackcdn.com/image/fetch/$s_!MYQa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718e0e1d-6d47-450f-9c7a-a7611c7067fa_2550x1640.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Red dots</strong> (above the watermark) are late events &#8212; they carry an old event time, but by the time they showed up, the watermark had already moved beyond them. For example, an event that happened at 10:01 but only arrived when the watermark was already at 10:03.</p><p><strong>Windows close</strong> when the watermark crosses their boundary. Once Window 1&#8217;s time range ends and the watermark passes 10:02, the system emits results for that window. Any event for Window 1 arriving after that is considered late and may be dropped or handled separately.</p><h4>The &#8220;Controlled Environment&#8221; Illusion</h4><p>Watermarks rely on a fundamental assumption: that <strong>processing time</strong> and <strong>event time</strong> are loosely coupled and somewhat predictable. This works perfectly when:</p><ul><li><p>All producers are on synchronized cloud clocks.</p></li><li><p>The network is stable.</p></li><li><p>There is a single, linear ingestion point (like a single Kafka topic).</p></li></ul><p>In these scenarios, a watermark is a reliable signal of &#8220;completeness.&#8221; But the real world is rarely this polite.</p><h2>Filling the Gap: From Heuristics to Layered Accuracy</h2><p>In reality there are a lot of cases where the heuristic breaks:</p><h4>1. The mobile/edge reality</h4><p>Mobile apps are the number one watermark killer.</p><p>Let&#8217;s say you set an allowed lateness of 30 seconds. Sounds generous right? But then look at what actually happens when you check your event arrival distribution:</p><ul><li><p>p90 of events arrive within 2&#8211;3 seconds. Fine.</p></li><li><p>p95 stretches to about 15 seconds &#8212; still within buffer.</p></li><li><p>p99 hits 4&#8211;8 minutes.</p></li><li><p>And then there&#8217;s a long tail &#8212; maybe 2&#8211;5% of total events &#8212; arriving <strong>hours</strong> late.</p></li></ul><p>Why? Users go into subway tunnels. They board flights. Their phone OS backgrounds the app. The device buffers events locally and dumps them all at once when it reconnects.</p><p>The problem is that by the time these events arrive, the watermark has already moved on. The window closed. The aggregate was emitted. Your real-time dashboard is now showing a number that&#8217;s wrong, and nobody knows it until someone compares streaming totals with the batch report and asks &#8220;why don&#8217;t these match?&#8221;</p><p>I think this is important to understand: in any company with a mobile user base, late data is not an edge case. It&#8217;s a structural property of the data. You can&#8217;t just set a bigger allowed lateness buffer and hope for the best &#8212; you need an architecture that handles it</p><h4>2. Multi-Region Clock Skew</h4><p>This one is more subtle and it specifically hurts when you&#8217;re doing stream-to-stream joins.</p><p>Let&#8217;s say you have two services. Service A runs in US-East, Service B runs in EU-West. Both emit events with <code>event_time</code> stamped from their local system clock, and you&#8217;re joining them on a shared key like <code>session_id</code>.</p><p>In theory, Network Time Protocol keeps cloud instances synchronized to within milliseconds. In practice, you&#8217;ll see clock drift anywhere from 50ms to 500ms across regions. During instance migrations or high CPU load it can be worse.</p><p>Here&#8217;s a concrete example. You have 1-minute tumbling windows. A user action triggers an event in Service A at <strong>10:04:59.800</strong> (US-East clock) and a corresponding event in Service B at <strong>10:05:00.200</strong> (EU-West clock). These events are about the same thing &#8212; they happened 400ms apart in reality. But because they straddle the minute boundary, Service A&#8217;s event goes into the <strong>10:04&#8211;10:05</strong> window and Service B&#8217;s event goes into the <strong>10:05&#8211;10:06</strong> window.</p><p>Your join now produces a null on both sides. The 10:04 window has an A event with no matching B. The 10:05 window has a B event with no matching A. Neither event is &#8220;late&#8221; from the watermark&#8217;s perspective &#8212; they&#8217;re just in the wrong windows.</p><p>Unfortunately this won&#8217;t show up in your late-event metrics. Everything looks healthy. You will only notice when join hit rates drop a few percentage points and you can&#8217;t explain why.</p><p>The usual fixes are either widening your windows (increases latency), using session windows with a generous gap (adds complexity), or pre-aligning timestamps to a shared reference clock before windowing. None of them are free.</p><h4>3. The &#8220;Completeness&#8221; Join</h4><p>This one is probably the least talked about but I think it causes the most confusion in microservice architectures.</p><p>In many systems, a single &#8220;event&#8221; doesn&#8217;t come from one place. Its full context depends on joining multiple streams. Let&#8217;s say you have an e-commerce platform where Service A emits <code>order_placed</code> events and Service B emits <code>payment_confirmed</code> events. You need to join them on <code>order_id</code> inside a time window to produce a complete <code>order_completed</code> record.</p><p>Service A is high-throughput &#8212; it processes thousands of events per second and they land in your stream almost instantly. Service B depends on a third-party payment gateway, so events arrive anywhere from 500ms to 30 seconds later, sometimes more.</p><p>I guess now you see the problem? If not don&#8217;t worry, I am going to explain :) Your watermark is advancing based on the fastest stream. Service A&#8217;s events are flowing in steadily, so the watermark moves forward at near real-time speed. But this watermark tells you nothing about whether the matching Service B event has arrived yet.</p><p>So what happens? The window closes because Stream A pushed the watermark past the boundary. The <code>payment_confirmed</code> event from Service B shows up 15 seconds later &#8212; technically on time by its own event timestamp, but the window is already gone. Your join emits a partial record or no record at all.</p><p>The result: you get <code>order_placed</code> events with no matching payment, which downstream looks like failed transactions. In a reports  you will see the orders are dropping and when you debug and deep dive you will figure it out that it was just payments were slow.</p><p>I think this is the scenario where watermarks are most misleading. They give you a completeness signal for a single stream, but in a multi-stream join the completeness of one stream is irrelevant without the other. You need to either set your watermark strategy per-source (which most engines support but few teams configure properly), use event-time based session windows with wide gaps, or move the join out of the streaming layer entirely and do it in a reconciliation step.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G_-5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc7c91e-3931-4ac0-88e8-92f38d42fe93_2266x1260.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G_-5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc7c91e-3931-4ac0-88e8-92f38d42fe93_2266x1260.png 424w, https://substackcdn.com/image/fetch/$s_!G_-5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc7c91e-3931-4ac0-88e8-92f38d42fe93_2266x1260.png 848w, https://substackcdn.com/image/fetch/$s_!G_-5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc7c91e-3931-4ac0-88e8-92f38d42fe93_2266x1260.png 1272w, https://substackcdn.com/image/fetch/$s_!G_-5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc7c91e-3931-4ac0-88e8-92f38d42fe93_2266x1260.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G_-5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc7c91e-3931-4ac0-88e8-92f38d42fe93_2266x1260.png" width="1456" height="810" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7cc7c91e-3931-4ac0-88e8-92f38d42fe93_2266x1260.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:810,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:222117,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/190516735?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc7c91e-3931-4ac0-88e8-92f38d42fe93_2266x1260.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!G_-5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc7c91e-3931-4ac0-88e8-92f38d42fe93_2266x1260.png 424w, https://substackcdn.com/image/fetch/$s_!G_-5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc7c91e-3931-4ac0-88e8-92f38d42fe93_2266x1260.png 848w, https://substackcdn.com/image/fetch/$s_!G_-5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc7c91e-3931-4ac0-88e8-92f38d42fe93_2266x1260.png 1272w, https://substackcdn.com/image/fetch/$s_!G_-5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cc7c91e-3931-4ac0-88e8-92f38d42fe93_2266x1260.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Before jumping into fixes &#8212; how do you even know your watermarks are wrong? The two metrics to watch are <strong>late-event ratio per window</strong> (what percentage of events arrive after the window closes) and <strong>watermark lag</strong> (the gap between your watermark timestamp and wall-clock time). If your late-event ratio keeps climbing or your watermark lag is consistently high, that's your signal.</p><h2>The Patterns That Fill the Gap</h2><p>When watermarks alone can't guarantee accuracy, you need a layered approach. Here are the four main patterns teams use &#8212; and more importantly, when to pick which one.</p><h4>Late-Data Side Outputs</h4><p>The first thing to do is stop throwing away late data. Instead of dropping events that miss the watermark, you route them to a side output &#8212; basically a secondary stream that catches everything the main window missed.</p><p>In Flink this looks something like:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;java&quot;,&quot;nodeId&quot;:&quot;76aa79c4-baf8-4ef0-8283-9fea7e271cbd&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-java">OutputTag&lt;Event&gt; lateTag = new OutputTag&lt;Event&gt;("late-data"){};

SingleOutputStreamOperator&lt;Result&gt; result = stream
    .keyBy(e -&gt; e.getKey())
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .allowedLateness(Time.seconds(30))
    .sideOutputLateData(lateTag)
    .aggregate(new MyAggregateFunction());

DataStream&lt;Event&gt; lateStream = result.getSideOutput(lateTag);</code></pre></div><p>You write these late events to a separate table (or topic) and let your query layer merge them with the main results. Your real-time view stays fast, and the late data doesn't disappear.</p><h4>The &#8220;Close and Reopen&#8221; Strategy</h4><p>We often think of a window as a one-time thing. It fires, emits a result, done. But it doesn&#8217;t have to be.</p><p>Many teams use a multi-pass approach:</p><p>The first pass closes the window based on the watermark and powers the real-time dashboard. It&#8217;s maybe 90-95% accurate and available in seconds.</p><p>The second pass runs hours later. It reopens the same window logic, incorporates the side output data, and emits a corrected version of the aggregate. Your dashboard number from 2 PM gets quietly updated by 5 PM.</p><p>I think this pattern works really well for analytics dashboards where &#8220;close enough now, exact later&#8221; is acceptable. If you&#8217;re building something where users see totals and trends &#8212; daily active users, revenue dashboards, funnel metrics &#8212; this is probably what you want.</p><h4>Reconciliation Batch Jobs (Lambda Redux)</h4><p>I know we are talking about streaming jobs but in real life most of the time they complement to each other. What does it mean for watermarks?</p><p>The streaming job handles the &#8220;now&#8221; &#8212; low-latency aggregates for dashboards and alerts. But alongside it, a nightly batch job (running in Snowflake, BigQuery, Spark, whatever you have) scans the raw event logs in S3 or GCS and recomputes everything from scratch. These raw logs eventually contain everything, no matter how late it was. The batch results overwrite the streaming aggregates.</p><p>Now I know what you&#8217;re thinking &#8212; this sounds like Lambda Architecture. And yes, conceptually it is. But the reason the original Lambda got so much hate was the &#8220;two codebases&#8221; problem: you had to maintain completely separate logic for the batch layer and the speed layer, and keeping them in sync was painful.</p><p>The modern version avoids this. You can use the same dbt models or the same SQL for both your streaming and batch paths. Flink SQL and your batch SQL can share the same transformation logic. The architecture is similar but the operational pain is significantly lower.</p><p>I think this pattern is non-negotiable for anything financial. Your dashboard can be slightly off at 2 PM, but your revenue reports need to be 100% accurate by next morning.</p><h4>Designing for Idempotency</h4><p>This is the one I think more teams should adopt. The idea is simple: instead of trying to make your windows perfect, make the downstream consumers not care if they receive updates multiple times.</p><p>If your writes are idempotent, you don&#8217;t need to &#8220;close&#8221; a window at all. Every time a late event arrives, you just re-emit the updated aggregate. The database overwrites the previous value.</p><p>In practice this means using <code>MERGE</code> or upsert semantics instead of <code>INSERT</code>:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:&quot;7b993fe3-e2e1-4d61-92e6-bfc8ee354ed5&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">-- Instead of this (breaks on late data):
INSERT INTO hourly_metrics (window_start, dimension, total)
VALUES ('2026-01-15 10:00', 'checkout', 542);

-- Do this (handles late data automatically):
MERGE INTO hourly_metrics AS target
USING (SELECT '2026-01-15 10:00' AS window_start, 
              'checkout' AS dimension, 
              547 AS total) AS source
ON target.window_start = source.window_start 
   AND target.dimension = source.dimension
WHEN MATCHED THEN UPDATE SET total = source.total
WHEN NOT MATCHED THEN INSERT (window_start, dimension, total) 
                      VALUES (source.window_start, source.dimension, source.total);</code></pre></div><p>Every late event just updates the row. Your downstream consumers always see the latest truth. No side outputs, no reprocessing, no reconciliation needed.</p><p>The catch is that this only works when your consumers can handle updates. If something downstream already read the old value and acted on it (sent a notification, charged a customer), an update won&#8217;t undo that. So this pattern fits best for read-heavy analytical systems, not transactional workflows.</p><h2>Conclusion</h2><p>Watermarks are an optimization for latency, not a guarantee of accuracy. I think this is the most important thing to understand before you build any streaming pipeline.</p><p>In a controlled environment &#8212; synchronized clocks, stable network, single ingestion point &#8212; watermarks work great. But once your data touches mobile devices, crosses regions, or depends on joining multiple services, your streaming windows will be wrong sometimes. That&#8217;s not a bug. That&#8217;s just how distributed systems work.</p><p>The goal is not to make the stream perfect. The goal is to build a system that handles late arrivals without losing data. Use watermarks to get your low-latency metrics out the door. Use side outputs and idempotent writes to catch what the watermark missed. And use batch reconciliation to make sure the final numbers are actually correct.</p><p>"Thank you for reading this far. See you in my next articles. Don't forget to subscribe to get more of this interesting data engineering content!"</p><h2><br></h2>]]></content:encoded></item><item><title><![CDATA[Building a simple RAG pipeline in 2026: a local-first approach]]></title><description><![CDATA[Code a simple RAG using ollama]]></description><link>https://dataheimer.substack.com/p/building-a-simple-rag-pipeline-in</link><guid isPermaLink="false">https://dataheimer.substack.com/p/building-a-simple-rag-pipeline-in</guid><dc:creator><![CDATA[Subhan Hagverdiyev]]></dc:creator><pubDate>Mon, 02 Mar 2026 18:27:20 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/6e70b9cb-5d27-4e98-8717-bf3ddf904dde_2400x1350.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><em>I&#8217;m Subhan Hagverdiyev and welcome to Dataheimer - where we explore the atomic impact of data.</em></p><p><em>Just like splitting an atom releases enormous energy, the right data engineering decisions can transform entire organizations.</em></p><p><em>This is where I break down complex concepts and share all the fascinating discoveries from my journey.</em></p><p><em>Want to join the adventure? Here you go:</em></p></blockquote><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://dataheimer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><p>We hear these words quite often these days: RAG, LLM, vector databases and of course MCP servers(though we don&#8217;t need it for this article). If you are planning to take a path of AI software engineer or simply don&#8217;t want to fall behind in AI development you need to understand what is RAG and how it helps LLMs. <br><br>In this article we will build a functional RAG system from scratch using <strong>Python</strong> and <strong>Ollama</strong> to run high-performance models locally on your machine.</p><h2>What is RAG?</h2><p>I think we already know what is LLM. A chatbot (like ChatGPT or Claude) that understands and generates human language.An LLM is like a pre-compiled binary with no internet access. <strong>RAG(Retriaval Augmented Generation) </strong> is a process where you find a specific piece of text from your own data and "paste" it into the prompt you send to the LLM. This allows the LLM to answer questions using information it wasn't originally trained on.<br><br>A real-world example would be asking Claude &#8220;How much debt I have to the bank?&#8221;. Claude cannot answer this question because it doesn&#8217;t have access to external knowledge, such as your financial information.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yroW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5df07be9-ed24-4c16-99c8-a64e928c09a8_1496x870.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yroW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5df07be9-ed24-4c16-99c8-a64e928c09a8_1496x870.png 424w, https://substackcdn.com/image/fetch/$s_!yroW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5df07be9-ed24-4c16-99c8-a64e928c09a8_1496x870.png 848w, https://substackcdn.com/image/fetch/$s_!yroW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5df07be9-ed24-4c16-99c8-a64e928c09a8_1496x870.png 1272w, https://substackcdn.com/image/fetch/$s_!yroW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5df07be9-ed24-4c16-99c8-a64e928c09a8_1496x870.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yroW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5df07be9-ed24-4c16-99c8-a64e928c09a8_1496x870.png" width="1456" height="847" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5df07be9-ed24-4c16-99c8-a64e928c09a8_1496x870.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:847,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:172498,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/189649133?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5df07be9-ed24-4c16-99c8-a64e928c09a8_1496x870.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yroW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5df07be9-ed24-4c16-99c8-a64e928c09a8_1496x870.png 424w, https://substackcdn.com/image/fetch/$s_!yroW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5df07be9-ed24-4c16-99c8-a64e928c09a8_1496x870.png 848w, https://substackcdn.com/image/fetch/$s_!yroW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5df07be9-ed24-4c16-99c8-a64e928c09a8_1496x870.png 1272w, https://substackcdn.com/image/fetch/$s_!yroW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5df07be9-ed24-4c16-99c8-a64e928c09a8_1496x870.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To address this limitation, we need to provide external knowledge to the model (in this example, a financial record)</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!B9fS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf56984b-e834-4bb0-8d29-afc0ef4b752c_317x443.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!B9fS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf56984b-e834-4bb0-8d29-afc0ef4b752c_317x443.png 424w, https://substackcdn.com/image/fetch/$s_!B9fS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf56984b-e834-4bb0-8d29-afc0ef4b752c_317x443.png 848w, https://substackcdn.com/image/fetch/$s_!B9fS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf56984b-e834-4bb0-8d29-afc0ef4b752c_317x443.png 1272w, https://substackcdn.com/image/fetch/$s_!B9fS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf56984b-e834-4bb0-8d29-afc0ef4b752c_317x443.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!B9fS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf56984b-e834-4bb0-8d29-afc0ef4b752c_317x443.png" width="317" height="443" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf56984b-e834-4bb0-8d29-afc0ef4b752c_317x443.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:443,&quot;width&quot;:317,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:28964,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/189649133?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf56984b-e834-4bb0-8d29-afc0ef4b752c_317x443.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!B9fS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf56984b-e834-4bb0-8d29-afc0ef4b752c_317x443.png 424w, https://substackcdn.com/image/fetch/$s_!B9fS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf56984b-e834-4bb0-8d29-afc0ef4b752c_317x443.png 848w, https://substackcdn.com/image/fetch/$s_!B9fS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf56984b-e834-4bb0-8d29-afc0ef4b752c_317x443.png 1272w, https://substackcdn.com/image/fetch/$s_!B9fS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf56984b-e834-4bb0-8d29-afc0ef4b752c_317x443.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A RAG system consists of two key components:</p><ul><li><p>A <strong>retrieval model</strong> that fetches relevant information from an external knowledge source, which could be a database, search engine, or any other information repository.</p></li><li><p>A <strong>language model</strong> that generates responses based on the retrieved knowledge</p></li></ul><p>Let&#8217;s create a simple RAG system that retrieves information from a predefined dataset and generates responses based on the retrieved knowledge. The system will comprise the following components:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XuuR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c6ea25-3eff-44b7-bca6-25ba8539055c_320x540.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XuuR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c6ea25-3eff-44b7-bca6-25ba8539055c_320x540.png 424w, https://substackcdn.com/image/fetch/$s_!XuuR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c6ea25-3eff-44b7-bca6-25ba8539055c_320x540.png 848w, https://substackcdn.com/image/fetch/$s_!XuuR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c6ea25-3eff-44b7-bca6-25ba8539055c_320x540.png 1272w, https://substackcdn.com/image/fetch/$s_!XuuR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c6ea25-3eff-44b7-bca6-25ba8539055c_320x540.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XuuR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c6ea25-3eff-44b7-bca6-25ba8539055c_320x540.png" width="320" height="540" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/71c6ea25-3eff-44b7-bca6-25ba8539055c_320x540.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:540,&quot;width&quot;:320,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:36861,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/189649133?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c6ea25-3eff-44b7-bca6-25ba8539055c_320x540.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XuuR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c6ea25-3eff-44b7-bca6-25ba8539055c_320x540.png 424w, https://substackcdn.com/image/fetch/$s_!XuuR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c6ea25-3eff-44b7-bca6-25ba8539055c_320x540.png 848w, https://substackcdn.com/image/fetch/$s_!XuuR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c6ea25-3eff-44b7-bca6-25ba8539055c_320x540.png 1272w, https://substackcdn.com/image/fetch/$s_!XuuR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71c6ea25-3eff-44b7-bca6-25ba8539055c_320x540.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ol><li><p><strong>Embedding model</strong>: A pre-trained language model that converts input text into embeddings - vector representations that capture semantic meaning. These vectors will be used to search for relevant information in the dataset.</p></li><li><p><strong>Vector database</strong>: A storage system for knowledge and its corresponding embedding vectors. While there are many vector database technologies like <strong><a href="https://qdrant.tech/">Qdrant</a></strong>, <strong><a href="https://www.pinecone.io/">Pinecone</a></strong>, and <strong><a href="https://github.com/pgvector/pgvector">pgvector</a></strong>, we&#8217;ll implement a simple in-memory database from scratch.</p></li><li><p><strong>Chatbot</strong>: A language model that generates responses based on retrieved knowledge. This can be any language model, such as Llama, Gemma, or GPT.</p></li></ol><h2>Indexing phase</h2><p>Before we can ask questions, we must "index" our data. This involves breaking documents into <strong>chunks</strong> and converting them into vectors.<br></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iWFg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2467d614-8309-4b9f-b572-6d8a3dbdffb5_1240x344.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iWFg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2467d614-8309-4b9f-b572-6d8a3dbdffb5_1240x344.png 424w, https://substackcdn.com/image/fetch/$s_!iWFg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2467d614-8309-4b9f-b572-6d8a3dbdffb5_1240x344.png 848w, https://substackcdn.com/image/fetch/$s_!iWFg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2467d614-8309-4b9f-b572-6d8a3dbdffb5_1240x344.png 1272w, https://substackcdn.com/image/fetch/$s_!iWFg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2467d614-8309-4b9f-b572-6d8a3dbdffb5_1240x344.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iWFg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2467d614-8309-4b9f-b572-6d8a3dbdffb5_1240x344.png" width="1240" height="344" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2467d614-8309-4b9f-b572-6d8a3dbdffb5_1240x344.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:344,&quot;width&quot;:1240,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:39969,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/189649133?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2467d614-8309-4b9f-b572-6d8a3dbdffb5_1240x344.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iWFg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2467d614-8309-4b9f-b572-6d8a3dbdffb5_1240x344.png 424w, https://substackcdn.com/image/fetch/$s_!iWFg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2467d614-8309-4b9f-b572-6d8a3dbdffb5_1240x344.png 848w, https://substackcdn.com/image/fetch/$s_!iWFg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2467d614-8309-4b9f-b572-6d8a3dbdffb5_1240x344.png 1272w, https://substackcdn.com/image/fetch/$s_!iWFg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2467d614-8309-4b9f-b572-6d8a3dbdffb5_1240x344.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The embedding vectors can be later used to retrieve relevant information based on a given query. Think of it as <em>SQL WHERE</em> clause, but instead of querying by exact text matching, we can now query a set of chunks based on their vector representations.<br></p><h2>Implementation</h2><p>We will use Python to implement the RAG with <strong>Ollama</strong> because it allows us to run these models locally without API fees or privacy concerns.</p><h3>Models we will use</h3><ul><li><p>Embedding model: https://ollama.com/library/nomic-embed-text</p></li><li><p>Language model: https://ollama.com/library/llama4:scout</p></li></ul><p>And for the dataset, we&#8217;ll use a simple list of facts about cats. Each fact will be treated as a chunk during the indexing phase.</p><h3>Install Ollama and pull models</h3><p>Start by downloading Ollama from <a href="https://ollama.com">ollama.com</a>.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;add07b97-dadf-4396-bb0d-9a00a18f43fc&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">ollama pull nomic-embed-text
ollama pull llama4:scout

```
If everything goes well, you'll see output like this:
```

pulling manifest
...
verifying sha256 digest
writing manifest
success</code></pre></div><p><br>Next, install the Ollama Python package:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;e2692627-1d90-4292-b908-305fc409f115&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">pip install ollama</code></pre></div><h3>Load the dataset</h3><p>Create a new Python script and load the dataset. The dataset is a plain text file where each line is a cat fact &#8212; each line becomes a chunk for indexing.</p><p>You can download the example dataset from <a href="https://github.com/anthropics/rag-tutorial/blob/main/cat-facts.txt">here</a>. Here&#8217;s how to load it:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;89792fbf-03a4-4049-8d98-4e3b462c61ce&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">dataset = []
with open('cat-facts.txt', 'r') as file:
    dataset = file.readlines()
    print(f'Loaded {len(dataset)} entries')</code></pre></div><h3>Build the vector database</h3><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;874412db-7b7b-47af-8b3d-3526b3c95355&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import ollama

EMBEDDING_MODEL = 'nomic-embed-text'
LANGUAGE_MODEL = 'llama4:scout'

# Each element in VECTOR_DB is a tuple: (chunk, embedding)
# An embedding is a list of floats, e.g. [0.1, 0.04, -0.34, 0.21, ...]
VECTOR_DB = []

def add_chunk_to_database(chunk):
    embedding = ollama.embed(model=EMBEDDING_MODEL, input=chunk)['embeddings'][0]
    VECTOR_DB.append((chunk, embedding))</code></pre></div><p>For simplicity, each line in the dataset is treated as its own chunk:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;4aa5610b-6c77-437f-8c92-94af1b8ebe5d&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">for i, chunk in enumerate(dataset):
    add_chunk_to_database(chunk)
    print(f'Added chunk {i+1}/{len(dataset)} to the database')</code></pre></div><h3>Implement the retrieval function</h3><p>Next, we need a way to find the most relevant chunks for a given query. We&#8217;ll compute the cosine similarity between the query&#8217;s embedding and every chunk embedding in our database, then return the top matches.</p><p>Cosine similarity measures how &#8220;close&#8221; two vectors are in the embedding space &#8212; a higher value means the texts are more semantically similar.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;3b53d6bc-6b13-4dd1-844d-8720a71eec11&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def cosine_similarity(a, b):
    dot_product = sum([x * y for x, y in zip(a, b)])
    norm_a = sum([x ** 2 for x in a]) ** 0.5
    norm_b = sum([x ** 2 for x in b]) ** 0.5
    return dot_product / (norm_a * norm_b)</code></pre></div><p>And the retrieval function:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;7734aedb-d141-4a89-a4fd-9ccd6fe3af72&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def retrieve(query, top_n=3):
    query_embedding = ollama.embed(model=EMBEDDING_MODEL, input=query)['embeddings'][0]
    similarities = []
    for chunk, embedding in VECTOR_DB:
        similarity = cosine_similarity(query_embedding, embedding)
        similarities.append((chunk, similarity))
    # Sort descending &#8212; higher similarity = more relevant
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_n]</code></pre></div><h3>Generate a response</h3><p>Now comes the generation phase. We take the retrieved chunks, inject them into a prompt as context, and let the language model produce an answer grounded in that context.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;00b7c248-8312-49ff-809c-534682085251&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">input_query = input('Ask me a question: ')
retrieved_knowledge = retrieve(input_query)

print('Retrieved knowledge:')
for chunk, similarity in retrieved_knowledge:
    print(f' - (similarity: {similarity:.2f}) {chunk}')

instruction_prompt = f'''You are a helpful chatbot.
Use only the following pieces of context to answer the question. Don't make up any new information:
{'\n'.join([f' - {chunk}' for chunk, similarity in retrieved_knowledge])}
'''</code></pre></div><p>Then pass it to Ollama for generation:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;58f14f0a-591e-4cd4-99c3-e43b5702f4c0&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">stream = ollama.chat(
    model=LANGUAGE_MODEL,
    messages=[
        {'role': 'system', 'content': instruction_prompt},
        {'role': 'user', 'content': input_query},
    ],
    stream=True,
)

print('Chatbot response:')
for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)</code></pre></div><h3>Putting it all together</h3><p>Save the complete code to a file called demo.py and run it using following command:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;06900429-3863-4752-a029-a6e293566e63&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">python demo.py</code></pre></div><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;32b0682e-2ad6-4396-a4fa-34c5dbb967b2&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">Ask me a question: tell me about cat speed
Retrieved chunks: ...
Chatbot response:
According to the given context, cats can travel at approximately 31 mph (49 km) over a short distance. This is their top speed.</code></pre></div><h2>Improvements</h2><p>This implementation is intentionally minimal. Here are some meaningful ways to improve it:</p><p><strong>Smarter query construction.</strong> If the user&#8217;s question spans multiple topics, a single retrieval pass may miss important context. One approach is to have the language model rewrite or decompose the user&#8217;s question into multiple targeted queries before retrieval.</p><p><strong>Reranking.</strong> The top-N results from cosine similarity aren&#8217;t always the most useful. A dedicated reranking model can re-score the retrieved chunks based on deeper relevance to the query, improving answer quality significantly.</p><p><strong>Use a proper vector database.</strong> Our in-memory list won&#8217;t scale. For real applications, consider a purpose-built vector store like Qdrant, Pinecone, pgvector, or ChromaDB. These offer fast approximate nearest-neighbor search, persistence, and filtering.</p><p><strong>Better chunking strategies.</strong> We&#8217;re treating each line as a chunk, which is simplistic. For longer documents, you&#8217;ll want to experiment with overlapping chunks, semantic chunking, or recursive splitting to capture more context per chunk.</p><p><strong>Upgrade the language model.</strong> We used a relatively small model here for speed and simplicity. Larger models like Llama 4 Maverick, Qwen3, or DeepSeek V3 will produce more coherent and accurate responses, especially for complex questions.</p><p></p><h2>Conclusion</h2><p>RAG remains one of the most practical techniques for making language models useful with your own data. By building a simple RAG from scratch, we&#8217;ve walked through the core concepts: embedding text into vectors, retrieving relevant context via similarity search, and grounding generation in that context.</p><p>The ecosystem has matured substantially &#8212; running high-quality open models locally is now straightforward with tools like Ollama, and the range of available models and vector databases keeps growing. Whether you&#8217;re building a quick prototype or a production system, the fundamentals covered here are the foundation everything else builds on.</p><blockquote><p>Thank you for reading this far. See you in my next articles. Don&#8217;t forget to subscribe to get more of this interesting data engineering content!&#8221;</p></blockquote>]]></content:encoded></item><item><title><![CDATA[Context Engineering in Data Engineering]]></title><description><![CDATA[How context engineering will affect the data engineers]]></description><link>https://dataheimer.substack.com/p/context-engineering-in-data-engineering</link><guid isPermaLink="false">https://dataheimer.substack.com/p/context-engineering-in-data-engineering</guid><dc:creator><![CDATA[Subhan Hagverdiyev]]></dc:creator><pubDate>Fri, 06 Feb 2026 20:07:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!aH4b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a14d62c-24e8-49e5-a17d-1f6f99b42d11_1186x558.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><em>I&#8216;m Subhan Hagverdiyev and welcome to Dataheimer - where we explore the atomic impact of data.</em></p><p><em>Just like splitting an atom releases enormous energy, the right data engineering decisions can transform entire organizations.</em></p><p><em>This is where I break down complex concepts and share all the fascinating discoveries from my journey.</em></p><p><em>Want to join the adventure? Here you go:</em></p></blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataheimer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataheimer.substack.com/subscribe?"><span>Subscribe now</span></a></p><p>A few years ago, most conversations about data systems revolved around scale: more rows, faster queries, cheaper storage. Lately, the tone has shifted. Teams are still chasing performance, but they are also wrestling with a subtler problem&#8212;<strong>how systems understand the situation they&#8217;re operating in</strong>.</p><p>That&#8217;s where <em>context engineering</em> comes in. </p><h3>The Tweet That Started It All</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aH4b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a14d62c-24e8-49e5-a17d-1f6f99b42d11_1186x558.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aH4b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a14d62c-24e8-49e5-a17d-1f6f99b42d11_1186x558.png 424w, https://substackcdn.com/image/fetch/$s_!aH4b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a14d62c-24e8-49e5-a17d-1f6f99b42d11_1186x558.png 848w, https://substackcdn.com/image/fetch/$s_!aH4b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a14d62c-24e8-49e5-a17d-1f6f99b42d11_1186x558.png 1272w, https://substackcdn.com/image/fetch/$s_!aH4b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a14d62c-24e8-49e5-a17d-1f6f99b42d11_1186x558.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aH4b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a14d62c-24e8-49e5-a17d-1f6f99b42d11_1186x558.png" width="1186" height="558" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2a14d62c-24e8-49e5-a17d-1f6f99b42d11_1186x558.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:558,&quot;width&quot;:1186,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:107159,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/187125185?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a14d62c-24e8-49e5-a17d-1f6f99b42d11_1186x558.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aH4b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a14d62c-24e8-49e5-a17d-1f6f99b42d11_1186x558.png 424w, https://substackcdn.com/image/fetch/$s_!aH4b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a14d62c-24e8-49e5-a17d-1f6f99b42d11_1186x558.png 848w, https://substackcdn.com/image/fetch/$s_!aH4b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a14d62c-24e8-49e5-a17d-1f6f99b42d11_1186x558.png 1272w, https://substackcdn.com/image/fetch/$s_!aH4b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a14d62c-24e8-49e5-a17d-1f6f99b42d11_1186x558.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In June 2025, Shopify CEO Tobi L&#252;tke posted something that made waves in AI commnity.Andrej Karpathy&#8212;yes, <em>that</em> Karpathy&#8212;immediately co-signed it. His take was more technical but hit the same nerve:</p><blockquote><p>"People associate prompts with short task descriptions you'd give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step."</p></blockquote><h2>So What Is Context Engineering, Really?</h2><p>Here&#8217;s my working definition: <strong>Context engineering is the discipline of designing and optimizing the entire information environment that an LLM operates in&#8212;not just the prompt, but everything the model sees at inference time.</strong></p><p>That includes:</p><ul><li><p><strong>System instructions</strong> (the static stuff that tells the model how to behave)</p></li><li><p><strong>Retrieved knowledge</strong> (documents, database records, whatever you pulled from your RAG pipeline)</p></li><li><p><strong>Tool definitions</strong> (what functions the model can call)</p></li><li><p><strong>Conversation history</strong> (what&#8217;s been said before)</p></li><li><p><strong>User metadata</strong> (who&#8217;s asking, what permissions they have, what timezone they&#8217;re in)</p></li><li><p><strong>State</strong> (what step are we on, what&#8217;s already been tried)</p></li></ul><h2>How context shows up inside data engineering</h2><p>For data engineering teams, context engineering is not merely an extension of existing ETL (Extract, Transform, Load) processes but a fundamental revaluation of what constitutes "data readiness". Traditionally, data engineering focused on moving raw digital signals into passive storage for human consumption. In the context-engineered enterprise, pipelines are redesigned to deliver "executable understanding"&#8212;data that is enriched with the semantic, structural, and operational metadata required for machines to act autonomously</p><h3><strong>Semantic Context and the Unified Meaning Layer</strong></h3><p>The first pillar of data-centric context engineering is semantic context, which ensures that machines and humans share a canonical understanding of business concepts. In complex enterprise environments, metric logic&#8212;such as the definition of "revenue" or "customer churn"&#8212;often varies across departments. Context engineering mandates that these definitions be centralized, versioned, and programmatically discoverable. Without this semantic grounding, AI agents frequently suffer from reasoning failures caused by mismatches in metric logic rather than model flaws.</p><h3><strong>Structural Context: Lineage as a Reasoning Graph</strong></h3><p>Lineage has transitioned from a passive compliance requirement to an active reasoning backbone. Agents require structural context to navigate the interconnected landscape of enterprise data, utilizing lineage graphs to trace anomalies upstream, estimate the &#8220;blast radius&#8221; of potential actions, and choose alternative data paths when a primary source is unavailable. This transformation often involves the use of knowledge graphs that link physical data assets (tables and columns) to business metrics and transformations.</p><h3><strong>Operational Context and Probabilistic Trust</strong></h3><p>Unlike traditional data quality monitoring, which treats trust as a binary (certified or uncertified), context engineering views trust as a dynamic, use-case dependent signal. Operational context includes real-time telemetry on data freshness, distribution shifts, and historical reliability patterns. For an agentic system, a dataset may be considered &#8220;good enough&#8221; for an internal forecast but &#8220;unsafe&#8221; for regulatory reporting; context engineering encodes this nuance, allowing agents to explain why a decision is safe or unsafe based on the current state of the underlying data.</p><h3><strong>Policy Context and Enforceable Constraints</strong></h3><p>The final data pillar is policy context, which ensures that agents operate within the legal, ethical, and regulatory boundaries of the organization. This involves embedding sensitivity classifications, regional constraints, and purpose limitations directly into the data context. These constraints must be machine-readable and enforced at decision time, providing an auditable trail of how data was used and for what purpose.</p><p>The realization of context-aware systems requires a significant upgrade to the traditional data stack, moving from simple relational tables to multidimensional vector databases and semantic graphs. These technologies provide the foundational "RAM" required for contextual understanding at scale. Although this is also quite important concept this will be discussed later in another article.</p><h2>Wrapping Up</h2><p>For data engineers, this is both a challenge and an opportunity. A challenge because the expectations are higher than ever. An opportunity because the skills you've built over years of pipeline work translate directly to this new domain.</p><p>Have thoughts on context engineering in your data stack? I'd love to hear what patterns you're seeing. The field is moving fast and none of us have all the answers.</p><p></p>]]></content:encoded></item><item><title><![CDATA[The resources that actually get you hired as a data engineer in 2025]]></title><description><![CDATA[YouTube channels and books that helps you to become Data Engineer in 2025]]></description><link>https://dataheimer.substack.com/p/the-resources-that-actually-get-you</link><guid isPermaLink="false">https://dataheimer.substack.com/p/the-resources-that-actually-get-you</guid><dc:creator><![CDATA[Subhan Hagverdiyev]]></dc:creator><pubDate>Sun, 20 Jul 2025 15:49:15 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!VGks!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F513f40eb-08a6-4449-9339-05149d00b160_1011x900.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VGks!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F513f40eb-08a6-4449-9339-05149d00b160_1011x900.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VGks!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F513f40eb-08a6-4449-9339-05149d00b160_1011x900.heic 424w, https://substackcdn.com/image/fetch/$s_!VGks!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F513f40eb-08a6-4449-9339-05149d00b160_1011x900.heic 848w, https://substackcdn.com/image/fetch/$s_!VGks!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F513f40eb-08a6-4449-9339-05149d00b160_1011x900.heic 1272w, https://substackcdn.com/image/fetch/$s_!VGks!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F513f40eb-08a6-4449-9339-05149d00b160_1011x900.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VGks!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F513f40eb-08a6-4449-9339-05149d00b160_1011x900.heic" width="658" height="585.7566765578636" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/513f40eb-08a6-4449-9339-05149d00b160_1011x900.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:900,&quot;width&quot;:1011,&quot;resizeWidth&quot;:658,&quot;bytes&quot;:36547,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/168783055?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F513f40eb-08a6-4449-9339-05149d00b160_1011x900.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VGks!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F513f40eb-08a6-4449-9339-05149d00b160_1011x900.heic 424w, https://substackcdn.com/image/fetch/$s_!VGks!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F513f40eb-08a6-4449-9339-05149d00b160_1011x900.heic 848w, https://substackcdn.com/image/fetch/$s_!VGks!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F513f40eb-08a6-4449-9339-05149d00b160_1011x900.heic 1272w, https://substackcdn.com/image/fetch/$s_!VGks!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F513f40eb-08a6-4449-9339-05149d00b160_1011x900.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There's a moment when you realize most data engineering courses teach you tools, not thinking.</p><p>You memorize Spark syntax. You follow Airflow tutorials. You build toy pipelines that would crumble under real load.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://dataheimer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Dataheimer Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>But then you find the real teachers. The ones who've processed billions of records at 3am when everything's on fire. The ones who explain <em>why</em> your pipeline failed, not just how to restart it.</p><p>Here are the resources that rewired my brain. They didn't just teach me data engineering. They taught me how to think like someone who builds systems that don't break.</p><h3><strong><a href="https://www.youtube.com/@andreaskayy">Andreas Kretz - Learn Data Engineering</a></strong></h3><p>I found Andreas's channel while debugging a Kafka issue at my work.</p><p>No fancy animations. No "DESTROYING DATA PIPELINES WITH THIS ONE TRICK." Just a guy who's been doing this for 10+ years showing you what actually works.</p><p>Andreas doesn't teach like other YouTubers. He tears down real architectures&#8212;like how Nielsen processes 55TB daily. <em>Real systems</em> with real constraints and real failures.</p><p>His videos changed how I approach problems. ETL vs ELT wasn't abstract anymore. I understood <em>why</em> you'd pick Kafka over Kinesis, when batch beats streaming, why some architectures cost millions to fix later.</p><p>Every video made me sharper. Made me question my designs. And most importantly: made me less terrified of building at scale.</p><h2><strong><a href="https://www.youtube.com/@SeattleDataGuy">Seattle Data Guy - Ben Rogojan</a></strong></h2><p>Ben's content hit different because he's been on both sides&#8212;building at Meta and consulting for Fortune 500s.</p><p>There's a gap most channels never bridge: the line between <em>pipelines that run</em>... and <em>pipelines that make money</em>. That's where Ben shines.</p><p>His channel shows what actually happens in big companies. Not the sanitized conference talks. The real mistakes that cost millions. The architectural decisions that haunt you for years. The politics behind technical choices.</p><p>He covers the modern stack&#8212;dbt, Trino, Airflow&#8212;but always ties it back to business impact. That's what makes him essential.</p><p>After watching Ben, I started seeing the bigger picture. The organizational debt. The resume-driven development. The difference between being right and being effective.</p><h2><a href="https://www.youtube.com/@EcZachly_">Data with Zach (Zack Wilson)</a></h2><p>Zach's channel found me when I needed it most&#8212;transitioning from software engineering to data engineering.</p><p>Most data content feels like it's made by people who've never written production code. Zach's different. He brings the software engineering lens to data problems. And that perspective changes <em>everything</em>.</p><p>He doesn't just show you how to build pipelines. He shows you how to build pipelines that other engineers won't curse you for. How to write tests that actually catch failures. How to structure code that humans can maintain.</p><p>His deep dives make you think. Why do we accept 99% accuracy? When is "good enough" actually good enough? How do you balance perfection with shipping?</p><p>After watching Zach, I started treating data code like <em>real</em> code. Version control became sacred. Tests became non-negotiable. Documentation became an act of kindness to future me.</p><h2><strong>The Books That Actually Matter</strong></h2><h3>1. Fundamentals of Data Engineering (Reis &amp; Housley)</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RCl4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e5da0-714b-43db-9ae9-9907576aa62a_1951x2560.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RCl4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e5da0-714b-43db-9ae9-9907576aa62a_1951x2560.jpeg 424w, https://substackcdn.com/image/fetch/$s_!RCl4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e5da0-714b-43db-9ae9-9907576aa62a_1951x2560.jpeg 848w, https://substackcdn.com/image/fetch/$s_!RCl4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e5da0-714b-43db-9ae9-9907576aa62a_1951x2560.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!RCl4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e5da0-714b-43db-9ae9-9907576aa62a_1951x2560.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RCl4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e5da0-714b-43db-9ae9-9907576aa62a_1951x2560.jpeg" width="383" height="502.42445054945057" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d89e5da0-714b-43db-9ae9-9907576aa62a_1951x2560.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1910,&quot;width&quot;:1456,&quot;resizeWidth&quot;:383,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Amazon.com: Fundamentals of Data Engineering: Plan and Build Robust Data  Systems: 9781098108304: Reis, Joe, Housley, Matt: Books&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Amazon.com: Fundamentals of Data Engineering: Plan and Build Robust Data  Systems: 9781098108304: Reis, Joe, Housley, Matt: Books" title="Amazon.com: Fundamentals of Data Engineering: Plan and Build Robust Data  Systems: 9781098108304: Reis, Joe, Housley, Matt: Books" srcset="https://substackcdn.com/image/fetch/$s_!RCl4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e5da0-714b-43db-9ae9-9907576aa62a_1951x2560.jpeg 424w, https://substackcdn.com/image/fetch/$s_!RCl4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e5da0-714b-43db-9ae9-9907576aa62a_1951x2560.jpeg 848w, https://substackcdn.com/image/fetch/$s_!RCl4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e5da0-714b-43db-9ae9-9907576aa62a_1951x2560.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!RCl4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e5da0-714b-43db-9ae9-9907576aa62a_1951x2560.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You could read 20 tool-specific books and still not understand data engineering like you will after reading this one. Reis and Housley don't teach you Spark or Airflow. They teach you how to <em>think</em>.</p><p>They introduce this concept called "undercurrents"&#8212;security, DataOps, orchestration, architecture&#8212;the invisible forces that affect everything you build. Once you see them, you can't unsee them.</p><p>But what really got me was their approach to the data engineering lifecycle. They map out every stage from generation to serving. Not as a checklist. As a way of thinking about how data flows through systems.</p><p>After reading it, I stopped chasing shiny tools. Started asking better questions. <em>What problem does this solve? What happens at 3am when this breaks? What are we optimizing for&#8212;cost, speed, or simplicity?</em></p><p>This book doesn't age. The tools it mentions might become obsolete. The thinking never will.</p><h3><strong>2. Designing Data-Intensive Applications (Kleppmann)</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v6yo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd62c070a-5d25-495f-bdaa-482dadb3b490_400x525.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v6yo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd62c070a-5d25-495f-bdaa-482dadb3b490_400x525.jpeg 424w, https://substackcdn.com/image/fetch/$s_!v6yo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd62c070a-5d25-495f-bdaa-482dadb3b490_400x525.jpeg 848w, https://substackcdn.com/image/fetch/$s_!v6yo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd62c070a-5d25-495f-bdaa-482dadb3b490_400x525.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!v6yo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd62c070a-5d25-495f-bdaa-482dadb3b490_400x525.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v6yo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd62c070a-5d25-495f-bdaa-482dadb3b490_400x525.jpeg" width="354" height="464.625" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d62c070a-5d25-495f-bdaa-482dadb3b490_400x525.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:525,&quot;width&quot;:400,&quot;resizeWidth&quot;:354,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Designing Data-Intensive Applications [Book]&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Designing Data-Intensive Applications [Book]" title="Designing Data-Intensive Applications [Book]" srcset="https://substackcdn.com/image/fetch/$s_!v6yo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd62c070a-5d25-495f-bdaa-482dadb3b490_400x525.jpeg 424w, https://substackcdn.com/image/fetch/$s_!v6yo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd62c070a-5d25-495f-bdaa-482dadb3b490_400x525.jpeg 848w, https://substackcdn.com/image/fetch/$s_!v6yo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd62c070a-5d25-495f-bdaa-482dadb3b490_400x525.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!v6yo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd62c070a-5d25-495f-bdaa-482dadb3b490_400x525.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Everyone in data engineering tells you to read this book. And they're right.</p><p>Kleppmann breaks down how Kafka, Cassandra, and all those distributed systems actually work under the hood. Not the marketing version. The real version&#8212;with all the ways they can fail.</p><p>The book's famous for its consistency models chapter. Apparently it's the moment when everything clicks. When you finally understand why distributed systems are hard. Why your data disappears. Why "it works on my machine" means nothing at scale.</p><p>What people love: Kleppmann explains the <em>why</em> behind every design choice. No tool worship. Just trade-offs. He shows you when MongoDB makes sense and when it'll ruin your life.</p><p>I would just add a small warning that it's dense. Like textbook dense. Multiple engineers told me they had to read chapters twice. But they all said it was worth it.</p><p>The consensus? This book turns you from someone who uses distributed systems to someone who understands them. You'll start seeing problems before they happen. You'll make better choices. You'll finally get why senior engineers are so paranoid about network partitions.</p><h2><strong>To sum it up briefly</strong></h2><p>We're drowning in content but starving for wisdom. Everyone's selling courses. Everyone's promising to make you job-ready in 12 weeks</p><p>These resources are different. They come from people who've built things that matter. Who've felt the weight of bad decisions. Who care more about making you <em>think</em> better than making you <em>code</em> faster.</p><p>Let the learning compound. Let it change how you see systems. Let it make you dangerous in the best way.</p><p>The tools will change. The platforms will evolve. But the ability to think clearly about data at scale? That's yours forever.</p><blockquote><p>Thank you for reading this far. See you in my next articles. Don't forget to subscribe to get more of this interesting data engineering content!"</p></blockquote><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://dataheimer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Dataheimer Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[HyperLogLog: The Probabilistic Algorithm That's Faster Than Exact Counting]]></title><description><![CDATA[The smarter way to count distinct values in massive datasets]]></description><link>https://dataheimer.substack.com/p/hyperloglog-the-probabilistic-algorithm</link><guid isPermaLink="false">https://dataheimer.substack.com/p/hyperloglog-the-probabilistic-algorithm</guid><dc:creator><![CDATA[Subhan Hagverdiyev]]></dc:creator><pubDate>Mon, 14 Jul 2025 18:48:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!G1TC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9688787b-4779-4838-b0a2-15ce4a66e303_2560x1700.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><em>I'm Subhan Hagverdiyev and welcome to Dataheimer - where we explore the atomic impact of data.</em></p><p><em>Just like splitting an atom releases enormous energy, the right data engineering decisions can transform entire organizations.</em></p><p><em>This is where I break down complex concepts and share all the fascinating discoveries from my journey.</em></p><p><em>Want to join the adventure? Here you go:</em></p></blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataheimer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataheimer.substack.com/subscribe?"><span>Subscribe now</span></a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G1TC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9688787b-4779-4838-b0a2-15ce4a66e303_2560x1700.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G1TC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9688787b-4779-4838-b0a2-15ce4a66e303_2560x1700.heic 424w, https://substackcdn.com/image/fetch/$s_!G1TC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9688787b-4779-4838-b0a2-15ce4a66e303_2560x1700.heic 848w, https://substackcdn.com/image/fetch/$s_!G1TC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9688787b-4779-4838-b0a2-15ce4a66e303_2560x1700.heic 1272w, https://substackcdn.com/image/fetch/$s_!G1TC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9688787b-4779-4838-b0a2-15ce4a66e303_2560x1700.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G1TC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9688787b-4779-4838-b0a2-15ce4a66e303_2560x1700.heic" width="582" height="386.53434065934067" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9688787b-4779-4838-b0a2-15ce4a66e303_2560x1700.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:967,&quot;width&quot;:1456,&quot;resizeWidth&quot;:582,&quot;bytes&quot;:125567,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/168303035?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9688787b-4779-4838-b0a2-15ce4a66e303_2560x1700.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!G1TC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9688787b-4779-4838-b0a2-15ce4a66e303_2560x1700.heic 424w, https://substackcdn.com/image/fetch/$s_!G1TC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9688787b-4779-4838-b0a2-15ce4a66e303_2560x1700.heic 848w, https://substackcdn.com/image/fetch/$s_!G1TC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9688787b-4779-4838-b0a2-15ce4a66e303_2560x1700.heic 1272w, https://substackcdn.com/image/fetch/$s_!G1TC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9688787b-4779-4838-b0a2-15ce4a66e303_2560x1700.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the world of big data analytics, counting distinct elements efficiently is a fundamental challenge. </p><p>Traditional methods like <strong>COUNT(DISTINCT)</strong> in SQL can be computationally expensive and memory-intensive when dealing with massive datasets. This is where HyperLogLog (HLL) comes to the rescue&#8212;a probabilistic data structure that can estimate the cardinality (<em>number of unique elements)</em> of large datasets with remarkable accuracy while using minimal memory.</p><p>The algorithm has been widely adopted across the database ecosystem, from traditional SQL databases to NoSQL systems, distributed query engines, and cloud data warehouses. This widespread adoption demonstrates its universal value for cardinality estimation in modern data processing.</p><h2>Origins and History</h2><p>The algorithm has its roots in academic research dating back to the 1980s. Philippe Flajolet and G. Nigel Martin introduced the foundational concepts in their 1984 paper "Probabilistic Counting Algorithms for Data Base Applications", which led to the Flajolet-Martin algorithm.</p><p>The evolution continued with Flajolet developing the LogLog algorithm in 2003, followed by SuperLogLog. Finally, HyperLogLog emerged as an extension of the earlier LogLog algorithm, offering improved accuracy and efficiency.</p><h2>How HyperLogLog Works</h2><p>HyperLogLog operates on a clever mathematical principle: it analyzes the patterns in hashed values to estimate cardinality. The algorithm works by:</p><h3>1.  Hashing: </h3><p>Each input element is hashed to produce a uniform bit pattern</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A9XD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20177003-fd2f-42e8-bb0f-c9f669ca0291_3549x1208.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A9XD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20177003-fd2f-42e8-bb0f-c9f669ca0291_3549x1208.png 424w, https://substackcdn.com/image/fetch/$s_!A9XD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20177003-fd2f-42e8-bb0f-c9f669ca0291_3549x1208.png 848w, https://substackcdn.com/image/fetch/$s_!A9XD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20177003-fd2f-42e8-bb0f-c9f669ca0291_3549x1208.png 1272w, https://substackcdn.com/image/fetch/$s_!A9XD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20177003-fd2f-42e8-bb0f-c9f669ca0291_3549x1208.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A9XD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20177003-fd2f-42e8-bb0f-c9f669ca0291_3549x1208.png" width="1456" height="496" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/20177003-fd2f-42e8-bb0f-c9f669ca0291_3549x1208.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:496,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:237756,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/168303035?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20177003-fd2f-42e8-bb0f-c9f669ca0291_3549x1208.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!A9XD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20177003-fd2f-42e8-bb0f-c9f669ca0291_3549x1208.png 424w, https://substackcdn.com/image/fetch/$s_!A9XD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20177003-fd2f-42e8-bb0f-c9f669ca0291_3549x1208.png 848w, https://substackcdn.com/image/fetch/$s_!A9XD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20177003-fd2f-42e8-bb0f-c9f669ca0291_3549x1208.png 1272w, https://substackcdn.com/image/fetch/$s_!A9XD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20177003-fd2f-42e8-bb0f-c9f669ca0291_3549x1208.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>2. Leading Zero Analysis</h3><p>Here's where HyperLogLog gets sophisticated. Instead of using the entire hash, we split it into two parts:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yz_s!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4c0009-d29b-4834-afd0-66d94e918c76_3507x1075.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yz_s!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4c0009-d29b-4834-afd0-66d94e918c76_3507x1075.png 424w, https://substackcdn.com/image/fetch/$s_!yz_s!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4c0009-d29b-4834-afd0-66d94e918c76_3507x1075.png 848w, https://substackcdn.com/image/fetch/$s_!yz_s!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4c0009-d29b-4834-afd0-66d94e918c76_3507x1075.png 1272w, https://substackcdn.com/image/fetch/$s_!yz_s!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4c0009-d29b-4834-afd0-66d94e918c76_3507x1075.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yz_s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4c0009-d29b-4834-afd0-66d94e918c76_3507x1075.png" width="1456" height="446" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1c4c0009-d29b-4834-afd0-66d94e918c76_3507x1075.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:446,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:215864,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/168303035?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4c0009-d29b-4834-afd0-66d94e918c76_3507x1075.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yz_s!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4c0009-d29b-4834-afd0-66d94e918c76_3507x1075.png 424w, https://substackcdn.com/image/fetch/$s_!yz_s!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4c0009-d29b-4834-afd0-66d94e918c76_3507x1075.png 848w, https://substackcdn.com/image/fetch/$s_!yz_s!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4c0009-d29b-4834-afd0-66d94e918c76_3507x1075.png 1272w, https://substackcdn.com/image/fetch/$s_!yz_s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c4c0009-d29b-4834-afd0-66d94e918c76_3507x1075.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Part 1: Bucket Selection (first b bits)</strong></p><ul><li><p>Use the first <code>b</code> bits to determine which bucket this element belongs to</p></li><li><p>This creates 2^b buckets (typically b=4 to b=16)</p></li><li><p>Each bucket will maintain its own statistics</p></li></ul><p><strong>Part 2: Rank Calculation (remaining bits)</strong></p><ul><li><p>Use the remaining (64-b) bits to calculate the rank</p></li></ul><p>For each hash, count the number of leading zeros in its binary representation. This becomes your "rank" for that element.</p><blockquote><p><strong>The mathematical insight</strong>: In a uniform random distribution, the probability of seeing exactly k leading zeros is 2^(-k-1). So if you observe k leading zeros, you've likely seen about 2^k unique elements.</p></blockquote><p>Each bucket keeps track of the maximum rank it has seen:</p><pre><code><code>Bucket 0: max_rank = 3
Bucket 1: max_rank = 7  
Bucket 2: max_rank = 2
...
Bucket 15: max_rank = 5</code></code></pre><h3>3. Final Cardinality Estimate</h3><p>The final estimate uses the harmonic mean of all bucket estimates to reduce variance:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_G0k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d0481e6-ccc8-428b-b96c-69c1c626d142_2461x1453.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_G0k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d0481e6-ccc8-428b-b96c-69c1c626d142_2461x1453.png 424w, https://substackcdn.com/image/fetch/$s_!_G0k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d0481e6-ccc8-428b-b96c-69c1c626d142_2461x1453.png 848w, https://substackcdn.com/image/fetch/$s_!_G0k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d0481e6-ccc8-428b-b96c-69c1c626d142_2461x1453.png 1272w, https://substackcdn.com/image/fetch/$s_!_G0k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d0481e6-ccc8-428b-b96c-69c1c626d142_2461x1453.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_G0k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d0481e6-ccc8-428b-b96c-69c1c626d142_2461x1453.png" width="535" height="316.00274725274727" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4d0481e6-ccc8-428b-b96c-69c1c626d142_2461x1453.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:860,&quot;width&quot;:1456,&quot;resizeWidth&quot;:535,&quot;bytes&quot;:204390,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/168303035?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d0481e6-ccc8-428b-b96c-69c1c626d142_2461x1453.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_G0k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d0481e6-ccc8-428b-b96c-69c1c626d142_2461x1453.png 424w, https://substackcdn.com/image/fetch/$s_!_G0k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d0481e6-ccc8-428b-b96c-69c1c626d142_2461x1453.png 848w, https://substackcdn.com/image/fetch/$s_!_G0k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d0481e6-ccc8-428b-b96c-69c1c626d142_2461x1453.png 1272w, https://substackcdn.com/image/fetch/$s_!_G0k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d0481e6-ccc8-428b-b96c-69c1c626d142_2461x1453.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><strong>Why harmonic mean?</strong> Regular averaging would be dominated by buckets with very high ranks (outliers). The harmonic mean is more robust to outliers and provides better estimates</p></blockquote><h2>Key Advantages</h2><h3>Memory Efficiency</h3><p>The Redis HyperLogLog implementation uses up to 12 KB and provides a standard error of 0.81%. This means you can estimate the cardinality of billions of unique elements using just 12 KB of memory&#8212;a remarkable feat compared to traditional exact counting methods.</p><h3>Scalability</h3><p>Traditional exact unique counting in SQL uses O(unique elements) additional memory, making it impractical for large datasets. HyperLogLog uses constant memory regardless of the dataset size.</p><h3>Accuracy</h3><p>Despite being probabilistic, HyperLogLog provides highly accurate estimates. The error for very small cardinalities tends to be very small, making it suitable for real-world applications.</p><h2>HyperLogLog Across Database Systems</h2><p><strong>PostgreSQL</strong>: One of the first major SQL databases to embrace HyperLogLog through extensions like postgresql-hll, enabling efficient cardinality estimation for analytical workloads.</p><p><strong>Microsoft SQL Server</strong>: Implements approximate distinct counting capabilities that leverage HyperLogLog principles in functions like APPROX_COUNT_DISTINCT.</p><p><strong>Oracle Database</strong>: Provides HyperLogLog functionality through various approximate query processing features for large-scale analytics.</p><h3>Distributed Query Engines</h3><p><strong>Presto/Trino</strong>: Facebook implemented HyperLogLog in Presto with the APPROX_DISTINCT function, reducing query times from days to hours for large-scale cardinality estimation.</p><p><strong>Apache Spark</strong>: Integrates HyperLogLog through various libraries and built-in functions for distributed data processing.</p><p><strong>Apache Drill</strong>: Supports HyperLogLog for efficient approximate distinct counting in distributed environments.</p><h3>Cloud Data Warehouses</h3><p><strong>Google BigQuery</strong>: Provides comprehensive HLL functions (HLL_COUNT.EXTRACT, HLL_COUNT.MERGE) for large-scale analytics workloads.</p><p><strong>Amazon Redshift</strong>: Implements approximate distinct counting using HyperLogLog-based algorithms.</p><p><strong>Snowflake</strong>: Offers HLL functions for efficient cardinality estimation in cloud data warehousing scenarios.</p><p><strong>Azure Synapse Analytics</strong>: Includes HyperLogLog-based approximate functions for big data analytics.</p><h3>NoSQL and Specialized Systems</h3><p><strong>Redis</strong>: Popularized HyperLogLog with simple PFADD, PFCOUNT, and PFMERGE commands, making it accessible to application developers.</p><p><strong>Apache Druid</strong>: Uses HyperLogLog for real-time analytics and approximate cardinality queries.</p><p><strong>ClickHouse</strong>: Implements various HyperLogLog variants optimized for analytical workloads.</p><p><strong>Elasticsearch</strong>: Provides cardinality aggregations based on HyperLogLog for search analytics.</p><h3>SQL Usage Examples</h3><p>Here are typical SQL pattern that could be used in real-world in Snowflake. Let&#8217;s say we are calculating user retention:</p><pre><code>CREATE OR REPLACE TABLE user_activity AS
SELECT 
    DATEADD(day, UNIFORM(0, 29, RANDOM()), '2024-01-01')::DATE as activity_date,
    UNIFORM(1000, 9999, RANDOM()) as user_id,
    UNIFORM(1, 10, RANDOM()) as page_id
FROM TABLE(GENERATOR(ROWCOUNT =&gt; 100000));</code></pre><p>We will calculate daily active users using HLL:</p><pre><code>CREATE OR REPLACE TABLE daily_active_users AS
SELECT 
    activity_date,
    HLL_BUILD_AGG(user_id) as user_hll_sketch
FROM user_activity 
GROUP BY activity_date
ORDER BY activity_date;</code></pre><p>For weekly active users calculation using HLL union:</p><pre><code>SELECT 
    DATE_TRUNC('week', activity_date) as week,
    HLL_ESTIMATE(HLL_COMBINE(user_hll_sketch)) as estimated_wau
FROM user_activity ua 
     WHERE DATE_TRUNC('week', ua.activity_date) = DATE_TRUNC('week', dau.activity_date)) as exact_wau
FROM daily_active_users dau
GROUP BY DATE_TRUNC('week', activity_date)
ORDER BY week;</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GJSs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5205676-6ab2-48c2-8a18-4e256c8fbb5f_1764x1292.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GJSs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5205676-6ab2-48c2-8a18-4e256c8fbb5f_1764x1292.heic 424w, https://substackcdn.com/image/fetch/$s_!GJSs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5205676-6ab2-48c2-8a18-4e256c8fbb5f_1764x1292.heic 848w, https://substackcdn.com/image/fetch/$s_!GJSs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5205676-6ab2-48c2-8a18-4e256c8fbb5f_1764x1292.heic 1272w, https://substackcdn.com/image/fetch/$s_!GJSs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5205676-6ab2-48c2-8a18-4e256c8fbb5f_1764x1292.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GJSs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5205676-6ab2-48c2-8a18-4e256c8fbb5f_1764x1292.heic" width="549" height="401.94642857142856" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e5205676-6ab2-48c2-8a18-4e256c8fbb5f_1764x1292.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1066,&quot;width&quot;:1456,&quot;resizeWidth&quot;:549,&quot;bytes&quot;:71998,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/168303035?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5205676-6ab2-48c2-8a18-4e256c8fbb5f_1764x1292.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GJSs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5205676-6ab2-48c2-8a18-4e256c8fbb5f_1764x1292.heic 424w, https://substackcdn.com/image/fetch/$s_!GJSs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5205676-6ab2-48c2-8a18-4e256c8fbb5f_1764x1292.heic 848w, https://substackcdn.com/image/fetch/$s_!GJSs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5205676-6ab2-48c2-8a18-4e256c8fbb5f_1764x1292.heic 1272w, https://substackcdn.com/image/fetch/$s_!GJSs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5205676-6ab2-48c2-8a18-4e256c8fbb5f_1764x1292.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Limitations and Considerations</h2><h3>Approximation Nature</h3><p>HyperLogLog provides estimates, not exact counts. While the error rate is typically low (around 0.81%), applications requiring exact counts should use traditional methods.</p><h3>Small Dataset Accuracy</h3><p>For very small datasets (fewer than 100 unique elements), exact counting might be more appropriate as the memory savings of HyperLogLog are minimal.</p><h3>Hash Function Dependency</h3><p>The accuracy of HyperLogLog depends on the quality of the hash function used. Poor hash functions can lead to biased estimates.</p><h2>Use Cases and Applications</h2><h3>Web Analytics</h3><ul><li><p>Counting unique visitors to websites</p></li><li><p>Tracking distinct page views</p></li><li><p>Measuring user engagement across different time periods</p></li></ul><h3>Database Analytics</h3><ul><li><p>Estimating table cardinalities for query optimization</p></li><li><p>Monitoring data distribution in large datasets</p></li><li><p>Real-time analytics dashboards</p></li></ul><h3>Stream Processing</h3><ul><li><p>Counting unique events in real-time data streams</p></li><li><p>Monitoring distinct users in live applications</p></li><li><p>Fraud detection systems</p></li></ul><h3>Data Warehousing</h3><ul><li><p>ETL pipeline optimization</p></li><li><p>Data quality assessment</p></li><li><p>Historical trend analysis</p></li></ul><h2>Conclusion</h2><p>Depending on your use case (some of them can't stand approximation!), HyperLogLog can be an incredibly useful thing in your toolbox!</p><p>HyperLogLog is particularly efficient at estimating the cardinality of large datasets:</p><ul><li><p>with relatively high accuracy <em>(typically ~0.81% error rate)</em></p></li><li><p>while using a small amount of memory and CPU <em>(~12KB memory vs gigabytes for exact counting)</em></p></li><li><p>enabling real-time analytics that would otherwise be impossible <em>(queries finishing in minutes instead of hours)</em></p></li></ul><p>Whether you're tracking daily active users, analyzing click streams, or optimizing database queries, HyperLogLog offers that sweet spot between accuracy and performance that makes big data analytics actually feasible. Just remember: when you absolutely need exact counts, stick with traditional methods&#8212;but for everything else, HLL is your friend!</p><blockquote><p>Thank you for reading this far. See you in my next articles. Don't forget to subscribe to get more of this interesting data engineering content!"</p></blockquote>]]></content:encoded></item><item><title><![CDATA[Hash Tables vs B+ Trees: Why Databases Choose "Good Enough" Over Perfect]]></title><description><![CDATA[Why every major database defaults to B+ trees]]></description><link>https://dataheimer.substack.com/p/hash-tables-vs-b-trees-why-databases</link><guid isPermaLink="false">https://dataheimer.substack.com/p/hash-tables-vs-b-trees-why-databases</guid><dc:creator><![CDATA[Subhan Hagverdiyev]]></dc:creator><pubDate>Tue, 08 Jul 2025 17:26:12 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!vHIV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc483dea-4002-42d4-a882-bb185fd6f3ab_2433x1876.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><em>I'm Subhan Hagverdiyev and welcome to Dataheimer - where we explore the atomic impact of data. </em></p><p><em>Just like splitting an atom releases enormous energy, the right data engineering decisions can transform entire organizations.</em></p><p><em>This is where I break down complex concepts and share all the fascinating discoveries from my journey.</em></p><p><em>Want to join the adventure? Here you go:</em></p></blockquote><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataheimer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataheimer.substack.com/subscribe?"><span>Subscribe now</span></a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vHIV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc483dea-4002-42d4-a882-bb185fd6f3ab_2433x1876.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vHIV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc483dea-4002-42d4-a882-bb185fd6f3ab_2433x1876.png 424w, https://substackcdn.com/image/fetch/$s_!vHIV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc483dea-4002-42d4-a882-bb185fd6f3ab_2433x1876.png 848w, https://substackcdn.com/image/fetch/$s_!vHIV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc483dea-4002-42d4-a882-bb185fd6f3ab_2433x1876.png 1272w, https://substackcdn.com/image/fetch/$s_!vHIV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc483dea-4002-42d4-a882-bb185fd6f3ab_2433x1876.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vHIV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc483dea-4002-42d4-a882-bb185fd6f3ab_2433x1876.png" width="568" height="438.0934065934066" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cc483dea-4002-42d4-a882-bb185fd6f3ab_2433x1876.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1123,&quot;width&quot;:1456,&quot;resizeWidth&quot;:568,&quot;bytes&quot;:448387,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/167736000?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc483dea-4002-42d4-a882-bb185fd6f3ab_2433x1876.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vHIV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc483dea-4002-42d4-a882-bb185fd6f3ab_2433x1876.png 424w, https://substackcdn.com/image/fetch/$s_!vHIV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc483dea-4002-42d4-a882-bb185fd6f3ab_2433x1876.png 848w, https://substackcdn.com/image/fetch/$s_!vHIV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc483dea-4002-42d4-a882-bb185fd6f3ab_2433x1876.png 1272w, https://substackcdn.com/image/fetch/$s_!vHIV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc483dea-4002-42d4-a882-bb185fd6f3ab_2433x1876.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>At first glance, hash tables seem like the obvious winner when choosing a data structure for database indexing. After all, they offer O(1) lookup time for exact matches&#8212;how can anything compete with that? B+ trees, by contrast, offer O(log n) performance, which feels sluggish in comparison.</p><p>But here's what's puzzling: PostgreSQL, MySQL, Oracle, and virtually every major database system defaults to B+ trees for their primary indexes. Are these billion-dollar companies making a fundamental performance mistake?</p><p>Spoiler alert: They're not.</p><h2>Intro</h2><p>Despite their O(1) lookup time , hashmaps are not preferred in database indexing. Because in real world , databases are asked to do more than just find a single record by its exact key.</p><p>Hash tables excel at one thing: fast, exact key lookups. But they fall flat when we need:</p><ul><li><p>Range queries (e.g., "find all users registered between January and March")</p></li><li><p>Sorting (e.g., "order by last name")</p></li><li><p>Prefix or partial matches (e.g., "name starts with 'Al'")</p></li></ul><p>B+ trees, on the other hand, gracefully support all of the above. Their structure maintains data in sorted order, enabling efficient in-order traversal, range filtering, and even multi-level indexing with minimal overhead.</p><p>We will now explore their strength and internal structure</p><blockquote><p><em>Disclaimer: We won&#8217;t go into too much detail while explaining each of them since our goal is to find out why databases prefer one over another for indexing. If their will be interest I will write about each of them in depth on next articles.</em></p></blockquote><h2>Hash table</h2><p>Let's start with what everyone learns in algorithms class:</p><p>A <strong>hash table</strong> is a data structure that maps keys to values using a <strong>hash function</strong> to compute an index into an array of buckets or slots.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T5zz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed56bbe-d758-4575-ac55-3b11b2638ed6_2069x1068.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T5zz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed56bbe-d758-4575-ac55-3b11b2638ed6_2069x1068.png 424w, https://substackcdn.com/image/fetch/$s_!T5zz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed56bbe-d758-4575-ac55-3b11b2638ed6_2069x1068.png 848w, https://substackcdn.com/image/fetch/$s_!T5zz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed56bbe-d758-4575-ac55-3b11b2638ed6_2069x1068.png 1272w, https://substackcdn.com/image/fetch/$s_!T5zz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed56bbe-d758-4575-ac55-3b11b2638ed6_2069x1068.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T5zz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed56bbe-d758-4575-ac55-3b11b2638ed6_2069x1068.png" width="1456" height="752" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ed56bbe-d758-4575-ac55-3b11b2638ed6_2069x1068.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:752,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:173832,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/167736000?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed56bbe-d758-4575-ac55-3b11b2638ed6_2069x1068.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!T5zz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed56bbe-d758-4575-ac55-3b11b2638ed6_2069x1068.png 424w, https://substackcdn.com/image/fetch/$s_!T5zz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed56bbe-d758-4575-ac55-3b11b2638ed6_2069x1068.png 848w, https://substackcdn.com/image/fetch/$s_!T5zz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed56bbe-d758-4575-ac55-3b11b2638ed6_2069x1068.png 1272w, https://substackcdn.com/image/fetch/$s_!T5zz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed56bbe-d758-4575-ac55-3b11b2638ed6_2069x1068.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Internals of Hash Table</figcaption></figure></div><p>Let&#8217;s say we have 2 key-value pairs <em>(Banana, 1</em>) and <em>(Apple, 2)</em>. We pass each key to the hash function and the hash function returns a numeric hash code( for example "Banana" &#8594; 100, "Apple" &#8594; 150). Then the hash code is converted to a valid array index by taking it modulo the array size. In our case the array size is 6: </p><ul><li><p>100 % 6 = 0 &#8594; index for "Banana" </p></li><li><p>150 % 6 = 5 &#8594; index for "Apple"</p></li></ul><p>Each index stores key-value pairs:</p><ul><li><p>Index 0 &#8594; ("Banana", 1)</p></li><li><p>Index 5 &#8594; ("Apple", 2)</p></li></ul><p>Since the hash codes map to different indices (0 and 5), there is no collision.</p><p>However collisions happen when two keys hash to the same index:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mZla!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d73e811-b056-48b2-ab52-33fe1ced94c7_2349x1124.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mZla!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d73e811-b056-48b2-ab52-33fe1ced94c7_2349x1124.png 424w, https://substackcdn.com/image/fetch/$s_!mZla!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d73e811-b056-48b2-ab52-33fe1ced94c7_2349x1124.png 848w, https://substackcdn.com/image/fetch/$s_!mZla!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d73e811-b056-48b2-ab52-33fe1ced94c7_2349x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!mZla!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d73e811-b056-48b2-ab52-33fe1ced94c7_2349x1124.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mZla!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d73e811-b056-48b2-ab52-33fe1ced94c7_2349x1124.png" width="728" height="348.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2d73e811-b056-48b2-ab52-33fe1ced94c7_2349x1124.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:697,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:197616,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/167736000?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d73e811-b056-48b2-ab52-33fe1ced94c7_2349x1124.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mZla!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d73e811-b056-48b2-ab52-33fe1ced94c7_2349x1124.png 424w, https://substackcdn.com/image/fetch/$s_!mZla!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d73e811-b056-48b2-ab52-33fe1ced94c7_2349x1124.png 848w, https://substackcdn.com/image/fetch/$s_!mZla!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d73e811-b056-48b2-ab52-33fe1ced94c7_2349x1124.png 1272w, https://substackcdn.com/image/fetch/$s_!mZla!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2d73e811-b056-48b2-ab52-33fe1ced94c7_2349x1124.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Hash collision</figcaption></figure></div><p>We are now adding (Mango, 3) pair and both hash codes(100, 200) points to index 0. There are several different ways of addressing collisions. The most common one is simply to store all the objects that get assigned to the same index in a linked list. In this scenario, instead of simply storing the value at that index, the linked list must contain both the entire key and the value in pairs instead of just the value, so that the values can be uniquely tied to a key.</p><h4>Performance:</h4><ul><li><p><strong>Average Case Lookup</strong>: <strong>O(1)</strong></p><ul><li><p>With a good hash function and low collision rate, lookup, insert, and delete operations take constant time.</p></li></ul></li><li><p><strong>Best Case</strong>: <strong>Lightning fast</strong></p><ul><li><p>If there are no collisions, the operation is almost instantaneous.</p></li></ul></li><li><p><strong>Worst Case</strong>: <strong>O(n)</strong></p><ul><li><p>If the hash function causes many collisions or the table is poorly sized, performance can degrade to linear time.</p></li></ul><p></p></li></ul><h2><strong>B+ Trees:</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!olrQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feee8bab9-7791-40ee-832c-91e5fc606bec_2289x1050.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!olrQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feee8bab9-7791-40ee-832c-91e5fc606bec_2289x1050.png 424w, https://substackcdn.com/image/fetch/$s_!olrQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feee8bab9-7791-40ee-832c-91e5fc606bec_2289x1050.png 848w, https://substackcdn.com/image/fetch/$s_!olrQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feee8bab9-7791-40ee-832c-91e5fc606bec_2289x1050.png 1272w, https://substackcdn.com/image/fetch/$s_!olrQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feee8bab9-7791-40ee-832c-91e5fc606bec_2289x1050.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!olrQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feee8bab9-7791-40ee-832c-91e5fc606bec_2289x1050.png" width="705" height="323.4478021978022" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eee8bab9-7791-40ee-832c-91e5fc606bec_2289x1050.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:668,&quot;width&quot;:1456,&quot;resizeWidth&quot;:705,&quot;bytes&quot;:143864,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/167736000?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feee8bab9-7791-40ee-832c-91e5fc606bec_2289x1050.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!olrQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feee8bab9-7791-40ee-832c-91e5fc606bec_2289x1050.png 424w, https://substackcdn.com/image/fetch/$s_!olrQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feee8bab9-7791-40ee-832c-91e5fc606bec_2289x1050.png 848w, https://substackcdn.com/image/fetch/$s_!olrQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feee8bab9-7791-40ee-832c-91e5fc606bec_2289x1050.png 1272w, https://substackcdn.com/image/fetch/$s_!olrQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feee8bab9-7791-40ee-832c-91e5fc606bec_2289x1050.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A <strong>B+ tree</strong> is a type of self-balancing tree data structure. Unlike a regular B-tree, a B+ tree separates its &#8220;table of contents&#8221; from its actual &#8220;content pages&#8221;</p><p>In B+ trees we keep all our actual data at the bottom level (leaf nodes) and use the upper levels purely for navigation</p><p>[7][17][25] represent our <strong>root nodes</strong>. These are pure navigation aids &#8211; they contain <strong>key values</strong> that guide our search but don't store actual data. Think of them as highway signs pointing us in the right direction. </p><p>[1][4] &#8594; [7][10] &#8594; [17][19][21] &#8594; [25][31] are our leaf nodes. These contain the actual data (or pointers to the actual data).</p><h3>The Magic of B-Tree Search</h3><p>Here's where it gets interesting. When we search for a value, the B-tree doesn't check every single piece of data. Instead, it uses those key values as a roadmap:</p><p><strong>Looking for the number 19?</strong></p><ol><li><p>Start at the root: [7][17][25] </p></li><li><p>Since 19 &gt; 17 but &lt; 25, follow the path between 17 and 25</p></li><li><p>Land directly at the leaf node [17][19][21]</p></li><li><p>Found it! Only 2 steps instead of potentially searching through dozens of records.</p></li></ol><p>These linked leaf nodes in the diagram are the reason B+ tree can have super fast range queries.If we need all records between 10 and 21?  We just need to find the starting point and follow the links. No more tree traversal needed!</p><h4>Performance:</h4><ul><li><p><strong>All Operations (Search, Insert, Delete)</strong>: <strong>O(log n)</strong></p><ul><li><p>Each operation takes logarithmic time because the tree remains balanced, and each node can have multiple children, reducing the tree height.</p></li></ul></li></ul><h2>What Real Applications Actually Need</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ts5w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d59524-4c46-411e-ac87-77e90d5d8e25_360x267.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ts5w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d59524-4c46-411e-ac87-77e90d5d8e25_360x267.gif 424w, https://substackcdn.com/image/fetch/$s_!ts5w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d59524-4c46-411e-ac87-77e90d5d8e25_360x267.gif 848w, https://substackcdn.com/image/fetch/$s_!ts5w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d59524-4c46-411e-ac87-77e90d5d8e25_360x267.gif 1272w, https://substackcdn.com/image/fetch/$s_!ts5w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d59524-4c46-411e-ac87-77e90d5d8e25_360x267.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ts5w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d59524-4c46-411e-ac87-77e90d5d8e25_360x267.gif" width="360" height="267" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/85d59524-4c46-411e-ac87-77e90d5d8e25_360x267.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:267,&quot;width&quot;:360,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1714265,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/167736000?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d59524-4c46-411e-ac87-77e90d5d8e25_360x267.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ts5w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d59524-4c46-411e-ac87-77e90d5d8e25_360x267.gif 424w, https://substackcdn.com/image/fetch/$s_!ts5w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d59524-4c46-411e-ac87-77e90d5d8e25_360x267.gif 848w, https://substackcdn.com/image/fetch/$s_!ts5w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d59524-4c46-411e-ac87-77e90d5d8e25_360x267.gif 1272w, https://substackcdn.com/image/fetch/$s_!ts5w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85d59524-4c46-411e-ac87-77e90d5d8e25_360x267.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let's look at what a typical e-commerce application needs:</p><pre><code>-- Exact lookup (Hash tables excel here)
SELECT * FROM users WHERE user_id = 12345;

-- Range query (B+ tree is faster)
SELECT * FROM orders WHERE order_date &gt;= '2025-01-01' AND order_date &lt; '2025-02-01';

-- Sorting (B+ tree is faster)
SELECT * FROM products ORDER BY price ASC;

-- Partial matching (B+ tree is faster)
SELECT * FROM users WHERE email LIKE 'john%';

-- Top-N queries (B+ tree is faster)
SELECT * FROM products ORDER BY rating DESC LIMIT 10;</code></pre><p>Out of these five common query patterns, hash tables only help with one. B+ trees handle all of them efficiently.</p><h3>1. Range Queries</h3><p>B+ trees maintain sorted order, making range scans efficient. Need all orders from January? The B+ tree can find the first January record and scan sequentially to the end.</p><p>Hash tables? We'd have to check every single day in January individually.</p><h3>2. Sorting</h3><p>Since B+ tree leaves are already sorted, getting ordered results is essentially free. The data is already in the order we need.</p><p>Hash tables provide no ordering guarantees. To sort hash table results, we need a separate sorting step.</p><h3>3. Partial Matches</h3><p>In the query example we are looking for all emails starting with "john". B+ trees can find the first match and scan forward. Hash tables require scanning the entire table.</p><h3>4. Memory Efficiency</h3><p>B+ trees have excellent cache locality. Related data is stored together, reducing disk I/O and improving cache performance.</p><p>Hash tables can have poor cache performance due to random access patterns.</p><h2>Major Databases usage of B+ tree</h2><h3>PostgreSQL's Approach</h3><p>PostgreSQL offers both hash and B+ tree indexes but defaults to B+ trees for good reason:</p><pre><code>-- PostgreSQL's default
CREATE INDEX idx_user_id ON users(user_id);  -- Uses B+ tree

-- You can force hash indexes, but rarely should
CREATE INDEX idx_user_id_hash ON users USING HASH(user_id);</code></pre><p>The hash index is only useful for equality comparisons and doesn't support WAL logging (making it unsuitable for replication).</p><h3>MySQL's InnoDB Strategy</h3><p>MySQL's InnoDB engine uses B+ trees exclusively for all indexes. They decided that the consistency and versatility of B+ trees outweighed the potential O(1) benefits of hash tables.</p><h3>Oracle's Balanced Approach</h3><p>Oracle uses B+ trees for most indexes but implements hash clusters for specific use cases where we genuinely only need exact lookups.</p><h2>When Hash Tables Actually Make Sense</h2><p>So until here we covered for indexing B+ tree definitely wins. However hash tables aren't useless&#8212;they excel in specific scenarios:</p><h3>1. Redis and In-Memory Stores</h3><p>Redis uses hash tables for O(1) key-value lookups because:</p><ul><li><p>Data fits in memory</p></li><li><p>Access patterns are primarily exact lookups</p></li><li><p>Range queries are handled by other data structures (sorted sets, lists)</p></li></ul><h3>2. Database Hash Joins</h3><p>During query execution, databases often build temporary hash tables for join operations:</p><pre><code>-- The query optimizer might build a hash table 
-- for the smaller table to speed up this join
SELECT * FROM small_table s
JOIN large_table l ON s.id = l.small_id;</code></pre><h3>3. Distributed Systems</h3><p>Systems like Cassandra use hash-based partitioning to distribute data across nodes, but still use other structures for local storage.</p><h2>Conclusion</h2><p>The choice between hash tables and B+ trees in database indexing isn't really about which data structure is "better" in isolation&#8212;it's about which one better serves the diverse needs of real-world applications.</p><p>While hash tables offer  O(1) lookup time, databases aren't just glorified key-value stores. They're complex systems that need to handle everything from exact matches to range queries, from sorting operations to partial text searches. In this broader context, B+ trees emerge as the clear winner for general-purpose indexing.</p><p>The logarithmic O(log n) performance of B+ trees might seem inferior to hash tables' constant time operations, but remember: in a well-balanced B+ tree with thousands or even millions of records, we&#8217;re typically looking at just 3-4 disk reads to find any piece of data</p><p>This is why PostgreSQL, MySQL, Oracle, and other major database systems have converged on B+ trees as their default indexing strategy. They've chosen versatility and consistency over the narrow performance advantage that hash tables provide for exact lookups.</p><p>Hash tables still have their place in computing, particularly in in-memory systems, caching layers, and specialized use cases. But for the foundational task of database indexing, B+ trees reign supreme&#8212;and now you know why.</p><blockquote><p>Thank you for reading this far. See you in my next articles. Don't forget to subscribe to get more of this interesting data engineering content!"</p></blockquote>]]></content:encoded></item><item><title><![CDATA[Top 10 Data Engineering Research Papers to read]]></title><description><![CDATA[Must read data engineering research papers including MapReduce, Google Filesystem, Spark]]></description><link>https://dataheimer.substack.com/p/top-10-data-engineering-research</link><guid isPermaLink="false">https://dataheimer.substack.com/p/top-10-data-engineering-research</guid><dc:creator><![CDATA[Subhan Hagverdiyev]]></dc:creator><pubDate>Tue, 01 Jul 2025 18:27:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!eDBm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4bba6ff-ed22-42e0-907d-c53e133f189c_2288x961.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eDBm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4bba6ff-ed22-42e0-907d-c53e133f189c_2288x961.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eDBm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4bba6ff-ed22-42e0-907d-c53e133f189c_2288x961.png 424w, https://substackcdn.com/image/fetch/$s_!eDBm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4bba6ff-ed22-42e0-907d-c53e133f189c_2288x961.png 848w, https://substackcdn.com/image/fetch/$s_!eDBm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4bba6ff-ed22-42e0-907d-c53e133f189c_2288x961.png 1272w, https://substackcdn.com/image/fetch/$s_!eDBm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4bba6ff-ed22-42e0-907d-c53e133f189c_2288x961.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eDBm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4bba6ff-ed22-42e0-907d-c53e133f189c_2288x961.png" width="1456" height="612" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e4bba6ff-ed22-42e0-907d-c53e133f189c_2288x961.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:612,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:81228,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/167279434?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4bba6ff-ed22-42e0-907d-c53e133f189c_2288x961.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eDBm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4bba6ff-ed22-42e0-907d-c53e133f189c_2288x961.png 424w, https://substackcdn.com/image/fetch/$s_!eDBm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4bba6ff-ed22-42e0-907d-c53e133f189c_2288x961.png 848w, https://substackcdn.com/image/fetch/$s_!eDBm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4bba6ff-ed22-42e0-907d-c53e133f189c_2288x961.png 1272w, https://substackcdn.com/image/fetch/$s_!eDBm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4bba6ff-ed22-42e0-907d-c53e133f189c_2288x961.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I have seen quite a lot of AI papers flooding around internet, and curious why data engineering papers was not shared enough. Today I am going to share top Data Engineering papers that are must read.</p><ol><li><p><a href="https://research.google/pubs/mapreduce-simplified-data-processing-on-large-clusters/">MapReduce</a></p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ozn0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19350a4-278a-4504-a219-6fcb62be3ebf_1059x344.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ozn0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19350a4-278a-4504-a219-6fcb62be3ebf_1059x344.png 424w, https://substackcdn.com/image/fetch/$s_!ozn0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19350a4-278a-4504-a219-6fcb62be3ebf_1059x344.png 848w, https://substackcdn.com/image/fetch/$s_!ozn0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19350a4-278a-4504-a219-6fcb62be3ebf_1059x344.png 1272w, https://substackcdn.com/image/fetch/$s_!ozn0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19350a4-278a-4504-a219-6fcb62be3ebf_1059x344.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ozn0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19350a4-278a-4504-a219-6fcb62be3ebf_1059x344.png" width="631" height="204.9707271010387" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a19350a4-278a-4504-a219-6fcb62be3ebf_1059x344.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:344,&quot;width&quot;:1059,&quot;resizeWidth&quot;:631,&quot;bytes&quot;:113276,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/167279434?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19350a4-278a-4504-a219-6fcb62be3ebf_1059x344.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ozn0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19350a4-278a-4504-a219-6fcb62be3ebf_1059x344.png 424w, https://substackcdn.com/image/fetch/$s_!ozn0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19350a4-278a-4504-a219-6fcb62be3ebf_1059x344.png 848w, https://substackcdn.com/image/fetch/$s_!ozn0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19350a4-278a-4504-a219-6fcb62be3ebf_1059x344.png 1272w, https://substackcdn.com/image/fetch/$s_!ozn0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19350a4-278a-4504-a219-6fcb62be3ebf_1059x344.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><p>This paper from Google introduced a programming model and execution framework for processing large-scale data across distributed systems. By abstracting parallelization, fault-tolerance, and load balancing, MapReduce made distributed computing accessible to engineers and sparked the big data revolution.It is probably fair to say that half of the academia are now working on problems heavily influenced by MapReduce.</p><ol start="2"><li><p><a href="https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf">Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yXyv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3843cb67-1395-4135-8a63-a0bcead7645a_467x302.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yXyv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3843cb67-1395-4135-8a63-a0bcead7645a_467x302.heic 424w, https://substackcdn.com/image/fetch/$s_!yXyv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3843cb67-1395-4135-8a63-a0bcead7645a_467x302.heic 848w, https://substackcdn.com/image/fetch/$s_!yXyv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3843cb67-1395-4135-8a63-a0bcead7645a_467x302.heic 1272w, https://substackcdn.com/image/fetch/$s_!yXyv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3843cb67-1395-4135-8a63-a0bcead7645a_467x302.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yXyv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3843cb67-1395-4135-8a63-a0bcead7645a_467x302.heic" width="469" height="303.2933618843683" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3843cb67-1395-4135-8a63-a0bcead7645a_467x302.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:302,&quot;width&quot;:467,&quot;resizeWidth&quot;:469,&quot;bytes&quot;:19767,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/167279434?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3843cb67-1395-4135-8a63-a0bcead7645a_467x302.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yXyv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3843cb67-1395-4135-8a63-a0bcead7645a_467x302.heic 424w, https://substackcdn.com/image/fetch/$s_!yXyv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3843cb67-1395-4135-8a63-a0bcead7645a_467x302.heic 848w, https://substackcdn.com/image/fetch/$s_!yXyv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3843cb67-1395-4135-8a63-a0bcead7645a_467x302.heic 1272w, https://substackcdn.com/image/fetch/$s_!yXyv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3843cb67-1395-4135-8a63-a0bcead7645a_467x302.heic 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is the research paper behind the Spark cluster computing project at Berkeley. It represents RDDs as a key abstraction enabling fast, in-memory computations while maintaining fault tolerance. It revolutionized batch and iterative data processing by offering better performance and easier programming models than MapReduce.</p></li><li><p><a href="https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf">What Goes Around Comes Around</a></p><p>This paper explores the design and performance of columnar storage formats, specifically how past ideas in database architecture are being revived and optimized for modern analytical workloads. It emphasizes the importance of vectorized execution and columnar encoding for big data systems.</p></li><li><p><a href="https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf">The Google File System</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QqiQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0bfb3b3-f5cb-428f-a394-a9ac6a603c82_1490x920.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QqiQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0bfb3b3-f5cb-428f-a394-a9ac6a603c82_1490x920.heic 424w, https://substackcdn.com/image/fetch/$s_!QqiQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0bfb3b3-f5cb-428f-a394-a9ac6a603c82_1490x920.heic 848w, https://substackcdn.com/image/fetch/$s_!QqiQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0bfb3b3-f5cb-428f-a394-a9ac6a603c82_1490x920.heic 1272w, https://substackcdn.com/image/fetch/$s_!QqiQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0bfb3b3-f5cb-428f-a394-a9ac6a603c82_1490x920.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QqiQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0bfb3b3-f5cb-428f-a394-a9ac6a603c82_1490x920.heic" width="624" height="385.2857142857143" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f0bfb3b3-f5cb-428f-a394-a9ac6a603c82_1490x920.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:899,&quot;width&quot;:1456,&quot;resizeWidth&quot;:624,&quot;bytes&quot;:67584,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/167279434?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0bfb3b3-f5cb-428f-a394-a9ac6a603c82_1490x920.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QqiQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0bfb3b3-f5cb-428f-a394-a9ac6a603c82_1490x920.heic 424w, https://substackcdn.com/image/fetch/$s_!QqiQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0bfb3b3-f5cb-428f-a394-a9ac6a603c82_1490x920.heic 848w, https://substackcdn.com/image/fetch/$s_!QqiQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0bfb3b3-f5cb-428f-a394-a9ac6a603c82_1490x920.heic 1272w, https://substackcdn.com/image/fetch/$s_!QqiQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0bfb3b3-f5cb-428f-a394-a9ac6a603c82_1490x920.heic 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Google file system(source: <a href="https://brabalawuka.cc/posts/study/gfs/">brabalawuka's blog</a>)</figcaption></figure></div><p>A foundational work that details a scalable, distributed file system built to handle Google&#8217;s internal data needs. GFS is designed for fault tolerance, high throughput, and large files, and it laid the groundwork for systems like HDFS and Bigtable.</p></li><li><p><a href="https://people.csail.mit.edu/matei/papers/2013/sigmod_shark.pdf">Shark: SQL and Rich Analytics at Scale</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VUWm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba3dda2f-c226-41b0-b92a-d6c1e61a8a2b_1024x615.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VUWm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba3dda2f-c226-41b0-b92a-d6c1e61a8a2b_1024x615.heic 424w, https://substackcdn.com/image/fetch/$s_!VUWm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba3dda2f-c226-41b0-b92a-d6c1e61a8a2b_1024x615.heic 848w, https://substackcdn.com/image/fetch/$s_!VUWm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba3dda2f-c226-41b0-b92a-d6c1e61a8a2b_1024x615.heic 1272w, https://substackcdn.com/image/fetch/$s_!VUWm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba3dda2f-c226-41b0-b92a-d6c1e61a8a2b_1024x615.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VUWm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba3dda2f-c226-41b0-b92a-d6c1e61a8a2b_1024x615.heic" width="586" height="351.943359375" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ba3dda2f-c226-41b0-b92a-d6c1e61a8a2b_1024x615.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:615,&quot;width&quot;:1024,&quot;resizeWidth&quot;:586,&quot;bytes&quot;:27353,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/167279434?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba3dda2f-c226-41b0-b92a-d6c1e61a8a2b_1024x615.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VUWm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba3dda2f-c226-41b0-b92a-d6c1e61a8a2b_1024x615.heic 424w, https://substackcdn.com/image/fetch/$s_!VUWm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba3dda2f-c226-41b0-b92a-d6c1e61a8a2b_1024x615.heic 848w, https://substackcdn.com/image/fetch/$s_!VUWm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba3dda2f-c226-41b0-b92a-d6c1e61a8a2b_1024x615.heic 1272w, https://substackcdn.com/image/fetch/$s_!VUWm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba3dda2f-c226-41b0-b92a-d6c1e61a8a2b_1024x615.heic 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Shark extended Apache Spark to support SQL queries alongside complex analytics. It bridged the gap between fast in-memory computation and declarative query languages, making it possible to unify ETL, analytics, and machine learning workloads.More importantly, the paper discusses why previous SQL on Hadoop/MapReduce query engines were slow.</p><p></p></li><li><p><a href="https://sites.cs.ucsb.edu/~rich/class/cs293b-cloud/papers/brewer-cap.pdf">CAP Twelve years later: How the "Rules" have Changed</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XaEy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff9b285c-3366-475e-918d-da2b0c9f01f4_960x956.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XaEy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff9b285c-3366-475e-918d-da2b0c9f01f4_960x956.heic 424w, https://substackcdn.com/image/fetch/$s_!XaEy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff9b285c-3366-475e-918d-da2b0c9f01f4_960x956.heic 848w, https://substackcdn.com/image/fetch/$s_!XaEy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff9b285c-3366-475e-918d-da2b0c9f01f4_960x956.heic 1272w, https://substackcdn.com/image/fetch/$s_!XaEy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff9b285c-3366-475e-918d-da2b0c9f01f4_960x956.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XaEy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff9b285c-3366-475e-918d-da2b0c9f01f4_960x956.heic" width="307" height="305.72083333333336" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ff9b285c-3366-475e-918d-da2b0c9f01f4_960x956.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:956,&quot;width&quot;:960,&quot;resizeWidth&quot;:307,&quot;bytes&quot;:27910,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/167279434?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff9b285c-3366-475e-918d-da2b0c9f01f4_960x956.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XaEy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff9b285c-3366-475e-918d-da2b0c9f01f4_960x956.heic 424w, https://substackcdn.com/image/fetch/$s_!XaEy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff9b285c-3366-475e-918d-da2b0c9f01f4_960x956.heic 848w, https://substackcdn.com/image/fetch/$s_!XaEy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff9b285c-3366-475e-918d-da2b0c9f01f4_960x956.heic 1272w, https://substackcdn.com/image/fetch/$s_!XaEy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff9b285c-3366-475e-918d-da2b0c9f01f4_960x956.heic 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The CAP theorem, proposed by Eric Brewer, asserts that any net&#173;worked shared-data system can have only two of three desirable properties: Consistency, Availability, and Partition-Tolerance. A number of NoSQL stores reference CAP to justify their decision to sacrifice consistency. A reflective piece that revisits the CAP theorem (Consistency, Availability, Partition Tolerance) a decade after its original formulation. It clarifies misconceptions, explores real-world implications, and offers a more nuanced understanding of trade-offs in distributed systems.</p></li><li><p><a href="https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf">Architecture of a Database System</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!U0l9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3c041-5d70-493a-aee3-8166112c127a_1398x928.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!U0l9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3c041-5d70-493a-aee3-8166112c127a_1398x928.heic 424w, https://substackcdn.com/image/fetch/$s_!U0l9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3c041-5d70-493a-aee3-8166112c127a_1398x928.heic 848w, https://substackcdn.com/image/fetch/$s_!U0l9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3c041-5d70-493a-aee3-8166112c127a_1398x928.heic 1272w, https://substackcdn.com/image/fetch/$s_!U0l9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3c041-5d70-493a-aee3-8166112c127a_1398x928.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!U0l9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3c041-5d70-493a-aee3-8166112c127a_1398x928.heic" width="535" height="355.13590844062946" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c8e3c041-5d70-493a-aee3-8166112c127a_1398x928.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:928,&quot;width&quot;:1398,&quot;resizeWidth&quot;:535,&quot;bytes&quot;:73749,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/167279434?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3c041-5d70-493a-aee3-8166112c127a_1398x928.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!U0l9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3c041-5d70-493a-aee3-8166112c127a_1398x928.heic 424w, https://substackcdn.com/image/fetch/$s_!U0l9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3c041-5d70-493a-aee3-8166112c127a_1398x928.heic 848w, https://substackcdn.com/image/fetch/$s_!U0l9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3c041-5d70-493a-aee3-8166112c127a_1398x928.heic 1272w, https://substackcdn.com/image/fetch/$s_!U0l9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e3c041-5d70-493a-aee3-8166112c127a_1398x928.heic 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>This survey paper dissects the major components and design principles of modern database systems, from storage and indexing to query processing and transaction management. It's a comprehensive blueprint for understanding how databases work under the hood.</p></li><li><p><a href="https://notes.stephenholiday.com/Kafka.pdf">Kafka: a Distributed Messaging System for Log Processing</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UxH4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8c62595-f22f-40c8-9b32-9303a248ca0b_3401x2311.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UxH4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8c62595-f22f-40c8-9b32-9303a248ca0b_3401x2311.heic 424w, https://substackcdn.com/image/fetch/$s_!UxH4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8c62595-f22f-40c8-9b32-9303a248ca0b_3401x2311.heic 848w, https://substackcdn.com/image/fetch/$s_!UxH4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8c62595-f22f-40c8-9b32-9303a248ca0b_3401x2311.heic 1272w, https://substackcdn.com/image/fetch/$s_!UxH4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8c62595-f22f-40c8-9b32-9303a248ca0b_3401x2311.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UxH4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8c62595-f22f-40c8-9b32-9303a248ca0b_3401x2311.heic" width="568" height="385.81868131868134" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f8c62595-f22f-40c8-9b32-9303a248ca0b_3401x2311.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:989,&quot;width&quot;:1456,&quot;resizeWidth&quot;:568,&quot;bytes&quot;:120386,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/167279434?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8c62595-f22f-40c8-9b32-9303a248ca0b_3401x2311.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UxH4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8c62595-f22f-40c8-9b32-9303a248ca0b_3401x2311.heic 424w, https://substackcdn.com/image/fetch/$s_!UxH4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8c62595-f22f-40c8-9b32-9303a248ca0b_3401x2311.heic 848w, https://substackcdn.com/image/fetch/$s_!UxH4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8c62595-f22f-40c8-9b32-9303a248ca0b_3401x2311.heic 1272w, https://substackcdn.com/image/fetch/$s_!UxH4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8c62595-f22f-40c8-9b32-9303a248ca0b_3401x2311.heic 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Kafka was introduced as a high-throughput, distributed messaging system designed for log data by Linkedin. This paper outlines its architecture and how it decouples producers and consumers while providing fault tolerance and scalability, forming the backbone of many real-time data pipelines.</p></li><li><p><a href="https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf">Bigtable: A Distributed Storage System for Structured Data</a></p><p>Google&#8217;s Bigtable paper describes a distributed, sparse, and scalable data storage system that supports structured data. It underpins services like Google Analytics and Search, and inspired open-source projects such as HBase and Cassandra</p><p></p></li><li><p><a href="https://www.amazon.science/publications/dynamo-amazons-highly-available-key-value-store">Dynamo: Amazon&#8217;s highly available key-value store</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ryf1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1592d670-fc4f-4ee6-b7c5-649c84522566_1400x688.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ryf1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1592d670-fc4f-4ee6-b7c5-649c84522566_1400x688.heic 424w, https://substackcdn.com/image/fetch/$s_!ryf1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1592d670-fc4f-4ee6-b7c5-649c84522566_1400x688.heic 848w, https://substackcdn.com/image/fetch/$s_!ryf1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1592d670-fc4f-4ee6-b7c5-649c84522566_1400x688.heic 1272w, https://substackcdn.com/image/fetch/$s_!ryf1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1592d670-fc4f-4ee6-b7c5-649c84522566_1400x688.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ryf1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1592d670-fc4f-4ee6-b7c5-649c84522566_1400x688.heic" width="657" height="322.86857142857144" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1592d670-fc4f-4ee6-b7c5-649c84522566_1400x688.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:688,&quot;width&quot;:1400,&quot;resizeWidth&quot;:657,&quot;bytes&quot;:61215,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/167279434?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1592d670-fc4f-4ee6-b7c5-649c84522566_1400x688.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ryf1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1592d670-fc4f-4ee6-b7c5-649c84522566_1400x688.heic 424w, https://substackcdn.com/image/fetch/$s_!ryf1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1592d670-fc4f-4ee6-b7c5-649c84522566_1400x688.heic 848w, https://substackcdn.com/image/fetch/$s_!ryf1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1592d670-fc4f-4ee6-b7c5-649c84522566_1400x688.heic 1272w, https://substackcdn.com/image/fetch/$s_!ryf1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1592d670-fc4f-4ee6-b7c5-649c84522566_1400x688.heic 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Amazon&#8217;s Dynamo paper presents a distributed key-value storage system optimized for high availability and scalability. It introduced techniques like eventual consistency, consistent hashing, and vector clocks, influencing systems like Cassandra, Riak, and DynamoDB.</p><p></p><p><br>If you would like to read more this type of papers you can check the Git repository which is created by Reynold Xin- cofounder of Databricks: <a href="https://github.com/rxin/db-readings?tab=readme-ov-file">https://github.com/rxin/db-readings?tab=readme-ov-file</a></p><p></p><p><br>If you enjoyed it, consider subscribing to this newsletter grow!</p><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataheimer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataheimer.substack.com/subscribe?"><span>Subscribe now</span></a></p><p></p></li></ol><p></p>]]></content:encoded></item><item><title><![CDATA[The Ultimate Guide to Change Data Capture Tools: Choosing the Right CDC Solution in 2025]]></title><description><![CDATA[Comparison of different CDC tools]]></description><link>https://dataheimer.substack.com/p/the-ultimate-guide-to-change-data</link><guid isPermaLink="false">https://dataheimer.substack.com/p/the-ultimate-guide-to-change-data</guid><dc:creator><![CDATA[Subhan Hagverdiyev]]></dc:creator><pubDate>Fri, 27 Jun 2025 18:33:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!f79q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e2c2d94-1a18-4cd6-b498-a5e8eb70275d_1614x948.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Change Data Capture (CDC) has become the backbone of modern data architectures, enabling real-time data synchronization between systems. But with so many options available, how do you choose the right tool for your organization? I&#8217;ve analyzed the leading CDC solutions to help you make an informed decision.</p><h3>The CDC Revolution</h3><p>Traditional data integration relied on batch processing &#8211; ETL jobs that ran nightly or hourly, leaving businesses operating on yesterday's data. CDC changed everything by monitoring database transaction logs and streaming changes as they happen. Instead of asking "What changed since last night?", CDC answers "What just changed?" in real-time.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!f79q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e2c2d94-1a18-4cd6-b498-a5e8eb70275d_1614x948.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!f79q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e2c2d94-1a18-4cd6-b498-a5e8eb70275d_1614x948.webp 424w, https://substackcdn.com/image/fetch/$s_!f79q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e2c2d94-1a18-4cd6-b498-a5e8eb70275d_1614x948.webp 848w, https://substackcdn.com/image/fetch/$s_!f79q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e2c2d94-1a18-4cd6-b498-a5e8eb70275d_1614x948.webp 1272w, https://substackcdn.com/image/fetch/$s_!f79q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e2c2d94-1a18-4cd6-b498-a5e8eb70275d_1614x948.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!f79q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e2c2d94-1a18-4cd6-b498-a5e8eb70275d_1614x948.webp" width="1456" height="855" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e2c2d94-1a18-4cd6-b498-a5e8eb70275d_1614x948.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:855,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:29606,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/166989905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e2c2d94-1a18-4cd6-b498-a5e8eb70275d_1614x948.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!f79q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e2c2d94-1a18-4cd6-b498-a5e8eb70275d_1614x948.webp 424w, https://substackcdn.com/image/fetch/$s_!f79q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e2c2d94-1a18-4cd6-b498-a5e8eb70275d_1614x948.webp 848w, https://substackcdn.com/image/fetch/$s_!f79q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e2c2d94-1a18-4cd6-b498-a5e8eb70275d_1614x948.webp 1272w, https://substackcdn.com/image/fetch/$s_!f79q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e2c2d94-1a18-4cd6-b498-a5e8eb70275d_1614x948.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: <a href="https://www.striim.com/blog/change-data-capture-cdc-what-it-is-and-how-it-works/">Striims</a></figcaption></figure></div><p>So when you make a change in the database let&#8217;s say : </p><p>You have a customer database table:</p><pre><code>ID    Name    Email 
1     Alice   <a href="mailto:alice@email.com">alice@email.com</a> 
2     Bob     <a href="mailto:bob@email.com">bob@email.com</a></code></pre><h3>Changes:</h3><ol><li><p>Bob updates his email.</p></li><li><p>A new customer, Carol, is added.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e1c1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8285709-4e6e-49a0-a5d6-9c832ced672b_1300x740.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e1c1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8285709-4e6e-49a0-a5d6-9c832ced672b_1300x740.heic 424w, https://substackcdn.com/image/fetch/$s_!e1c1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8285709-4e6e-49a0-a5d6-9c832ced672b_1300x740.heic 848w, https://substackcdn.com/image/fetch/$s_!e1c1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8285709-4e6e-49a0-a5d6-9c832ced672b_1300x740.heic 1272w, https://substackcdn.com/image/fetch/$s_!e1c1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8285709-4e6e-49a0-a5d6-9c832ced672b_1300x740.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e1c1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8285709-4e6e-49a0-a5d6-9c832ced672b_1300x740.heic" width="513" height="292.0153846153846" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8285709-4e6e-49a0-a5d6-9c832ced672b_1300x740.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:740,&quot;width&quot;:1300,&quot;resizeWidth&quot;:513,&quot;bytes&quot;:58804,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/166989905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8285709-4e6e-49a0-a5d6-9c832ced672b_1300x740.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!e1c1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8285709-4e6e-49a0-a5d6-9c832ced672b_1300x740.heic 424w, https://substackcdn.com/image/fetch/$s_!e1c1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8285709-4e6e-49a0-a5d6-9c832ced672b_1300x740.heic 848w, https://substackcdn.com/image/fetch/$s_!e1c1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8285709-4e6e-49a0-a5d6-9c832ced672b_1300x740.heic 1272w, https://substackcdn.com/image/fetch/$s_!e1c1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8285709-4e6e-49a0-a5d6-9c832ced672b_1300x740.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This output can be sent to analytics systems, caches, or data lakes for real-time updates.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://dataheimer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Dataheimer Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h3>Why Low Latency Matters for Streaming</h3><p>Modern dashboards need live metrics, not hour-old snapshots. A 5-second delay in detecting a system outage could mean millions in lost revenue. Organizations using real-time CDC report:</p><ul><li><p>60% faster decision-making with live dashboards</p></li><li><p>40% reduction in customer service issues due to data consistency</p></li><li><p>25% improvement in fraud detection accuracy</p></li><li><p>Significant cost savings from preventing data-driven errors</p></li></ul><p>But with great power comes great complexity. The CDC tool you choose will determine whether you achieve these benefits or struggle with operational overhead. Let's explore your options.</p><h3>&#128295; <strong>Open Source Champions</strong></h3><h4><strong>Debezium + Kafka</strong> <em>The streaming powerhouse</em></h4><p>Debezium transforms your database transaction logs into Kafka streams, offering true real-time CDC with sub-second latency. Built on Kafka Connect, it supports virtually every major database and integrates seamlessly with the Kafka ecosystem.</p><p><strong>Best for:</strong> Teams already invested in Kafka infrastructure <br><strong>Strengths:</strong> Zero licensing costs, highly flexible, mature community <br><strong>Watch out for:</strong> Operational complexity, requires Kafka expertise, at-least-once delivery only <br><strong>Latency:</strong> ~0.5 &#8211; 2 seconds<br><strong>Cost:</strong> Free (infrastructure costs apply)</p><p><strong>Architecture:</strong> Debezium runs as Kafka Connect source connectors on a Kafka Connect cluster. Each connector (MySQL, Postgres, Mongo, etc.) connects to its source DB&#8217;s log (binlog or logical stream) and writes change events to Kafka topics. Kafka Connect offset topics (or an external store) track connector state. Debezium also supports a standalone Debezium Server (no Kafka) to send events to Kinesis, Pub/Sub, Pulsar, etc. Users often build full pipelines by pairing Debezium (CDC) sources with Kafka Connect sinks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!259Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3270198-8592-4aba-b91b-5759c5651aae_1400x549.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!259Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3270198-8592-4aba-b91b-5759c5651aae_1400x549.png 424w, https://substackcdn.com/image/fetch/$s_!259Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3270198-8592-4aba-b91b-5759c5651aae_1400x549.png 848w, https://substackcdn.com/image/fetch/$s_!259Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3270198-8592-4aba-b91b-5759c5651aae_1400x549.png 1272w, https://substackcdn.com/image/fetch/$s_!259Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3270198-8592-4aba-b91b-5759c5651aae_1400x549.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!259Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3270198-8592-4aba-b91b-5759c5651aae_1400x549.png" width="1400" height="549" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3270198-8592-4aba-b91b-5759c5651aae_1400x549.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:549,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:122296,&quot;alt&quot;:&quot;Architecture of Debezium + Kafka CDC method&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/166989905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3270198-8592-4aba-b91b-5759c5651aae_1400x549.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Architecture of Debezium + Kafka CDC method" title="Architecture of Debezium + Kafka CDC method" srcset="https://substackcdn.com/image/fetch/$s_!259Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3270198-8592-4aba-b91b-5759c5651aae_1400x549.png 424w, https://substackcdn.com/image/fetch/$s_!259Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3270198-8592-4aba-b91b-5759c5651aae_1400x549.png 848w, https://substackcdn.com/image/fetch/$s_!259Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3270198-8592-4aba-b91b-5759c5651aae_1400x549.png 1272w, https://substackcdn.com/image/fetch/$s_!259Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3270198-8592-4aba-b91b-5759c5651aae_1400x549.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Architecture of Debezium + Kafka CDC method</figcaption></figure></div><h4><strong>Apache Flink CDC</strong> <em>The exactly-once guarantee</em></h4><p>Flink CDC combines the power of Apache Flink's stream processing with robust CDC capabilities. It's the only open-source solution offering true exactly-once semantics out of the box.</p><p><strong>Best for:</strong> Organizations needing complex real-time transformations<br><strong>Strengths:</strong> Exactly-once processing, automatic schema evolution, extremely low latency <br><strong>Watch out for:</strong> Steep learning curve, requires Flink cluster management<br><strong>Latency:</strong> Sub-second (often &lt;100ms) <br><strong>Cost:</strong> Free (infrastructure costs apply)<br></p><p><strong>Architecture:</strong> A Flink cluster executes Flink CDC jobs defined via YAML or code. Each job uses Flink&#8217;s <strong>DataStream API</strong> (or Table API) with a CDC source connector. The job can incorporate transformations and connect to sinks (e.g. Kafka, JDBC, Iceberg). Flink CDC can run in streaming mode for continuous flow. It also supports <strong>batch</strong> operation by scanning static data or performing periodic snapshots. State (offsets, transformation state) is managed by Flink&#8217;s checkpointing and state backends (e.g. RocksDB).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kzz9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45360017-f102-40b4-8331-fa6e854daab4_1589x552.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kzz9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45360017-f102-40b4-8331-fa6e854daab4_1589x552.heic 424w, https://substackcdn.com/image/fetch/$s_!kzz9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45360017-f102-40b4-8331-fa6e854daab4_1589x552.heic 848w, https://substackcdn.com/image/fetch/$s_!kzz9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45360017-f102-40b4-8331-fa6e854daab4_1589x552.heic 1272w, https://substackcdn.com/image/fetch/$s_!kzz9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45360017-f102-40b4-8331-fa6e854daab4_1589x552.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kzz9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45360017-f102-40b4-8331-fa6e854daab4_1589x552.heic" width="1456" height="506" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/45360017-f102-40b4-8331-fa6e854daab4_1589x552.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:506,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:57202,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/166989905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45360017-f102-40b4-8331-fa6e854daab4_1589x552.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kzz9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45360017-f102-40b4-8331-fa6e854daab4_1589x552.heic 424w, https://substackcdn.com/image/fetch/$s_!kzz9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45360017-f102-40b4-8331-fa6e854daab4_1589x552.heic 848w, https://substackcdn.com/image/fetch/$s_!kzz9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45360017-f102-40b4-8331-fa6e854daab4_1589x552.heic 1272w, https://substackcdn.com/image/fetch/$s_!kzz9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45360017-f102-40b4-8331-fa6e854daab4_1589x552.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">An example of Flink CDC architecture </figcaption></figure></div><h4><strong>Airbyte</strong> <em>The connector ecosystem</em></h4><p>With 400+ connectors, Airbyte democratizes data integration. While not purely streaming (it runs frequent batch jobs), it's user-friendly and cost-effective for many use cases. However I think for pure streaming this is not a good solution because it is very slow.</p><p><strong>Best for:</strong> Startups and teams wanting low-cost, easy setup<br><strong>Strengths:</strong> Massive connector library, user-friendly UI, open-source with cloud option <strong>Watch out for:</strong> Batch-oriented approach, 5+ minute minimum intervals <br><strong>Latency:</strong> 5+ minutes<br><strong>Cost:</strong> Free self-hosted, usage-based cloud pricing</p><p><strong>Architecture:</strong> A central Airbyte server (cloud service or self-hosted) orchestrates jobs. Connectors run in Docker containers. Users configure sources and destinations in the UI (or via API). For CDC sources, Airbyte schedules frequent sync jobs (it treats CDC sources as regular syncs at high frequency). There&#8217;s no persistent streaming &#8211; each sync job pulls changes since the last sync.</p><h3>&#127970; <strong>Enterprise Solutions</strong></h3><h4>Fivetran (Commercial SAAS)</h4><p><strong>Architecture:</strong> Fully-managed SaaS. Fivetran separates the user&#8217;s environment, Fivetran cloud, and customer cloud for security and performance. When a connector sync is due, Fivetran&#8217;s backend spins up transient worker processes (in Fivetran-managed infrastructure) to extract and load data. In practice, users log in to Fivetran&#8217;s dashboard to create connectors. The Fivetran control plane stores configuration, and its execution engine (workers) handles the actual data movement. Fivetran uses parallelization on-demand to maximize throughput. </p><p><strong>Best for:</strong> Organizations prioritizing reliability and support over cost<br><strong>Strengths:</strong> Zero maintenance, automatic schema handling, comprehensive connector catalog<br><strong>Watch out for:</strong> High costs at scale, vendor lock-in, black-box operations<br><strong>Latency:</strong> Sub-5 minutes <br><strong>Cost:</strong> Usage-based (can be expensive)</p><p>I think for very low-latency (sub-1-minute) or custom pipelines, Fivetran may not be ideal.</p><h4>Qlik Replicate (Attunity)</h4><p>A mature, GUI-driven solution designed for mission-critical enterprise replication. Offers exactly-once delivery and supports even mainframe systems.</p><p><strong>Best for:</strong> Large enterprises with complex, heterogeneous environments <br><strong>Strengths:</strong> Exactly-once guarantees, minimal source impact, comprehensive enterprise features <br><strong>Watch out for:</strong> High licensing costs, Windows-centric, heavyweight for small projects <strong>Latency:</strong> Sub-second to seconds <br><strong>Cost:</strong> High (perpetual licenses)</p><p><strong>Architecture:</strong> Typically installed as a server with agent components. Qlik has a <strong>listener</strong> that taps DB transaction logs (or triggers) and an <strong>in-memory engine</strong> that stages changes. It features two data paths: an ongoing CDC stream for updates and a batch loader for full loads/backfills. A central <em>checkpoint store</em> tracks progress for exactly-once delivery. Configuration is done via a GUI or CLI.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iBWN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc33bd274-8e9b-466c-ad18-9b2f5576ec8a_1335x753.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iBWN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc33bd274-8e9b-466c-ad18-9b2f5576ec8a_1335x753.heic 424w, https://substackcdn.com/image/fetch/$s_!iBWN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc33bd274-8e9b-466c-ad18-9b2f5576ec8a_1335x753.heic 848w, https://substackcdn.com/image/fetch/$s_!iBWN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc33bd274-8e9b-466c-ad18-9b2f5576ec8a_1335x753.heic 1272w, https://substackcdn.com/image/fetch/$s_!iBWN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc33bd274-8e9b-466c-ad18-9b2f5576ec8a_1335x753.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iBWN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc33bd274-8e9b-466c-ad18-9b2f5576ec8a_1335x753.heic" width="1335" height="753" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c33bd274-8e9b-466c-ad18-9b2f5576ec8a_1335x753.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:753,&quot;width&quot;:1335,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:38763,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/166989905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc33bd274-8e9b-466c-ad18-9b2f5576ec8a_1335x753.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iBWN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc33bd274-8e9b-466c-ad18-9b2f5576ec8a_1335x753.heic 424w, https://substackcdn.com/image/fetch/$s_!iBWN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc33bd274-8e9b-466c-ad18-9b2f5576ec8a_1335x753.heic 848w, https://substackcdn.com/image/fetch/$s_!iBWN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc33bd274-8e9b-466c-ad18-9b2f5576ec8a_1335x753.heic 1272w, https://substackcdn.com/image/fetch/$s_!iBWN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc33bd274-8e9b-466c-ad18-9b2f5576ec8a_1335x753.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Qlik architecture</figcaption></figure></div><h3>&#9729;&#65039; <strong>Cloud-Native Options</strong></h3><p><strong>AWS DMS</strong> </p><p>AWS DMS is a <strong>managed</strong> cloud service for migrating and replicating databases. It supports both full load and ongoing CDC for many sources and targets.</p><p><strong>Best for:</strong> AWS migrations and simple replications <br><strong>Strengths:</strong> Fully managed, good for AWS ecosystems, handles legacy sources <br><strong>Watch out for:</strong> Limited to AWS targets, higher latency, frequent task failures reported <strong>Latency:</strong> <strong>Not</strong> sub-second; typical latency might be seconds to minutes<br><strong>Cost:</strong> Per-hour instance pricing</p><p><strong>Architecture:</strong> AWS-managed. You create &#8220;tasks&#8221; that run on DMS replication instances (AWS EC2 under the hood). The instance connects to a source DB and target. DMS handles log-reading (CDC) internally. Architecture is serverless from the user&#8217;s perspective (AWS scales the replication instance).</p><p>AWS itself notes that DMS &#8220;is not a streaming system&#8221; and that latency can vary and is hard to monitor</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Aeuz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783f8847-f8c1-4ae2-bbe9-a9ed9b0f7309_1000x302.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Aeuz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783f8847-f8c1-4ae2-bbe9-a9ed9b0f7309_1000x302.heic 424w, https://substackcdn.com/image/fetch/$s_!Aeuz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783f8847-f8c1-4ae2-bbe9-a9ed9b0f7309_1000x302.heic 848w, https://substackcdn.com/image/fetch/$s_!Aeuz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783f8847-f8c1-4ae2-bbe9-a9ed9b0f7309_1000x302.heic 1272w, https://substackcdn.com/image/fetch/$s_!Aeuz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783f8847-f8c1-4ae2-bbe9-a9ed9b0f7309_1000x302.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Aeuz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783f8847-f8c1-4ae2-bbe9-a9ed9b0f7309_1000x302.heic" width="1000" height="302" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/783f8847-f8c1-4ae2-bbe9-a9ed9b0f7309_1000x302.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:302,&quot;width&quot;:1000,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:19968,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/166989905?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783f8847-f8c1-4ae2-bbe9-a9ed9b0f7309_1000x302.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Aeuz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783f8847-f8c1-4ae2-bbe9-a9ed9b0f7309_1000x302.heic 424w, https://substackcdn.com/image/fetch/$s_!Aeuz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783f8847-f8c1-4ae2-bbe9-a9ed9b0f7309_1000x302.heic 848w, https://substackcdn.com/image/fetch/$s_!Aeuz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783f8847-f8c1-4ae2-bbe9-a9ed9b0f7309_1000x302.heic 1272w, https://substackcdn.com/image/fetch/$s_!Aeuz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F783f8847-f8c1-4ae2-bbe9-a9ed9b0f7309_1000x302.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>&#127881; BONUS</h3><h4><strong>Estuary Flow</strong> <em>The exactly-once innovator</em></h4><p>Estuary Flow is a unified streaming/batch data platform. It uses a custom DSL (Flow) to define pipelines that capture, transform, and load data. Its core is <strong>Gazette</strong> (open-source) which manages the data streams (Collections). Flow can do CDC, batch, and real-time analytics in one model.</p><p><strong>Best for:</strong> Organizations needing guaranteed exactly-once delivery with multiple targets <br><strong>Strengths:</strong> True exactly-once end-to-end, automatic schema evolution, unified real-time/batch <br><strong>Watch out for:</strong> Newer ecosystem, primarily commercial offering <br><strong>Latency:</strong> &lt;100ms <br><strong>Cost:</strong> Usage-based with free tier<br></p><p>A<strong>rchitecture:</strong> Flow pipelines run on Estuary&#8217;s control plane (SaaS) and data plane (can be public or private cloud, or on-prem). A pipeline attaches to source systems (e.g. database tables or logs) and continuously reads changes. Internally, every source stream is stored in <strong>Collections</strong> (append-only log). One collection per source/table, which can then be read by multiple sinks. A Flow app can write to multiple targets from the same source collection. Underneath, Estuary provides connectors for many sources/targets, and a &#8220;Kafka emulation&#8221; (Dekaf) if needed.</p><p>I think this one is truly designed for real-time streaming. It captures every change exactly once as it happens (via DB WAL or log streaming). It also supports batch backfills: you can initiate a snapshot that will emit historical data into the collection. Crucially, real-time and historical data are unified in the same pipeline.<br></p><h3><strong>Conclusion</strong></h3><p>Choosing the right CDC solution is all about balancing your organization&#8217;s priorities&#8212;latency, complexity, cost, and ecosystem fit. Ultimately, real-time CDC is no longer a luxury but a necessity for data-driven businesses. By aligning your CDC tool with your team&#8217;s skills, infrastructure, and performance targets, you&#8217;ll unlock faster insights, more dependable reporting, and a decisive edge in a world where &#8220;what just changed?&#8221; matters more than ever.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://dataheimer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Dataheimer Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA["Databricks Adds Another 'Lake' to the Family: Meet Lakebase (Because Why Stop at Lakehouse?)]]></title><description><![CDATA[How a $800M Neon acquisition just became Databricks' secret weapon for AI-powered Postgres]]></description><link>https://dataheimer.substack.com/p/databricks-adds-another-lake-to-the</link><guid isPermaLink="false">https://dataheimer.substack.com/p/databricks-adds-another-lake-to-the</guid><dc:creator><![CDATA[Subhan Hagverdiyev]]></dc:creator><pubDate>Thu, 26 Jun 2025 17:03:37 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!R-4B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c390b8e-2273-41ac-9f3a-a83256a0f469_1200x628.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!R-4B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c390b8e-2273-41ac-9f3a-a83256a0f469_1200x628.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!R-4B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c390b8e-2273-41ac-9f3a-a83256a0f469_1200x628.png 424w, https://substackcdn.com/image/fetch/$s_!R-4B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c390b8e-2273-41ac-9f3a-a83256a0f469_1200x628.png 848w, https://substackcdn.com/image/fetch/$s_!R-4B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c390b8e-2273-41ac-9f3a-a83256a0f469_1200x628.png 1272w, https://substackcdn.com/image/fetch/$s_!R-4B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c390b8e-2273-41ac-9f3a-a83256a0f469_1200x628.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!R-4B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c390b8e-2273-41ac-9f3a-a83256a0f469_1200x628.png" width="1200" height="628" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1c390b8e-2273-41ac-9f3a-a83256a0f469_1200x628.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:628,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:119927,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://dataheimer.substack.com/i/166900557?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c390b8e-2273-41ac-9f3a-a83256a0f469_1200x628.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!R-4B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c390b8e-2273-41ac-9f3a-a83256a0f469_1200x628.png 424w, https://substackcdn.com/image/fetch/$s_!R-4B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c390b8e-2273-41ac-9f3a-a83256a0f469_1200x628.png 848w, https://substackcdn.com/image/fetch/$s_!R-4B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c390b8e-2273-41ac-9f3a-a83256a0f469_1200x628.png 1272w, https://substackcdn.com/image/fetch/$s_!R-4B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1c390b8e-2273-41ac-9f3a-a83256a0f469_1200x628.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Hey data folks,</p><p>Remember when Databricks dropped $800 million on Neon back in May and we were all scratching our heads wondering "why the hell does a lakehouse company need a Postgres startup?"</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://dataheimer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Dataheimer Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Well, plot twist: They weren't just buying Neon for their cute serverless database tech. They were playing 4D chess, and we just got to see their master move.</p><p>Lakebase is Databricks' new fully-managed Postgres database built specifically for AI applications and integrated with their lakehouse platform Databricks. It was announced at the Data + AI Summit on June 11, 2025 <a href="https://www.databricks.com/company/newsroom/press-releases/databricks-launches-lakebase-new-class-operational-database-ai-apps">Databricks Launches Lakebase: a New Class of Operational Database for AI Apps and Agents - Databricks</a> and is now in public preview.</p><h2>The "Aha!" Moment We've All Been Waiting For</h2><p>Here's what actually happened: Databricks saw the future of AI applications and realized there was a massive gap in their platform. Sure, you could run your fancy ML models and analytics on their lakehouse, but what about all those AI apps and agents that need to interact with operational data in real-time?</p><p>You know the drill &#8211; your AI chatbot needs to query customer data, your recommendation engine needs to hit product catalogs, your intelligent automation needs to read from and write to transactional systems. All of that lives in operational databases, usually Postgres, and it was living completely separate from your analytical lakehouse environment. </p><p>Before Lakebase, if you wanted to build an AI app on Databricks, you had to maintain separate infrastructure, deal with multiple security models, juggle different connection patterns, and probably write a bunch of custom ETL to keep everything in sync. That's not just technical debt &#8211; that's <em><strong>technical bankruptcy</strong></em>.</p><h2>The Bigger Picture: Databricks' Master Plan</h2><p>Step back and look at what Databricks is building here. They started with the lakehouse &#8211; unified analytics and ML on one platform. Then they added generative AI capabilities, vector databases, and model serving. Now they're completing the circle with operational databases.</p><p>This isn't just about having more products. This is about owning the entire AI application lifecycle:</p><ol><li><p><strong>Data ingestion and preparation</strong> (existing lakehouse)</p></li><li><p><strong>Model training and fine-tuning</strong> (existing ML platform)</p></li><li><p><strong>Model deployment and serving</strong> (existing model serving)</p></li><li><p><strong>AI application runtime</strong> (new Lakebase)</p></li></ol><p>When your AI application needs to serve a recommendation, it can query operational data in Lakebase, invoke a model served on Databricks, and write the results back to Lakebase &#8211; all within the same platform, same security model, same billing account.</p><h2>Why Millisecond Response Times Matter (And Why Delta Tables Can't Deliver Them)</h2><p>Here's the fundamental problem that's been haunting data teams: your beautiful lakehouse is fantastic for analytics, but it's terrible for serving live applications. When your customer-facing API needs to return personalized recommendations in under 50ms, querying a Delta table just isn't going to cut it.</p><p><strong>The Physics Problem:</strong></p><ul><li><p>Delta tables are optimized for <strong>throughput</strong>, not <strong>latency</strong></p></li><li><p>Parquet files require scanning and decompression</p></li><li><p>Even with Z-ordering and liquid clustering, you're still talking hundreds of milliseconds for simple lookups</p></li><li><p>Connection overhead to Spark clusters adds another layer of latency</p></li></ul><p><strong>The Real-World Impact:</strong></p><pre><code><code>Typical Delta Lake query: 200-500ms
API SLA requirement: &lt;50ms
Customer patience: ~3 seconds before abandoning</code></code></pre><p>This mismatch has forced teams into complex, expensive workarounds that Lakebase finally eliminates.</p><h2>Lakebase's Dual-Mode Architecture</h2><p>Lakebase solves this with two distinct operational modes, each optimized for different use cases:</p><h3>Mode 1: Delta-Postgres Sync (The Game Changer)</h3><p>This is where the magic happens. Lakebase maintains real-time synchronization between your Delta tables and corresponding Postgres tables:</p><p><strong>How It Works:</strong></p><ol><li><p><strong>Change Data Capture (CDC)</strong>: Lakebase monitors your Delta table's transaction log</p></li><li><p><strong>Incremental Sync</strong>: Only changed rows are propagated to Postgres</p></li><li><p><strong>Schema Evolution</strong>: DDL changes in Delta automatically update Postgres schema</p></li><li><p><strong>Conflict Resolution</strong>: Built-in handling for concurrent updates</p></li></ol><p><strong>Technical Implementation:</strong></p><pre><code>-- Create a synced table
CREATE TABLE customer_features 
SYNC WITH DELTA 'dbfs:/mnt/lakehouse/customer_features'
REFRESH EVERY 30 SECONDS;

-- Now query with sub-10ms response times
SELECT recommendation_score, last_purchase_date 
FROM customer_features 
WHERE customer_id = 12345;</code></pre><h3>Mode 2: Native Postgres Tables (Traditional OLTP)</h3><p>For pure transactional workloads that don't need lakehouse integration:</p><pre><code><code>CREATE TABLE user_sessions (
    session_id UUID PRIMARY KEY,
    user_id BIGINT,
    created_at TIMESTAMP,
    last_activity TIMESTAMP,
    session_data JSONB
);

CREATE INDEX idx_user_sessions_user_id ON user_sessions(user_id);
CREATE INDEX idx_user_sessions_activity ON user_sessions(last_activity);</code></code></pre><h2>The Bottom Line: Engineering Impact</h2><p>Lakebase solves a fundamental architecture problem that's been forcing teams into complex, expensive workarounds. By providing sub-10ms query performance on lakehouse data, it eliminates the need for:</p><ul><li><p>Complex caching layers</p></li><li><p>Custom ETL pipelines for operational data</p></li><li><p>Separate operational databases</p></li><li><p>Multiple security and governance systems</p></li><li><p>Custom sync mechanisms</p></li></ul><p>For engineering teams, this means faster development cycles, fewer systems to maintain, and the ability to build AI applications that were previously architecturally impossible or prohibitively expensive.</p><p>The real test will be how well it handles the edge cases and scale requirements of production workloads, but the fundamental approach is sound and addresses a genuine gap in the modern data stack.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://dataheimer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Dataheimer Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[DataHeimer: Exploring the Atomic Impact of Data]]></title><description><![CDATA[Welcome Everybody!]]></description><link>https://dataheimer.substack.com/p/dataheimer-exploring-the-atomic-impact</link><guid isPermaLink="false">https://dataheimer.substack.com/p/dataheimer-exploring-the-atomic-impact</guid><dc:creator><![CDATA[Subhan Hagverdiyev]]></dc:creator><pubDate>Mon, 09 Sep 2024 19:44:30 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/fa2f38d1-df27-472a-af4c-8ed77f3b5665_800x600.gif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome Everybody! I have finally decided to use Substack to publish my blogs. I have shared my articles on several websites including Linkedin, Medium .etc before but I thought I need to have a single place where I can share detailed articles in depth about Data Engineering, Data Science, Software Engineering and architecture of applications.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NVj7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68948c1a-c6ed-467c-a438-a4f921946fac_800x600.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NVj7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68948c1a-c6ed-467c-a438-a4f921946fac_800x600.gif 424w, https://substackcdn.com/image/fetch/$s_!NVj7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68948c1a-c6ed-467c-a438-a4f921946fac_800x600.gif 848w, https://substackcdn.com/image/fetch/$s_!NVj7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68948c1a-c6ed-467c-a438-a4f921946fac_800x600.gif 1272w, https://substackcdn.com/image/fetch/$s_!NVj7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68948c1a-c6ed-467c-a438-a4f921946fac_800x600.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NVj7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68948c1a-c6ed-467c-a438-a4f921946fac_800x600.gif" width="800" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68948c1a-c6ed-467c-a438-a4f921946fac_800x600.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1097289,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NVj7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68948c1a-c6ed-467c-a438-a4f921946fac_800x600.gif 424w, https://substackcdn.com/image/fetch/$s_!NVj7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68948c1a-c6ed-467c-a438-a4f921946fac_800x600.gif 848w, https://substackcdn.com/image/fetch/$s_!NVj7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68948c1a-c6ed-467c-a438-a4f921946fac_800x600.gif 1272w, https://substackcdn.com/image/fetch/$s_!NVj7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68948c1a-c6ed-467c-a438-a4f921946fac_800x600.gif 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://dataheimer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Subhan&#8217;s Substack! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Coming soon]]></title><description><![CDATA[This is Dataheimer Newsletter.]]></description><link>https://dataheimer.substack.com/p/coming-soon</link><guid isPermaLink="false">https://dataheimer.substack.com/p/coming-soon</guid><dc:creator><![CDATA[Subhan Hagverdiyev]]></dc:creator><pubDate>Mon, 09 Sep 2024 19:30:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!kayI!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F737e0768-239b-442b-bc27-b066513283cc_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is Dataheimer Newsletter.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://dataheimer.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://dataheimer.substack.com/subscribe?"><span>Subscribe now</span></a></p>]]></content:encoded></item></channel></rss>