ScubaDuck Codex prompts and trajectories

Scraped data from coding agent interactions

159 trajectories Generated on June 02, 2025

Design exploratory data analysis tool

rejected - click to expand
I really like Scuba (Meta's internal real-time database system). The distributed, real-time database part of Scuba is quite difficult (and expensive) to replicate, but I also really like Scuba's UI for doing queries, and I have found myself wishing that I have access to it even for "small" databases, e.g., I have a sqlite dataset I want to explore. Pivotal ideas: * Time series by default. In the dedicated "time series" view, there are many features specifically oriented towards working towards tables that represent events that occurred over time: the start, end, compare, aggregate and granularity fields all specially privilege the timestamp field. In fact, you can't log events to Scuba's backing data store without a timestamp, they always come with one. (Scuba also supports other views that don't presuppose a time series, but the time series is the most beloved and well used view.) This is in contrast to typical software which tries to generalize to arbitrary data first, with time series being added on later. * It's all about exploration. Scuba is predicated on the idea that you don't know what you're looking for, that you are going to spend time tweaking queries and changing filters/grouping as part of an investigation to figure out why a system behaves the way it is. So the filters/comparisons/groupings you want to edit are always visible on the left sidebar, with the expectation that you're going to tweak the query to look at something else. Similarly, all the parameters of your query get saved into your URL, so your browser history can double up as a query history / you can easily share a query with someone else. This is contrast to typical software which is often oriented to making pretty dashboards and reports. (This function is important too, but it's not what I want in exploration mode!) * You can fix data problems in the query editor. It's pretty common to have messed up and ended up with a database that doesn't have exactly the columns you need, or some columns that are corrupted in some way. Scuba has pretty robust support for defining custom columns with arbitrary SQL functions, grouping over them as if they were native functions, and doing so with minimal runtime cost (Scuba aims to turn around your query in milliseconds!) Having to go and run a huge data pipeline to fix your data is a big impediment to exploration; quick and easy custom columns means you can patch over problems when you're investigating and fix them for real later. We're going to build a exploratory data analysis tool like Scuba for time series database (i.e., a database with a mandatory timestamp representing the time an event occurred). We'll use DuckDB as the underlying SQL engine served from a Python server, and render the GUI/results as a webpage with vanilla HTML and JS. We'll use choices.js to support token inputs. We define a token input to mean a text input element where as you type a dropdown displays with valid values, and if you select one or press enter, the selection turns into a token/chip that can only be deleted as one unit. To start, we are going to support one views: samples. The samples view only allows you to view individual samples from the database, subject to a filter. Our main UI concept is that there is a left sidebar that is the query editor, and the right side that shows the view. The sidebar is always visible and defaults to the query parameters of the current view. After you make changes to the query, clicking the "Dive" button updates the view. The URL of the page encodes all of the values of the query (and gets updated when you Dive), so the browser's back button lets you view previous queries. The query editor's job is to generate a SQL query, which then is applied on the database, and then the result visualized according to the view. Here are the settings you can apply to the query. The help text should show up when you mouse over the field name: * Start/End - Help text: "Sets the start/end of the time range to query. Can be any kind of datetime string. For example: 'April 23, 2014' or 'yesterday'." The UI for this selector supports both relative selections (now, -1 hour, -3 hours, -12 hours, -1 day, -3 days, -1 week, -1 fortnight, -30 days, -90 days) as well as specifying an absolute date. The way this field is rendered is there is a free form text box, a drop down arrow (for the relative selectors), and then a calendar button (for date selection). * Order By - Help text: "Choose a column to sort results by." There is an ASC/DESC toggle next to it. * Limit - Help text: "Choose the maximum number of results to show in the chart after any aggregations have been applied. For example, a limit of 10 will show no more than 10 rows for a table, etc." * Filters - You can create as many filters as you want. You can either write a filter using a UI or manual SQL. In the UI, filter consists of a column name, a relation (e.g., =, !=, <, >) and then a text field. The text field is a token input. It accepts multiple tokens for = relation, in which case we match using an OR for all options. There is also a "Columns" tab which lets you view all fields in the table, organized by their type. You can also define derived columns, by specifying a column name and SQL expression. Derived columns can be used for all parts of the UI, including filters/group by/etc. Columns have checkboxes indicating if we should SELECT them or not. Each selected column shows up in the graph. There is an All/None link which can be used to select/deselect all checkboxes. The query UI constructs a SQL query that intuitively has this form: ``` SELECT column, column, ..., FROM table WHERE time >= min-timestamp AND time <= max-timestamp [AND condition ...] ORDER BY aggregate(column) LIMIT number ```

Write tests for query backend and frontend

rejected - click to expand
I really like Scuba (Meta's internal real-time database system). The distributed, real-time database part of Scuba is quite difficult (and expensive) to replicate, but I also really like Scuba's UI for doing queries, and I have found myself wishing that I have access to it even for "small" databases, e.g., I have a sqlite dataset I want to explore. Pivotal ideas: * Time series by default. In the dedicated "time series" view, there are many features specifically oriented towards working towards tables that represent events that occurred over time: the start, end, compare, aggregate and granularity fields all specially privilege the timestamp field. In fact, you can't log events to Scuba's backing data store without a timestamp, they always come with one. (Scuba also supports other views that don't presuppose a time series, but the time series is the most beloved and well used view.) This is in contrast to typical software which tries to generalize to arbitrary data first, with time series being added on later. * It's all about exploration. Scuba is predicated on the idea that you don't know what you're looking for, that you are going to spend time tweaking queries and changing filters/grouping as part of an investigation to figure out why a system behaves the way it is. So the filters/comparisons/groupings you want to edit are always visible on the left sidebar, with the expectation that you're going to tweak the query to look at something else. Similarly, all the parameters of your query get saved into your URL, so your browser history can double up as a query history / you can easily share a query with someone else. This is contrast to typical software which is often oriented to making pretty dashboards and reports. (This function is important too, but it's not what I want in exploration mode!) * You can fix data problems in the query editor. It's pretty common to have messed up and ended up with a database that doesn't have exactly the columns you need, or some columns that are corrupted in some way. Scuba has pretty robust support for defining custom columns with arbitrary SQL functions, grouping over them as if they were native functions, and doing so with minimal runtime cost (Scuba aims to turn around your query in milliseconds!) Having to go and run a huge data pipeline to fix your data is a big impediment to exploration; quick and easy custom columns means you can patch over problems when you're investigating and fix them for real later. We're going to build a exploratory data analysis tool like Scuba for time series database (i.e., a database with a mandatory timestamp representing the time an event occurred). We'll use DuckDB as the underlying SQL engine served from a Python server, and render the GUI/results as a webpage with vanilla HTML and JS. We'll use choices.js to support token inputs. We define a token input to mean a text input element where as you type a dropdown displays with valid values, and if you select one or press enter, the selection turns into a token/chip that can only be deleted as one unit. To start, we are going to support one views: samples. The samples view only allows you to view individual samples from the database, subject to a filter. Our main UI concept is that there is a left sidebar that is the query editor, and the right side that shows the view. The sidebar is always visible and defaults to the query parameters of the current view. After you make changes to the query, clicking the "Dive" button updates the view. The URL of the page encodes all of the values of the query (and gets updated when you Dive), so the browser's back button lets you view previous queries. The query editor's job is to generate a SQL query, which then is applied on the database, and then the result visualized according to the view. Here are the settings you can apply to the query. The help text should show up when you mouse over the field name: * Start/End - Help text: "Sets the start/end of the time range to query. Can be any kind of datetime string. For example: 'April 23, 2014' or 'yesterday'." The UI for this selector supports both relative selections (now, -1 hour, -3 hours, -12 hours, -1 day, -3 days, -1 week, -1 fortnight, -30 days, -90 days) as well as specifying an absolute date. The way this field is rendered is there is a free form text box, a drop down arrow (for the relative selectors), and then a calendar button (for date selection). * Order By - Help text: "Choose a column to sort results by." There is an ASC/DESC toggle next to it. * Limit - Help text: "Choose the maximum number of results to show in the chart after any aggregations have been applied. For example, a limit of 10 will show no more than 10 rows for a table, etc." * Filters - You can create as many filters as you want. You can either write a filter using a UI or manual SQL. In the UI, filter consists of a column name, a relation (e.g., =, !=, <, >) and then a text field. The text field is a token input. It accepts multiple tokens for = relation, in which case we match using an OR for all options. There is also a "Columns" tab which lets you view all fields in the table, organized by their type. You can also define derived columns, by specifying a column name and SQL expression. Derived columns can be used for all parts of the UI, including filters/group by/etc. Columns have checkboxes indicating if we should SELECT them or not. Each selected column shows up in the graph. There is an All/None link which can be used to select/deselect all checkboxes. The query UI constructs a SQL query that intuitively has this form: ``` SELECT column, column, ..., FROM table WHERE time >= min-timestamp AND time <= max-timestamp [AND condition ...] ORDER BY aggregate(column) LIMIT number ``` You should write tests for the server backend, demonstrating that at specific query values we get back the correct rows of data. For the frontend, use Cypress to demonstrate that each individual form element works correctly.

Write tests for server backend queries

rejected - click to expand
I really like Scuba (Meta's internal real-time database system). The distributed, real-time database part of Scuba is quite difficult (and expensive) to replicate, but I also really like Scuba's UI for doing queries, and I have found myself wishing that I have access to it even for "small" databases, e.g., I have a sqlite dataset I want to explore. Pivotal ideas: * Time series by default. In the dedicated "time series" view, there are many features specifically oriented towards working towards tables that represent events that occurred over time: the start, end, compare, aggregate and granularity fields all specially privilege the timestamp field. In fact, you can't log events to Scuba's backing data store without a timestamp, they always come with one. (Scuba also supports other views that don't presuppose a time series, but the time series is the most beloved and well used view.) This is in contrast to typical software which tries to generalize to arbitrary data first, with time series being added on later. * It's all about exploration. Scuba is predicated on the idea that you don't know what you're looking for, that you are going to spend time tweaking queries and changing filters/grouping as part of an investigation to figure out why a system behaves the way it is. So the filters/comparisons/groupings you want to edit are always visible on the left sidebar, with the expectation that you're going to tweak the query to look at something else. Similarly, all the parameters of your query get saved into your URL, so your browser history can double up as a query history / you can easily share a query with someone else. This is contrast to typical software which is often oriented to making pretty dashboards and reports. (This function is important too, but it's not what I want in exploration mode!) * You can fix data problems in the query editor. It's pretty common to have messed up and ended up with a database that doesn't have exactly the columns you need, or some columns that are corrupted in some way. Scuba has pretty robust support for defining custom columns with arbitrary SQL functions, grouping over them as if they were native functions, and doing so with minimal runtime cost (Scuba aims to turn around your query in milliseconds!) Having to go and run a huge data pipeline to fix your data is a big impediment to exploration; quick and easy custom columns means you can patch over problems when you're investigating and fix them for real later. We're going to build a exploratory data analysis tool like Scuba for time series database (i.e., a database with a mandatory timestamp representing the time an event occurred). We'll use DuckDB as the underlying SQL engine served from a Python server, and render the GUI/results as a webpage with vanilla HTML and JS. We'll use choices.js to support token inputs. We define a token input to mean a text input element where as you type a dropdown displays with valid values, and if you select one or press enter, the selection turns into a token/chip that can only be deleted as one unit. To start, we are going to support one views: samples. The samples view only allows you to view individual samples from the database, subject to a filter. Our main UI concept is that there is a left sidebar that is the query editor, and the right side that shows the view. The sidebar is always visible and defaults to the query parameters of the current view. After you make changes to the query, clicking the "Dive" button updates the view. The URL of the page encodes all of the values of the query (and gets updated when you Dive), so the browser's back button lets you view previous queries. The query editor's job is to generate a SQL query, which then is applied on the database, and then the result visualized according to the view. Here are the settings you can apply to the query. The help text should show up when you mouse over the field name: * Start/End - Help text: "Sets the start/end of the time range to query. Can be any kind of datetime string. For example: 'April 23, 2014' or 'yesterday'." The UI for this selector supports both relative selections (now, -1 hour, -3 hours, -12 hours, -1 day, -3 days, -1 week, -1 fortnight, -30 days, -90 days) as well as specifying an absolute date. The way this field is rendered is there is a free form text box, a drop down arrow (for the relative selectors), and then a calendar button (for date selection). * Order By - Help text: "Choose a column to sort results by." There is an ASC/DESC toggle next to it. * Limit - Help text: "Choose the maximum number of results to show in the chart after any aggregations have been applied. For example, a limit of 10 will show no more than 10 rows for a table, etc." * Filters - You can create as many filters as you want. You can either write a filter using a UI or manual SQL. In the UI, filter consists of a column name, a relation (e.g., =, !=, <, >) and then a text field. The text field is a token input. It accepts multiple tokens for = relation, in which case we match using an OR for all options. There is also a "Columns" tab which lets you view all fields in the table, organized by their type. You can also define derived columns, by specifying a column name and SQL expression. Derived columns can be used for all parts of the UI, including filters/group by/etc. Columns have checkboxes indicating if we should SELECT them or not. Each selected column shows up in the graph. There is an All/None link which can be used to select/deselect all checkboxes. The query UI constructs a SQL query that intuitively has this form: ``` SELECT column, column, ..., FROM table WHERE time >= min-timestamp AND time <= max-timestamp [AND condition ...] ORDER BY aggregate(column) LIMIT number ``` You should write tests for the server backend, demonstrating that at specific query values we get back the correct rows of data.

Write tests for server backend queries

I really like Scuba (Meta's internal real-time database system). The distributed, real-time database part of Scuba is quite difficult (and expensive) to replicate, but I also really like Scuba's UI for doing queries, and I have found myself wishing that I have access to it even for "small" databases, e.g., I have a sqlite dataset I want to explore. Pivotal ideas: * Time series by default. In the dedicated "time series" view, there are many features specifically oriented towards working towards tables that represent events that occurred over time: the start, end, compare, aggregate and granularity fields all specially privilege the timestamp field. In fact, you can't log events to Scuba's backing data store without a timestamp, they always come with one. (Scuba also supports other views that don't presuppose a time series, but the time series is the most beloved and well used view.) This is in contrast to typical software which tries to generalize to arbitrary data first, with time series being added on later. * It's all about exploration. Scuba is predicated on the idea that you don't know what you're looking for, that you are going to spend time tweaking queries and changing filters/grouping as part of an investigation to figure out why a system behaves the way it is. So the filters/comparisons/groupings you want to edit are always visible on the left sidebar, with the expectation that you're going to tweak the query to look at something else. Similarly, all the parameters of your query get saved into your URL, so your browser history can double up as a query history / you can easily share a query with someone else. This is contrast to typical software which is often oriented to making pretty dashboards and reports. (This function is important too, but it's not what I want in exploration mode!) * You can fix data problems in the query editor. It's pretty common to have messed up and ended up with a database that doesn't have exactly the columns you need, or some columns that are corrupted in some way. Scuba has pretty robust support for defining custom columns with arbitrary SQL functions, grouping over them as if they were native functions, and doing so with minimal runtime cost (Scuba aims to turn around your query in milliseconds!) Having to go and run a huge data pipeline to fix your data is a big impediment to exploration; quick and easy custom columns means you can patch over problems when you're investigating and fix them for real later. We're going to build a exploratory data analysis tool like Scuba for time series database (i.e., a database with a mandatory timestamp representing the time an event occurred). We'll use DuckDB as the underlying SQL engine served from a Python server, and render the GUI/results as a webpage with vanilla HTML and JS. We'll use choices.js to support token inputs. We define a token input to mean a text input element where as you type a dropdown displays with valid values, and if you select one or press enter, the selection turns into a token/chip that can only be deleted as one unit. To start, we are going to support one views: samples. The samples view only allows you to view individual samples from the database, subject to a filter. Our main UI concept is that there is a left sidebar that is the query editor, and the right side that shows the view. The sidebar is always visible and defaults to the query parameters of the current view. After you make changes to the query, clicking the "Dive" button updates the view. The URL of the page encodes all of the values of the query (and gets updated when you Dive), so the browser's back button lets you view previous queries. The query editor's job is to generate a SQL query, which then is applied on the database, and then the result visualized according to the view. Here are the settings you can apply to the query. The help text should show up when you mouse over the field name: * Start/End - Help text: "Sets the start/end of the time range to query. Can be any kind of datetime string. For example: 'April 23, 2014' or 'yesterday'." The UI for this selector supports both relative selections (now, -1 hour, -3 hours, -12 hours, -1 day, -3 days, -1 week, -1 fortnight, -30 days, -90 days) as well as specifying an absolute date. The way this field is rendered is there is a free form text box, a drop down arrow (for the relative selectors), and then a calendar button (for date selection). * Order By - Help text: "Choose a column to sort results by." There is an ASC/DESC toggle next to it. * Limit - Help text: "Choose the maximum number of results to show in the chart after any aggregations have been applied. For example, a limit of 10 will show no more than 10 rows for a table, etc." * Filters - You can create as many filters as you want. You can either write a filter using a UI or manual SQL. In the UI, filter consists of a column name, a relation (e.g., =, !=, <, >) and then a text field. The text field is a token input. It accepts multiple tokens for = relation, in which case we match using an OR for all options. There is also a "Columns" tab which lets you view all fields in the table, organized by their type. You can also define derived columns, by specifying a column name and SQL expression. Derived columns can be used for all parts of the UI, including filters/group by/etc. Columns have checkboxes indicating if we should SELECT them or not. Each selected column shows up in the graph. There is an All/None link which can be used to select/deselect all checkboxes. The query UI constructs a SQL query that intuitively has this form: ``` SELECT column, column, ..., FROM table WHERE time >= min-timestamp AND time <= max-timestamp [AND condition ...] ORDER BY aggregate(column) LIMIT number ``` You should write tests for the server backend, demonstrating that at specific query values we get back the correct rows of data.

Improve Filter functionality and styling

We're going to improve the Filter functionality. There should be one unfilled filter displayed by default (rather than zero right now). The X to delete the filter should be right aligned and in the first row. Add more styling to make the filter boxes clearer. We should support an autocomplete, which is done by issuing a query to the server to find unique values given the typed prefix. Issue on every keypress. This should be cached in memory. Have some reasonable limit on how many options we show.

Add Playwright tests for filters and sorting

rejected - click to expand
Let's add some tests for our web UI. We will use playwright and pytest (with the pytest-playwright plugin) to orchestrate these tests, on the chromium browser (this is the only browser you have available). Here are the things I want to test from the main page: * Test that changing the Start/End field and then running Dive properly filters out elements based on the range * Test that changing Order By field and then running Dive properly sorts rows * Test that changing Limit properly reduces the number of shown records The tests will be run in parallel with pytest-xdist so make sure they are parallelizable. The web UI already exists and I have verified that the requested functionality works so do NOT change the actual implementation. Instead of directly testing on the HTML table, if you want to publish the raw JSON data pre-rendering in a global variable and test against that, this is acceptable.

Add Playwright tests for main page UI

rejected - click to expand
Let's add some tests for our web UI. We will use playwright and pytest (with the pytest-playwright plugin) to orchestrate these tests, on the chromium browser (this is the only browser you have available). Here are the things I want to test from the main page: * Test that changing the Start/End field and then running Dive properly filters out elements based on the range * Test that changing Order By field and then running Dive properly sorts rows * Test that changing Limit properly reduces the number of shown records The tests will be run in parallel with pytest-xdist so make sure they are parallelizable. The web UI already exists and I have verified that the requested functionality works so do NOT change the actual implementation. Instead of directly testing on the HTML table, if you want to publish the raw JSON data pre-rendering in a global variable and test against that, this is acceptable.

Add UI tests with Playwright and pytest

Let's add some tests for our web UI. We will use playwright and pytest (with the pytest-playwright plugin) to orchestrate these tests, on the chromium browser (this is the only browser you have available). Here are the things I want to test from the main page: * Test that changing the Start/End field and then running Dive properly filters out elements based on the range * Test that changing Order By field and then running Dive properly sorts rows * Test that changing Limit properly reduces the number of shown records The tests will be run in parallel with pytest-xdist so make sure they are parallelizable. The web UI already exists and I have verified that the requested functionality works so do NOT change the actual implementation. Instead of directly testing on the HTML table, if you want to publish the raw JSON data pre-rendering in a global variable and test against that, this is acceptable.

Add GitHub Actions configuration for tests and linting

Add a GitHub actions configuration that uses modern uv tooling to run all tests on main and PRs. It should also run lint with ruff.

Add type annotations for third-party libraries

We recently enabled type checking in this repo. You can typecheck by running 'pyright'. There are some errors. Many of the errors arise because duckdb does not have accurate typing annotations. What we will do is we will create local stubs for all of duckdb's interfaces to make them well typed. pyright is already configured to look in stubs/ directory for stubs. Do this for any other poorly typed interfaces from third party libraries (or tell me in the PR description that I should install some typeshed libraries). Fix all type errors.

Enable type checking for tests folder

Enable type checking for the tests/ folder (in pyproject.toml pyright config) and then fix errors.

Add and fix filter functionality test

Filter functionality does not work. We will first add a failing test, and then fix it. The failing test should click "Add Filter", then select "user" as the field, and then add an "alice" chip (by typing alice in the text box and pressing ENTER). Then when we dive, we should see two alice rows. Right now, NO request is issued at all when we click Dive. Diagnose and then fix the problem.

Fix race condition in test logic

There's a race condition in the current test logic for matching against table contents in run_query. Specifically, if there were previously valid results in lastResults, and for some reason Dive doesn't do anything, then we will still see the old results. The testing framework should explicitly clear lastResults before attempting an interaction.

Enhance sample table with sorting and UI updates

We're going to enhance the table that displays samples on the right panel. Make the column headers left aligned. When you click the column header, it should sort the table by that column. If you click once, it sorts descending; click again it sorts ascending, click again it goes back to no sort. When we are sorted by a column there should be an arrow showing which way the sort direction is going, and the text should be colored blue indicating this column is sorted. Only LOCAL data is sorted this way, we don't issue a new query.

Improve UI layout and functionality

Let's make some UI improvements. Let's add a header at the very top which spans the entire width which displays the filename and name of the table we're looking at. Remove the Query heading, it's redundant. Instead, add two tabs above the fields. The default tab is "View Settings" which shows us the tabs we currently see. The other tab is "Columns", which for now should just list all of the columns in the table when clicked. Make the left pane and the right pane scroll independently. Make the Field label (e.g., Start) and the input box aligned; all of the labels should be right-aligned, all the input boxes left-aligned. Add a [?] help text next to each label, when mouse-d over it should display help text as currently specified in README.md. Make Dive a green button, and place it at the top, floating to the right of the tabs. Write a test for each UI change you make, intuitively testing how I would manually verify that the changes I asked for above were implemented correctly.

Enhance table with sortable column headers

rejected - click to expand
We're going to enhance the table that displays samples on the right panel. Make the column headers left aligned. When you click the column header, it should sort the table by that column. If you click once, it sorts descending; click again it sorts ascending, click again it goes back to no sort. When we are sorted by a column there should be an arrow showing which way the sort direction is going, and the text should be colored blue indicating this column is sorted. Only LOCAL data is sorted this way, we don't issue a new query. Add tests for the new functionality added.

Update GHA workflow to run pyright

Update the GHA workflow to also run pyright

Swap column sorting to ASC first

When we click a column on the table it goes DESC and then ASC. Swap it so it's ASC first and then DESC. This feature was added in 2c1285e6f45e8627a0b4a4ce245989cf35646f21

Add loading interstitial and query timing info

When you click Dive, replace the right pane with an interstitial saying that we are loading. When it's done replace with the data. Also, at the very bottom of View Settings, add some info about the query: Your query took about X ms, where X is how long it took.

Split GHA workflows for lint and test

Split the GHA workflows into a lint workflow and a test workflow.

Fix button size change on toggle

The View Settings / Columns buttons shouldn't change size when you toggle between them due to bold text changing size. They should be fixed size, like tabs.

Update filter UI and behavior

Make the Filters label and actual dedicated heading (since we can have multiple filters). Have an empty filter show up by default. When a filter has no entries in its value, it's a no-op. (Add server side test for this). The Add Filter button should be at the bottom of the section. The X button to remove a filter should be floated on the top right of the filter box.

Add dropdown selector for relative time periods

We're going to modify the Start/Stop fields to have an optional drop down which is triggered by pressing a new down triangle button on their right side. When pressed, you get a drop down selector for relative time periods: -1 hour, -3 hours, -12 hours, -1 day, -3 days, -1 week, -1 fortnight, -30 days, -90 days. Clicking will populate the text area with this content. Free form entry must keep working. Add a test for this functionality.

Replace ASC/DESC dropdown with toggle button

Instead of ASC/DESC in Order By being a drop down field, instead it should be a button that displays the current state. When you click it toggles to the opposite order. There should also be an arrow in the button text.

Update filter layout for column and input

In the filter, make the column name + relation (=, etc) take the first row, and the input field take a second row. The column name should stretch to fill space (and the relation is small).

Enhance columns tab with checkboxes and toggle

We're going to enhance the columns tab in the the left sidebar. It is now going to be a form that is part of the overall scuba query we send when we send. Every column now has a checkbox indicating if it should be part of the result or not. By default, for samples view, all columns are selected. There is also an All/None toggle before the column list which can be used to toggle them all on or off. Add both server and client tests for this functionality.

Enhance columns tab with checkboxes and toggle

We're going to enhance the columns tab in the the left sidebar. It is now going to be a form that is part of the overall scuba query we send when we send. Every column now has a checkbox indicating if it should be part of the result or not. By default, for samples view, all columns are selected. There is also an All/None toggle before the column list which can be used to toggle them all on or off. Add both server and client tests for this functionality.

Implement custom chips input component

We're going to write a small implementation of choices.js style chips for the Filters. We will NOT use any third party dependencies. When we click into the input, a dropdown appears. As we enter text, the dropdown sends a (new) api query to the server to find samples from the column which contain the currently entered substring anywhere in their value. We limit this query to 20 results. This query is only issued for string-like columns, we do not issue it for numeric columns. Only 6 or so are shown and we have a scrollbar. Every time we type a keystroke we trigger a new query. Values should be cached in memory on server side with a sensible invalidation scheme to avoid unbounded memory use. The exact literal value that is typed is also offered as the second choice in the dropdown. There is a highlighted value in the dropdown that can be moved up/down via arrow keys as well as directly moused over. When we press ENTER or we click on a value, the selected value is placed into the text input box as a chip, rendered with a border and solid background, as well as an X on the rhs which can be used to remove the chip. More chips can be added by continuing to type. ONLY chips are considered as part of the filter query, plain text is NOT sent to the server. On the right of the chip input field there is a copy icon, which can be used to copy all of the chips. This will put into the clipboard all of the chip values intercalated with a comma (no spaces). If we paste into the input field, we split the input on commas and convert the resulting list into chips. If we shift-paste into the field, we do NOT split the input on commas and instead convert the entire value verbatim into a chip. Add comprehensive tests on the web side for all of this functionality.

Implement custom choices.js chips for filters

We're going to write a small implementation of choices.js style chips for the Filters. We will NOT use any third party dependencies. When we click into the input, a dropdown appears. As we enter text, the dropdown sends a (new) api query to the server to find samples from the column which contain the currently entered substring anywhere in their value. We limit this query to 20 results. This query is only issued for string-like columns, we do not issue it for numeric columns. Only 6 or so are shown and we have a scrollbar. Every time we type a keystroke we trigger a new query. Values should be cached in memory on server side with a sensible invalidation scheme to avoid unbounded memory use. The exact literal value that is typed is also offered as the second choice in the dropdown. There is a highlighted value in the dropdown that can be moved up/down via arrow keys as well as directly moused over. When we press ENTER or we click on a value, the selected value is placed into the text input box as a chip, rendered with a border and solid background, as well as an X on the rhs which can be used to remove the chip. More chips can be added by continuing to type. ONLY chips are considered as part of the filter query, plain text is NOT sent to the server. On the right of the chip input field there is a copy icon, which can be used to copy all of the chips. This will put into the clipboard all of the chip values intercalated with a comma (no spaces). If we paste into the input field, we split the input on commas and convert the resulting list into chips. If we shift-paste into the field, we do NOT split the input on commas and instead convert the entire value verbatim into a chip. Add comprehensive tests on the web side for all of this functionality.

Improve chip layout and text input behavior

We're going to make some improvements to the chip implementation on filters. We want the chips to be laid out horizontally, line wrapping in the ordinary way. To do this, we need to have a single div which we put the chips in. To simulate the ability to write into this field, we will put a borderless text input right after the last chip, which when we click the field gets focus and we can type text into. Furthermore, when we press BACKSPACE and the text input is empty, which should delete the rightmost chip.

Remove blue border on filter chip selection

When the input in the filter chip input is selected, we should NOT render the blue border.

Close dropdown on click outside

When we click outside of the filter chip selection area, the dropdown box should go away.

Enhance column type handling in UI

Let's improve the treatment of column types in the backend/frontend. From a UI perspective, we need to distinguish between different types of column. In our UI, we will distinguish these main types: (1) time (TIMESTAMPTZ), (2) integers (BIGINT, etc), (3) strings (VARCHAR). This has the following implications: - In the Columns tab, we group columns by their type, and have separate All/None buttons for each type (as well as one for everything globally). While you're at it, split All and None into separate buttons rather than making it a toggle. - In the chip autocomplete functionality, we only issue queries for string columns. Integer/time is never considered. - In filters, string columns have different relational operators: they retain = and !=, but we drop the comparison operators and instead add: matches regex, not matches regex, contains, not contains, empty, not empty, like. These should map to the corresponding string operations supported by DuckDB. When you select empty/not empty we hide the text input entirely. Add server side tests for this new functionality - In order by, only time and integer columns are eligible to show up. Do not show string columns. - In the table view, time columns should be rendered as timestamp like Tue, May 20, 2025 01:39:26 PM (EDT). Use the local timezone for rendering. Time and integer columns should be right aligned (while string columns are left aligned).

Enhance table view with styles and interactions

Let's add some small enhancements to the table view. Let's add border lines to all cells in the table. When you hover over a (non-header) row it should be highlighted gray. If you click the row it should turn light blue and then stay that color. The rows should alternate white and light gray color by default. The borders between column header should be draggable so a user can adjust the width of a column. When you mouse over there should be an indication that dragging is possible. Dragging affects the cell to the left. The table should be 100% width, with any extra space distributed to columns evenly. If there are too many columns to display we should overflow to the right with a scroll bar for that pane only.

Update filter UI for query column resizing

In the filter UI, when the relational operator dropdown (e.g., the one that has = in it) has a long string like matches regex, it is covered up by the X. The X should not float, it should always take up its space. When the relational operator dropdown value is large, it should instead expand left, taking up space from the column name. The query column should be 50% bigger by default. Make the border thicker and draggable so the user can adjust the query column size. The size of the query column should be persisted across refresh (via either a cookie or local storage).

Fix table overflow issue and remove column resizing

The current table rendering has a size miscalculation which causes the table to always slightly overflow causing the scrollbar to appear. We should only overflow if there is legitimately no space. Also, let's delete the column resizing functionality, I don't like it.

Fix invalid date rendering in table

With the current test data, the timestamp when rendered in the table shows "Invalid Date". Fix this problem. Add a test specifically for the rendering, since looking at the JSON data won't exhibit this problem.

Fix filter box layout

The X for removing a filter is now at the bottom of the filter box. It should be in the first row: so it's column name, relation, X. The X has fixed size, relation should resize to fit whatever its current value is, and column name is flex (truncating if necessary).

Make All / None buttons links and right-align

Make the All / None buttons in Columns render as links instead. And right-align them.

Persist query information in URL

Right now, the query information in the left sidebar is not persisted. We now want to persist it in the URL. The way this will work is that we will update our URL to have all of the form parameters as URL parameters when we press dive. When we navigate to a page, we set all of our form inputs to match the form parameters. The URL updates should be done so that back button takes you back to the previous query. Also, let's always issue a query on the RHS immediately, based on whatever the current state of the form is (after we've populated based on the URL parameters.)

Convert sub controls to links

The global All / None column control was converted to links. However, the sub controls for each type of column (string, int, etc) was not converted. Convert those too.

Make database configurable with multiple formats

We need to make the database we're going to load configurable. We want to support csv, sqlite and native duckdb database. We should be able to specify the database via filename passed on command line. The type of database should be automatically inferred. Add tests for all of these cases.

Float per section column control links to the right

The per section column control links All / None should floated to the right.

Fix relative time selector UI

The relative time selector for Start / End is problematic. First, end should have Now as an option. Second, when we click the down arrow button, we should immediately see the dropdown options (it shouldn't display as a separate dropdown on the right). When we select something in the dropdown, it should immediately populate the free form text field and we're done.

Add Table display type with new fields

We're going to add support for a new display type for our UI. We will put a drop down selector for graph type to the right of the topmost header as a dropdown. Our current functionality is "Samples". We're now going to add "Table". Table adds support for some new fields in View Settings. These fields should be hidden on the Samples view. We may add some new fields / graph types so don't excessively hard code here. We are adding two new fields: Group By, Aggregate, Show Hits / Samples. Group By should use the same chip input that we use for Filters (refactor so it's a reusable component if needed). It can select over columns. A user can specify multiple columns to group by. This should translate into the corresponding SQL group by. Aggregate specifies how we should aggregate things that are grouped together. It is a traditional dropdown. The default selection is Avg. We also support Count, Sum, Min, Max, Count Distinct, p5, p25, p50, p70, p75, p90, p95, p99, p99.9, p99.99. Show Hits has a checkbox and is selected by default. It controls if we add a column "Hits" which specifies the Count aggregate of the group by. If a column is nullable, the NULL entries should get group'ed by themselves as well. To display the table, we reuse the same table that displays samples. Columns that are grouped by are always displayed on the left, in the order they were selected by the user. Hits is then displayed next, followed by the rest of the columns. Unlike samples, only Time and Integer columns can be selected for display (string columns are hidden from the Columns view). The Hits column should not only display a count but also a percentage, which is its percentage compared to the sum of all counts.

Report empty data in table query

When we make a query and no data is returned, we should report "Empty data provided to table"

Fix relative time handling and add tests

Relative times in Start/End entry don't work. Here's what happens when I submit a query for them: Traceback (most recent call last): File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 1536, in __call__ return self.wsgi_app(environ, start_response) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 1514, in wsgi_app response = self.handle_exception(e) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 1511, in wsgi_app response = self.full_dispatch_request() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 919, in full_dispatch_request rv = self.handle_user_exception(e) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 917, in full_dispatch_request rv = self.dispatch_request() ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 902, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/scubaduck/server.py", line 206, in query rows = con.execute(sql).fetchall() ^^^^^^^^^^^^^^^^^^^^^^^^^^^ duckdb.duckdb.ConversionException: Conversion Error: invalid timestamp field format: "-1 hour", expected format is (YYYY-MM-DD HH:MM:SS[.US][±HH:MM| ZONE]) LINE 1: ... timestamp, event, value, user FROM events WHERE timestamp >= '-1 hour' AND timestamp <= 'now' ORDER BY timestamp ASC... Example URL: http://127.0.0.1:5000/?start=-1+hour&end=now&order_by=timestamp&order_dir=ASC&limit=100&columns=timestamp%2Cevent%2Cvalue%2Cuser&filters=%5B%7B%22column%22%3A%22timestamp%22%2C%22op%22%3A%22%3D%22%2C%22value%22%3Anull%7D%5D Fix this and add a test for this case. Make sure to ensure ALL the relative times in the dropdown are supported. Also, if we pass an invalid time, the server should report the error message to the client and we should show it in the pane, instead of being stuck "Loading..."

Include full SQL query in error message

When an error like this occurs: 127.0.0.1 - - [20/May/2025 21:42:31] "POST /api/query HTTP/1.1" 500 - Traceback (most recent call last): File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 1536, in __call__ return self.wsgi_app(environ, start_response) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 1514, in wsgi_app response = self.handle_exception(e) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 1511, in wsgi_app response = self.full_dispatch_request() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 919, in full_dispatch_request rv = self.handle_user_exception(e) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 917, in full_dispatch_request rv = self.dispatch_request() ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 902, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/scubaduck/server.py", line 206, in query rows = con.execute(sql).fetchall() ^^^^^^^^^^^^^^^^^^^^^^^^^^^ duckdb.duckdb.BinderException: Binder Error: No function matches the given name and argument types 'avg(VARCHAR)'. You might need to add explicit type casts. Candidate functions: avg(DECIMAL) -> DECIMAL avg(SMALLINT) -> DOUBLE avg(INTEGER) -> DOUBLE avg(BIGINT) -> DOUBLE avg(HUGEINT) -> DOUBLE avg(DOUBLE) -> DOUBLE LINE 1: SELECT user, avg(event) AS event, avg(value) AS value FROM events GROUP... ^ I want the full SQL query that was generated to be in the error message (both seen in console and on web UI). On the web UI, I also want to see the traceback.

Fix Group By chip form issues

In the Table View Settings, the Group By chip form has some problems. The copy button renders as \u2398 rather than the actual copy icon. When the URL parameter specifies the value of this field, the input is not properly taking on the value (NB: it should be rendered as chips). When typing in the text input, we do not get an autocomplete drop down (this drop down should be populated with valid column names to group by).

Fix handling of hidden fields in form views

When fields are hidden in the form (e.g., in the Samples view, group by and similar fields are hidden; in Table view, string columns are hidden from the Columns view), those fields are still submitted to the server and included in the query URL and the server attempts to incorporate them in which causes problems. Concretely, I cannot do a query on the sample dataset with Table view with group by on user, I get this error: Binder Error: Referenced column "Hits" not found in FROM clause! Candidate bindings: "timestamp" LINE 1: SELECT user, count(*) AS Hits, avg(Hits) AS Hits, avg(timestamp) AS timestamp, avg(event) AS... ^ SQL: SELECT user, count(*) AS Hits, avg(Hits) AS Hits, avg(timestamp) AS timestamp, avg(event) AS event, avg(value) AS value FROM events GROUP BY user ORDER BY timestamp ASC LIMIT 100 Traceback (most recent call last): File "/Users/ezyang/Dev/scubaduck/scubaduck/server.py", line 255, in query rows = con.execute(sql).fetchall() ^^^^^^^^^^^^^^^^ duckdb.duckdb.BinderException: Binder Error: Referenced column "Hits" not found in FROM clause! Candidate bindings: "timestamp" LINE 1: SELECT user, count(*) AS Hits, avg(Hits) AS Hits, avg(timestamp) AS timestamp, avg(event) AS... ^ First, add this to the test suite and verify it's failing in the same way. Our fix for this will be a bit involved. Let's establish some system invariants: - The parameters in the URL only update when we click Dive. However, if a user makes local changes to a field, those changes should persist if the field is hidden and then unhidden. For example, if I specify a Group by in Table view, and then swap to Samples view, and then swap back to Table view, I should still have whatever I previously specified in the Group by field (but note, this is ephemeral; it will go away if I refresh the page). - When a field is hidden, it is not persisted in the parameters of the URL. So this means that if we set Group By in Table view, swap to Samples view, Dive (updating the URL), load the resulting URL in a new tab, and swap to Table view, we will have lost the Group By setting. But if we DIDN'T explicitly reload the URL, swapping to Table would see the Group By preserved (because of the temporary state on fields). - When a field is hidden, it is NOT submitted to the server. You'll need to figure out some way to do this, since you probably want to keep the field around so that its temporary value is preserved. Separately, the server should validate what fields are valid for the view type (Table supports more than Samples) and report an error if fields are incorrectly passed in (so we can fix our client code for this case.) It is VERY important this applies to columns, as when we are in table view we need to make sure we don't attempt to apply an aggregate function to column types that don't support it. To make the above test work end-to-end, you may need to make other fixes to the query builder. Add appropriate server tests for these cases as needed.

Fix styling alignment for Group By chip

The styling on the Group By chip field doesn't match the Filter chip field; specifically, the text input box isn't supposed to have a border. Double check that the code is factored so that our chip implementation can be easily reused in these contexts without problems.

Implement custom dropdown with search

Everywhere there is a drop down input, we want to replace it with a custom drop down implementation that is implemented in JavaScript. When you click the dropdown you will get a hand rendered list of drop down possibilities with scrolldown, where the current selection is highlighted blue and your hover over is highlighted grey. However, critically, at the top of the drop down box is a search box (which is focused when you click the drop down). Typing in the search field will narrow down to only elements that contain the string (anywhere in the string). Implement this modularly so we can easily apply it to many inputs. It should be applied at minimum to Aggregate, Group By, the Column selector in a Filter, Order By fields.

Replace X with close symbol

We use X in a number of places to represent deleting a chip/field. Replace this with a more visually distinct Unicode character that clearly denotes close.

Write test for table functionality and fix bugs

Table functionality is broken and our tests aren't catching it. Let's write a test that closely mimics how a user interacts with the UI. This is what the test should do: 1. Load the main page with empty query 2. Switch view to Table 3. Dive and validate the results. Because the Group By field is empty but we have an aggregate function, this should cause us to return only a single row, with the first column being the number of rows in the table (4 on the test dataset), and then averages for all the columns that were selected (there are two). Right now, this display is buggy because Hits is rendered as a timestamp (rather than 4 (100%)), the timestamps appear to have been averaged incorrectly, and the value column is blank. This is because the returned data didn't actually populate hits. Duckdb does not directly support AVG on timestamptz fields, so the fields must be coerced to integers, which can be averaged, and then coerced back into timestamps. After writing the test and showing that it fails, fix the bugs to make the test pass.

Fix custom dropdown UI issues

There are some bugs with the custom dropdown with search. When I hover elements, only that element should turn grey, but the entire dropdown turns grey. There should be a down-facing triangle on the right of the input box to indicate it's a dropdown. The search input should have a gray "Search" text overlaid when the input is empty, disappearing when input is entered. A small enhancement is to underline the substring that was matched in the candidates. The relational operator in filter should be a plain dropdown input, not a special one (search not needed). The view type (Samples, Table) can be a plain dropdown input too. The Unicode X for deleting a filter isn't properly aligned in the button, and it is not flush with the right end of the box.

Add aggregate labels to table columns

When we're in Table view, all columns (except Hits) should have a parenthetical saying what aggregate they are; e.g., not "timestamp" but "timestamp (avg)" (if it's an avg aggregate).

Update chip input behavior and add tests

For the chip input, the invariant should be that if the field is focused, we see the dropdown box. When we press enter to accept a chip, this should defocus the input text field. If we have the field selected (dropdown box visible) and we press DELETE to delete a chip, we should retain input text focus and the dropdown should stay visible. Add tests for this.

Fix aggregate avg for timestamp column

This is broken: 1. Load the main page with empty query 2. Switch view to Table 3. Put "user" chip in Group By field 4. Dive This fails with Aggregate avg cannot be applied to column timestamp. This is incorrect. To average a timestamp we should coerce it to a unix timestamp integer, average, and then coerce it back. Add a test for this and then fix it.

Fix filter box layout and add test

In the Filter box, the column name field should stretch to fill all of the remaining space after the relational operator and the X button have taken their fixed amount of space. This used to work but is no longer working after adding the custom dropdown implementation. Add a test for this, specifically asserting that the X is flush with the right side of the filter box.

Fix chip input edge cases and add tests

There's still some edge cases for chip input which need to be improved. If I click on an element in the dropdown to select it, this should also blur the text input. If I click anywhere outside of the input, including another dropdown box (like the column drop down box) this should cause the chip input dropdown to disappear. When I backspace to delete a chip, this still causes the dropdown box to go away (it should not, the dropdown should stay). Add tests for all these bugs.

Implement intelligent number formatting

Let's make formatting of numeric integer/float quantities in our table more intelligent. First, rules for formatting big numbers. If a number is greater than 999.999, we will use SI suffixes to display the number compactly. We will only ever display two decimal places. For example, 3.66 M or 815.21 K. For numbers that don't require an SI prefix, we will instead show three decimal places, e.g., 999.999 or 0.001. If a number is extremely small but nonzero we display it as 0.000. But if it's literally zero we display it as simply 0.

Refine chip input blur behavior

A slight refinement on the chip input blur rule. If I use ENTER to accept a chip, we should add the chip, keep focus on the text field, and HIDE the dropdown. However, if I type anything, the dropdown should come back, and if I then backspace (so the text input is empty), the dropdown should persist (until I intentionally unblur it via the current mechanisms).

Add time series view with line chart

We're going to add a new View type, to go along with Samples and Table: Time Series. Time Series supports all the fields that Table supports, and a few more: - X-axis: Main group by dimension, e.g., the x-axis on time series view. This is our custom dropdown selector, but only time columns are populated here. It should prefer a default setting from the following list, most preferred first: "time", "timestamp" - Granularity: Choose the time interval between data points on the chart. For example, a granularity of 1 hour means there will be a data point every 60 minutes that is aggregated with the chosen Aggregate function over the data for the granularity period before point. This is a plain drop down. The valid values are: Auto, Fine, 1 second, 5 seconds, 10 seconds, 30 seconds, 1 minute, 4 minutes, 5 minutes, 10 minutes, 15 minutes, 30 minutes, 1 hour, 3 hours, 6 hours, 1 day, 1 week, 30 days. The semantics of the Auto setting is that it sets the interval to whatever would result in maximum 100 buckets (if there are not enough data points for that many buckets, it just picks the finest time interval that makes sense), and Fine which sets the interval to 500 buckets. - Fill Missing Buckets: This is a dropdown. For now, it has the settings "Fill with 0 (Per Series)" (default), "Connect (Per Series)" and "Leave blank". Additionally, the default setting of Limit is 7, as it controls how many elements from group by will be plotted (the actual number of lines plotted could be a multiple of this, as we will plot every selected Column). Unlike Samples and Table, we will instead display a line chart in the right panel. To plot the line chart, we will implement it by hand with JS and SVG, similar to how highcharts implements it. We will not use any third party dependencies. Lines will be plotted as paths, no smoothing, no dots for individual data points. Each series (as generated by group by) should be plotted with a different color, assigned using a best practices color palette for graph design. There should be a rendering of x-axis and y-axis; the x-axis should have slanted labels to aid readability. When we mouse over the chart, a vertical line should snap to the center of the time bucket that we are closest to. We should also display a crosshair on all of the series showing us their values at that data point, and highlight the closest point we are on, and increase the thickness of the series that point is on. To the left of the graph (still in the right panel), there should be a legend. The legend looks like this: [GROUP BY VALUE] [AGGREGATE] [First Column name, with series color] [Number of samples for the first column] [Second Column name, with series color] [Number of samples for the second column] ... for all columns ---- ... for all group by values (up to the limit) So for example, if I group by user, I might see Alice AVG value 4 (samples) The highlighted series (which has a thicker line) should also be highlighted in the legend).

Add support for derived columns

We're going to add support for Derived Columns in the Columns bar. This is a multi-input similar to the filters input. Each derived column box is composed as follows: Line 1: Drop down for column type (Aggregated, String, Numeric), Text input for column name (default to derived_N, where N is selected to be the first available number) Line 2: Checkbox to Include in Query Line 3: Textarea The derived columns allow us to transmit arbitrary pieces of SQL to DuckDB to add extra columns to our result. They should otherwise behave like ordinary columns, e.g., we can Group by them. Derived columns should be selectable from our dropdowns. This means we must merge both the server provided list of columns, with whatever our local list of derived columns are (with analogous filtering for appropriate types as the server side had done.) Derived columns, like other settings in View Settings / Columns, should be saved to the URL.

Remove chip if selected again

In the chip input, if a chip is already present in the text box, and the user selects it again, we should REMOVE the chip (instead of adding an extra copy of the chip.)

Add support for Derived Columns in Columns bar

We're going to add support for Derived Columns in the Columns bar. This is a multi-input similar to the filters input. Each derived column box is composed as follows: Line 1: Drop down for column type (Aggregated, String, Numeric), Text input for column name (default to derived_N, where N is selected to be the first available number) Line 2: Checkbox to Include in Query Line 3: Textarea The derived columns allow us to transmit arbitrary pieces of SQL to DuckDB to add extra columns to our result. They should otherwise behave like ordinary columns, e.g., we can Group by them. Derived columns should be selectable from our dropdowns. This means we must merge both the server provided list of columns, with whatever our local list of derived columns are (with analogous filtering for appropriate types as the server side had done.) Derived columns, like other settings in View Settings / Columns, should be saved to the URL.

Fix Binder error and add test

Bug: 1. Go to main page 2. Switch to Time Series 3. Dive This fails with: Binder Error: No function matches the given name and argument types 'avg(VARCHAR)'. You might need to add explicit type casts. Candidate functions: avg(DECIMAL) -> DECIMAL avg(SMALLINT) -> DOUBLE avg(INTEGER) -> DOUBLE avg(BIGINT) -> DOUBLE avg(HUGEINT) -> DOUBLE avg(DOUBLE) -> DOUBLE LINE 1: ... second' * CAST(avg(epoch(timestamp)) AS BIGINT) AS timestamp, avg(event) AS event, avg(value) AS value, avg(user) AS user... ^ SQL: SELECT TIMESTAMP 'epoch' + INTERVAL '3600 second' * CAST(floor(epoch(timestamp)/3600) AS BIGINT) AS bucket, count(*) AS Hits, bucket, TIMESTAMP 'epoch' + INTERVAL '1 second' * CAST(avg(epoch(timestamp)) AS BIGINT) AS timestamp, avg(event) AS event, avg(value) AS value, avg(user) AS user FROM events GROUP BY bucket ORDER BY timestamp ASC LIMIT 7 Traceback (most recent call last): File "/Users/ezyang/Dev/scubaduck/scubaduck/server.py", line 394, in query rows = con.execute(sql).fetchall() ^^^^^^^^^^^^^^^^ duckdb.duckdb.BinderException: Binder Error: No function matches the given name and argument types 'avg(VARCHAR)'. You might need to add explicit type casts. Candidate functions: avg(DECIMAL) -> DECIMAL avg(SMALLINT) -> DOUBLE avg(INTEGER) -> DOUBLE avg(BIGINT) -> DOUBLE avg(HUGEINT) -> DOUBLE avg(DOUBLE) -> DOUBLE LINE 1: ... second' * CAST(avg(epoch(timestamp)) AS BIGINT) AS timestamp, avg(event) AS event, avg(value) AS value, avg(user) AS user... ^ Add a test for this and then fix it. I'm expecting to see a graph. Also, time columns should be unchecked by default in the Columns field on the time series view.

Fix time series render with NaN values

On the simplest time series test where I have default everything and ask for a time series render, there is nothing in the chart. On inspection, I see d="M50 NaN L70 NaN L530 NaN L590 NaN" in the SVG, which seems bad. Add a test for this and fix it. It might be related to "Fill Missing Buckets" functionality.

Add complex test data generation and validation

Let's make some more complicated test data, beyond the current sample.csv. Here's what we will do: - Only when scubaduck is invoked with no arguments, the top heading (which usually says sample.csv - events) will now be a dropdown, which we can use to select been data sources. This is incorporated in the URL parameters. We will retain the existing sample.csv as the default. - We will add a new data source. This will be generated on the fly, no sample.csv. We will use deterministic, seeded RNG to make sure that we always get the same dataset. It will generate 10k rows. The rows should obey an appropriate probability distribution over a fixed one week time interval, with a 'time' column with the timestamp when the event occurred. The entries will have some string columns and some integer columns, with different distribution of values. These rows represent events that happened at some point in time. Then, let's add a test exercising the table view on this new dataset. We should compute a deterministic aggregate average statistic, and do a few group bys and assert that those values match gold values.

Order time series query by x-axis

When doing a time series SQL query, we should order the results by the x-axis for ease of rendering on the chart.

Split chip input JS into separate file

index.html has gotten a bit long. Let's split out the chip input JS implementation into a dedicated script file. You'll need to setup Flask endpoint to serve the JS (we'll be factoring out other bits of JS, so write it so that you can easily support more JS files) and then move all of the chip input related JS into that file.

Show SQL query below chart/table

Below the chart/table in the right pane, also show the SQL query that generated this result.

Split SVG chart JS into separate file

index.html has gotten a bit long. Let's split out the SVG chart JS implementation into a dedicated script file. Chip input was factored in a similar way, use that pattern.

Fix SQL Binder error in derived column

This fails: 1. Go to Time Series 2. Add a numeric Derived Column derived_1 with "value * 2" 3. Dive It fails with: Binder Error: column "value" must appear in the GROUP BY clause or must be part of an aggregate function. Either add it to the GROUP BY list, or use "ANY_VALUE(value)" if the exact value of "value" is not important. LINE 1: ...) AS bucket, user, count(*) AS Hits, avg(value) AS value, value * 2 AS derived_1 FROM events GROUP BY bucket, user... ^ SQL: SELECT TIMESTAMP 'epoch' + INTERVAL '3600 second' * CAST(floor(epoch(timestamp)/3600) AS BIGINT) AS bucket, user, count(*) AS Hits, avg(value) AS value, value * 2 AS derived_1 FROM events GROUP BY bucket, user ORDER BY bucket LIMIT 7 Traceback (most recent call last): File "/Users/ezyang/Dev/scubaduck/scubaduck/server.py", line 408, in query rows = con.execute(sql).fetchall() ^^^^^^^^^^^^^^^^ duckdb.duckdb.BinderException: Binder Error: column "value" must appear in the GROUP BY clause or must be part of an aggregate function. Either add it to the GROUP BY list, or use "ANY_VALUE(value)" if the exact value of "value" is not important. LINE 1: ...) AS bucket, user, count(*) AS Hits, avg(value) AS value, value * 2 AS derived_1 FROM events GROUP BY bucket, user... ^ It seems likely the problem is the avg(value) shouldn't get aliased as value, it needs to get a dedicated, generated name that will never conflict with column names. Add this test, show it fails, and then fix it.

Add delete button to Derived Columns box

Add an X box similar to the Filter X box to the Derived Columns box so we can delete derived columns

Document app startup and support SCUBADUCK_DB envvar

We currently use `flask --app scubaduck.server run --debug` to run our app. Document this in README.md. We would like to be able to specify what database to load when running. We'll do this with an envvar SCUBADUCK_DB. Modify create_app to respect the envvar when the arguments are not passed. (We'll retain the arguments for ease of testing.)

Implement graph-side missing buckets handling

We need to implement graph-side functionality for fill missing buckets. To do this, we need to be a little more structured about how exactly we render the graph. Critically, not all buckets are guaranteed to show up in the returned result. We are still responsible for plotting an element for every part of the bucket. When "Fill Missing Buckets" is "Fill with 0", if the element is missing, we should set its value to 0. When it's "Connect", we can omit the element from the path and rely on SVG path's built-in interpolation. When it's "Leave blank", we use SVG path's "move to" command to directly jump to the next bucket which exists. Let's also improve handling when Start/End fields are omitted (e.g., blank). In this case, the server should implicitly compute Start/End based on the MAX/MIN bounds of the dataset. This information needs to be communicated to the client with the rest of the results so it knows where to start/end the chart.

Error on non-existent SCUBADUCK_DB filename

If I pass a filename that doesn't exist in SCUBADUCK_DB, we should error, we shouldn't create a blank sqlite database.

Add table selection and sqlite integration

Right now, the server hardcodes a single table events. When we load a database, there will likely be many tables. We should be able to switch between these tables without having to restart the server. We'll add a new query parameter 'table'. This should be a dropdown selectable in the topmost header (e.g., the "events" in "sample.csv - events".) Queries for columns need to pass the correct table. Instead of manually importing sqlite into our local duckdb, we should use the sqlite extension in duckdb to directly access the sqlite database.

Add table switching and sqlite extension support

Right now, the server hardcodes a single table events. When we load a database, there will likely be many tables. We should be able to switch between these tables without having to restart the server. We'll add a new query parameter 'table'. This should be a dropdown selectable in the topmost header (e.g., the "events" in "sample.csv - events".) Queries for columns need to pass the correct table. Instead of manually importing sqlite into our local duckdb, we should use the sqlite extension in duckdb to directly access the sqlite database.

Fix time series view bucket issue

This is buggy: 1. Switch to time series view. Keep Granularity at the default Auto. 2. Dive I expect to see some non zero data points. But they are all zero. When I inspect the returned rows from the server, they don't look bucketed. It works correctly when I explicitly select a bucket. Add a test for this and fix it. Make sure Fine works too.

Highlight path and legend on hover

When I mouse over a path in the SVG chart, it should get "highlighted" (by making the path more thick) and its corresponding legend entry on the left hand side should also get highlighted grey.

Deselect previous blue selection on field click

When I am viewing a table and I click a field to make it turn blue, any other blue selection (if there are any) should be deselected (no longer blue).

Fix CatalogError for LONGVARCHAR type

I ran this with SCUBADUCK_DB set to a non-trivial sqlite database (representing Chrome's history), and it failed with: File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/cli.py", line 349, in load_app app = locate_app(import_name, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/cli.py", line 245, in locate_app __import__(module_name) File "/Users/ezyang/Dev/scubaduck/scubaduck/server.py", line 533, in <module> app = create_app() ^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/scubaduck/server.py", line 287, in create_app con = _load_database(db_path) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/scubaduck/server.py", line 81, in _load_database con.execute(f'CREATE TABLE "{t}" ({col_defs})') duckdb.duckdb.CatalogException: Catalog Error: Type with name LONGVARCHAR does not exist! Did you mean "JSON"? Fix the problem. In general, we should map types like this to just VARCHAR, we don't really care about length specifiers.

Fix LIMIT calculation for time series

The LIMIT calculation for time series is incorrect. It's directly using the passed limit from the UI, but actually it needs to be multiplied by the number of expected buckets, because the UI limit denotes a limit on the number of series.

Fix database type conversion error

I loaded a sqlite database with SCUBADUCK_DB and it failed with: File "/Users/ezyang/Dev/scubaduck/scubaduck/server.py", line 309, in create_app con = _load_database(db_path) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/scubaduck/server.py", line 106, in _load_database con.execute(f'INSERT INTO "{t}" VALUES ({placeholders})', row) duckdb.duckdb.ConversionException: Conversion Error: Type INT64 with value 13385262862605259 can't be cast because the value is out of range for the destination type INT32 fix this

Fix flat line in time series view

This is buggy: 1. Switch to time series view. Keep Granularity at the default Auto. 2. Dive I expect to see some changes in the y-value of coordinates in the graph (thanks to the four samples in the table). However, the line is completely flat. Add a test for this and then fix it.

Add debugging info for query execution

I am running with SCUBADUCK_DB pointed at a nontrivial sqlite database. It failed with: $ SCUBADUCK_DB=history.sqlite flask --app scubaduck.server run --debug Traceback (most recent call last): File "/Users/ezyang/Dev/scubaduck/.venv/bin/flask", line 8, in <module> sys.exit(main()) ^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/cli.py", line 1131, in main cli.main() File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/click/core.py", line 1363, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/click/core.py", line 1830, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/click/core.py", line 1226, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/click/core.py", line 794, in invoke return callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/click/decorators.py", line 93, in new_func return ctx.invoke(f, obj, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/click/core.py", line 794, in invoke return callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/cli.py", line 979, in run_command raise e from None File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/cli.py", line 963, in run_command app: WSGIApplication = info.load_app() # pyright: ignore ^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/cli.py", line 349, in load_app app = locate_app(import_name, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/cli.py", line 245, in locate_app __import__(module_name) File "/Users/ezyang/Dev/scubaduck/scubaduck/server.py", line 573, in <module> app = create_app() ^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/scubaduck/server.py", line 312, in create_app con = _load_database(db_path) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/scubaduck/server.py", line 106, in _load_database con.execute(f'CREATE TABLE "{t}" ({col_defs})') duckdb.duckdb.ParserException: Parser Error: syntax error at or near "offset" Add more debugging information so we can see the query that was executed. If you are able to tell what the problem is, also fix it.

Implement adaptive resizing for SVG chart

Let's implement adaptive resizing for the time series svg chart. The SVG chart should resize so it always spans 100% of its container's width. When a resize occurs, we need to rescale the x-coordinates for all elements inside the SVG container appropriately. This should preferably be done by first calculating coordinates in bucket space, and then scaling it to pixel space. We do NOT use a top-level transform as this would distort text labels. We never resize the y-axis, it is kept fixed.

Improve legend layout in time series view

Let's make some improvements to the legend in time series view. We should split the right pane into two vertical subpanes, the left subpane is the legend and the right subpane is the chart. The left subpane has a fixed size while the right subpane will resize when we change the size of the browser.

Fix issue with derived column in time series view

This doesn't work: 1. Switch to time series view 2. Go to Columns and add a numeric Derived Column value_2 whose expression is value * 2 3. Dive We expect to see two lines (one for value, one for value_2), where the spikes in value_2 are twice as tall as value. But we only see one (for value). I also notice that the SQL query with a derived column is a nested query, but I don't think this should be necessary, the derived columns can be just included directly in the regular query.

Fix SVG chart overflow issue

The SVG chart in time series is computing a pixel width which extends itself all the way to the edge of the browser viewport (in fact, it is causing a horizontal scrollbar to appear) but it should respect the margin on the right side of the container it is.

Update SQL generation indentation

Update the SQL generation in server to be properly indented for ease of reading.

Improve legend layout in time series view

Let's make some improvements to the legend in time series view. We should split the right pane into two vertical subpanes, the left subpane is the legend and the right subpane is the chart. The left subpane has a fixed size while the right subpane will resize when we change the size of the browser.

Add time axis labels with dynamic granularity

In the time series SVG chart there is not labels on the x-axis. We should add tick marks at various intervals with labels describing the current time. The algorithm we will use is a little nontrivial, because we want to intelligently show the most useful labels. - We only ever show label ticks snapped to "nice" time intervals, which are aligned to logical times. Our permitted granularities are: 1 second, 2 seconds, 5 seconds, 10 seconds, 15 seconds, 30 seconds, 1 minute, 2 minutes, 5 minutes, 10 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, 3 hours, 6 hours, 12 hours, 1 day, 2 days, 1 week, 2 weeks, 1 month, 3 month, 6 month, 1 year, 2 year, 5 year, 10 year. We always align to the "zero" at the appropriate granularity. So for example, at 2 second granularity, we will have 00, 02, 04, etc. We will NEVER have 01, 03, 05. We pick a granularity that brings us closest to the "target" number of ticks (for now, let's say we aim for 15 ticks). - We intelligently decide how much information to show in the date label. Consider year-month-day-hour-minute-second. We will either display year, month-year, day-month, hour-minute, minute-second. The choice of label is based on our granularity; for example, if our ticks are per year, we only display year. There is an exception; suppose our ticks are per hour, but at some point we roll over to the next day. For the 00:00 tick, we instead display the coarser granularity day-month. This applies at all levels of the hierarchy. - The labels should be displayed horizontally if possible, but if spacing is tight they should instead be displayed at a 45deg angle.

Update mouseover interval detection algorithm

Let's change the algorithm for determining if we have moused over an interval. When we hover over the SVG chart, we should render a vertical line that is snapped to the closest bucket. At every series this vertical intersects, we display a (temporary) dot (with appropriate color matching the series). Our mouse lies at some y-coordinate on this vertical line, with potentially a series above and below. The selected series is whichever of these series our mouse is closest to. We only do this matching when we are mouseover the svg element.

Fix time series label offsets and add test

Do this: - Go to time series view - Dive For the example dataset, the hour interval was chosen to be 4 hours. We should only see labels 04:00, 08:00, 12:00, etc. But the labels are currently snapped to the starting bucket, which is not correct; the ticks should be offset so they line up with multiples of four. Furthermore, once we fix this, we should NOT see the label 00:00, instead, it should be a Day-Month label indicating the new day. Add a test and fix it.

Augment legend with hierarchical structure

We're going to augment the legend in the left subpanel in the time series right panel view. We are going to give it a hierarchical structure: the first level of hierarchy is the group by, the second level of hierarchy is are the columns. The first level has a (non-bold) heading with the group by value (e.g., alice) and the aggregate (avg). The second level has the colored columns which identify the relevant series, and a number showing the current value of that series per the mouse over crosshair.

Add time column selection and heuristic

Currently, we assume that a table has a 'time' column and use this to apply Start/End filters. However, this column does not always exist, and even if it does exist we do not necessarily know the semantics of it. To get around this problem, we will apply two strategy: (1) we will use heuristics to guess the time column, but (2) we will always let the user select between all potential time columns. There are a few things that must be done: - We look for any column which contains any of these substrings: timestamp,created,created_at,event_time,time, date, occurred, happened, logged and is an appropriate type (timestamp,datetime,date, or an integer). - Columns that could be time columns can either be a true "time" type, or just an integer. If it's in integer, we interpret it as Unix timestamp in seconds. Any numeric/time column is eligible to be a time column. We add a new column selector dropdown above Start/End which specifies which column is treated as time (and default it to whatever the heuristic picked.) The choice of time column is recorded in URL params. We must also modify the server code to no longer hardcode the time column name and instead receive it from query.

Modify chart visual parameters

Let's modify some of the visual parameters on the chart. Let's increase the default stroke width to 1.3 (make the stroke width when focused 2.5). Let's increase the height of the svg chart so it's 600px. Let's make the ticks and axis line grey rather than black.

Add y-axis labels to time series chart

Let's add y-axis labels to the time series svg chart. Use a heuristic to get "nice" round looking label values subject to a target number of ticks (maybe 10 or so). For every label we add, render a horizontal grey line across the entire chart. These lines should be BEHIND the series line (z-index wise).

Add freezing feature for time series crosshair

When we click on the SVG time series chart, the vertical line crosshair and the values on the left hand column should stay frozen. Clicking again in the SVG area unfreezes and reverts back to the hovering crosshair.

Add dropdown for time unit selection

For Time Column we need another dropdown (put it next to it on the right) which lets you select if an integer column is s/ms/us/ns. And then interpret this appropriately in the backend.

Remove sqlite fallback in _load_database

In scubaduck/server.py there is a fallback path in _load_database where we manually import sqlite into duckdb. We do NOT ever want to do this. Remove the fallback path and stop suppressing exception. This will result in a failure. I'm not sure what the test failure is, but on my real sqlite database it is: duckdb.duckdb.CatalogException: Catalog Error: Table with name sqlite_master does not exist! Did you mean "main.sqlite_master or memory.sqlite_master"? LINE 1: SELECT name FROM db.sqlite_master WHERE type='table' ^ Fix the problem.

Add helper links for drilling down/up in time series view

We're going to add some helper links to the left subpanel in the time series view to assist in drilling down / drilling up. When we are viewing a time series chart with NO group by set, at the bottom (after the legend) we add a Group by heading and then a link for every column which is eligible for group by (same algo as the Group by chip input autocomplete). Clicking that link gives you the same query as before but with Group by set to that column. It is probably best for the link to be implemented by filling in the Group by input and then triggering a Dive. If a Group By is already set, instead, the heading is Drill up and there's a single link Aggregate, which clears ALL group by columns.

Fix ValueError in SCUBADUCK_DB query

When I use SCUBADUCK_DB to load a sqlite database representing Chrome history, it fails when I query with: 127.0.0.1 - - [22/May/2025 00:26:48] "POST /api/query HTTP/1.1" 500 - Traceback (most recent call last): File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 1536, in __call__ return self.wsgi_app(environ, start_response) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 1514, in wsgi_app response = self.handle_exception(e) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 1511, in wsgi_app response = self.full_dispatch_request() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 919, in full_dispatch_request rv = self.handle_user_exception(e) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 917, in full_dispatch_request rv = self.dispatch_request() ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 902, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/scubaduck/server.py", line 584, in query mn = datetime.fromtimestamp(int(mn), tz=timezone.utc) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: year 424141527 is out of range It's hard to tell exactly what the context of this problem is from the traceback only. If you know how to fix it, write a test that exercises the problem and fix it directly. Otherwise, enhance the error information with extra information about the context of the request / useful information about the database in question.

Remove fallback path in _load_database

In scubaduck/server.py there is a fallback path in _load_database where we manually import sqlite into duckdb. We do NOT ever want to do this. Remove the fallback path and stop suppressing exception. This will result in a failure. I'm not sure what the test failure is, but on my real sqlite database it is: duckdb.duckdb.CatalogException: Catalog Error: Table with name sqlite_master does not exist! Did you mean "main.sqlite_master or memory.sqlite_master"? LINE 1: SELECT name FROM db.sqlite_master WHERE type='table' ^ Fix the problem.

Fix INT64 to INT32 casting issue in sqlite query

I have a sqlite database with a visit_time column that is an integer representing time after unix epoch after and I am trying to view it as 'us' unit. However, this fails because of this error: Invalid Input Error: Type INT64 with value 13384551652 can't be cast because the value is out of range for the destination type INT32SQL: SELECT id, url, visit_time, from_visit, external_referrer_url, transition, segment_id, visit_duration, incremented_omnibox_typed_score, opener_visit, originator_cache_guid, originator_visit_id, originator_from_visit, originator_opener_visit, is_known_to_sync, consider_for_ntp_most_visited, visited_link_id, app_idFROM "visits"WHERE TIMESTAMP 'epoch' + INTERVAL '1 second' * CAST(visit_time / 1000000 AS BIGINT) >= '2394-02-20 19:00:52' AND TIMESTAMP 'epoch' + INTERVAL '1 second' * CAST(visit_time / 1000000 AS BIGINT) <= '2394-05-21 03:40:15'LIMIT 7Traceback (most recent call last): File "/Users/ezyang/Dev/scubaduck/scubaduck/server.py", line 653, in query rows = con.execute(sql).fetchall() ^^^^^^^^^^^^^^^^duckdb.duckdb.InvalidInputException: Invalid Input Error: Type INT64 with value 13384551652 can't be cast because the value is out of range for the destination type INT32 Add a test for this and fix it.

Fix boolean field aggregation error

Boolean fields should be treated as integral, and coerced into either 0/1 before going into aggregation functions. Concretely, if we load a sqlite database with a bool column in it, we should be able to view table on it and not get this error: Binder Error: No function matches the given name and argument types 'avg(BOOLEAN)'. You might need to add explicit type casts. Candidate functions: avg(DECIMAL) -> DECIMAL avg(SMALLINT) -> DOUBLE avg(INTEGER) -> DOUBLE avg(BIGINT) -> DOUBLE avg(HUGEINT) -> DOUBLE avg(DOUBLE) -> DOUBLE LINE 1: ...) AS segment_id, avg(visit_duration) AS visit_duration, avg(incremented_omnibox_typed_score) AS incremented_omnibox... ^ SQL: SELECT count(*) AS Hits, avg(id) AS id, avg(url) AS url, avg(visit_time) AS visit_time, avg(from_visit) AS from_visit, avg(transition) AS transition, avg(segment_id) AS segment_id, avg(visit_duration) AS visit_duration, avg(incremented_omnibox_typed_score) AS incremented_omnibox_typed_score, avg(opener_visit) AS opener_visit, avg(originator_visit_id) AS originator_visit_id, avg(originator_from_visit) AS originator_from_visit, avg(originator_opener_visit) AS originator_opener_visit, avg(is_known_to_sync) AS is_known_to_sync, avg(consider_for_ntp_most_visited) AS consider_for_ntp_most_visited, avg(visited_link_id) AS visited_link_id FROM "visits" WHERE TIMESTAMP 'epoch' + INTERVAL '1 second' * CAST(visit_time / 1000000 AS BIGINT) >= '2394-02-20 19:00:52' AND TIMESTAMP 'epoch' + INTERVAL '1 second' * CAST(visit_time / 1000000 AS BIGINT) <= '2394-05-21 03:40:15' LIMIT 7 Traceback (most recent call last): File "/Users/ezyang/Dev/scubaduck/scubaduck/server.py", line 653, in query rows = con.execute(sql).fetchall() ^^^^^^^^^^^^^^^^ duckdb.duckdb.BinderException: Binder Error: No function matches the given name and argument types 'avg(BOOLEAN)'. You might need to add explicit type casts. Candidate functions: avg(DECIMAL) -> DECIMAL avg(SMALLINT) -> DOUBLE avg(INTEGER) -> DOUBLE avg(BIGINT) -> DOUBLE avg(HUGEINT) -> DOUBLE avg(DOUBLE) -> DOUBLE LINE 1: ...) AS segment_id, avg(visit_duration) AS visit_duration, avg(incremented_omnibox_typed_score) AS incremented_omnibox... ^ Add a test that triggers this and fix it.

Add default entry to X-axis dropdown

In time series view, the X-axis dropdown should also include a (default) entry. This default entry will use the time column setting, even if it's not a timestamp. We should respect the unit specified in time column as well. Add a test for this and fix it.

Prevent number-SI suffix break in samples view

In numeric columns in the table/samples view, we should never allow a break between the number and the SI suffix. Implement this using CSS, do NOT use a nbsp.

Add dedicated input field for Limit by chart type

Let's have a dedicate input field for Limit for each chart type (samples, table, time series) (despite them all having the same name) so that we retain distinct values for them when switching chart types. Specifically, test that switching from Time Series to Samples doesn't leave the limit at 7; it should go back to the samples' larger default.

Add test for column settings persistence

Let's modify how columns are represented in the query URL string. For each graph view type (samples, table, time series), we save a separate column setting of these types. So for example, we should add a test that if we modify the checked columns in samples, and then switch to time series, we do NOT see the new checked columns, we have the default selection. The default selection for columns is NO columns selected, except for samples, which is ALL columns selected.

Update aggregate to default to count

Make the default aggregate count, and it should work in time series even if we have no columns present (in fact, we just ignore columns entirely when the aggregate is count)

Prevent date line wrapping in table

We previously made numeric quantities in table not line wrap their SI prefix. Make dates also never line wrap (with CSS, not nbsp)

Remove decimals from integer fields

In table display (table and samples), integer fields should never render with a decimal (e.g., 30 not 30.000)

Add 'none' option for Time Column

Make it so that Time Column can be set to (none). When it is set this way, the Start/End fields are hidden and we don't apply a filter on time range.

Add column count to Columns tab

The Columns tab should have a parenthesized number saying how many columns are selected

Replace table selector with searchable dropdown

Make the table selector dropdown in the top most header a searchable dropdown using our dropdown implementation, instead of a native one.

Make table select dropdown wider

Make the table select dropdown autocomplete bigger, it should be wide enough to contain all table names without truncating.

Fix dropdown refresh issue in Order By

Bug: 1. Open the app on a nontrivial sqlite database with multiple tables 2. Change the table An extra dropdown in Order By pops up. The drop down SHOULD refresh (to reflect the new column names of the new table), but we shouldn't keep adding new dropdowns lol.

Create sqlite test fixture and add table selection bug test

tests/test_web.py currently uses a very simple sample csv as the test database. We need a more complicated test database from a sqlite database that has multiple tables and exercises various column types. First, create some code that can generate this sqlite database on-the-fly as a test fixture in a new dedicated test file. Then, we'll use this test data to add a test case for a bug. When we load ScubaDuck with a sqlite database with multiple tables, and then we change the table dropdown to select a new table, and then we dive, the 'table' URL parameter doesn't change. Add a test that triggers this problem, and then fix it.

Rename test file and refactor fixture

Rename tests/test_table_dropdown.py to tests/test_multi_table_web.py. In this file there's a test fixture which generates a sqlite database on the fly and loads it. Move this somewhere so that when we run scubaduck by hand we can select this database as our test database, using the special value TEST.

Write test for table dropdown bug

Write a test case for this bug: 1. Use TEST database 2. Navigate to extra table and Dive 3. Refresh the page The table dropdown is expected to be extra, but it is actually events

Quote columns in SQL generation

With the TEST db, all queries that involve desc fail because desc is a keyword. Modify the SQL generation to properly quote all columns so this is not a problem.

Factor out sidebar resizing logic

Factor out sidebar resizing logic JS in scubaduck/static/index.html into a dedicated JS file.

Update timestamp to unix epoch time

In _create_test_database let's represent the timestamp as a unix epoch time rather than a string.

Refactor time column heuristic

The heuristic for selecting a default time column currently lives in loadColumns in index.html. Let's factor it out into a dedicated JavaScript function with a logical input/output signature. It should take something like the names and types of all columns in the table, and then a time column to be selected, or null if no time column was fine and we should use (none).

Refactor JS for table rendering

In scubaduck/static/index.html there is JS for rendering tables (for the table and samples view), e.g., sorting them and coloring. Factor this into a dedicated JS file.

Refactor and split test_web.py

tests/test_web.py is quite long. Read through it, come up with a good organization into separate test files, and split it up.

Split test_server.py into multiple files

tests/test_server.py is quite long. Read over its contents, come up with a logical subdivision scheme, and then split it into multiple files.

Update guessTimeColumn function logic

We're going to make some modifications to guessTimeColumn. We will NOT propose a numeric column as the timestamp if it does not match the heuristic name. So we will only return a column from this function if (1) it is a timestamp, or (2) it has a heur name AND it is numeric or timestamp. A column that matches heur takes precedence over a generic timestamp column.

Factor out View Settings logic to JS file

In scubaduck/static/index.html, factor out all of the logic that controls View Settings / Columns to a dedicated JS file. This includes the logic for updating the URL parameters and propagating the URL parameters setting to the fields, and the behavior of Dive button. It also includes the columns logic.

Hide unit selector when time is none

When time column is (none), we should not show the unit selector (s/ms/us/ns)

Implicitly select column in order by

If we have Order by specify a column that is not selected in columns, implicitly select that column (and show it in the columns). It shouldn't actually show up as checked in the columns list, but we will include it in the query and include it in the columns we show returned. Add a test for this case (right now we fail with an exception).

Ignore unselected columns in order by

If Order by is specified on a column that isn't actually selected in Columns, just silently ignore it when constructing the query (it doesn't do anything).

Update table view for count aggregate

In Table view, when the aggregate is count, we should NOT show headings for columns, since we don't get any useful info on them. There should only be the group by columns and the Hits column. You can repro this on the TEST dataset by doing a table query with all columns checked, group by id, aggregate count. Add a test that asserts we only have an id column and a Hits column, and that the table has correct dimensions.

Left-align table headers

Headers in a table (for Table or Samples) should always be left aligned

Fix Time Column Start/End bugs

There are some bugs with Time Column / Start / End. Here is how to repro with TEST dataset. 1. Go to Samples, set Time Column's time unit to s. It says Out of Range Error: Overflow in multiplication of INT32 (1704067200 * 1000000)! First, we should have done this in INT64 precision. 2. Go to Samples, set Time Column's time unit to us (bypassing the problem above). Dive. We expect two entries, but there are none. It appears the Start/End is not calculated correctly Add a test for these two cases and then fix it.

Support relative month/year suffixes

In Start/End fields, Unknown string format: -1000 month (same with year). Support these relative suffixes. Add a test.

Fix avg aggregate for boolean column

Bug: Aggregate avg cannot be applied to column flag Repro: Use TEST db. Events table, table view, start = -100000 day, end = now, group by id, aggregate avg. It fails with Aggregate avg cannot be applied to column flag. But a bool column should be treated as 0/1 integer when computing averages. Don't overzealously error out here, do the conversion!

Refactor tests to set default Order By

We are going to change the default setting for Order By on Table and Time Series view. To avoid breaking existing tests, we need to go through all tests and explicitly set what the current default is (currently, we seem to just pick the FIRST column in the database). Please do this preparatory refactor.

Update Table view to hide unnecessary columns for count aggregate

In Table view, when the aggregate is count, we should NOT show headings for columns, since we don't get any useful info on them. There should only be the group by columns and the Hits column. You can repro this on the TEST dataset by doing a table query with all columns checked, group by id, aggregate count. Add a test that asserts we only have an id column and a Hits column, and that the table has correct dimensions.

Add 'Samples' order by option

In the Time Series and Table view, we will add a new entry for Order by which is "Samples". This would cause us to order by COUNT(1), i.e., the number of hits we got for the row / series. This should work even if there are no columns selected. We should already be computing this quantity to display as a Hits column in our table, so you just need to make use of it. Also double check that the "Show Hits / Samples" is purely client side; it shouldn't cause the SQL query to stop issuing a COUNT(1) query (we just don't render the column). Add tests for this.

Fix table column alignment bug

Bug: With TEST db, extra table, table view, group by num, aggregate count, the headers and data in the table don't line up; there's one more column in data. The headers are correct: we should have num (group by column), hits, and nothing else (because this is count aggregate).

Fix time series group by bug

Bug: with TEST db, time series view, group by num, we see nothing on the graph. When we remove the group by, we see two spikes (corresponding to the two rows in the table), so those spikes should be present in the SVG when we group by (ideally two distinct series.) Add a test for this and fix it.

Reset view settings on table switch

When we switch tables, we should reset all view settings to their defaults (as the user may have previously selected columns that no longer exist).

Fix table view alignment issue

Bug: With TEST db, extra table, table view, group by num, aggregate count, the headers and data in the table don't line up; there's one more column in data. The headers are correct: we should have num (group by column), hits, and nothing else (because this is count aggregate).

Fix time series aggregation bug

Bug: use TEST db, table events, time series, time column ts, time unit s, aggregate count, no columns selected. I expect to see a single series showing how many entries are per time bucket, but it's blank. Add a test and fix.

Run and fix lint, format, typecheck

rejected - click to expand
Run and fix lint/format/typecheck.

Add test for bytes serialization to JSON

If a SQL database has a bytes column with binary data, it will not work if we try to view samples from it: File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/app.py", line 902, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/scubaduck/server.py", line 721, in query return jsonify(result) ^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/json/__init__.py", line 170, in jsonify return current_app.json.response(*args, **kwargs) # type: ignore[return-value] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/json/provider.py", line 214, in response f"{self.dumps(obj, **dump_args)}\n", mimetype=self.mimetype ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/json/provider.py", line 179, in dumps return json.dumps(obj, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/ezyang/Library/Application Support/uv/python/cpython-3.12.4-macos-aarch64-none/lib/python3.12/json/__init__.py", line 238, in dumps **kw).encode(obj) ^^^^^^^^^^^ File "/Users/ezyang/Library/Application Support/uv/python/cpython-3.12.4-macos-aarch64-none/lib/python3.12/json/encoder.py", line 202, in encode chunks = list(chunks) ^^^^^^^^^^^^ File "/Users/ezyang/Library/Application Support/uv/python/cpython-3.12.4-macos-aarch64-none/lib/python3.12/json/encoder.py", line 432, in _iterencode yield from _iterencode_dict(o, _current_indent_level) File "/Users/ezyang/Library/Application Support/uv/python/cpython-3.12.4-macos-aarch64-none/lib/python3.12/json/encoder.py", line 406, in _iterencode_dict yield from chunks File "/Users/ezyang/Library/Application Support/uv/python/cpython-3.12.4-macos-aarch64-none/lib/python3.12/json/encoder.py", line 326, in _iterencode_list yield from chunks File "/Users/ezyang/Library/Application Support/uv/python/cpython-3.12.4-macos-aarch64-none/lib/python3.12/json/encoder.py", line 326, in _iterencode_list yield from chunks File "/Users/ezyang/Library/Application Support/uv/python/cpython-3.12.4-macos-aarch64-none/lib/python3.12/json/encoder.py", line 439, in _iterencode o = _default(o) ^^^^^^^^^^^ File "/Users/ezyang/Dev/scubaduck/.venv/lib/python3.12/site-packages/flask/json/provider.py", line 121, in _default raise TypeError(f"Object of type {type(o).__name__} is not JSON serializable") TypeError: Object of type bytes is not JSON serializable Add a test for this case, and when we serialize to JSON, we should serialize it to an escaped ASCII representation, the way byte literals in Python would be repped.

Add test for 45deg x-axis labels truncation

When x-axis labels in time series chart are angled 45deg and they are Day-Month, there is not enough space for the labels and they get truncated off of the SVG. Add a test for this, and then adjust the margin so they are not truncated.

Scroll dropdown to current selection

When we click on a dropdown with search (our custom dropdown implementation), we should scroll the dropdown so that the current selection is visible (nearly, but not quite, at the top of the viewport), rather than starting at the top.

Suggest correct time unit for errors

If we try to interpret an integral column as a time, but with the wrong unit, we currently get an error like "Invalid time value 13384414800000000 for column time_slot with time_unit s". In this case, we should look at the value and heuristically determine what the correct unit is, and suggest it to the user. (We WON'T select it by default, because that would require us to snoop data, and we don't want to snoop data unless the user Dives.)

Filter series beyond limit on server side

When we plot time series where buckets are very sparse, we render too many series, because our limit assumes buckets are densely populated and we end up with extra series if they are not. On the server side, we should filter out series beyond the limit so we only include series within the limit.

Fix 'Unknown column: Samples' error

Bug: Go to Table view, add a Group By, and then Order By Samples. It says "Unknown column: Samples". The Samples column is actually the computed "hits", i.e., COUNT(1). Add a test for this and fix it. I think there might be some overzealous server validation, as we never hit the query.

Add test for SCUBADUCK_DB with parquet file

Add a test showing that SCUBADUCK_DB can be used with a parquet file. Add any missing functionality if needed.