[Tech Talk] What Developers Should Know About SQL Optimization, Part 1: Fundamentals

Updated Mar 21, 2024

Whenever your application contains features that require database access, software developers creating these will likely encounter the need for SQL optimization to ensure the app runs smoothly and reliably. At SaM Solutions, our developers understand that writing efficient and optimized SQL queries results in higher quality, scalable software that delivers a good user experience.

To share internal best practices and proven solutions in this area, we have recently initiated a series of online meetups titled “SQL Optimization in the Context of Our Projects: Challenges and Successful Cases.” We specifically targeted professionals who are not database administrators or domain experts. The series aims to provide hands-on tips for software engineers with different experience levels who may confront query performance challenges throughout the development lifecycle.

Nearly forty developers from the company’s offices across the globe have participated in the inaugural SQL meetup. The event’s main speaker, Dzmitry Verasau, a developer and team lead with 16 years of experience, invited colleagues to discover specific problems and solutions encountered in real-life projects and transfer that practical knowledge to other work in the company. Read more about our knowledge-sharing initiatives.

This first blog post offers an onboarding guide with several essential lessons discussed at the meeting, and we will then dive into the details in future posts.

Why Optimize?

SQL optimization is crucial in database management and application development for several reasons. SQL (Structured Query Language) is a powerful tool for interacting with relational databases, but inefficiently written SQL queries can lead to performance bottlenecks, increased resource consumption, and poor user experiences. Here are some key reasons why you need to optimize:

Improved Performance. Optimized SQL queries can execute much faster than poorly written ones. It means your applications and systems can respond to requests more quickly, providing a better user experience.
Resource Efficiency. Inefficient SQL queries can consume excessive server resources, such as CPU and memory. Therefore, for reliable and predictable operation of an SQL server, it is recommended not to have other applications deployed on the same server. In rare cases, this is acceptable with additional resource allocation configuration.
Scalability. As your application grows, the database workload increases. Optimized SQL queries can handle a higher load with the same hardware, delaying the need for costly hardware upgrades. This scalability is crucial for accommodating more users and data without compromising performance.
Cost Reduction. Inefficient queries can increase operational costs, especially in cloud-based database services where you pay for resource usage. By optimizing SQL queries, you can reduce the number of computing resources required, leading to cost savings.
Consistent Response Times. Unoptimized SQL queries can exhibit erratic response times, making providing a consistent user experience challenging. Optimized queries ensure that response times remain stable even under varying workloads.
Data Integrity. Poorly written SQL queries can introduce data integrity issues by not properly enforcing constraints, performing incorrect updates, or causing data corruption. Optimized ones are more likely to operate on data safely and accurately.

When to Optimize?

We can divide all databases into three general groups by total size and record number.

Basic SQL optimization is most useful for medium-sized databases

Small: It is hard to slow down an application running on a database that is less than 1 GB in size, so you rarely need to optimize these.
Medium: Once your data volume reaches at least a million records across multiple tables, it becomes easy to construct nonoptimal queries (especially with joins) that will ruin your app performance. This type of database is most common in our projects.
Big: For multi-terabyte databases with the number of records approaching billions, there are typically specific upfront requirements and designated database administrators employing advanced tools not covered here.

There are no strict limits in the given classification. Databases in the intermediate range of 100 GB – 1 TB can fall into either the ‘medium’ or ‘big’ category depending on other factors, such as average load and data structure. Yet, the medium-sized databases are where the tips offered in our meetup would be most relevant and helpful.

In the rest of this post, we attempt to explain some basic elements essential for query processing speed as applied to such databases. We focus on Microsoft SQL, but the principles for optimizing queries are similar across different database types.

Of course, there are many more challenges and solutions related to SQL performance, and what we discuss here is only the tip of the iceberg. In particular, we omit any issues related to SQL Server configuration, which can be crucial if not set up correctly.

How to Optimize?

Lesson 1: Set a minimum amount of memory for SQL Server

Databases contain files that store all data in blocks of 8 KB each, called “data pages” (see the diagram below). Thus, if a server needs to retrieve some data, it will load it from the disk to the memory in corresponding small bits.

Let’s say you have a query requiring you to read 8 GB of data while your RAM size is just 4 GB. In this case, each request will demand accessing data on disk drives, a much slower operation than that in the main memory. The SQL Server experiencing a lack of RAM may – instead of continuing to process the request – await memory release from side resources, which can take multiple seconds of waiting time.

Thus, you must know how much RAM is available on the production server. Sometimes, development teams do not have the information about the database environment setup on the client’s side. Yet, this situation should be alarming.

Furthermore, in some projects, the production environment, including a database and web server, is located on a single computer with one OS, dividing memory resources between themselves. The problem: the SQL Server (at least in the case of Microsoft SQL) assumes ownership of virtually the entire RAM available on your machine, trying to reserve it for performance purposes. If a web server also uses memory, a slowdown of the whole system is inevitable.

Dzmitry’s recommendation: When setting up an SQL Server that shares resources with other programs, including a development environment, reserve a minimum amount of guaranteed memory by default. With a few gigabytes of RAM secured you will be confident that the server has room for caching pages-based data into memory and executing queries quickly.

Lesson 2: Understand what happens when you query data

What exactly is the SQL Server doing to process a query? It passes through several stages, as outlined in the original diagram from Microsoft’s guide below.

Parsing: Based on the user’s query, the server tries to make sense of the request and construct a query tree (this process tends to be slow)
Plan building: The server creates an optimized query plan unless it is already available in the plan cache, in which case it skips this stage, speeding up the whole process
Execution: The query plan gets interpreted by the query execution engine
Data access: The execution engine requests respective data rows
Data transfer to the client: Loading of data pages into the buffer pool completes the process

Plan building is the step to pay attention to. Here, the server distinguishes between fast and slow operations for selecting the required data. For instance, if you query a single record by ID, it should be relatively quick, whereas a request for a data range would hint at the need to optimize.

Using a fast profiling tool such as Express Profiler (see recommended tools at the end of this post) for Microsoft SQL Server or EXPLAIN / EXPLAIN PLAN FOR commands in PostgreSQL / Oracle databases, you can quickly check query performance data.

Sample screen from Express Profiler v2.1

In particular, keep an eye on the duration and the number of reads and writes for different queries. The more complex the statement, the higher the number of read/write operations performed. Dzmitry has developed a rule of thumb for his projects: if logical operations exceed 5,000-10,000, it indicates potential optimization problems with the given query and requires further investigation.

Lesson 3: Update statistics for query optimization

As explained above, the server starts responding to a request by building a query tree and deciding what action and in which order to take. At the same time, it is looking into ‘statistics’ – information about the distribution of column values in tables or indexed views.

The statistics help the SQL Server estimate the number of rows in the query result, significantly increasing the query processing speed. While the server itself may update statistics based on particular triggers, Dzmitry recommends manually running the updating procedure:

before any optimizations,
periodically on the live server,
after extensive data updates.

Sample statistics for query optimization

Let’s say your data contains a date of birth with a range of possible values (e.g., a minimum point 100 years ago and a maximum at today’s date). That gives the server an approximation of the region size where the result will be present. In a hypothetical example of a query for records of persons born later than today, the server will return a null result immediately without even trying to access the table and execute the query.

Once the SQL Server has estimated the number of records, it can allocate a corresponding memory size for that amount of rows. And what if something goes wrong while processing the query, and the result is significantly larger than estimated? The server will try to request additional memory or write your data into TempDB (a system database).

This example explains why the profiler can display write operations for SELECT statements intended only to fetch data, and it is another indication of a non-optimally constructed query. Dzmitry’s recommendation is to pay close attention to such cases.

On a Side Note: Knowledge Sharing at SaM Solutions

As a pioneer in the software development industry that has been building innovative corporate technology solutions for more than three decades, we understand that professional growth and knowledge sharing are not only essential to employee engagement. They are also the only way for us to stay relevant in the ever-evolving IT world and continue to deliver exceptional value to our clients.

One of the tools we have developed internally to provide our specialists with opportunities to share their experiences and best practices seamlessly and regularly is the SaM MeetApp platform. It helps us facilitate the activities of our various technical communities (.NET, Java, Frontend, QA, BA, etc.) across the entire company and organize their knowledge exchange through online and offline meetups.

Since we launched the platform in 2019, more than 250 events have taken place, and technical meetups have also inspired our employees to hold sessions on soft-skill development, stress management, and other topics. In addition to our internal speakers, several internationally acclaimed experts have participated in the webinars.

Lesson 4: Choose your indexes wisely

An SQL index is a performance tool representing a lookup table that helps the server speed up the retrieval of data stored in the database through pointers that refer to it. Several index types exist, and using an unordered heap versus a table with a clustered index is a common dilemma. Each index type has pros and cons – analyze your unique data requirements to make the right choice.

The clustered index, automatically created with a primary key, is well-suited for range reads. However, it can present problems with random inserts requiring the rebalancing of the binary tree. The situation is the opposite with the heap: range operations get slower due to the need to sort the entire table, but inserts tend to be faster.

Non-clustered indexes can be helpful for fast access based on filters or for maintaining database consistency. They usually refer to the clustered indexes through row locators pointing to the latter’s data rows.

If your data contains globally unique identifiers (GUIDs), a widespread practice today for security and other reasons, Dzmitry recommends avoiding the standard schema with a clustered index.

With random GUIDs, inserting every new record will restructure the B-tree, slowing the process and causing performance issues. An alternative is to use sequential GUIDs provided by the SQL Server, mitigating the inserts issue, though it brings the security issue back, as non-random IDs are not as safe.

Other index types that are worth to know:

Unique – Use it to guarantee database consistency and facilitate plan generation for speeding up query processing.
Filtered – Indexing a portion of rows in a table instead of a full table is often helpful for reducing the space occupied by the index and improving query performance.
Index with included columns – Add non-key columns referenced by a query for faster data access, as the query optimizer would then locate everything within the non-clustered index without accessing the clustered one.

Important note: the latter index needs to cover the query; if it doesn’t include just one of the requested columns, you may not realize performance gains.

Lesson 5: Keep row size as small as possible

One of the meetup’s activities was an exercise in creating a table optimized for SQL querying where all participants could comment on the design of the query and recommend tweaks in fields. Here is the SQL statement:

CREATE TABLE [dbo].[tblEmployee](
[Id] [uniqueidentifier] NOT NULL,
[FirstName] [nvarchar](max) NULL,
[Last_Name] [nvarchar](max) NULL,
[Full_Name] [nvarchar](max) NULL,
[Phone] [nvarchar](50) NULL,
[DateOfBirth] datetime NULL,
[Status] int NOT NULL,
CONSTRAINT [PK_Employee] PRIMARY KEY CLUSTERED ([Id] ASC)
)

Our developers voiced and submitted in the chat a variety of solutions to make the table support higher query performance. As a possible optimized variant, consider the following revised statement:

CREATE TABLE [dbo].[Employee](
[Id] [uniqueidentifier] NOT NULL,
[FirstName] [nvarchar](50) NULL,
[LastName] [nvarchar](50) NULL,
[FullName] AS [FirstName] + ' ‘ + [LastName],
[Phone] [varchar](15) NULL,
[DateOfBirth] date NULL,
[Status] tinyint NOT NULL,
CONSTRAINT [PK_Employee] PRIMARY KEY NONCLUSTERED ([Id] ASC)
)

Without going into the details of each change, a critical design recommendation is to shorten the row size and avoid including “just-in-case” extras. If you have a wide table with hundreds of columns, fewer data will fit into each page, and an SQL Server will need to perform more operations to process the data.

Remember that just 1 byte per data row for a table containing 100 million rows totals 100 megabytes. And, of course, you need to consider the additional size of an index (if it exists), which can be larger than the table itself. Also, in Microsoft SQL Server, there is currently a size limit of 900 bytes for a field that can be indexed (an index key) with a clustered index and 1,700 bytes with a non-clustered one.

As for the name fields in the table example above, there is no single correct approach. Dzmitry noted some common misconceptions about persons’ names, summarized in a famous post by Patrick McKenzie (see recommended readings at the end of this post).

Lesson 6: Use stored procedures, but remember to recompile

A stored procedure, or a grouped batch of SQL statements saved in the database for further reuse, is another commonly used tool for speeding up query processing. But why are procedures faster than regular SQL queries?

When you perform the operations within a stored procedure for the first time, it will cache an execution plan by default. As noted in Lesson 2, SQL Server’s parsing and compilation stage is relatively slow, which makes rebuilding the schema at each request ineffective. With the stored procedure, the query processor does not need to construct a new plan again, so the processing time is typically smaller.

There is one caveat, though. SQL Server uses parameter sniffing to optimize the execution plan of a query based on the values of the parameters passed to it. Sometimes, this feature can backfire and lead to suboptimal plans, especially when the data distribution or statistics are skewed or outdated. If there are significant changes in the tables or data, Dzmitry advises recompiling the stored procedure, forcing a new execution plan:

ALTER PROCEDURE [dbo].[spDoIt]
@dateTo AS datetime
WITH RECOMPILE
AS
BEGIN
  SET NOCOUNT ON;

  DELETE FROM [dbo].[Table]
  WHERE [ProcessedDate] IS NOT NULL AND [ProcessedDate] < @dateTo
  --OPTION (RECOMPILE)
END

However, note that this is a workaround, and you shouldn’t just add recompilation to every stored procedure. Ensure that the benefits of this solution are higher than the drawbacks of recompilation in terms of the lengthier planning phase.

Lesson 7: Use a test database identical to the production

Regardless of the optimization techniques you are trying to apply, these efforts may become useless if your database differs from the one you will deploy in the production environment. The reason for that is apparent: the query speed depends on how much data you have, its distribution, and other unique characteristics of the given database.

What if critical or sensitive data is involved that a project client is not inclined to disclose to the development team? In that case, Dzmitry recommends anonymizing or replacing the data with random values while keeping the same total row number and exact row size unchanged.

Final Notes

During the meetup, Dzmitry used SQL Server Management Studio to show several real-life examples of nonoptimal queries and suggestions for improving them. For instance, what he dubbed a “left-outer-join problem,” often causing troubles in development projects, could be effectively resolved by splitting the original query into two separate ones, resulting in a significant performance increase.

While these cases are beyond the scope of the current blog post, many meetup participants involved in the discussions expressed their strong interest in holding a follow-up session to dive deeper into the examples and details of SQL optimization. Watch out for future posts in this series that will provide further practical tips.