One of the best products of aws.amazon.com’s is undoubtedly Redshift.
Amazon Redshift as used as production data warehouse for our clients Analytical needs.
Amazon has tuned Redshift to much to the orders of magnitude for faster performance especially for COPY commands over INSERT commands by using its massively parallel processing (MPP) architecture to load (and read) data in parallel from multiple data sources.
We created a resilient(we thought) data pipeline that utilizes Amazon’s S3 service before copying the data into our Redshift cluster(As recommended by aws engineers). Additionally, Redshift has enabled us to quickly generate materialized views to provide near real-time analytics to our clients while maintaining high availability and throughput with our data pipeline.
Until we make our services Micro-service architecture to be better performant every thing was fine from Redshift side, but when we scale up the Micro-service based architecture in a full swing for production then slowly we are getting errors(Not actually errors but applying the consistency level expected from Redshift) from Redshift.
The error states that
Serializable isolation violation on table – xxxxx, transactions forming the cycle are: 56xxxxx, 56yyyyy (pid:29zzzzz)
Serializable isolation violation – Is the main cause for error.
let us dig in to it… For that we must know what it meant by Isolation level in a Database.Database isolation refers to the ability of a database to allow a transaction to execute as if there are no other concurrently running transactions (even though in reality there can be a large number of concurrently running transactions as in case of Redshift). The main goal is to prevent reads and writes of temporary, aborted, or otherwise incorrect data written by concurrent transactions.
Here the term isolation borrowed from ACID .
If we give the definition for term Isolation in the context of ACID , then it will be some thing like this
“Concurrently executed transactions are isolated from each other”
Why? — In order to provide consistency(C in ACID).
How? — Isolation was mainly achieved by serializing each transaction,which means that each transaction can pretend to be the only transaction running on in the entire database(If we do that then we can’t have MPP architecture in Redshift).The Database ensures that when the transactions are committed, the result is same as that it had run serially in reality it might have run concurrently and hence we achieve “Serializable isolation”
In practice actual Serializable isolation is rarely used because of serious performance penalty.
Even though we don’t use Redshift for OLTP, still there will be problem if there is no proper isolation levels while copying data in to it.
Implementation of Serializable isolation in Redshift
There are different ways to implement Isolation, Redshift uses the concept of serializable isolation to implement concurrent write operations. Since concurrently running transactions are invisible to each other, Redshift achieves “serializable isolation” by using write locks on tables to ensure that no two transactions are manipulating the same subset of data.
The basic premise is that if one transaction reads another transaction’s uncommitted writes than it will be a violating of isolation, so in order to avoid that there should be some kind of lock mechanism in which the subset of data in which the transaction was operating on will be locked until the current transaction was committed.
For the sake of better understanding let us consider a hypothetical rare scenario in which Two transactions A and B try to manipulate same set of data(same row), then here clearly Serializable isolation condition was violated.So If an application code tries to do concurrent operations in Redshift then Serializable isolation condition was violated to avoid that the developer of the app should explicitly declare redshift LOCK command. It works basically making the next operation wait until the previous one is closed. Note that, using this command the code will naturally get a little bit slower.
LOCK table_name; SELECT * from …
Now that we better understand what a serializable isolation error actually is, what sort of information is Redshift giving us with this error? We’re getting a table_id (91xxxx), a process_id (29xxx) and two transaction_id‘s (566xxxx1, 56xxxx2). Using these three pieces of information we can start to piece together information about where in our application code we have conflicting transactions. Luckily for us, Redshift automatically generates quite a few system tables and views that we can readily use. From Amazon’s documentation, svv_table_info shows summary information for tables in our Redshift cluster.
First, let’s find out exactly in which table the error occurred in by using the table_id in the error
-- Get table_name from table_id SELECT svv_table_info.table AS table_name FROM svv_table_info WHERE table_id='91xxxx';
Nice, we have a table name — that’ll be helpful later when looking through our application code for places where we manipulate that table. What about system tables for conflicts?
Redshift has a solution for that too: stl_tr_conflict. According to documentation:
Every time a transaction conflict occurs, Amazon Redshift writes a data row to the STL_TR_CONFLICT system table containing details about the aborted transaction. For more information, see Serializable Isolation.
That sounds promising — and those docs show that this table has a column for both process_id and table_id. We have both of those, so we can list our aborted transactions with the following query
— Aborted Transactions
SELECT * FROM stl_tr_conflict WHERE table_id=’91xxxx’ AND process_id=29xxxx;
will return xact_id = 56xxxxx1
That xact_id is looking mighty familiar — it matches up with one of our transaction_i‘s from above. So far so good. But what about the other concurrent transaction that was running simultaneously? How do we know what the queries were for the aborted transaction 56xxxxx1 and the successful transaction56xxxx2? We can get that information from svl_transactiontext as AWS docs say that this table has an xid(In Zookeeper it is zxid(transaction id)), which is the “transaction ID associated with the statement.”
-- Concurrent transaction queries SELECT xid, text AS query_text, starttime, endtime, CASE xid WHEN xid=56xxxxx1 THEN 1 ELSE 0 END AS aborted FROM svl_statementtext WHERE xid IN (56xxxx1, 56xxxx2) ORDER BY starttime;
Using these query_text substrings (Redshift inserts queries into svl_statementtext in 200 character increments), we can begin to dig into our application code to understand where these two queries are happening, why they’re happening concurrently and how we should go about fixing them.
As mentioned above, serializable isolation level comes with a performance cost( in the form of latency(tail latency amplification)), As we know that in redshift there are some important factors affecting query performance
- Total # of nodes → # of processors or slices
- Node type → storage capacity, memory and CPU
- Data distribution → try avoiding broadcast or redistribution for good
- Data sort order → use sort key
- Dataset size → vacuum & analyze
- Concurrent operations → can be improved through WLM rule
- Query structure → process and return as little data as will meet your need
- Code compilation → remember to always run a query more than once to assess its performance because the first time always includes compilation.
Even with all that power, it’s possible that you’ll see uneven query performance, or challenges in scaling workloads. Performance optimization for Amazon Redshift is a matter of doing some thoughtful up-front planning and ongoing monitoring as your data volume, users and cluster grow.Hope my thoughts will help you. Thanks for reading.