onsdag, september 20, 2006

MySQL, some concrete suggestions!

After my post Rails, Databases, ActiveRecord and the path towards damnation, I got an e-mail from rten Mickos, the CEO of MySQL. He asked me to provide concrete suggestions on how to improve MySQL (since the other post just contained some unspecified not-like vibes), so that's the rationale for this post. I'm going to point at a few things I see as a problem for using MySQL as a production database right now. Standard disclaimer stands: these are my opinions, my own only, and my employer doesn't necessarily agree or disagree with them on any level.

Let us jump into the fray:
  • Sequences. I would like real, nice and sweet sequences. I really don't like to have no control of my primary key generation, and I especially don't like that I can't have sequences for anything else. The recommended solution according to the manual is to create a table with one auto-increment column in it, and use this as a sequence. That's not acceptable, especially since I cannot tie this so-called sequence to the generation of id's on other tables with subselects and other fun things.
  • OK, I really don't like the auto-increment feature. Why not provide an IDENTITY keyword like the non-core feature ID T174+T175 specifies?
  • Real, honest-to-god, boolean types. Real ones. Not tinyint(1)s. Not enums. Not tinyint's hidden behind the word boolean (like JDBC). Real boolean types.
  • I would like table1 and Table1 to be different (as per the spec). Oh yes, we seem to live in an insensitive world (case and otherwise) with Windows all over the place. But in my database I want that kind of control.
  • Limiting the return values of result sets. Now, I have no problem with LIMIT and friends, but since there is a spec, and that spec has a feature for this functionality too (T611), why can't that be in MySQL?
  • Time-types should be able to store fractional seconds and time zones.
  • And what's the matter with the TIMESTAMP type? That doesn't really do what the standard says it should do. Please give it a name not in the standard.
  • And for Pete's sake, double bars is for concatenation in SQL. || is for 'or' in programming, but SQL is a DSL. This screams leaky abstractions and is very annoying.
  • Stability of 5.0 features. I know triggers, foreign keys and stored procedures are all there now. But frankly, I don't trust my referential integrity with them yet. Not from a database vendor that a few years ago wrote in their manual that the only reason for foreign keys was to be able to let GUI's diagram relationships between database objects. Not from a vendor that said that you don't need transactions to ensure data integrity. All in all, I want these features to be around a few hours, get the bugs hashed out, let them be pounded on for a while. But that's not going to happen if people move to Rails, since Rails doesn't believe in data integrity or foreign keys.
Well, that's that. Only my opinions, remember? Anyway, for small and fast development, MySQL is really useful. I'm just arguing that a big production system should choose something else.

16 kommentarer:

Anonym sa...

Just a few notes:

"Real, honest-to-god, boolean types. Real ones. Not tinyint(1)s. Not enums. Not tinyint's hidden behind the word boolean (like JDBC). Real boolean types."

-- BIT(1) might be what you're looking for?

"And what's the matter with the TIMESTAMP type? That doesn't really do what the standard says it should do. Please give it a name not in the standard."

-- and break 10s or 100s of thousands, perhaps 1mio+ applications that have been developed to depend on it over the years? not very likely to happen :-)

"And for Pete's sake, double bars is for concatenation in SQL."

-- SET sql_mode = 'PIPES_AS_CONCAT';

"I don't trust my referential integrity with them yet. Not from a database vendor that a few years ago wrote in their manual..."

-- Referential integrity has been available for more than five years now. How long until you'd consider it stable? :-)

Ola Bini sa...

Bit(1) is not what I'm looking for. Boolean types which I can set true or false on. Explicitly. No conversion underneath. And Bit(1) doesn't really read like Boolean in my CREATE TABLE.

Regarding TIMESTAMP, I didn't say it would be doable. This is just things I really don't like about MySQL. I don't expect they will be able to do anything about it. I know as well as you about backwards compatibility. =)

Pipes: That's probably the best way available. But that's not going to cut it. To have SQL compliant behaviour only after setting something is much like not having it at all.

RI: Well, some parts have been 'sort of' available for five yours. Triggers? Will my triggers affect RI? Will my data integrity worsen because of cursors? Since these are some new features, and since data integrity is a real holistic issue, I don't think that 5 years for _some parts of it_ is enough.

SwitchBL8 sa...

Sequences
=========
1 word: AMEN! Auto-increment columns are for T-SQL adepts.

Booleans
========
[pseudo]
fakebool BIT(1);

if fakebool = 0 then
print 'Right'
else
print 'Wrong'
end if;
[/pseudo]
You can wait for this to show up on theDailyWTF. Not going to happen with REAL (as: non-fake) boolean-types.

RI
==
@anonymous: I'm an Oracle guy (I make a living with it, I don't work at Oracle). Oracle has declarative RI since Oracle 7 (1992). Even when it has been around for some time, Oracle still finds new ways and tricks to maintain RI. Don't compare 5 years of catching-up to almost 15 years of dedication.

As Ola states: MySQL is good for a fast read-most, update-hardly db backend for a website. I have yet to see a large production site (100's of databases in the range of 100GB or larger) that uses anything less than Oracle (or DB2, but don't mention that when my boss can hear it).

Roland Bouman sa...

"I really don't like to have no control of my primary key generation, and I especially don't like that I can't have sequences for anything else."

Seems very odd to me. I mean, either your primary key is a surrogate key, in which case I don't see why you should care, or the primary key is some sort of natural key in which case you have full control.

It really makes me curious, what would you want to control, and formost: why do you want to do it? (this is not a rhetorical question - please explain)

"The recommended solution according to the manual is to create a table with one auto-increment column in it, and use this as a sequence. That's not acceptable, especially since I cannot tie this so-called sequence to the generation of id's on other tables with subselects and other fun things."

Same thing here: why? what fun things? what problems do you solve with this kind of functionality?

"Pipes: That's probably the best way available. But that's not going to cut it. To have SQL compliant behaviour only after setting something is much like not having it at all."

Why not? you can control the entire sql mode at server startup, either from the command line or by specifying so in the configuration file (my.cnf or my.ini)

Roland Bouman

Ola Bini sa...

Seems very odd to me. I mean, either your primary key is a surrogate key, in which case I don't see why you should care, or the primary key is some sort of natural key in which case you have full control.
The situation could be inbetween. Or, I maybe want to generate values for something that isn't surrogate. Sequences, as a tool, is not available to me in MySQL, and since sequences is very useful for a variety of situations, this makes me frustrated.
Having auto-increment as an option for columns is fine, but I want the SQL standard way of IDENTITY columns. I would like control over my primary key generation. The world isn't always black and white, and sometimes surrogate keys isn't good enough.

Why not? you can control the entire sql mode at server startup, either from the command line or by specifying so in the configuration file (my.cnf or my.ini)But this only works if you have control over your database. In production, most organizations have an operations team which babysits the database. Or you may deploy your application to a third-party site where the database team isn't even part of your organization. In those situations it is highly doubtful that you will get them to change that flag, especially since that may have consequences for other applications running on the same DB.

Anonym sa...

So what do db experts think about www.db4o.com (as an alternative to MySQL)!

Roland Bouman sa...

"The situation could be inbetween. Or, I maybe want to generate values for something that isn't surrogate."

"Sequences, as a tool, is not available to me in MySQL, and since sequences is very useful for a variety of situations, this makes me frustrated."

Well, it seems to me you are repeating your earlier desire to have sequences. Please give an example of such an "in between natural and surrogate"-case. Please give an example of a variety of situations (or even one) why a sequence feature is very useful.

"Having auto-increment as an option for columns is fine, but I want the SQL standard way of IDENTITY columns. I would like control over my primary key generation."

Well, what kind of control would that be then? As far as I understand it, SQL Standard sequence generators can be controlled in DDL by specifying the base, minimum, maximum and increment value, and whether the sequence should cycle. I'm really curious how that would improve primary key generation. Again, could you give an example of how these features would be useful?

"The world isn't always black and white, and sometimes surrogate keys isn't good enough."

Well, that's when you should use natural keys, but apparently you want to generate natural keys...Sticking to the black and white: I'd really appreciate it if you could add the shades of grey to my world with an example.

"But this only works if you have control over your database. In production, most organizations have an operations team which babysits the database. Or you may deploy your application to a third-party site where the database team isn't even part of your organization. In those situations it is highly doubtful that you will get them to change that flag, especially since that may have consequences for other applications running on the same DB."

And that is exactly why you can set it at the session level too, and why views, triggers and stored procedures are executed with the sql mode that was active at DDL time.

Ola Bini sa...

Roland!

I don't have an good examples of this right now, unfortunately. One example of where sequences is useful is, just as you say, to help generate natural keys.

I didn't know you could set the sql_mode in your session. I stand corrected. But I do feel that the behaviour should be the other way around, but that's a small issue.

Mikey126 sa...

IIRC, the problem with boolean columns in SQL databases was realted to NULL, boolean being intrinsically bi-state and True/False/Null being, er, tri-state. Something like that.

I'm happy for you to have case-sensitivity, provided it's configurable. I spent a couple of years in a Sybase (case-sensitive by default) installation where some genius had tables with "TradeID", "TradeId", "tradeId", "tradeID" that meant he didn't have to use table aliases, since the column names were unique. I never got to meet him, which meant I carried that baseball bat for nothing...

Adam Sanderson sa...

I believe that valid use for sequences would be something like selecting from a temporary sequence.

IE: left join onto a sequence of dates incremented by one.

This way you get results as:
2006-01-01 null
2006-01-02 'joined data'
2006-01-03 null

as opposed to:
2006-01-02 'joined data'

I've never really needed to do it, but I would say it has its uses.

Jay Pipes sa...

Hi Oli!

Thanks for your suggestions. Roland has already touched on some key points I would make about sequences and SQL_MODE. I'll elaborate a bit.

On sequences, much of the complaint I hear regarding them in reality is due to folks being accustomed to the Oracle way of doing things. Relatedly, this is often why folks used to Oracle typically proclaim that PostgreSQL is a "vastly superior" product. The reason is that PostgreSQL *feels* very similar to Oracle. Why? Because it was built as an Oracle clone. MySQL was not, and while there have been various features added recently that make it behave a little more Oracle-like, it will never be Oracle. It doesn't want to be. The important thing that Roland was saying is "what are the advantages that the sequence functionality gives you". I challenge folks to demonstrate explicit scenarios where sequences provide substantial benefit over auto-incremented integer keys. Remember that sequences have been around for ages because they represent an ancient and outdated method of tracking order and SKU numbers for various schema entities where the SKU number was maintained on COBOL/mainframe applications and could not be changed. In modern applications, the use of sequences has become almost pedantic. The *old* use of sequences is easily emulated using a unique foreign key relationship and application logic, where, arguably, that code belonged in the first place, as it is not integral to the data itself, but rather application rules.

As for SQL_MODE, yes, you should investigate the various mode values and their affects. You will find at least a couple of your complaints (piping and date-behaviour) are affected by the sql mode.

Regarding the BIT/Boolean type, I am fairly confident that such a thing does not, in your words, mean MySQL cannot be used in a "big production system". Perhaps this is just a gripe? I'll give you that; but don't we all have gripes about all our favorite (and not so favorite) applications...?

@switchbl8: You need only look at the top 100 companies on the net to see large MySQL production databases well over the 100GB mark. YouTube, Yahoo!, Google, Travelocity, GoDaddy, Wikipedia, Friendster, Technorati, etc. All MySQL. All the time.

Am I saying MySQL is the best for everything? Course not! MySQL has its strengths and its weaknesses. PostgreSQL and Oracle are great databases. The points I raise are intended to dispel the FUD and point out the facts.

Cheers,

Jay Pipes
Community Relations Manager, North America, MySQL

I would like to add a

Eivind sa...

I've needed sequences for handling ids inside transactions, where I use several rolling IDs. I had to sort of re-implement sequences to make that code work with MySQL, which was an utter pain.

As for PostgreSQL feeling better for Oracle-people: I have never touched Oracle. Most of my experience with databases has been with MySQL, and my first experience with databases was with MySQL. The only reason I'm on MySQL instead of PostgreSQL today is the migration cost/risks - my experiences with running PostgreSQL has been much, much better than MySQL.

For MySQL to grow up and for me be anywhere near as pleasant to use as PostgreSQL, I believe it has to get a change of developer mentality, doing things properly instead of trying to meet a marketing checklist.

Example: Subselects has to work fully instead of working in some but not all cases. Roughly 1/3 of my subselects do not work in mysql due to psycho restrictions (cannot reference a table several places in the same query, for instance).

Example: Referential integrity has to work by default. Not work "when you add an index and set the table type to InnoDB and the server happens to be configured to work that way".

Example: MySQL has to fail instead of making up random behaviour for cases where it cannot do what I ask it to. One example is above, where there's at least 3 different ways I know for referential integrity checking to fail to work, and MySQL fails this SILENTLY, just dropping the checks. Another example is inserting bad data into enums, where this is turned into the empty string instead of generating an error. Even if the empty string isn't allowed in that enum.

All of this has to be fixed by DEFAULT, and be tunable for compatibility. Screwing up the user's data by default is unprofessional, and make me have no trust in MySQL. (MySQL regularly messing up and needing my personal touch to get going again does not help this trust.)

Jay Pipes sa...

Hi Eivind,

I appreciate your comments. Regarding this:

"I've needed sequences for handling ids inside transactions, where I use several rolling IDs. I had to sort of re-implement sequences to make that code work with MySQL, which was an utter pain."

still, it is not clear what *exactly* is the point of the sequences. Could you elaborate, perhaps with some code, so that we can learn more about the use case?

Thanks!

Jay

David sa...

One need we've had for keys is date+sequence that is reset everyday. So you'd have 2006091600001, 2006091600002 etc ...

Roland Bouman sa...

See, that's what I don't get. Why reset every day? Or, put another way, why reset *at all*?

I'm guessing this is in order to be able to identify the records in a chronological order within a day. That can certainly be useful.
However you could do so equally well if not *better* using only a unique auto_increment value. If you need a date too, well, no problem. You can still select per day, and you can still know how the records are ordered within that day.

The problem is, I'm totally missing *why* it is so important to have that sequence start again the next day. Please, explain...I'm curious, and I want to know what *real world* problems are solved with this technique.

(BTW - this particular setup is supported in MySQL too. It is not a well-know feature, but it has been there ever since:

mysql> create table inc (
-> d timestamp
-> , s tinyint unsigned auto_increment
-> , n char(10)
-> , primary key (d,s)
-> ) engine=MyIsam
-> ;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into inc(n)values ('a'),('b');
Query OK, 2 rows affected (0.00 sec)
Records: 2 Duplicates: 0 Warnings: 0

mysql> select * from inc;
+---------------------+---+------+
| d | s | n |
+---------------------+---+------+
| 2006-09-28 02:32:39 | 1 | a |
| 2006-09-28 02:32:39 | 2 | b |
+---------------------+---+------+
2 rows in set (0.00 sec)

mysql> insert into inc(n)values ('c'),('d');
Query OK, 2 rows affected (0.00 sec)
Records: 2 Duplicates: 0 Warnings: 0

mysql> select * from inc;
+---------------------+---+------+
| d | s | n |
+---------------------+---+------+
| 2006-09-28 02:32:39 | 1 | a |
| 2006-09-28 02:32:39 | 2 | b |
| 2006-09-28 02:32:51 | 1 | c |
| 2006-09-28 02:32:51 | 2 | d |
+---------------------+---+------+
4 rows in set (0.00 sec)

As you can see, the auto_increment is reset automatically for each new timestamp value.

For MyISAM tables, the behaviour of auto_increment columns in a composite primary key is to increment within the combination of values in the preceding columns of the primary key, and to reset for each new combination
)

David sa...

The main reason was "the user was adament about it". Some of this goes back to the old days in the financial services business where forms where numbered and accounted/audited that way.

Another reason is readability when confirming transaction numbers. It's easier to say 153, 154 with the date implied vs. reading back a randam 10 digit number to a customer over the phone.